Weight Streaming Execution
Cerebras Systems have presented Weight Streaming Execution as a new paradigm for training giant models. Weight streaming disaggregates the storage of parameters from the compute units. Cerebras Systems’ implementation involves Cerebras weight streaming architecture, that is based on wafer-scale compute units; a new system, the MemoryX service, for parameter storage and update; and a novel interconnection, SwarmX fabric, between parameter memory and compute. It is believed that because of the storage volume provided by the MemoryX service, the weight streaming solution provides the only way currently known to run models with hundreds of trillions of parameters. The architecture is designed around the Cerebras WSE-2 system which contains a wafer-scale processor that provides enough compute power and on-wafer SRAM to support layer sizes an order of magnitude greater than those used in today’s state-of-the-art models. Since each WSE–2 can support massive layers, the Weight Streaming architecture is able to use a scale-out model based on pure data parallelism, which can support a cluster delivering more floating-point performance than the current largest supercomputer in the world for this class of workloads. The WSE is also a preferred platform because, unlike units that prefer dense matrix multiplication, it can fully exploit sparsity in the weight tensors, for a one to two orders of magnitude reduction in computational work and runtime. What is achieved is runtime reduction nearly linear in number of nonzero weights for unstructured weight sparsity. Cerebras Systems contend that the combination of effective sparsity, compute units capable of storing full layers, and memory disaggregation gives researchers the only practical way to train models with trillions of parameters.
