MemoryX
MemoryX is a component within Cerebras Systems’ Weight Streaming Execution. The Weight Streaming Execution is a new paradigm for training giant models. Weight streaming disaggregates the storage of parameters from the compute units. The MemoryX service, for parameter storage and update; works together with SwarmX fabric which is a novel interconnection between parameter memory and compute.
Model parameters and optimizer state are stored using the MemoryX service, where they are also updated between each training iteration. The capacity of the MemoryX service can scale from 4TB to 2.4PB, allowing the solution to support models with up to 120 trillion parameters. Internally, the MemoryX architecture uses both DRAM and flash storage in a hybrid fashion to achieve both high performance and high capacity. Achieving full compute utilization requires enough network and memory bandwidth to feed weights into the compute units as fast as they are consumed for computations. Both the storage and I/O interface of the MemoryX service can match or exceed the I/O bandwidth of a CS-2 system.
It is important that parameters can be accessed by the compute units with minimal latency, to avoid bubbles during the training process. The process of streaming weights for each layer can be pipelined since weights for most layers can be accessed before computations of the previous layer complete. The one exception to this occurs at the boundary between training iterations, when weights are updated. This means that latency can be hidden for most of the training iteration and has a minimal impact on performance. Weight updates are performed by the MemoryX service, which provides flexible compute capable of supporting any optimizer algorithm, such as SGD or Adam. The amount of compute required for weight updates is relatively low compared to the compute used to calculate activations and gradients. This is because the number of weight update operations is proportional to the number of parameters, O(P), but activation and gradient compute increases linearly with the batch size, O(BSP). For the same reason, it is possible for the compute provided by the MemoryX service to support any size of CS-2 cluster. Weight updates must be computed at least as fast as weights are streamed out to the CS-2s to avoid a compute bottleneck. Each weight is streamed out of the MemoryX service twice between each weight update, once in the forward pass and once in the backward pass. The MemoryX service delivers a FLOP/s rate three orders of magnitude greater than its I/O bandwidth, which allows for execution of thousands of floating-point operations per weight on each training iteration. This is plenty of compute power to support any commonly used optimizer algorithm.
