As neural networks grow in size and computational demand, scaling across large, distributed systems is the dominant approach to keeping training times in check. Recent state of the art models in areas covering computer vision, natural language processing and reinforcement learning would have been impossible to develop on single machines. Even a single training run of a given model would take weeks or months on a single device, if possible at all. Distributed training is a must today to achieve state of the art results.
Distributed training on traditional hardware scales predominantly through data parallelism, where multiple copies of the model run on different devices, with different input samples. In this blog post, we look at where this approach hits its limits. We also discuss how the fundamental advantages of wafer-scale technology enable any blend of different parallel execution strategies. This includes more effective options of traditional data-parallel as well as model-parallel execution modes, where different parts of the model are distributed across multiple devices, or any combination of those.
Parallel Execution Modes with Traditional Devices
Accelerators such as GPUs typically operate on batches of data. This provides the arithmetic intensity required to avoid bottlenecks where processing cores are waiting for data to load from off-chip memory such as HBM. The inability to fit all the layers of a deep neural network in local on-chip memory, together with limited off-chip memory bandwidth makes layer-sequential execution, where layers are processed one at a time, a natural choice for GPUs.
This layer-sequential, data-parallel execution can be scaled to multiple workers in a cluster by copying the model to all workers and having each one operate on a separate mini-batch of data. After gradients are computed for each worker, they must be synchronized to update all copies of the model parameters. This requires sending model parameters over slow interfaces such as PCIe and Ethernet. With standard hardware, this often becomes a performance bottleneck.
The communication overhead can be hidden with larger per-device batch sizes, but this comes at the cost of increasing the total batch size — the product of per-worker batch size and number of workers — even more. This imposes a limit on the total number of accelerators that can be used before the total batch size becomes too large. At that point, adding more samples to the batch no longer accelerates training, models require extensive hyper-parameter tuning and special adaptive learning rate techniques to converge, and eventually fail to converge at all.
Model-parallel execution is less frequently used with traditional accelerators because it often requires even higher bandwidth for communication between devices, and a custom model-specific distribution layout. There are two main kinds of model parallel execution: 1) individual layers of a model are split across multiple devices, 2) different layers, or groups of subsequent layers, run in parallel on different devices for different input samples in a pipelined fashion. The first requires very frequent and massive communication between devices, is rarely used, and we won’t go into it here. The second kind is more feasible. It resembles an assembly line where each step of the processing is performed by a particular worker. But not all models are easily “pipelined”. It is critical for a pipeline to be balanced, so it takes approximately the same time for each worker to perform its task. For layer-pipelined execution this means engineers have to find an optimal way of dividing a model into blocks of subsequent layers, so that each block requires roughly the same amount of time to process. For models with heterogeneous layers, when every layer has different compute requirements, this is not an easy task: imagine, that you have many objects of different weights, and you need to divide them into a predefined number of sets, such that each total weight per set is approximately the same.
CS-1 Execution Modes
The Cerebras CS-1, with 400,000 cores and 60x more silicon than the biggest GPUs, is a single system designed to provide the acceleration otherwise achieved only through massive scale-out. It can perform data-parallel layer sequential execution at much smaller batch sizes and with less weight synchronization overhead than traditional clusters. This is made possible by its 9.6 PB/s memory bandwidth and a low latency, high bandwidth interconnect sharing the same silicon substrate with all the compute cores. This means data parallel training of existing models at scale “just works” without the need for large batch sizes, special optimizers or hyper-parameter tuning.
In addition to pushing small-batch, layer-sequential training to a much higher speed than otherwise possible, the CS1 greatly simplifies model-parallel scaling. Layer-pipelined execution allows multiple layers (or blocks of layers) to run in parallel for different input samples, in a pipelined manner. In this mode, each “worker” is allocated a subset of the 400,000 processing elements. These workers can now specialize, with a specific set of weights and instructions that stay local to a subset of processors. Each worker continuously runs the same operation and keeps all required parameters in fast local memory. Eliminating redundant copies of weights enables larger networks to be trained. Not needing to exchange weights or gradients avoids synchronization overhead. Instead of weights and gradients, the activations of each sample are passed locally from one worker to the next. For a large mini-batch, this may require more memory bandwidth than exchanging gradients, making layer-pipelined training particularly suited for the high memory bandwidth of the CS1.
In contrast to clusters of traditional accelerators, where the computational capacity of each worker is defined by a corresponding accelerator and is not flexible, CS-1 provides a flexible pool of compute resources which are dynamically allocated to different virtual workers. Coming back to our earlier metaphor with objects of different weights, with CS-1 there is no predefined number of sets to which these objects should be divided, and there is no requirement to have equal total weight per set. We can allocate as many virtual workers on CS-1 as needed for a given model, and every worker will get as much compute power (or as many cores) as needed for a given layer or block of layers. This optimal allocation is done automatically by the Cerebras Software Stack.
Stochastic gradient descent is implemented for layer-pipelined execution by sequentially feeding a mini-batch of inputs into the network and then waiting for all resulting gradients to be accumulated before updating the weights and processing the next batch. The pipeline is repeatedly filled with samples and then drained to perform the weight update. At the start and end of each mini-batch, some workers remain idle. Highest utilization is achieved when the batch size is large compared to the number of parallel stages. For most common networks with tens of layers, this is the case at small batch sizes. For networks where it is not, the next section describes a way to completely eliminate this overhead.
Pipelined backpropagation  is a method that avoids the fill and drain overhead of standard layer-pipelined execution. In pipelined backpropagation, the weights are updated for every sample, without ever draining the pipeline. Updating the weights without draining the pipeline can lead to slightly different weights being used on the forward and backwards passes. Cerebras researchers analyzed the implications of this and designed a simple momentum-based technique to correct these differences and enable training to full accuracy using pipelined backpropagation. This method is discussed in more detail here.
To summarize, the CS-1 supports training using many different types of parallelism. It can efficiently perform standard layer-sequential, data-parallel execution at smaller batch sizes than clusters of traditional accelerators, and it also enables more flexible layer-pipelined execution modes. Layer-sequential and layer-pipelined execution can be combined with data-parallelism at many different levels of granularity. It allows training today’s architectures faster without tuning batch sizes and learning rates. For small networks, it allows combining both layer and batch parallelism, while the largest networks can use layer-sequential execution efficiently at a batch size of one. Midsize networks can be executed in a “block-sequential” mode, when one block of layers is evaluated at a time with layer-pipelined execution within each block. This gives practitioners the freedom to train networks of all shapes and sizes efficiently, be it deep or shallow, wide or narrow, large or small.
I would like to acknowledge Atli Kosson, Vitaliy Chiley, Urs Koster, Jessica Liu and Sean Lie for their contributions to this work.
 A. Petrowski, G. Dreyfus, and C. Girault. 1993. Performance analysis of a pipelined backpropagation parallel algorithm. Trans. Neur. Netw. 4, 6 (November 1993), 970–981. DOI