Sep 14 2022

Linear Scaling Made Possible with Weight Streaming - Cerebras

Our wafer-scale processor, working in tandem with MemoryX and SwarmX technologies, enables weight streaming by disaggregating memory and compute. This results in near-linear scaling for Cerebras Wafer-Scale Clusters and huge performance gains without needing to think about complex distributed compute challenges.

Clustering | 3D parallelism | Scaling to trillions | Ease-of-use | Conclusion

Large language models (LLMs) like GPT-J, GPT-Neo and GPT-3 can produce amazing results but training these models presents novel computational challenges. In 2018, state-of-the-art language models had 100 million parameters. Two years later, and the largest model, GPT-3, had 175 billion parameters. Today, there are 1 trillion parameter models. In fact, OpenAI’s scaling laws seem to show that as parameter count increases, so too does model accuracy, meaning that in the not-too-distant future, we may well see models with tens or hundreds of trillions of parameters.

The impact on compute infrastructure of this extraordinary model size growth is twofold. As seen in Figure 1 below, both the compute and the memory requirements needed to train these LLMs have grown exponentially and show no sign of slowing down.

Today’s demands for AI compute have been met by ever-larger clusters of graphics processing units (GPUs). Figure 2, gathered from recent publications, shows the number of graphic processing units required to run the largest of the LLMs. So large and complicated are these clusters that successful training takes months of work, tens of millions of dollars of hardware, 10s of megawatts of power, and the efforts often result in a publication. In fact, as the model size gets larger, the number of GPUs needed to train the model increases so rapidly that the Y-axis is on a logarithmic scale.

Today, the fundamental limiting factor in running LLMs is not the AI. It is the distributed compute challenge of putting LLMs on thousands of GPUs, and the scarcity of the distributed compute expertise necessary to do so.

Ironically, what’s hard about big AI isn’t the AI. It’s the distributed compute.

Given this backdrop, how do we make these models – and the future models expected to be 100 or 1,000 times larger – accessible to more than just a handful of hyperscale organizations? And how do we reduce the training time so practitioners can experiment and train their own models rather than relying on generic, pre-trained models?

Clustering is difficult and scales nonlinearly with GPUs.

Let’s look at what it takes to train the state-of-the art models on a cluster of GPUs. There are three parallelization techniques:

Data parallel
Pipeline Model Parallel
Tensor Model parallel

Data Parallel

The simplest method for parallelizing work is called data parallel. Data parallel training is only possible if the model fits onto a single device – that is, the parameters fit in the device’s on-chip memory, and the largest matrix multiply in the largest layer can be done on the device. When running data parallel, multiple devices, each with the same configuration, are each presented a portion of the training set. The results are then averaged. This is dead simple. And this is the go-to approach for all models smaller than two billion parameters.

Tensor Model Parallel

With models larger than two billion parameters, the parameters no longer fit in memory and the largest matrix multiplies no longer fit on a single device. Instead, every single layer must be split and spread across several devices. This approach is called tensor model parallel. Even in the largest of networks, the 1 trillion parameter behemoth Megatron, which uses more than 2000 GPUs to train, the Microsoft team could only spread a single layer over a maximum of 8 GPUs. In fact, in all the published work cited in Figure 2, nobody has been able to run Tensor Model Parallel over more than 8 GPUs.

Pipeline Model Parallel

Starting at 20 billion parameters, yet another form of parallelism is deployed, namely Pipeline Model Parallel. In this mode, a sequential pipeline is formed with where the work from Layer 1 is done on a GPU or group of GPU’s and then Layer 2 is done on a separate GPU or group of GPUs. This involves deciding which layers should be put on which devices and carefully measuring the latency between them.

In sum, for models of fewer than two billion parameters, the easiest technique – data parallel – is sufficient. Models larger than two billion and smaller than 20 billion parameters require two simultaneous techniques: data parallel and more complicated tensor model parallel. And models of 20 billion parameters and beyond require all three techniques – data parallel, pipeline tensor model parallel and pipeline model parallel – to be implemented simultaneously. This is shown in Figure 2 above, as the colored bars represent the number of graphics processing units allocated to each technique. This combination of all three approaches is known as 3D parallelism.

If you want to learn more about these parallelization strategies (and their challenges), look at this blog post.

Is this even possible? It is, but only with an architecture that’s co-designed from the ground up specifically for neural networks. In this post, I’m going to take a deep dive into the Cerebras architecture to show you how we do this.

Implementing 3D parallelism is not for the faint of heart

Implementing 3D parallelism is not for the faint of heart. Just look at NVDIA’s own instructions for distributed training. New parameters are required for any change in setup. For each new model, one must determine how to split and distribute the model across multiple devices. This involves measuring and allocating device memory, latency, and device constraints. It requires the use of Horovod or OpenMPI, accounting for drops in latency and accuracy, changing the batch size, as well as setting new hyperparameters.

3D parallelism is well and truly in the most complex of distributed compute domains. In addition to the list above, you must decide on the values of the parameters defining degree of data parallelism, tensor model parallelism and pipeline model parallelism. And you must choose an appropriate batch size and whether activation checkpointing be used. If it should, which option for activation checkpointing? Choices of all these parameters are inter-dependent. And the optimal choice only works for one model and for one cluster configuration. Changes to your model configuration result in starting again from scratch. Similarly, gaining access to a different size cluster – the same, start from scratch.

As if that weren’t complicated enough, each different form of parallelism has different bandwidth requirements for communication between the devices. Tensor model parallel training is very communication heavy, which is why it is not used across more than a single 8-GPU servers. This means that the number of devices that can leverage this form of parallelism is limited. Data parallel and pipeline parallel both require large batch size to keep all the devices busy. But very large batch sizes work against convergence, and larger batches often require more training epochs to achieve target accuracy, or fail to reach target accuracy at all even though that accuracy target can be reached when trained with a smaller batch.

Finally, after all this work, the result is a model spread over hundreds or thousands of GPUs. But performance has not scaled with GPU count. It is vastly sublinear. This is why training LLMs is so power hungry and requires so many GPUs: you are only getting a small percentage of the FLOPs from each GPU (but paying all the cost in dollars and in power), and the rest are being wasted by communication overheads and latency delays.

Cerebras makes scaling to trillions of parameters dead simple.

How does Cerebras do this? We have invented technology that allows us to scale to the largest LLMs using only data parallelism. We avoid distributed computing, and instead deliver push-button allocation of work to compute, and linear performance scaling. In fact, Cerebras Wafer-Scale Cluster users can scale a model from a single CS-2 to 192 CS-2s with a single keystroke. By avoiding all the 3D parallelism mess, we make scaling the largest models dead simple.

Data parallelism requires that all the calculations, including the largest matrix multiplies of the largest layer, fit on a single device. And, that all the parameters fit in the device’s memory. The Cerebras CS-2 system, powered by the second-generation Wafer Scale Engine-2 (WSE-2), achieves both characteristics.

The Cerebras WSE-2 is the largest processor ever built. It is 56 times larger than the largest GPU, has 123 times more cores, 1,000 times more on-chip memory, 12,000 times more memory bandwidth, and 45,000 times more fabric bandwidth. The WSE-2 is the size of a dinner plate, while the largest GPU is the size of postage stamp.

The sheer size and computational resources on the WSE-2 allow us to fit the largest layers of the largest neural networks onto a single device. In fact, on the Wafer-Scale Engine-2, we can fit layers 1000 times larger than the largest layer in the largest existing NLP network. This means we never need to break up work and spread it across multiple processors.

However, even a wafer-scale processor doesn’t have the on-chip memory to support trillions of parameters. For this, we invented MemoryX and SwarmX.

MemoryX enables Cerebras to disaggregate parameter storage from compute without suffering the penalty usually associated with off-chip memory. Storage for model parameters is in the separate MemoryX system, while all the compute is in the CS-2, as show in Figure 4. By disaggregating compute from memory, the MemoryX provides nearly unbounded amounts of storage for the parameters and optimizer states.

How does it work? MemoryX streams weights to the CS-2, where the activations reside. And in return, the CS-2 streams back the gradients. The MemoryX uses these in combination with stored optimizer parameters to compute weight updates for the next training iterations. This process is then repeated until training is complete. MemoryX enables even a single CS-2 to support a model with trillions of parameters. The mechanism whereby the latency induced by this approach is hidden by coarse- and fine-grained parallelism is described in this whitepaper.

What good is a single CS-2 for LLMs? A single CS-2 can quickly and easily train models of up to 10 billion parameters. And for models with hundreds of billions of parameters, a single CS-2 can be used for fine tuning or for development and debug. Then the model can be trained on a cluster of arbitrary size with no code changes. You just change one parameter in the training command.

While MemoryX adds vast parameter storage capabilities, SwarmX connects MemoryX to clusters of CS-2s, enabling CS-2s to scale out and for the cluster to run strictly data parallel. SwarmX forms a broadcast reduce fabric. The parameters stored in MemoryX are replicated in hardware and broadcast across the SwarmX fabric to multiple CS-2s. The SwarmX fabric reduces the gradients sent back from the CS-2s, providing a single gradient stream to the MemoryX (Figure 5).

The result is very powerful. The combination of SwarmX and MemoryX with weight streaming on the Wafer Scale Engine-2 enable strict data parallel scaling. Each CS-2 is configured identically. The training set is divided into parts and sent to a corresponding CS-2. Weights stream from the MemoryX through the SwarmX to each of the CS-2s. The CS-2s never need to communicate with each other (which is the essence of data parallel scaling). They simply calculate the gradients, based on the parameters and the data they see, and send the gradients back through the SwarmX fabric where the gradients are reduced. A single stream is delivered to the MemoryX where the weight update occurs, and the process is repeated. (A more detailed discussion of the elements in this solution can be found in this white paper).

Ease-of-use and Performance

There are several benefits of this solution. First, clusters are dead simple to set up. From a Jupyter notebook on a laptop, GPT-3 can be spread over a cluster of CS-2s with a single keystroke. The largest LLMs can be set up quickly, distributed, and debugged without worrying about complex hybrid parallelism. Running a 20B parameter model is the same as running a 1B parameter model, which is the same as running a 175B parameter model, each requiring just a single number change in a configuration file (see Figure 6). It’s that simple! This is not possible on GPUs, even with sophisticated distributed software frameworks, because of the fundamental challenges of distributing the work across the small devices as discussed above. For a walkthrough of training and reconfiguring large models on our systems, see our blog post here.

Data parallelism also enables strict linear scaling. That means if you go from one CS-2 to two CS-2s in a cluster, the time to train is cut in half. If you go from one CS-2 to four CS-2s, training time is cut in four. This is an exceptionally rare characteristic in distributed computing. It is profoundly cost and power efficient. Unlike GPU clusters, in a Cerebras cluster, as you add more compute, performance increases linearly. And time to train is reduced linearly. Let’s begin by looking at a chart published by Purdue professor Tim Rogers and Mahmoud Khairy (Figure 7) in this article. This chart shows the performance scaling of TPUs and GPUs and is normalized to 16 GPUs. The Y-axis shows the performance increase over a 16 GPU cluster–10X, 20X, 30X etc., while the X-axis shows the number of GPUs necessary to achieve this speed up. You might think that to go 10 times faster than 16 GPUs, you would need 160 GPUs. The research suggests otherwise: You need 800 GPUs to achieve a 10X speedup. So, to go 10 times faster you need 50 times as many GPUs. This costs at least 50 times as much and uses more than 50 times the power. This is the essence of non-linear scaling.

Compare with the chart below (Figure 8). On GPT-3 XL, Cerebras shows perfect linear scaling up to 16 CS-2s – that’s perfect scaling up to 13.6 million cores. So, to go 10 times as fast as a single CS-2, you don’t need 50 CS-2s. You need exactly 10. That’s the power of the Cerebras Wafer-Scale Cluster.

Andrew Feldman, CEO and Co-Founder, Cerebras Systems