Nov 28 2022

Harnessing the Power of Sparsity for Large GPT AI Models

Abstract

Machine learning models such as GPT have been growing at an exponential rate and resulting in prohibitively high compute, memory and energy requirements. This growth is unsustainable even with the ongoing impressive advances in traditional hardware design. At Cerebras, we believe the only way to truly solve this problem is by co-designing the hardware along with the ML algorithms, for sparsity. Fortunately, we already have hardware capable of accelerating sparsity: the Cerebras CS-2 system. What has been missing, until now, are new sparse machine learning (ML) algorithms that can harness that hardware.

Existing ML techniques already show that models can be very sparse, either with inherent sparsity, or by inducing sparsity. At Cerebras, we are building on these ML foundations and creating sparse ML techniques paired with the Cerebras CS-2 hardware. Even in our initial work, we have shown that high degrees of sparsity can be induced into large-scale GPT models while still preserving model accuracy. While we believe we are just at the beginning of what sparsity can do, these results already show that sparsity can be a fundamental key to enabling the industry to continue to grow in an efficient and sustainable way.

With the Cerebras CS-2’s unique ability to run large models easily while accelerating unstructured sparsity, we are enabling sparsity innovation at a scale not practical before. Until now, most published sparsity research has been limited to models 10x smaller than what we use in our initial work. We are excited to share the results of that work, done only in a matter of weeks made possible by the Cerebras CS-2, that already shows the promise of high sparsity on large-scale GPT models. We demonstrate training acceleration of a large-scale GPT model on the Cerebras CS-2 by pre-training with high sparsity at a fraction of the training FLOPs while preserving downstream accuracy using dense fine-tuning. We also demonstrate training a large-scale GPT model using iterative pruning on the Cerebras CS-2 to create an extremely sparse model with only a fraction of the FLOPs for inference. We are excited by these initial results but know they are just the beginning of what sparsity can do on the Cerebras CS-2.

Unsustainable ML Model Growth

In 2018, state-of-the-art neural networks such as BERT had a few hundred million parameters. Then just two years later, the famous GPT-3 model had 175 billion parameters. That represents over 1000x growth in compute demand in just two years, as shown in Figure 1. The GPT-3 model, for example, famously took months and millions of dollars to train on 1,024 GPUs, for a single training run. There is no end in sight for this growth as the ML community continues to demonstrate larger models continue to improve accuracy. Soon we will be striving to run models with trillions of parameters.

The compute and memory requirements are already prohibitive even for the largest companies in the world. We need a better way to grow models more efficiently, to get the advantages of larger models but with substantially less compute and memory resources.

Sparsity is the Answer

Neural network models are made up of layers of neurons and connections between them. The number of layers, the number of neurons per layer, and the pattern of connections together make up the structure of the model. The connections are represented as weights in a collection of matrices, one for each layer of the model. When there are missing connections, represented as zeros in the weight matrices, we refer to the model as sparse. For all zero weights, no computation or memory is required during training or inference. to achieve the correct result. However, only hardware that can accelerate sparsity, such as Cerebras CS-2, can take advantage of the lower resource requirement. I spoke about this in some detail at this year’s Hot Chips conference; a summary of which you can read here.

Structured Sparsity

Sparsity comes in different forms. It is common for sparsity to occur naturally in the model structure itself if the pattern of connections is designed to only connect a subset of the neurons. Often, models are constructed this way intentionally with a predefined pattern and we refer to this as structured sparsity. An example of this type of sparsity is depth-wise separable convolutions. This type of structured sparsity that exists already in the model is an obvious avenue to pursue more efficient computation.

Unstructured Sparsity

But what if the model does not have structured sparsity? The ever-growing GPT model is an example. It turns out that even fully dense models, such as GPT, can be made sparse by inducing unstructured sparsity. In this form of sparsity, certain weights are set to zero which effectively prunes the connections within the model, as shown in Figure 2. When the pruning is done without a fixed pattern, we refer to this as unstructured sparsity.

Although the original model was dense, inducing unstructured sparsity is also natural because all the weights are not created equal. In fact, the goal of training a model is ultimately to discover which weights are more important than others. When a dense model is fully trained, there will naturally be many resulting weights that are very small in magnitude, which are less important than the high magnitude weights. The nature of training itself suggests that unstructured sparsity is also natural even in an otherwise dense model.

A key benefit of unstructured sparsity is the model retains the original baseline structure, without the need to create a new model architecture. Additionally, on top of the existing properties of the baseline model, the sparse model can provide speedup in both training and inference. The following sections provide additional details and the ML techniques we have developed to achieve each of these benefits.

Faster Training

The most obvious goal of training a sparse model is to accelerate training time and reduce training cost by skipping the computation on all zero weights. There are many techniques for sparse training but they often suffer from lower model accuracy, especially on GPT models. However, at Cerebras, even in our initial study, we demonstrated it is, in fact, possible to accelerate training of large-scale GPT models using unstructured sparsity while still preserving accuracy on downstream tasks.

Our technique accelerates training large-scale GPT-3 models by pre-training with extreme sparsity but preserves downstream accuracy using dense fine-tuning. We showed this technique on a GPT-3 XL model with 1.3 billion parameters with up to 75% unstructured weight sparsity on a Cerebras CS-2. At that level of sparsity, we pre-train with 2.5x fewer FLOPs followed by fine-tuning, resulting in a final dense model without loss in accuracy on many downstream tasks. Although the fine-tuning was performed dense, for large-scale GPT models, the pre-training dominates the overall training FLOPs so using sparse pre-training reduced the overall training FLOPs substantially as shown in Figure 3.

With only a simple sparsity technique in our first study, we already see the promise of sparse training. Going forward, the Cerebras CS-2 will enable even further improvement on even larger models with more advanced techniques.

For more information about this sparse training technique and results, see our more detailed blog post here.

Faster Inference

Large-scale GPT inference latency and cost have also become prohibitively high due to model growth. Therefore, it is extremely valuable to create a sparse model to improve inference speed and lower inference cost by using fewer weights and fewer compute FLOPs. Pruning models for inference is already a common practice in industry for small models especially when running on edge devices. However, pruning is not widely adopted on large-scale GPT models yet because the cost of pruning on hardware without sparsity acceleration is prohibitively high for large models.

With our ability to accelerate unstructured sparsity using the Cerebras CS-2, we set out to answer the open question of whether large GPT models can be sparsified for inference. In our first attempt, we demonstrated it is, in fact, possible to train a large-scale GPT model using iterative pruning to create a sparse model without loss of accuracy. The iterative pruning process removes the lowest magnitude weights incrementally and retrains the model to accuracy on each iteration, as shown in Figure 4. We trained an extremely sparse GPT-3 XL 1.3B parameter model using iterative pruning with unstructured weight sparsity on Cerebras CS-2. The result is a sparse model with 84% sparsity, 3x fewer inference FLOPs, 4.3x fewer parameters in memory, and no loss in accuracy. We are excited by these initial results and look forward to using Cerebras CS-2 to expand to more advanced techniques on even larger models.

For more information about this pruning technique and results, see our more detailed blog post here.

The Cerebras Architecture Is Designed for Sparsity

These ML sparsity techniques need to be codesigned with the hardware architecture to actually achieve training or inference speedup because traditional GPU architectures can only run dense operations efficiently. Unlike traditional architectures, however, the Cerebras CS-2 hardware was codesigned to support unstructured sparsity at full performance. When paired with sparse ML techniques, the Cerebras CS-2 is uniquely capable of translating the FLOP reduction into performance and speedup. There are two key mechanisms in the hardware that enable unstructured sparsity acceleration: memory bandwidth and dataflow execution.

Memory bandwidth is fundamental because to accelerate unstructured sparsity, you need enough memory bandwidth to run sparse general matrix multiply (GEMM) operations at full performance. The Cerebras CS-2 architecture has unique on-chip memory with full bandwidth to run all matrix operations out of memory at full performance across all BLAS levels, as shown Figure 5. Traditional GPU architectures, on the other hand, with low bandwidth to off-chip DRAM memory are limited to running only dense GEMMs at full performance: that’s dense matrix-matrix multiplies only. In fact, any BLAS level below full matrix-matrix multiply requires a massive jump in memory bandwidth. That’s not possible with traditional architectures, but with the Cerebras on-chip SRAM-only memory, we can enable full performance all the way down to AXPY, which is a vector-scalar multiply. This capability is what enables unstructured sparsity acceleration because a sparse GEMM is simply a collection of AXPY operations, with one operation for every non-zero element. This level of memory bandwidth is the first key enabler for unstructured sparsity acceleration.

Dataflow execution is the other key enabler for accelerating unstructured sparsity. The Cerebras CS-2 uses fine-grained dataflow scheduling where all the computation is triggered by data. Only when the compute cores receive data, does the hardware trigger a lookup of instructions to run. This dataflow mechanism enables native sparsity acceleration because it only performs work on non-zero data. We filter out all the zero data at the sender, so the receiver does not even see it. Only non-zero data is sent, and that triggers all the computation. Not only do we save power by not performing the wasted compute, but we get acceleration by skipping it and moving on to the next useful compute. Since the operations are triggered by single data elements, this supports ultra-fine-grained, fully unstructured sparsity without any performance loss.

To learn more about the Cerebras hardware architecture and sparsity capabilities, see our detailed blog post here.

Conclusion

As ML models continue to grow at an exponential rate, sparsity acceleration is a key enabler to supporting these models efficiently and sustainably. To achieve this sparsity acceleration, we need both hardware support for sparsity and sparse ML techniques to be co-designed together. At Cerebras, we have done just that. The Cerebras CS-2 hardware architecture accelerates unstructured sparsity, automatically. Paired with that hardware, we have developed and demonstrated sparse ML techniques to both accelerate training and to reduce the cost of inference significantly on large-scale GPT models, while preserving model accuracy. We are extremely excited by these results and their promise to enable more efficient and larger models, but we believe are we just as the beginning of what sparsity can do.

Sean Lie, Chief Hardware Architect and Co-Founder | November 28, 2022