Mar 22 2023

Can Sparsity Make AI Models More Accurate?

Most published sparsity techniques have focused on using sparsity to reduce model compute. It is also possible to use sparsity to improve model accuracy, but that has been relatively less studied. In this article, I’ll introduce you to a simple-to-use sparse technique we have developed called Sparse Iso-FLOP Transformation, or Sparse-IFT, which can increase accuracy on computer vision models (e.g., ResNet, MobileNet) by up to 3.5% and improve perplexity on language models (e.g., GPT) by up to 0.4, all without significantly changing training FLOPs (floating point operations).

The Cerebras architecture is uniquely designed to support unstructured sparsity, making all sparse techniques easy to use and efficient to run. The highly effective Sparse-IFT technique, combined with Cerebras’ unique sparsity acceleration, enables our users to achieve model capabilities not possible otherwise.

You can download our paper, “Sparse-IFT: Sparse Iso-FLOP Transformations for Maximizing Training Efficiency” here.

Introduction

To achieve higher accuracy generally requires more compute, most commonly by increasing the model and dataset size. This link between compute and accuracy gave rise to the modern deep learning era, resulting in an unprecedented demand for compute over the past decade. However, this trend is not sustainable because even today’s models are already prohibitively expensive to train in terms of time, cost, and energy.

At Cerebras, we believe the way forward is to break the link between accuracy and compute, by training with sparsity. Sparsity is already often used in inference, but two factors have hindered widespread use in training: (1) a lack of hardware that can train with sparsity and (2) a lack of accessible systematic ML techniques to improve sparse training.

At Cerebras, we are targeting both issues with hardware-ML co-design. The Cerebras CS-2 system is explicitly designed to accelerate training using sparsity. In fact, it is the only hardware architecture capable of accelerating unstructured sparse training at scale today. And, on the ML front, in this blog, we are introducing a simple-to-use sparsity technique that can increase model accuracy without significantly changing compute.

Most traditional sparsity techniques primarily aim to reduce model compute for a given accuracy, as shown in Figure 1. Sparsity can also be used to improve model accuracy, as shown in point B of Figure 1, but that has been relatively little studied. At Cerebras, building on existing sparsity techniques as a foundation, we have developed a sparse transformation called Sparse-IFT: Sparse Iso-FLOP Transformation that can be easily applied to any existing model. As detailed in our paper, Sparse-IFT creates larger layers of an existing dense model by making each layer sparse while preserving the compute requirement of the original dense layers. The term “Iso-FLOP” refers to requiring the same amount of compute, measured in FLOPs, to train each layer. For example, we can create a 10x larger layer, but with 90% sparsity, it will require the same compute in FLOPs to train as the original dense layer (i.e., remain Iso-FLOP). By increasing the model size, Sparse-IFT increases the model’s representational capacity which, in turn, increases accuracy. And by using sparsity, it does not incur the significantly higher compute requirement traditionally resulting from larger models.

This technique increases model accuracy across a wide set of use cases in both natural language and computer vision domains. Sparse-IFT has shown to increase accuracy on computer vision models (e.g., ResNet, MobileNet) by up to 3.5% and improve perplexity in language models (e.g., GPT) by up to 0.4, all without significantly changing compute requirements.

Sparse-IFT is also easy to use. It’s a simple drop-in replacement for existing dense layers, can be used without extensive hyper-parameter tuning, and is easily accessible through the Cerebras push-button sparsity software interface.

Why Sparsity?

Neural network models are made up of layers of neurons and connections between them. The number of layers, the number of neurons per layer, and the pattern of connections together make up the structure of the model. The connections are represented as weights in a collection of matrices, one for each layer of the model. When there are missing connections, represented as zeros in the weight matrices, we refer to the model as sparse.

For all zero weights, no computation is mathematically required during training or inference to achieve the correct result, making sparsity an incredibly powerful tool. However, only hardware that can translate sparsity into acceleration, such as the Cerebras CS-2 system, can take advantage of the lower compute requirement. To learn more about the CS-2 hardware design for sparsity, see the Cerebras Architecture Deep Dive.

Until now, most published sparsity techniques have focused on using sparsity to reduce model compute. The alternative goal to increase model accuracy is equally, if not more, valuable but has been relatively less studied. The reason may be because increasing accuracy generally requires running larger models, sometimes significantly larger (like Sparse-IFT), but most sparsity studies in the community are conducted on GPUs that are inherently incapable of accelerating unstructured sparsity, which significantly limits the scale of sparse experimentation.

Thus, to use a technique like Sparse-IFT, which requires increasing model size significantly, is prohibitively costly on GPUs where everything is executed dense even for sparse models. At Cerebras, however, the CS-2 hardware and software are co-designed to natively support sparsity, making such sparse techniques easy to use and efficient to run. The highly effective Sparse-IFT technique, combined with Cerebras’ unique sparsity acceleration, enables our users to achieve model capabilities not possible otherwise.

All popular ML models are dense by design, but even fully dense models can be made sparse by inducing unstructured sparsity. In this form of sparsity, certain weights are set to zero, which effectively prunes the connections within the model, as shown in Figure 2. When the pruning is done without a fixed regular pattern, we refer to this as unstructured sparsity.

Inducing unstructured sparsity is quite natural because not all weights are equally important. In fact, training a model ultimately results in discovering which weights are more important than others. When a dense model is fully trained, there will naturally be many resulting weights that are very small in magnitude (approximately zero), which are less important than the high magnitude weights. Therefore, the nature of training itself suggests that unstructured sparsity can be applied even to an otherwise dense model while preserving the original model characteristics.

This is the key property of unstructured sparsity that enables it to be applied to any model. It reduces the model compute without the need to create a new model architecture. To see examples of how unstructured sparsity can be used to reduce compute on otherwise dense GPT models, see our blog on accelerating large GPT models with sparsity. Unstructured sparsity is also used as a key component to derive different members of Sparse-IFT family, where each member can be easily applied to any model as a drop-in replacement of dense layers without changing any other architectural aspects of the model.

Sparse Iso-FLOP Transformation

The Sparse-IFT transformation performs two operations: (1) increase model size and (2) induce sparsity. By performing both operations simultaneously, Sparse-IFT increases the model’s representational capacity while preserving the FLOPs close to the original. Sparse-IFT is very easy to use since there is only a single hyper-parameter to choose: sparsity level. Given a sparsity level, each model layer increases proportionally to keep the layer FLOPs constant.

Multiple transformations can be expressed using the Sparse-IFT family, where the simplest is the Sparse-Wide transformation, as illustrated in Figure 3. In the Sparse-Wide transform, every layer of the model is widened and sparsified proportionally. While each transformed layer remains iso-FLOP, Sparse-IFT does not impact the input/output layers (to meet external constraints) or attention layers (which do not have weights). These unaffected layers can cause the full model to have slightly more FLOPs than the original, but the difference is not significant since these layers are a relatively small portion of the full model.

This simple Sparse-Wide Sparse-IFT transformation has been shown to be robust across different model types in both the natural language processing (NLP) and computer vision (CV) domains. For example, ResNet-18 transformed with 90% sparsity improves accuracy by 3.5% on ImageNet. That may sound like a modest improvement, but as shown in Figure 4, improving accuracy by 3.5% exceeds the larger traditional dense scaling model (ResNet-34) that requires 2x more training FLOPs, resulting in much better real-world image processing quality.

Similarly, GPT-3 Small transformed with 50% sparsity improves perplexity by a significant 0.4 on WikiText-103. This improvement also exceeds that of the larger traditional dense scaling model (GPT-3 Medium) that requires 2.4x more training FLOPS, as shown in Figure 4. All while not significantly changing training FLOPs compared to the original models.

Traditionally, to achieve such accuracy gains would require larger models with much higher compute and/or significant ML engineering to change the model architecture. However, with Sparse-IFT, these gains were obtained by simply applying the transformation to the existing model without significantly increasing compute or tuning the model and hyperparameters.

To learn more about our results as well as other transformations in the Sparse-IFT family, see our paper.

Push-Button Software for Sparsity

To access the power of Sparse-IFT and other sparse techniques, we need both hardware capable of sparsity acceleration and software with an easy interface to use that sparsity. A good user interface is generally important, but even more so with sparsity because most existing software packages are optimized for dense computation.

At Cerebras, however, we build our software, just like our hardware, to inherently support sparsity from the ground up. As a result, using sparsity does not require changing complex code deep in the framework, nor does it even require changing the model code itself. Building on the existing Cerebras software stack’s easy interface, any existing PyTorch model in the Cerebras Model Zoo can be simply made sparse with just a few lines of change to the configuration file, as shown in Figure 5.

Since sparsity is natively supported in the underlying hardware and the entire software stack, it is natural to expose it to the user as a simple “knob”. The user just specifies the sparsity level, the type of sparsity, and which layers to sparsify. The model size is also defined in the same configuration file, so when pairing the sparsity change with a proportional change to the model’s hidden dimension, we have Sparse-IFT. In fact, the example in Figure 5 is the Sparse-IFT configuration used for the GPT-3 result mentioned previously and in our paper. This simple software interface makes it trivially easy to get the benefits of Sparse-IFT and other sparsity techniques.

Conclusion

The current trend of using ever-increasing compute in the pursuit of more accurate models is unsustainable. Sparsity is an incredibly powerful tool that can break this trend. By using sparsity through the Sparse-IFT Sparse Iso-FLOP Transformation, we have an easy-to-use ML technique that increases model accuracy without significantly changing compute.

When paired with the unique hardware sparsity acceleration of the Cerebras CS-2 and its push-button sparsity software interface, Sparse-IFT and other sparsity techniques provide new capabilities not possible otherwise. At Cerebras, we believe it is through this level of hardware-ML co-design that we can break free from current compute limitations and scale to better, more accurate models of the future.