Sparsity

In machine learning, sparsity is a concept that describes the proportion of zero elements to non-zero elements in a matrix or vector. It is used to measure the level of “compactness” of a data structure and can be applied to datasets with any number of dimensions. Sparsity is defined as the ratio of zeros to non-zeroes in a given array, which helps machine learning algorithms more efficiently process large datasets. In general, machine learning algorithms benefit from higher levels of sparsity as they are able to reduce computational costs when fewer coefficients need to be estimated. As such, machine learning practitioners often strive for increased sparsity in their models. Additionally, certain machine learning techniques rely on exploiting sparsity by using a sparse representation of data. Sparse representations reduce the number of parameters that need to be estimated and in some cases can make machine learning tasks easier or more efficient. Therefore, sparsity is an important concept in machine learning that should not be overlooked.   

Machine learning models such as GPT have been growing at an exponential rate and resulting in prohibitively high compute, memory and energy requirements. This growth is unsustainable even with the ongoing impressive advances in traditional hardware design. At Cerebras, we believe the only way to truly solve this problem is by codesigning the hardware along with the ML algorithms, for sparsity. Fortunately, we already have hardware capable of accelerating sparsity: the Cerebras CS-2 system. What has been missing, until now, are new sparse machine learning (ML) algorithms that can harness that hardware. 

Existing ML techniques already show that models can be very sparse, either with inherent sparsity, or by inducing sparsity. At Cerebras, we are building on these ML foundations and creating sparse ML techniques paired with the Cerebras CS-2 hardware. Even in our initial work, we have shown that high degrees of sparsity can be induced into large-scale GPT models while still preserving model accuracy. While we believe we are just at the beginning of what sparsity can do, these results already show that sparsity can be a fundamental key to enabling the industry to continue to grow in an efficient and sustainable way. 

With the Cerebras CS-2’s unique ability to run large models easily while accelerating unstructured sparsity, we are enabling sparsity innovation at a scale not practical before. Until now, most published sparsity research has been limited to models 10x smaller than what we use in our initial work. We are excited to share the results of that work, done only in a matter of weeks made possible by the Cerebras CS-2, that already shows the promise of high sparsity on large-scale GPT models. We demonstrate training acceleration of a large-scale GPT model on the Cerebras CS-2 by pretraining with high sparsity at a fraction of the training FLOPs while preserving downstream accuracy using dense fine-tuning. We also demonstrate training a large-scale GPT model using iterative pruning on the Cerebras CS-2 to create an extremely sparse model with only a fraction of the FLOPs for inference. We are excited by these initial results but know they are just the beginning of what sparsity can do on the Cerebras CS-2.