Iterative Pruning

Iterative Pruning is a technique used in machine learning to reduce the size and complexity of neural networks. It works by starting with a fully-trained model and then successively removing parts of the network that are not contributing to the accuracy of the final output. This process can reduce the amount of compute resources needed for inference, and result in better performance on unseen data. Additionally, iterative pruning encourages exploration and discovery of more efficient neural architectures by forcing researchers to identify what elements are necessary for a high-performing model. Ultimately, this technique leads to models that are faster and more accurate than those trained without pruning. Iterative pruning is an invaluable tool for machine learning practitioners seeking to create powerful neural networks with limited computing resources.

As language models scale in size, cost, and energy and become harder to setup and use, they also become out of reach of AI groups looking to train and deploy them. With the Cerebras CS-2 enabling unstructured sparse research and language model training at scale, we have taken our first steps towards exploring sparse models and the impact they could have on industry applications. We successfully trained unstructured sparse 1.3 billion parameter GPT-3 models on Cerebras CS-2 systems and demonstrated how these models achieve competitive results at a fraction of the inference FLOPs, with our 83.8% sparse model achieving a 3x reduction in FLOPs at matching performance on the Pile, setting the stage for future work in sparse training, pruning, and inference.