Abstract

In an industry where the exponential growth in the size of large GPT models has resulted in prohibitively high training costs, the ability to reduce the compute to train these models is a fundamental enabler. We believe sparsity is a key to reducing that compute and there are machine learning (ML) techniques to ensure the resulting models have the quality of their dense counterparts. With the Cerebras CS-2’s unique ability to run large models easily while accelerating unstructured sparsity, we are starting to explore these ML techniques at a scale that was not practical before. Until now, most published sparsity research has been limited to models 10x smaller than the ones we use. In our first sparsity study on training models of full size, done only in a matter of weeks made possible by the Cerebras CS-2, we demonstrate it is possible to pre-train large GPT models with high sparsity followed by dense fine-tuning to preserve accuracy on downstream tasks.

Specifically, we start by using simple, static sparsity and evaluate model sizes up to GPT-3 XL with 1.3B parameters. Our initial results show we can pre-train GPT-3 XL with up to 75% unstructured sparsity and 2.5x fewer training FLOPS on Cerebras CS-2, while using dense fine-tuning to preserve the evaluation metrics on many downstream tasks. To the best of our knowledge, this is the first time a large GPT model has been pre-trained with high sparsity without significant loss in downstream task metrics. These initial findings with static sparsity show the promise of sparse training, and we are motivated to explore more advanced sparse techniques for even further improvement on even larger models, enabled by the Cerebras CS-2.

Introduction

A simple way to increase the quality of a production-grade machine learning model is to increase the size of the training dataset. This approach works well in practice but comes with an increase in the cost of creating a large annotated dataset. To mitigate this issue, multiple approaches have been adopted in the industry: active learning [1], semi-supervised learning [2], and unsupervised learning [3, 4, 5]. In the last couple of years, we have seen most progress with unsupervised learning resulting in GPT model families pre-trained on large unannotated text datasets. GPT models have been shown to outperform state-of-the-art models across multiple tasks [3]. Building on this success, recent works have established the importance of scaling both GPT model size and training data during pre-training [6, 7, 8]. Scaling model size and dataset improves quality but also substantially increases the cost of pre-training the model. For example, the cost of pre-training GPT-3 175B is estimated to be millions of dollars [9].

In our work, we show how pre-training GPT models can be accelerated by the Cerebras CS-2, with its support for unstructured weight sparsity, to reduce the training FLOPs (floating point operations) by up to 2.5x, while retaining the benefits of pre-trained textual representations in large language models.

Recap of GPT Training and Evaluation

In standard GPT training, the model [10] is trained on a large unannotated corpus of text data in an autoregressive manner. Since there are no annotations, autoregressive training involves predicting the next word in the sequence given the preceding set of words. This part of training is commonly referred to as “pre-training”. After pre-training, the model can be evaluated in three different ways: zero-shot, few-shot, and supervised fine-tuning. Zero-shot and few-shot evaluation is interesting from an academic point of view but in the industry, the most common setting is supervised fine-tuning. Here, a single pre-trained model is generally applied to numerous downstream tasks via fine-tuning. More specifically, during fine-tuning, the pre-trained model is further trained on a much smaller, task-specific labelled dataset, where the training objective is to predict the labelled output instead of predicting the general next unlabeled word. Fine-tuning is commonly used in the industry because it adapts the pre-trained model towards a target domain and task to produce significantly better accuracy than just using the pre-trained model with zero/few-shot directly. For example, GPT can be fine-tuned for domain-specific natural language generation (NLG) [11] or text summarization [8].

Recap of Weight Sparsity

Before we go into the details of our work, here is a quick refresher on weight sparsity. We focus on unstructured sparsity, as this form of sparsity has been extensively studied in the literature, has well established benchmarks, and achieves the best trade-off between accuracy and model compression. Training with weight sparsity involves setting certain learnable weights in the layer’s weight matrix to zero (shown in Figure 1). Training with sparse weights allows us to skip the compute FLOPs during the forward and backward pass, giving speedup on hardware that supports accelerating unstructured sparsity, such as Cerebras CS-2 [12].

Figure 1: Applying weight sparsity to a dense neural network by zeroing weights effectively prunes neuron connections within the network.

Sparse Pre-training and Dense Fine-tuning

In the common setup of pre-training followed by fine-tuning, pre-training takes significantly longer on a larger dataset than fine-tuning on a substantially smaller dataset. The reason for the imbalance is pre-training is learning the significantly harder task of natural language understanding, while fine-tuning is learning the simpler domain-specific task on top of the already established language model. Even with such different goals, the model size and capacity is generally kept the same between pre-training and fine-tuning. In our work, we break this assumption and show the benefits of altering the model capacity between pre-training and fine-tuning with weight sparsity. More specifically, during pre-training, we train a sparse GPT model instead of a dense model. During fine-tuning on downstream tasks, we densify the GPT model, which allows the previously zeroed weights to learn, thereby increasing the model capacity. By pre-training sparse, we leverage the fact that the full general learning capability of the pre-trained model is not always required to perform the simpler downstream fine-tuned task, as shown by the analysis on intrinsic dimensionality of pre-trained representations [13]. Then by using dense fine-tuning, we can use the increased full capacity of the model towards the final downstream task.

Our approach has two advantages. First, it reduces computational cost and training time substantially. Computational cost of pre-training large GPT models is commonly on the order of 25x larger than the cost of fine-tuning (see Figure 2). Using sparsity during pre-training leads to a significant training speedup for the entire pipeline on hardware that can accelerate unstructured sparsity, such as Cerebras CS-2. Second, densifying weights during fine-tuning unlocks extra representational capacity, which allows it to preserve downstream task metrics.

Figure 2: FLOPs spent in pre-training GPT-3 XL on the Pile dataset, followed by fine-tuning on Curation Corpus. FLOPs spent during pre-training dominate the overall FLOPs. Sparse pre-training at 75% sparsity leads to 2.5x reduction in overall FLOPs.

Experiment Setup and Details

We evaluated our setup on two sizes of GPT models: GPT-2 Small (117M) and GPT-3 XL (1.3B). All models are trained on the Pile dataset [14]. To obtain state-of-the art FLOP-optimal models, we follow the training schedule advocated in Chinchilla [6]. We sparsify weights at initialization using a simple static random mask, where the defined sparsity level is distributed uniformly across layers. We only sparsify weights in projection layers (QKV projections, output attention projections) and feed forward network layers and do not sparsify other variables such as embeddings or biases. For GPT-3 XL, at 75% sparsity, this sparsity mask reduces the overall training FLOPS by 2.5x compared to dense.

We evaluate fine-tuning GPT models across structured data-to-text natural language generation (E2E [15], WebNLG [16], and DART [17]) and text summarization (Curation Corpus [18]) tasks. Figure 3 shows an example from the E2E dataset depicting the task of describing structured data using natural language. We perform a grid-search across standard fine-tuning hyperparameters: learning rate, epochs, and weight-decay. The evaluation on the test set is done using official scripts. To get our final scores for each task, we conducted multiple training runs for each model and report the average and variation on the test set.

Figure 3: An example from the E2E dataset [15]. This dataset focuses on the task of mapping a structured data in the form of a meaningful representation (MR) to a natural language sentence.

Results

In this section, we evaluate our hypotheses across several dimensions. First, the results show that a high degree of sparsity, up to 75% in some cases, can be used during pre-training while downstream task metrics are preserved with dense fine-tuning. Second, the results validate our expectations that the simpler data-to-text NLG tasks can tolerate higher degrees of sparsity during pre-training than the more complex summarization task. And last, the results show that the larger GPT-3 XL model can be sparsified to a higher degree than the smaller GPT-2 Small model in some cases. In this section we discuss our hypotheses and detailed experimental results. We show all fine-tuning results for GPT-2 Small and GPT-3 XL at various pre-training sparsity levels and across various downstream tasks in Table 1 and Table 2.

Figure 4: Evaluation of GPT-2 Small and GPT-3 XL sparse pre-training and dense fine-tuning on downstream tasks E2E (left) and Curation Corpus (right). E2E is evaluated with BLEU score (higher is better) and Curation Corpus is evaluated with perplexity (lower is better).

Hypothesis 1: High degrees of sparsity can be used during pre-training while preserving the downstream accuracy with dense fine-tuning.

Our results indicate that we can pre-train these GPT models with up to 75% sparsity without degradation on the E2E task. As shown in Table 2, at 75% sparsity, the GPT-3 XL model obtains a BLEU score of 67.66±0.59, which matches the BLEU score of the dense model of 68.10±0.46, within the statistical seed variation. Similarly, on WebNLG, we observe the same trend at 50% sparsity for the GPT-3 XL model. Overall, our findings show that these GPT models can be pre-trained with 50%-75% sparsity without losing significant accuracy on these downstream tasks.

Hypothesis 2: Simpler downstream tasks can tolerate higher degrees of sparsity during pre-training.

In Figure 4, we compare the GPT models across two downstream fine-tuning tasks: E2E and Curation Corpus. The E2E task focuses on mapping structured data content to a text describing this content. The Curation Corpus tasks focuses on summarizing the text description. While both tasks involve generating semantically coherent natural language, the summarization tasks additionally require understanding of long sequences and compressing the sequence without loss of information. On the E2E task, GPT-3 XL can be pre-trained up to 75% sparsity without loss in BLEU score as discussed previously. In contrast, on Curation Corpus, GPT-3 XL pre-trained at 50% sparsity loses 0.83 perplexity. In general, all data-to-text NLG tasks (WebNLG, E2E, and DART) obtain a lower degradation compared to the more difficult Curation Corpus summarization task at higher levels of sparsity.

Hypothesis 3: Large GPT models are more amenable to higher levels of sparsity during pre-training.

Recent works on sparsity such as the Lottery Ticket Hypothesis [19] have shown that modern deep neural networks are overparametrized. The degree of overparameterization increases with model size, and therefore we expect that a large model should be more amenable to higher levels of sparsification during training. Existing work [20] has shown that the quality of a network trained with random static sparsity (even at high sparsity levels) improves quickly to match its dense counterpart, as the network grows wider and deeper. Our results validate this hypothesis for the E2E and Curation Corpus tasks. As shown in Figure 4, on the E2E task, when comparing dense to the 75% sparse, the larger GPT-3 XL model loses 0.44 BLEU compared to a worse 0.98 BLEU loss in the smaller GPT-2 Small model. This trend becomes more evident on the more difficult Curation Corpus task at 75% sparsity, where relative to the dense baseline, the larger GPT-3 XL model has a perplexity delta of +2.75 compared to a worse +3.76 delta observed in the smaller GPT-2 Small model.

However, we found an exception to this trend in the WebNLG and DART datasets, where the larger GPT-3 XL model has a higher degradation at 75% sparsity compared to the smaller GPT-2 Small model. On both datasets, however, GPT-2 Small has a very low degradation at high sparsity, so we hypothesize that the GPT-3 XL result can be improved with better hyperparameter tuning and extended fine-tuning. This is an area we will investigate in our future work.

Table 1: Downstream accuracy of GPT-2 Small across various tasks at different sparsity levels during pre-training. In the metric column, the direction of the arrow indicates better result (e.g., up indicates higher is better).
Table 2: Downstream accuracy of GPT-3 XL across various tasks at different sparsity levels during pre-training. In the metric column, the direction of the arrow indicates better result (e.g., up indicates higher is better).

Conclusion

In this work, we introduced Sparse Pre-training and Dense Fine-tuning to reduce the computational FLOPs of training GPT models using weight sparsity. Multiple studies [21] have previously shown that using sparsity during training leads to a loss in downstream metrics compared to dense training. However, in our work, we showed that sparse pre-training followed by dense fine-tuning on downstream tasks can be on par with the accuracy of a dense pre-trained model on many tasks, while significantly lowering overall training FLOPS. To the best of our knowledge, this is the first time a large GPT model has been pre-trained with high sparsity (50%-75%) without significant loss in downstream task metrics.

In this initial work, we only use simple static sparsity, which is arguably the most naïve way to induce sparsity in neural networks. As next steps, we are investigating several directions for improving our results on even larger models, including dynamic sparsity methods (e.g., RigL [22]), better optimization techniques for sparse training, and architectures amenable to sparse training. Moreover, to limit the computational cost of our first study, we trained our GPT models following the Chinchilla scaling law [6]. Although the Chinchilla pre-training schedule has been shown to be FLOP-optimal for dense models, it is unclear how well it transfers to sparse models especially at extreme sparsity levels. It may be more FLOP-optimal to pre-train sparse models longer to achieve even higher accuracy improvements on downstream tasks. Our future work will also investigate sparse scaling outside of the Chinchilla dense scaling laws.

Even in our initial results, we see the tremendous promise of sparsity to accelerate large GPT training. Enabled by the Cerebras CS-2’s ability to train large models with unstructured sparsity, we plan to explore several directions to improve on our results by using even higher degrees of sparsity on even larger models.

To learn more about how Cerebras CS-2 and sparsity can empower your AI research or to learn more about this study, contact us.

References

[1] Nvidia AI, https://medium.com/nvidia-ai/scalable-active-learning-for-autonomous-driving-a-practical-implementation-and-a-b-test-4d315ed04b5f

[2] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, David Lopez-Paz. mixup: Beyond Empirical Risk Minimization. In ICLR 2018.

[3] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever. Improving Language Understanding by Generative Pre-training. 2018.

[4] Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko. Bootstrap your own latent: A new approach to self-supervised Learning. In NeurIPS 2020.

[5] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, Josh Susskind, Wenda Wang, Russ Webb. Learning from Simulated and Unsupervised Images through Adversarial Training. In CVPR 2017.

[6] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre. An empirical analysis of compute-optimal large language model training. In NeurIPS 2022.

[7] Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv 2022.

[8] Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher. arXiv 2022.

[9] Lambda Labs. https://lambdalabs.com/blog/demystifying-gpt-3.

[10] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention Is All You Need. In NeurIPS 2017.

[11] Edward J Hu, yelong shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen. LoRA: Low-Rank Adaptation of Large Language Models. In ICLR 2022.

[12] Cerebras Systems. https://www.cerebras.net/blog/cerebras-architecture-deep-dive-first-look-inside-the-hw/sw-co-design-for-deep-learning

[13] Armen Aghajanyan, Luke Zettlemoyer, Sonal Gupta. Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning. In ACL 2021.

[14] Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, Shawn Presser, Connor Leahy. The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv 2020.

[15] J Jekaterina Novikova, Ondřej Dušek, Verena Rieser. The E2E Dataset: New Challenges for End-to-End Generation. In SIGDIAL 2017.

[16] Claire Gardent, Anastasia Shimorina, Shashi Narayan, Laura Perez-Beltrachini. The WebNLG Challenge: Generating Text from RDF Data. In ACL 2017.

[17] Linyong Nan, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand Sivaprasad, Chiachun Hsieh, Xiangru Tang, Aadit Vyas, Neha Verma, Pranav Krishna, Yangxiaokang Liu, Nadia Irwanto, Jessica Pan, Faiaz Rahman, Ahmad Zaidi, Mutethia Mutuma, Yasin Tarabar, Ankit Gupta, Tao Yu, Yi Chern Tan, Xi Victoria Lin, Caiming Xiong, Richard Socher, Nazneen Fatema Rajani. DART: Open-Domain Structured Data Record to Text Generation. In NAACL 2021.

[18] Curation. Curation corpus base, 2020. https://github.com/CurationCorp/curation-corpus

[19] Jonathan Frankle, Michael Carbin. The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. In ICLR 2019.

[20] Shiwei Liu, Tianlong Chen, Xiaohan Chen, Li Shen, Decebal Constantin Mocanu, Zhangyang Wang, Mykola Pechenizkiy. The Unreasonable Effectiveness of Random Pruning: Return of the Most Naive Baseline for Sparse Training. In ICLR 2022.

[21] Anonymous. Sparsity May Cry: Let Us Fail (Current) Sparse Neural Networks Together!. Under Review ICLR 2023.

[22] Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen. Rigging the Lottery: Making All Tickets Winners. In ICML 2020.

Contributors

Vithursan Thangarasa led the productization effort, evaluated the technique in different FLOP efficient training setups, and brought up multiple downstream tasks.
Abhay Gupta helped with bring up of reference models to validate our training, evaluation, and fine-tuning setup.
William Marshall brought up various downstream tasks and ran the majority of experiments on CS-2 involving fine-tuning large GPT models.
Tianda Li helped with running fine-tuning experiments.
Kevin Leong ran most of the pre-training on CS-2 and provided crucial help in debugging issues.
Dennis DeCoste proposed the original idea.
Sean Lie coordinated the bring up of GPT on CS-2 and was involved in experimental validation and analysis.
Shreyas Saxena advised the entire effort, experimented with different sparsity techniques, and brought up the initial proof of concept.
Vithursan, Sean, and Shreyas summarized the insights and contributed to the writing.

The contributors would like to thank Rob Schreiber for his review and guidance in the writing of this post, and to thank the Cerebras software and hardware teams who have made large language model training and sparse research a reality on the CS-2.

November 28, 2022

Dive deeper