Accelerating Large Language Model Training with Variable Sparse Pre-training and Dense Fine-tuning

Abstract

The growing commercialization of large language models for myriad tasks such as text generation, retrieval-augmented search, etc., is leading to an exponential growth in training new language models. As models and datasets sizes scale, the ability to reduce the prohibitive costs of training is a fundamental enabler. At Cerebras, we believe unstructured sparsity is the answer for lowering the compute for training foundation models. With the Cerebras CS-2’s unique ability to accelerate large language models with unstructured sparsity, co-designed with our in-house sparse training algorithms, we can enable models at an unprecedented scale at a fraction of the cost of dense training.

This work extends our SPDF: Sparse Pre-training and Dense Fine-tuning framework to incorporate better task-agnostic representations using Variable Sparsity. Our new framework, Variable SPDF, can be incorporated into any existing GPT model development lifecycle, agnostic of model architecture, datasets, and training setups. We show that by applying the Variable SPDF framework, we can match the downstream metrics of a 6.7 billion parameter, dense GPT model while using 64% fewer FLOPs for pre-training.

Introduction

Large Language Models (LLMs) are becoming ubiquitous and powering a new wave of AI assistants such as Chat-GPT, Claude, and Pi. Training these models involves a long, compute-intensive pre-training phase, where an LLM is trained on mixtures of large data corpora, followed by fine-tuning either for specific tasks such as structured language generation [1] or for multiple tasks by training to follow instructions [2, 3]. While recent advances in hardware such as the Cerebras Wafer-Scale Cluster and H100 DGX enable rapid pre-training of these models, the high compute, memory, and energy requirements are still prohibitive for most organizations. The recent wave of open-source models, such as Falcon [4], Cerebras-GPT [5], etc., are seeing large-scale adoption for downstream fine-tuning. However, a user has no control over how these models are pre-trained. Pre-training models from scratch provides a lot of benefits, such as data privacy (e.g., BloombergGPT [6]), domain customization (e.g., BioMedLM [7]), and fine-grained control over model updates.

For GPT pre-training, scaling models [16] and datasets [17] can predictably improve performance for a wide range of downstream tasks. However, naïve scaling for pre-training leads to an exponential increase in compute requirements, necessitating efficient scaling. With this objective, in this work, we explore unstructured weight sparsity to reduce the compute needs of pre-training. Training with weight sparsity involves masking certain learnable weights in the layer’s weight matrix, as shown in Figure 1. As explained in an earlier blog post, training with sparse weights allows us to skip floating point operations (FLOPs), i.e., compute during the forward and backward pass, giving speedup on hardware that supports accelerating unstructured sparsity, such as the Cerebras CS-2 system.

Figure 1: Applying weight sparsity to a dense neural network by masking weights effectively prunes neuron connections within the network. The light blue connections in the sparse network indicate the masked weights.

Previously, we introduced Sparse Pre-training and Dense Fine-tuning (SPDF) [8], which is a simple way to accelerate end-to-end wall-clock training times for GPT [9, 10] models. We used weight sparsity to pre-train models in this work, reducing the compute and memory requirements. We followed it up with dense fine-tuning to match the accuracy of fully dense baselines on downstream tasks. However, there were two limitations of this work: (1) we restricted densification to the fine-tuning phase, which might not be sufficient to recover accuracy on complex downstream tasks, and (2) our study in [8] was limited to models at a smaller scale (up to 1.3B parameters). Our current study addresses these limitations by, , introducing densification during the pre-training phase, which we refer to as Variable Sparse Pre-training, to improve the model quality during pre-training. Next, we scale Variable Sparse Pre-training to larger models (6.7B parameters) and show that we can match the downstream metrics of the dense baseline while requiring 64% fewer pre-training FLOPs.

In the rest of the blog, we will discuss the idea behind Variable Sparse Pre-training and Dense Fine-tuning (Variable SPDF) and show the simplicity of enabling this method using the Cerebras Software Platform (CSoft). We will follow this up with experimental results highlighting how Variable SPDF improves over SPDF and can accelerate GPT training lifecycles end-to-end.

Variable Sparse Pre-training and Dense Fine-tuning

In this section, we first recap the key idea of SPDF and then extend it to the new Variable SPDF framework. In the SPDF setup, we first pre-train an unstructured sparse GPT model and then densify the model during fine-tuning, allowing the previously masked weights to learn, thereby increasing model capacity. Figure 2 shows the setup of SPDF [8]. By pre-training sparse, we leverage the fact that the full general learning capability of the pre-trained model is only sometimes required to perform the simpler downstream fine-tuned task, as shown by the analysis of the intrinsic dimensionality of pre-trained representations [11]. Then by using dense fine-tuning, we can use the increased full capacity of the model toward the final downstream task.

Figure 2: Sparse Pre-training and Dense Fine-tuning (SPDF) framework. In this framework, we pre-train a sparse GPT model followed by dense fine-tuning (green connections indicate newly activated weights) on downstream tasks.

Variable Sparse Pre-training builds on a similar sparse-to-dense framework, but the transition to dense happens during pre-training. Figure 3 depicts this pre-training framework, which splits the standard pre-training into two stages. In the first stage, Stage-1, the model is trained using unstructured sparsity. This is followed by an adaptation phase called Stage-2, where the model is densified, allowing the masked weights to learn. The choice of the sparsity level and the duration of pre-training for Stage-1 depends on the desired reduction in training times and available compute resources. A model trained longer with higher sparsities will result in a faster time to convergence.

Figure 3: Variable Sparse Pre-training and Dense Fine-tuning framework. In this framework, we break the pre-training phase into two stages, characterized by an initial sparse phase, followed by a dense pre-training phase (green connections indicate newly activated weights). We follow pre-training with a dense fine-tuning phase for downstream tasks.

The Variable SPDF approach has two advantages over SPDF. First, densification during pre-training will help the model learn generic task-agnostic representations, allowing for improved performance and accuracy on different downstream tasks. Second, by regulating both the sparsity level in Stage-1 and the duration of both pre-training stages, one gets fine-grained control over the total FLOPs during pre-training. In SPDF, we can only set the sparsity level for pre-training, which decides the FLOPs saved over purely dense training.

Experiments and Results

This section demonstrates the advantages of the Variable SPDF framework applied to large GPT models. First, we show how using Variable SPDF can achieve better metrics for downstream tasks compared to SPDF for the same reduction in pre-training FLOPs. Next, we apply the Variable SPDF framework to larger models (6.7B parameters) and show no loss in downstream metrics compared to dense training while using 64% fewer pre-training FLOPs.

We evaluate both frameworks (SPDF and Variable SPDF) for model configurations released in Cerebras-GPT [5], pre-trained on the Pile dataset [12]. We follow this by evaluating fine-tuning on structured language generation (DART [13]). For pre-training, we sparsify weights at initialization using a simple static random mask, where the defined sparsity level is distributed uniformly across layers. We only sparsify weights in projection layers (QKV projections, output attention projections) and feed-forward network layers. We do not sparsify other variables such as embeddings, normalizations, or biases. For fine-tuning, we perform a grid search across standard fine-tuning hyperparameters: learning rate and epochs. The evaluation of the test set uses the official scripts. We conduct multiple training runs for each model and report the mean and standard deviation on the test set to get our final scores for each task.

Figure 4: An example from the DART dataset [13]. This dataset focuses on the task of mapping structured data in the form of DAta-Records (stored as modified triplesets) to natural language Text Generation.

Variable SPDF improves over SPDF for Better Downstream Performance

In this section, we demonstrate how Variable SPDF improves over the SPDF framework for GPT training. We train a 256M GPT model using both frameworks. For the SPDF setup, we pre-train with 75% static unstructured sparsity. For the Variable SPDF setup, we pre-train with 90% unstructured sparsity during Stage-1 and follow it up with dense pre-training (Stage-2) such that the effective weight sparsity after per-training is 75%. We tune the learning rate during pre-training in both setups for improved convergence.

We evaluate the pre-trained models on zero-shot evaluation using the Eleuther eval harness [18]. Table 1 compares the average accuracy across seven tasks (reported in Cerebras-GPT [5]) for the pre-trained models. We see that at the same effective weight sparsity (75%), the model pre-trained with variable sparsity outperforms (+0.9%) the model trained static sparsity. We follow pre-training with dense fine-tuning for both frameworks and report the downstream metrics (BLEU score) on the DART language generation task. Table 2 shows that the improvement in model quality from pre-training with variable sparsity translates to fine-tuning, with a 0.3 gain in BLEU score over pre-training with static sparsity. At the same effective weight sparsity, pre-training with variable sparsity enables the model to learn better task-agnostic representations, which leads to better performance on downstream tasks.

Zero-Shot Evaluation of Pre-trained Model using Eleuther Harness
Metric	Sparse Pre-training (75% sparsity)	Variable Sparse Pre-training (effective 75% sparsity)
Accuracy (↑)	32.7	33.6

Table 1: Comparison of Sparse Pre-training and Variable Sparse Pre-training using Eleuther eval harness with 256M GPT model. We report the average zero-shot accuracy (higher is better, indicated by an up arrow) across seven tasks from Cerebras-GPT [5]. Pre-training with variable sparsity outperforms pre-training with static sparsity by a significant margin. A detailed breakdown of the per-task metric is available in Appendix A.

Zero-Shot Evaluation of Pre-trained Model using Eleuther Harness
Metric	SPDF (75% sparsity)	Variable SPDF (effective 75% sparsity)
BLEU (↑)	45.3 ± 0.07	45.6 ± 0.04

Table 2: Fine-tuning results on DART for 256M GPT model using SPDF and Variable SPDF frameworks at the same effective weight sparsity (75%). For DART, we report the BLEU score (higher is better, indicated by an up arrow). We see that better model quality from pre-training with variable sparsity outperforms the model pre-trained with static sparsity after fine-tuning on DART.

Scaling Variable SPDF to 6.7B parameters

Results presented in the previous section help validate our hypothesis, i.e., GPT models trained with the Variable SPDF framework can learn better task-agnostic features than the SPDF framework. In this section, we extend the Variable SPDF framework to a larger scale (6.7B parameters) and compare it against a dense training setup. In this experiment, we evaluate using the Variable SPDF framework to obtain a significant reduction in pre-training FLOPs while matching the downstream metrics of a dense model.

We follow the Variable SPDF setup from the previous experiment and target 75% effective weight sparsity. Post pre-training, we finetune the model on the DART language generation task. In Table 3 and Figure 5, we report the downstream metrics for the two models and the FLOPs spent in pre-training on the Pile dataset, respectively. Our results demonstrate that the 6.7B GPT model trained with Variable SPDF framework requires 64% fewer pre-training FLOPs and can match downstream metrics of the dense baseline.

Evaluation of Fine-tuned Model on DART
Metric	Dense 6.7B	Variable SPDF 6.7B (effective 75% sparsity)
BLEU (↑)	49.3 ± 0.1	49.1 ± 0.1

Table 3: Fine-tuning results on DART for the 6.7B Cerebras-GPT model using the Variable SPDF framework at effective 75% sparse FLOPs. For DART, we report the BLEU metric (higher is better, indicated by an up arrow). We can see the model pre-trained with variable sparsity matches the downstream metrics of the dense pre-trained model.

Figure 5: FLOPs spent on pre-training Cerebras-GPT 6.7B on the Pile dataset. Variable Sparse pre-training at effective 75% sparsity uses 64% fewer FLOPs than the dense pre-training setup. The Dense and Variable SPDF pre-trained models for DART are fine-tuned for the same number of tokens, utilizing the same FLOPs. Note we do not show the fine-tuning FLOPs in this graph since it is less than 0.05% of the total pre-training FLOPs.

Push Button Software for Variable SPDF Training

The Cerebras Software Platform (CSoft) makes it extremely simple to pre-train models using unstructured sparsity. Figure 4 shows the YAML configuration files with changes to enable sparse training. For more detailed options to configure sparsity, refer to the developer docs.

Figure 4: Example YAML configuration changes required to train with unstructured sparsity. In this example, we show how to enable training a model with 75% sparsity by randomly initializing the sparsity masks for the model weights. As a rule of thumb, we do not sparsify the normalization and embedding layers and all biases in the model. This is effective in ensuring better model quality with minimal additional FLOPs. The documentation has more examples to enable fine-grained configuration of sparsity.

After Stage-1, i.e., sparse pre-training, is finished, a conversion utility is available to enable the sparse-to-dense transition. Figure 5 shows an example call of this utility, with more details in our docs.

Figure 5: Conversion Utility for sparse-to-dense transition. This utility converts the sparse sentinel in the weights to 0, enabling further dense training. Post conversion, we can directly fine-tune, which enables the SPDF framework or continue dense pre-training, enabling the Variable SPDF framework.

To ensure the best model quality after resuming dense pre-training, i.e., Stage-2, we should not repeat data from Stage-1, as this can be detrimental to training LLMs [15]. As part of our training pipeline, the deterministic restart capability provides this feature and should be enabled from the beginning of pre-training.

Conclusion

In this work, we revisited the Sparse Pre-training and Dense Fine-tuning (SPDF) frameworks to reduce the computational FLOPs of training GPT models using weight sparsity. We introduced a simple albeit powerful extension of the SPDF approach, called Variable SPDF, deriving its name from the densification during pre-training and show how the fine-grained control this framework offers leads to overall better performance over the standard SPDF approach at the same effective weight sparsity. Finally, we showed how this framework applied to large models (6.7B parameters) matches the dense baseline while using 64% fewer FLOPs during pre-training.

Our initial results show the tremendous promise of sparsity to accelerate training large-scale foundation models. Enabled by the Cerebras CS-2’s ability to train large models with unstructured sparsity, we plan to explore several directions to improve our results on even larger models. Contact us to learn more about this study or how Cerebras CS-2 and sparsity can empower your AI research.

References

Novikova, Jekaterina, Ondřej Dušek, and Verena Rieser. “The E2E dataset: New challenges for end-to-end generation.”arXiv preprint arXiv:1706.09254 (2017).
Wei, Jason, et al. “Finetuned language models are zero-shot learners.”arXiv preprint arXiv:2109.01652 (2021).
Ouyang, Long, et al. “Training language models to follow instructions with human feedback.”Advances in Neural Information Processing Systems 35 (2022): 27730-27744.
Falcon LLM, 2023. https://falconllm.tii.ae
Dey, Nolan, et al. “Cerebras-GPT: Open compute-optimal language models trained on the Cerebras wafer-scale cluster.”arXiv preprint arXiv:2304.03208 (2023).
Wu, Shijie, et al. “Bloomberggpt: A large language model for finance.”arXiv preprint arXiv:2303.17564 (2023).
Bolton, Elliot, et al. “BioMedLM”, https://crfm.stanford.edu/2022/12/15/biomedlm.html
Thangarasa, Vithursan, et al. “SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models.”In Proceedings of UAI (2023).
Radford, Alec, et al. “Language models are unsupervised multitask learners.”OpenAI blog 8 (2019): 9.
Brown, Tom, et al. “Language models are few-shot learners.”Advances in neural information processing systems 33 (2020): 1877-1901.
Aghajanyan, Armen, Luke Zettlemoyer, and Sonal Gupta. “Intrinsic dimensionality explains the effectiveness of language model fine-tuning.”arXiv preprint arXiv:2012.13255 (2020).
Gao, Leo, et al. “The pile: An 800gb dataset of diverse text for language modeling.”arXiv preprint arXiv:2101.00027 (2020).
Hu, Edward J., et al. “Lora: Low-rank adaptation of large language models.”arXiv preprint arXiv:2106.09685 (2021).
Curation corpus base, 2020.https://github.com/CurationCorp/curation-corpus
Lee, Katherine, et al. “Deduplicating training data makes language models better.”arXiv preprint arXiv:2107.06499(2021).
Chowdhery, Aakanksha, et al. “Palm: Scaling language modeling with pathways.”arXiv preprint arXiv:2204.02311(2022).
Touvron, Hugo, et al. “Llama: Open and efficient foundation language models.”arXiv preprint arXiv:2302.13971 (2023).
Gao, Leo, et al. “A framework for few-shot language model evaluation.”Version v0. 0.3. Dec (2022).

Appendix

A. Detailed Evaluation Metrics for GPT Models using SPDF and Variable SPDF Frameworks

Table 4 reports the task level breakdown of the evaluation scores for the models from Table 1. Following standard evaluation guidelines, we use the Eleuther evaluation harness for the ARC-Easy, ARC-Challenge, HellaSwag, Lambada (OpenAI), OpenBookQA, PiQA and Winogrande tasks.

Task Accuracy (↑)	Sparse Pre-training	Variable Sparse Pre-training
ARC (e)	38.5	40
ARC ( c )	17.9	17.2
HellaSwag	26.85	26.8
Lambada (OpenAI)	21.9	21.8
OpenBookAQ	14.2	16.8
PiQA	59.9	59.6
Winogrande	49.6	53.0
Average	32.7	33.6

Table 5: Average 0-shot accuracy (higher is better) using Eleuther Harness on 7 tasks for GPT (256M) model trained using SPDF and Variable SPDF frameworks at the same pre-training FLOPs. For most tasks, variable sparsity improves over static sparse training.

Contributions

Abhay Gupta led the research effort, evaluated the SPDF and Variable SPDF techniques in different FLOP efficient training setups, and ran most pre-training experiments. Mahmoud Salem and Vithursan Thangarasa ran fine-tuning on various downstream tasks for different setups. Kevin Leong supported training infrastructure, babysat the 6.7B pre-training, and provided debugging support throughout the project. Sean Lie coordinated the setup of the training infrastructure and was involved in experimental analysis. Shreyas Saxena advised the entire effort and presented the initial proof of concept. Abhay and Shreyas summarized the insights and contributed to the writing.

Author(s): Abhay Gupta, Mahmoud Salem, Vithursan Thangarasa, Kevin Leong, Sean Lie, Shreyas Saxena