State-of-the-art language models are extremely challenging to train: they require huge compute budgets, complex distributed compute techniques and large teams with deep expertise in machine learning and parallel programming. At Cerebras, we have built a simple, data-parallel weight streaming architecture that allows users and researchers to get high quality models efficiently and with push-button scaling. This blog highlights the ease of using Cerebras to train large-scale GPT models and reproduce a state-of-the-art external baseline.

Introduction

Large Language Models (LLMs) like OpenAI’s GPT-3 [3] show impressive language understanding and text generation capabilities. However, the ability and expertise to train these models from scratch and at scale are limited to a handful of large organizations. Training language models from scratch has a range of benefits, including data privacy, domain customization, and finer-grained control over model updates. Today, there are many challenges to training LLMs at scale, including and not limited to intricate modeling issues (e.g., activation checkpointing, 3D parallelism for training) and hardware and system issues (e.g., hardware failures, multi-node orchestration).

At Cerebras, our mission is to reduce the cost of curiosity for our customers. Anyone should be able to harness the power of large language models, whether general-purpose models or specialized domain models in fields such as genomics [8], and not worry about spending months setting up infrastructure or hiring large teams with deep technical expertise to fulfill their business requirements. With this in mind, we designed our weight streaming execution mode from the ground up to build on the flexibility of the underlying architecture. For the ML practitioner, the Cerebras AI Model Studio cloud service provides simple push-button access for anyone to the performance of the Cerebras Wafer-Scale Cluster.

This blog highlights how easy it is to use Cerebras to train large-scale GPT models and reproduce a state-of-the-art external baseline. As a candidate for our replication, we picked a configuration from the exhaustive study done by the BigScience Architecture and Scaling group [1], which consists of authors from multiple organizations, including Hugging Face, EleutherAI, New York University and the Allen Institute for AI. In that study, the authors evaluated the impact of various modeling choices on the zero-shot performance of GPT models. In our work, we replicated the pre-training of a GPT-3 XL (1.3 billion parameters) model on the Pile [2] dataset and reported the zero-shot evaluation accuracy across 24 downstream tasks.

Scaling Pre-Training on a Cerebras Wafer-Scale Cluster

Following the setup and hyperparameters from BigScience [1], we pre-trained a GPT-3 XL (1.3B) model on the Pile [2] dataset for 112B tokens. The model uses learned position embeddings with GELU [9] activation and no layer norm [10] after the embedding layer. An implementation of the model can be found in our public Cerebras Model Zoo repository.

While a single CS-2 system can seamlessly pre-train GPT models up to 175 billion parameters, to speed up training for our experiment, we leverage a Cerebras Wafer-Scale Cluster equipped with multiple CS-2 systems. We scaled pre-training efficiently on an 8 x CS-2 cluster without being forced to worry about the significant complexities of different parallelism strategies on GPUs. The remarkable ease of scaling is shown in Figure 1. A more detailed discussion of our scaling properties can be found in this blog post.

Figure 1. Distributing training across multiple CS-2 systems in a Wafer-Scale Cluster is as easy as specifying the number of systems in the run command. No programming for distributed training is necessary.

Zero-Shot Generalization

We evaluated the quality of our pre-trained model using following the tasks and setup described in BigScience [1]. Table 1 reports the average accuracy using the EleutherAI (EAI) harness [6] results across 24 tasks. Note we did not evaluate the model on SST-2 [7] since the evaluation metric on this task exhibits high variance across different pre-training setups (see Appendix E in [1]).

Setup Model Size Training Tokens Average EAI Results
BigScience [1] 1.3B 112B 41.43
Cerebras 1.3B 112B 41.75

Table 1: Average EAI results (higher is better) for the Cerebras-trained GPT-3 XL model compared against the results reported in the paper [1] across 25 zero-shot tasks.

Figure 2 shows a detailed comparison of each task’s accuracy between the BigScience [1] reference and our trained model. Our zero-shot results are comparable to the results (within standard error) from the reference for most . We note that our results are different from the published metrics for some tasks. This is primarily due to minor differences in our pre-training setup, such as training on a different subset of 112B tokens.

Figure 2: Comparison of zero-shot results across 24 tasks from EleutherAI harness[6] between the reference model from BigScience [1] and the Cerebras-trained GPT-3 XL model.

Conclusion

The Cerebras AI Model Studio has made the training of large language models accessible to a broader audience. Users can efficiently train high-quality models of different sizes (1-175 billion parameters) and effortlessly reproduce external baselines. Learn more about Cerebras’ NLP innovations here.

Authors:

Abhay Gupta, Senior Research Scientist | May 24, 2023

Resources

[1] Scao, Teven Le, et al. “What Language Model to Train if You Have One Million GPU Hours?.” arXiv preprint arXiv:2210.15424 (2022).

[2] Gao, Leo, et al. “The pile: An 800gb dataset of diverse text for language modeling.” arXiv preprint arXiv:2101.00027 (2020).

[3] Brown, Tom, et al. “Language models are few-shot learners.” Advances in neural information processing systems 33 (2020): 1877-1901.

[4] Loshchilov, Ilya, and Frank Hutter. “Decoupled weight decay regularization.” arXiv preprint arXiv:1711.05101 (2017).

[5] Child, Rewon, et al. “Generating long sequences with sparse transformers.” arXiv preprint arXiv:1904.10509 (2019).

[6] Gao, Leo, et al. “A framework for few-shot language model evaluation.” Version v0. 0.1. Sept (2021).

[7] Socher, Richard, et al. “Recursive deep models for semantic compositionality over a sentiment treebank.” Proceedings of the 2013 conference on empirical methods in natural language processing. 2013.

[8] Zvyagin, Maxim, et al. “GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics.” bioRxiv(2022): 2022-10.

[9] Hendrycks, Dan, and Kevin Gimpel. “Gaussian error linear units (gelus).” arXiv preprint arXiv:1606.08415 (2016).

[10] Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv:1607.06450 (2016).