Abstract

State-of-the-art language models are extremely challenging to train; they require huge compute budgets, complex distributed compute techniques and deep ML expertise. As a result, few organizations train large language models (LLMs) from scratch. And increasingly those that have the resources and expertise are not open sourcing the results, marking a significant change from even a few months back.

At Cerebras, we believe in fostering open access to the most advanced models. With this in mind, we are proud to announce the release to the open source community of Cerebras-GPT, a family of seven GPT models ranging from 111 million to 13 billion parameters. Trained using the Chinchilla formula, these models provide the highest accuracy for a given compute budget. Cerebras-GPT has faster training times, lower training costs, and consumes less energy than any publicly available model to date.

All models were trained on CS-2 systems that are part of the Andromeda AI supercomputer using our simple, data-parallel weight streaming architecture. By not having to worry about model partitioning, we were able to train these models in just a few weeks. Training these seven models has allowed us to derive a new scaling law. Scaling laws predict model accuracy based on the training compute budget and have been hugely influential in guiding AI research. To the best of our knowledge, Cerebras-GPT is the first scaling law that predicts model performance for a public dataset.

Today’s release is designed to be used by and reproducible by anyone. All models, weights, and checkpoints are available on Hugging Face and GitHub under the Apache 2.0 license. Additionally, we provide detailed information on our training methods and performance results in our paper, “Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster“. The Cerebras CS-2 systems used for training are also available on-demand via Cerebras Model Studio.

Cerebras-GPT: A New Model For Open LLM Development

Artificial intelligence has the potential to transform the world economy, but its access is increasingly gated. The latest large language model – OpenAI’s GPT4 – was released with no information on its model architecture, training data, training hardware, or hyperparameters. Companies are increasingly building large models using closed datasets and offering model outputs only via API access.

For LLMs to be an open and accessible technology, we believe it’s important to have access to state-of-the-art models that are open, reproducible, and royalty free for both research and commercial applications. To that end, we have trained a family of transformer models using the latest techniques and open datasets that we call Cerebras-GPT. These models are the first family of GPT models trained using the Chinchilla formula and released via the Apache 2.0 license.

Figure 1. A comparison of different large language models and their openness and training philosophy.

Large language models can be broadly categorized into two camps. The first group includes models such as OpenAI’s GPT-4 and DeepMind’s Chinchilla, which are trained on private data to achieve the highest level of accuracy. However, the trained weights and source code of these models are not available to the public. The second group includes models such as Meta’s OPT and Eleuther’s Pythia, which are open source but not trained in a compute-optimal manner.

By “compute-optimal,” we refer to DeepMind’s finding that large language models achieve the highest accuracy for a fixed compute budget when 20 data tokens are used for every parameter in the model. Therefore, a one billion parameter model should be trained on 20 billion data tokens to reach optimal results for a fixed training budget. This is sometimes referred to as the “Chinchilla recipe.”

An implication of this finding is that it’s not optimal to use the same amount of training data when training a family of model sizes. For instance, training a small model with too much data results in diminishing returns and less accuracy gains per FLOP – it would be better to use a larger model with less data. In contrast, a large model trained on too little data does not reach its potential – it would be better to reduce the model size and feed it more data. In each case, using 20 tokens per parameter is optimal, per the Chinchilla recipe.

Figure 2. Cerebras-GPT vs. Pythia. Lower curves show greater compute efficiency for a given loss level.

EleutherAI’s Pythia open-source model suite is highly valuable for the research community because it provides a wide range of model sizes using the public Pile dataset under a controlled training methodology. However, Pythia was trained with a fixed number of tokens across all model sizes with the objective of providing an apples-to-apples baseline across all models.

Designed to be complimentary to Pythia, Cerebras-GPT was designed to cover a wide range of model sizes using the same public Pile dataset and to establish a training-efficient scaling law and family of models. Cerebras-GPT consists of seven models with 111M, 256M, 590M, 1.3B, 2.7B, 6.7B, and 13B parameters, all of which are trained using 20 tokens per parameter. By using the optimal training tokens for each model size, Cerebras-GPT achieves the lowest loss per unit of compute across all model sizes (Figure 2).

New Scaling Law

Training a large language model can be an expensive and time-consuming process. It requires a significant amount of computational resources and expertise to optimize the model’s performance. One way to address this challenge is to train a family of models with varying sizes, which can help establish a scaling law that describes the relationship between training compute and model performance.

Figure 3. Cerebras-GPT scaling law

Scaling laws are vital to LLM development since they allow researchers to predict the expected loss of a model before training, thus avoiding costly hyperparameter search. OpenAI was the first to establish a scaling law showing a power law relationship between compute and model loss. DeepMind followed with the Chinchilla study, demonstrating an optimal ratio between compute and data. However, these studies were performed using closed datasets, making them difficult to apply the results to other datasets.

Cerebras-GPT continues this line of research by establishing a scaling law based on the open Pile dataset. The resulting scaling law provides a compute-efficient recipe for training LLMs of any size using Pile. By publishing our findings, we hope to provide a valuable resource for the community and further advance the development of large language models.

Model Performance on Downstream Tasks

We evaluated the performance of Cerebras-GPT on several task specific language tasks such as sentence completion and question-and-answer. These are important because even though the models may have good natural language understanding, that may not translate to specialized downstream tasks. We show that Cerebras-GPT preserves state-of-the-art training efficiency for most common downstream tasks, as shown in the examples in Figure 4. Notably while previous scaling laws have shown scaling for pre-training loss, this is the first time results have been published showing scaling for downstream natural language tasks.

Figure 4 Example downstream task performance comparison of Cerebras-GPT and other open-source models. Cerebras-GPT preserves the training efficiency advantage across downstream tasks.

Cerebras CS-2: Simple, Data-Parallel Training

It takes substantial technical expertise to train very large models on GPUs. In the recently released GPT-4 Technical Report, OpenAI credits over thirty contributors just for compute infrastructure and scaling. To understand why, we will look at existing LLM scaling techniques on the GPU shown in Figure 5.

The simplest way to scale is data parallel.  Data parallel scaling replicates the model in each device and uses different training batches on those devices, averaging their gradients. Clearly, this does not address the issue of model size – it fails if the entire model does not fit on a single GPU.

A common alternative approach is pipelined model parallel, which runs different layers on different GPUs as a pipeline. However, as the pipeline grows, the activation memory increases quadratically with the pipeline depth, and this can be prohibitive for large models. To avoid that, another common approach is to split layers across GPUs, called tensor model parallel, but this imposes significant communication between the GPUs, which complicates the implementation and can be slow.

Because of these complexities, there is no single way to scale on GPU clusters today. Training large models on GPUs requires a hybrid approach with all forms of parallelism; the implementations are complicated and hard to bring up, and there are significant performance issues

Figure 5 Existing scaling techniques on distributed GPU clusters and their challenges. Scaling on GPU clusters requires a complex combination of all forms of parallelism.
Figure 6. GPU scaling requires the use of multiple parallelism techniques. Cerebras CS-2 uses data parallel scaling for any model size.

Two recent large language models illustrate the complexities involved in splitting large language models across many GPUs (Figure 6). Meta’s OPT model, ranging from 125M to 175B parameters was trained on 992 GPUs using a combination of data parallelism and tensor parallelism along with various memory optimization techniques. Eleuther’s 20B parameter GPT-NeoX used a combination data, tensor, and pipeline parallelism to train the model across 96 GPUs.

Cerebras GPT was trained using standard data parallelism on 16 CS-2 systems. This is possible because the Cerebras CS-2 systems are fitted with enough memory to run even the largest models on a single device without splitting the model. We then designed the purpose-built Cerebras Wafer-Scale Cluster around the CS-2 to enable easy scale-out. It uses a HW/SW co-designed execution called weight streaming that enables independent scaling of model size and cluster size, without model parallelism. With this architecture, scaling to larger clusters is as simple as changing the number of systems in a configuration file, as shown in Figure 7.

Figure 7. Push-button scaling to multiple CS-2 systems in the Cerebras Wafer-Scale Cluster using only simple data parallel scaling.

We trained all Cerebras-GPT models on a 16x CS-2 Cerebras Wafer-Scale Cluster called Andromeda. The cluster enabled all experiments to be completed quickly, without the traditional distributed systems engineering and model parallel tuning needed on GPU clusters. Most importantly, it enabled our researchers to focus on the design of the ML instead of the distributed system. We believe the capability to easily train large models is a key enabler for the broad community, so we have made the Cerebras Wafer-Scale Cluster available on the cloud through the Cerebras AI Model Studio.

Conclusion

At Cerebras, we believe democratizing large models requires both solving the training infrastructure challenge and opening more models to the community. To that end, we have designed the Cerebras Wafer-Scale Cluster with push-button scaling, and we are open-sourcing the Cerebras-GPT family of large generative models. We hope that as the first public large GPT model suite with state-of-the-art training efficiency Cerebras-GPT will serve as a recipe for efficient training and as a reference for further community research. Additionally, we are making both the infrastructure and models available on the cloud through the Cerebras AI Model Studio. We believe it’s through better training infrastructure and more community sharing that we can, together, further advance the large generative AI industry.

Authors

Nolan Dey, Research Scientist; Joel Hestness, Principal Research Scientist; Sean Lie, Chief Hardware Architect and Co-founder | Mar 28, 2023

Contributing Authors

Nolan Dey, Gurpreet Gosal, Charles Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, Joel Hestness.

Additional Resources