Training Multi-Billion-Parameter Models on a Single Cerebras System is Easy - Cerebras

Natalia Vassilieva, Director of Product, Machine Learning | June 22, 2022

Configuring models | Training GPT with 1.3 billion parameters | The pain and suffering of GPU clusters
Switching to 6.7 billion parameters | Benefits of training on Cerebras | Summary

The Cerebras Software Platform (CSoft) makes it easy to train large-scale Transformer-style natural language processing (NLP) models on a single Cerebras CS-2 system. How easy? Let me show you with GPT-3 XL 1.3B and GPT-3 6.7B parameter models as examples. The same codebase, the same command to launch a training job, just different model configurations. No need to worry about how to distribute the training across multiple conventional devices, no complicated hybrid 3D parallelism. Switching from 1.3B to 6.7B to 13B to 20B model training just by changing a few parameters in a configuration file.

Configuring the models

Our first example is GPT-3 XL model with 1.3 billion parameters, and the second example is GPT-3 model with 6.7 billion parameters. Both use the same standard Python implementation relying on TensorFlow. All the main model parameters are defined in an easy-to-read and modify YAML parameter file, which are passed to the python code to define model configuration. You can visit our reference implementation repository to take a look at our TensorFlow implementation of GPT-J model and examples of model configuration YAML files, and we will make GPT-2 and GPT-3 implementations available shortly too.[i]

Each YAML configuration file contains all the information that a typical researcher would care about: model depth in terms of number of hidden layers (decoder blocks in case of GPT models), hidden layer sizes, number of attention heads, optimizer settings, checkpoint frequency, etc.

Training a 1.3 billion parameter GPT-3 model on the CS-2

CS-2 is a network-attached accelerator. To launch a training job, you need your code and data on a standard CPU node, connected to a CS-2 system. This adjacent conventional server will run compilation and coordinate the whole run. We use a standard job orchestrator, slurm to manage and allocate hardware resources for the runs, and to launch any job on a CS-2 system, we use a bash script csrun_wse, which executes a standard slurm srun command with a few pre-defined settings. csrun_wse allocates the required hardware resources and executes a command passed as an argument to this script.

Figure 1 shows a command used to launch training of GPT-3 XL model on a CS-2 system.

Figure 1. Command line to train GPT-3 XL using CSoft

We are launching a python training script run.py using Cerebras python build for Weight Streaming execution python-ws. The training python script run.py takes a few additional command line arguments: execution mode (train), where to store checkpoints and other training artifacts (in a directory gpt3_xl_pile, which will be created during the run), which model configuration to use (the one described in a YAML configuration file configs/params_gpt3_xl_pile.yaml), which CS-2 system to run own (the one with IP address stored in the environment variable $CS_IP), and how many training iterations to run (2000).

When this command line is entered, slurm allocates required hardware resources and compilation begins. The compute graph is extracted from TensorFlow and compiled into an executable that can be loaded onto the CS-2 system. Once this is done, training starts and the first losses begin to stream out of the WSE and appear in the console (Figure 2).

Figure 2. As compilation completes, the first losses are streamed to the console.

During the training, all training artifacts are stored in a specified output directory. You can find there the checkpoints, standard outputs of the training run with TensorFlow, such as tfevents, which can be used to hook a TensorBoard and monitor your run (Figure 3).

The pain and suffering of training large models on a cluster of GPUs

Before discussing how to train the 6.7 billion parameter model on a CS-2 system, let me talk you through what it would take to train the model on a cluster of GPUs. To train large-scale models on clusters of GPUs, several distribution strategies are required. For a model as large as six billion parameters or beyond, this requires both simple ways of distribution, such as data parallel, and more complicated pipelined model parallel and tensor model parallel (Figure 4). This combination of all three approaches is known as 3D parallelism or hybrid parallelism.

If you want to learn more about these parallelization strategies (and their challenges), take a look at this blog post.

Implementing this 3D parallelism is not for the faint of heart. A lot of effort has been spent to create libraries which are supposed to make distributed training easy. But it is very far from easy even with these libraries. Just take a look at NVDIA’s own instructions for distributed training. A ton of new parameters are required for distributed runs. Every time a new model needs to be trained, work must be done to determine how to split and distribute the model across multiple devices.

The simplest method is to run it with data parallel training only, but that is only possible if the model fits into a single device – not the case with models larger than three-four billion parameters even with the largest GPUs with 80GB of HBM. Instead, every single layer must be split into several devices (tensor parallel) and decisions must be made about which layers belong to which devices (pipeline parallel). How to decide what the values of those parameters defining degree of data parallelism, tensor model parallelism and pipeline model parallelism should be? How to choose an appropriate batch size? Should activation checkpointing be used or not? If yes, which option for activation checkpointing?

Choices of all these parameters are inter-dependent. And the optimal choice only works for one model and for one cluster configuration. If you change your model configuration you have to start again from scratch. If you get access to a bigger cluster – the same, redo all your settings.

Don’t forget to keep in mind that different forms of parallelism have different bandwidth requirements for communication between the devices. For example, tensor parallel training is very communication heavy, so, typically, it is not used across multiple GPU-equipped servers. This means that the number of devices that can leverage this form of parallelism is limited. And two others, data parallel and pipeline parallel, need sufficiently larger batch size to keep all the devices busy. And very large batch sizes are not good for convergence. With larger batch sizes you may need to run training for more epochs to get to the target accuracy, or you may not reach that target accuracy at all (while that accuracy can be reached when trained with a smaller batch).

To illustrate this point, Figure 5 shows data collected from ML Perf 1.1 training logs. As more processors are added to accelerate training of a single model, overall batch size is increased to keep those additional processors busy: X axis shows relative increase in batch size. But as training is done with a larger batch, models have to be trained for more epochs to reach target accuracy: Y axis shows relative increase in number of training samples a model has to see to reach the target accuracy. BERT is trained with an optimizer which has been specifically designed to deal with very large batch sizes (LAMB), but even for BERT we see that 15x increase in batch size leads to 1.5x more work to be done to reach the target accuracy.

And now given this picture, here is a tricky question: how to pick an optimal batch size given a cluster of conventional processors, to minimize overall time-to-accuracy? Table 1 below shows some choices made for distributed BERT training ML Perf 1.1, which relies on data parallel distribution only. As number of GPU systems used increases, chosen per-GPU batch size decreases (likely to ensure convergence), leading to decreased GPU utilization. But per-GPU batch size isn’t decreased enough to maintain constant global batch size across all the runs – global batch is increased and as a result there is an increase in number of samples required to train the model to the target accuracy. And this is with data parallel training only. When model parallel is added into the mix, the choice of optimal parameters becomes even more complicated. And now add an option of gradient checkpointing. Gradient checkpointing reduces memory requirements, thus enabling running with a larger batch size and potentially increasing GPU utilization, but comes at expense of additional compute. Will the increased utilization compensate for the additional computation? It depends. It depends on model, on cluster configuration, on the choice of other parameters defining distributed training. There is no simple and intuitive answer.

Table 1. Impact of batch size choices with distributed training.

All these choices for best batch size, gradient checkpointing options, data and model parallelism settings are required to train any large-scale model on a cluster of GPUs. The 6-billion parameter model fits in this category, which would require at least tensor parallel and/or pipeline parallel in addition.

Switching to 6.7 billion parameters on the CS-2 is trivial

Training the 6.7 billion-parameter model on a single CS-2 system follows exactly the same process as with the smaller model. Examining the changes in configuration files between the models, there are only a few differences (Figure 6). The few lines that are different are exactly the lines which define the model. For example, in increasing the hidden size from two to four, the depth of the model increases from 24 to 32. The size of the forward layer also increases, and a higher frequency of logging is used for convenience because a six billion parameter model takes longer to train; to see feedback from the system, more frequent losses should be sent.

Figure 6. Comparison of configuration file differences between GPT-3 XL and GPT-3 5.7B.

Notably, the code is not changed. The exact same commands are used as in the 1.3 billion-parameter model. Only the four model configuration parameters are changed, and the exact same training script is run.

Figure 7 shows the progress of the run with GPT-3 6.7B parameter model, trained on a single CS-2 system.

Benefits of training on the CS-2

As demonstrated, the Cerebras CS-2 makes it easy to train large scale transformer models.

Figure 8 lists all configurations of that model that the OpenAI team has tested, including 1.3 and 6.7 billion parameter models that we’ve discussed. Figure 9 shows the two YAML configuration files, with the changes highlighted to switch from the 1.3 to the 6.7 billion parameter model. The only changes made are the size of the model, the depth, the number of attention heads. In this implementation, the size of the feedforward layer was also changed. To reiterate, no changes to the Python code were made, no need to worry about how the model should be distributed, no additional parameters to choose and set, just choose a desired model definition.

In contrast, attempting the same switch from a 1.3 billion parameter model to a 6.7 billion parameter model on a GPU with the same code base is unlikely to work. Even if sophisticated parallelism libraries are used, the decision of how to split the training of the larger model must be done manually. The burden falls on the ML researcher to decide how many GPUs to use for data parallel, for model parallel, and which flavor of model parallel to be used, etc. Depending on the model configuration, these need to be tweaked. This is a brittle system, and every iteration depends on the model configuration and infrastructure. It’s all custom to the model configuration and to the cluster. It takes expertise, and, moreover, it takes time. Distributed training is painful to debug.

With the Cerebras CS-2, there is no need to worry about any of that. The CS-2 complier decides the most optimal way to distribute all the computations across hundreds of thousands of cores on the WSE. When the code runs on the WSE, it runs as though it’s running on a single device. Because it does run on a single device. For a researcher, this gives the opportunity to easily switch between different model sizes.

For example, in the GPT-3 configuration file discussed here, if a 20 billion parameter model is desired instead of the 6 billion parameter model, only the depth of the model and the size of the model must be changed (Figure 9). This can all be done without needing to think about how to distribute the model across multiple devices.

Summary

To sum up, it’s easy to train, and switch between, large NLP models with the Cerebras CS-2. Whether it’s GPT-3 model variant with 1.3 or 6.7 parameters, or GPT-J models with 6 billion or 20 billion parameters the process is just as simple.

There are a many applications where these large language models outperform their smaller cousins. Now, for the first time, all ML users are able to take advantage of the possibilities these large models unlock.

Please contact us and get in touch if this is an area you are interested in. We will be happy to help you and to get you going with the training of those large-scale models on the single system without you needing to think how to distribute your computations across clusters of many GPUs.

References

[i] Cerebras reference implementation repository https://github.com/Cerebras/cerebras_reference_implementations

[ii] T. Brown et al., “Language Models are Few-Shot Learners”, arXiv, 2020 https://arxiv.org/abs/2005.14165