Abstract

Training large ML models at scale is difficult because of the complexities involved in 3D parallelism techniques. Moreover, standard scaling methods suffer from bottlenecks at various points in the system, and therefore are unable to reach linear scaling.  In this article, I’ll show how Cerebras has solved these challenges through “appliance mode”, enabling push-button linear scaling on Cerebras Wafer-Scale Clusters of up to 192 CS-2 systems.

Picture this: you’re sitting at your desk, it’s the Friday before a long weekend, and you need to kick off a training run for a GPT-style large NLP model.  It’s a multi-billion parameter model, and you’ve just started working on getting it to train on a NVIDIA cluster. The command used to run training is arcane, and your model code is filled with details about chunk sizes and GPU mappings for pipeline parallel. You pray that there are no bugs.  The hours march on.

Meanwhile, your colleague, who used a Cerebras Wafer-Scale Cluster, has long gone home.

Getting large language models to train at scale is no easy feat on GPUs.

Often, for experimentation, models are written to train on a single server multi-GPU setup.  Once it’s time to scale out, in order to speed up training, multiple servers are involved, and things get tricky. Inter-process communications libraries have to be considered. We have to use batch-scheduling systems. And, because model sizes don’t fit on a single device, and even individual layers don’t fit on a single device, we have to resort to complex training strategies like pipelined or tensor model parallelism (Figure 1).  For a deeper dive on parallelism strategies, I highly recommend this post on our developer blog by my colleagues Natalia Vassilieva.

Many papers have been written about efforts to try to make distributed cluster training faster, such as this one by Narayanan et al which describes their efforts to train Megatron-LM across more than 3,000 GPUs (!) by using complex interleaving techniques, but latency at each stage can cause bottlenecks in the entire system. Moreover, debugging latency in such a setup is time-consuming and especially so as the number of systems increases.

Figure 2, which is drawn from the Microsoft DeepSpeed blog, elegantly illustrates the complexity involved in training large models on GPUs. It is necessary to leverage 3-D parallelism, data, model and pipeline parallel. Training is limited by the slowest speed between any of the elements, and this is the source of much difficulty in distributed training.

 

Figure 2: Mapping of workers in Figure 1 to GPUs on a system with eight nodes, each with four GPUs. Coloring denotes GPUs on the same node (Microsoft Research).

Scaling models is turning out to be a full-time engineering problem for many organizations. Standard approaches include using the Torch Pipe API, which allows out-of-the-box pipeline parallelism. Unfortunately, even something simple like adding skip connections becomes difficult. There are limitations to input-output shapes, and conditional flow is impossible, as described in this very clear article from Hugging Face.

Even once pipeline parallel strategies are in place, problems at the cluster-scale arise.  If your cluster is in the cloud, it’s difficult to ensure that the connectivity between machines is the same, and speed of training will be bottlenecked by the slowest connection. This is, incidentally, why very large NLP models are typically trained using dedicated hardware, outside of traditional cloud structures.

These issues make it difficult to even start training large language models, and often, tasks like hyperparameter tuning become even more challenging due to the brittleness of the system.  These systems make debugging training latency and restarting from checkpoints a difficult process.

Cerebras to the rescue!

Cerebras solves these problems through its unique weight-streaming execution mode, which enables training extreme scale models on a single machine.  Because even the largest layers fit onto the Wafer-Scale Engine at the heart of our CS-2 systems, the 3D parallelism structure collapses into purely data-parallel. Training in this fashion utilizes the Cerebras Weight-Streaming Cluster, consisting of CS-2s, MemoryX, and SwarmX nodes. MemoryX nodes store the model weights: we are limited only by its memory capacity. Internally, MemoryX uses both DRAM and Flash storage to hold weights, and it also computes weight updates during each step.   SwarmX nodes distribute weights and facilitate aggregation of gradients. Because of its bidirectional tree topology, we can service many CS-2s with minimal latency requirements. For a deep dive into this, take a look at this white paper by my colleagues Stewart Hall, Rob Schreiber and Sean Lie.

The beauty of such a layout is that we can scale up the number of CS-2s in a cluster with minimal overhead, a feat not easily obtainable within GPU clusters, where data, activations, and gradients have to all communicate between GPUs and between nodes.

More importantly, we can leverage this cluster design with remarkable ease using Cerebras’ Appliance Mode, standard on all Cerebras Wafer-Scale Clusters.

The complete guide to scale-out on Cerebras Wafer-Scale Clusters

All that’s required is to execute run.py with a flag --num_csx=1. This can be switched from 1 to n, and all heavy lifting is completed under the hood by the CS-Estimator.

In order to run in appliance mode, we need to first complete these two steps:

      1. Make sure the cluster management software is installed on the cluster. This should already be installed in a multi-box ready setup, or can be enabled by your sysadmin for an existing single system installation.
      2. Activate the proper python environment source venv_appliance/bin/activate

Then to run a model on a single system, use this simple command:

python run.py --params params.yaml --num_csx=1 --model_dir=model_dir --num_steps=10 --mode train

To spread model training over four systems, all you have to do is change one parameter!

python run.py --params params.yaml --num_csx=4 --model_dir=model_dir --num_steps=10 --mode train

And that’s it! The model will train in data-parallel mode across 4 CS-2s.  No more worrying about pipeline or tensor parallelism strategies, even for models with trillions of parameters!

(In time-honored fashion, I will leave the steps need to train on n systems as an exercise for the reader.)

If you are curious to learn more about our software stack, visit the developer page on our website. If you have question, visit us our Discourse page. To get a demo of our cluster capabilities and appliance workflow, contact Cerebras support by sending email to support@cerebras.net.

Learn more

Darius Lam, Product Lead, Computer Vision | September 14, 2022