Nov 10 2023

Cerebras Software Release 2.0: 50% Faster Training, PyTorch 2.0 Support, Diffusion Transformers, and More

Today we are excited to announce Cerebras software release 2.0. R2.0 is the biggest software update to our platform this year, bringing a generational leap in performance, programmability, and model support. Crucially, R2.0 moves the Cerebras software stack to PyTorch 2.0, making it easy to work with the latest models and training techniques on Cerebras hardware.

50% Faster – it’s as if we released a new chip – but it’s all software!

In 2022 we introduced Cerebras Wafer Scale Cluster, a new system architecture that lets us connect up to 192 Cerebras CS-2s together into a single AI supercomputer. To distribute work to the cluster, we introduced Weight Streaming – a new software stack to enable data parallel only training. Weight Streaming was a complete re-write of our software stack, and since then we’ve made enormous strides in optimizing system performance for single and multi-node clusters. Training large language models is almost six times faster than a year ago. R2.0 is the largest step function improvement among these– boosting performance by 50% on large GPT models.

The key feature that enables this huge performance boost is dynamic memory offload. To briefly recap: the Cerebras Wafer Scale Cluster extends the 40GB of on-chip memory of each CS-2 with over 12 Terabytes of external memory. Unlike GPUs which only have 80GB of memory for all weights and activations, we store weights in external memory and activations in on-chip memory. The CS-2 works on one layer of the model at a time and weights are streamed in as needed, hence the name – Weight Streaming. This model provides us with over 100x larger aggregate memory capacity than GPUs, allowing us to natively support trillion parameter models without resorting to complex model partitioning schemes such as tensor and pipeline parallelism.

Weight Streaming offloads weights to external memory but activations still can take up a substantial amount of memory. Activation checkpointing helps relieve memory pressure by selectively storing activations and recomputing the rest, but checkpoints can still consume a large amount of memory. With R2.0, we now analyze activation checkpoints and offload them to external memory if their values are not immediately required. As the compute graph progresses, we copy the values back on-chip just as they are needed again so that they can be used to compute gradients during the backward pass.

Dynamic memory offload is especially helpful for large models with large amounts of activations. By offloading activations and reloading them on the fly, we can use the memory savings to employ larger batches sizes, resulting in higher throughput. This lets us train GPT-3 models 50% faster than in the previous release.

For context – GPUs typically are released every two years, with each generation on average doubling the performance of its predecessor. That means performance improves at a 42% compounded annual rate. Our R2.0 software release is the equivalent of an annual hardware release for a GPU. And R2.0 is immediately available and free for all our customers – no hardware upgrade needed!

PyTorch 2.0 Support With LTC + Torch-MLIR Backend

R2.0 brings PyTorch 2.0 support to the Cerebras platform, providing generalized capability for developers to implement models, optimizers, and training scripts. Up until now, LLM support was provided through reference implementations of popular models through our Modelzoo Github repository. Developers could modify existing models and training scripts, but it was difficult to create projects from scratch.

With R2.0, we have implemented an entirely new backend and accompanying API that provides a standard PyTorch 2.0 interface to Cerebras hardware for the first time. Our new backend is based on Lazy Tensor Core (LTC), replacing the old PyTorch/XLA implementation. Unlike XLA, LTC is part of standard PyTorch, which will allow us to support new PyTorch versions much faster going forward.

As part of R2.0 we are releasing the Cerebras PyTorch API – our new interface to program Cerebras hardware. Prior to R2.0, Cerebras hardware was programmed via the Modelzoo runner API which required developers to convert their models and optimizers. Custom runner support was limited, making it difficult to customize training. The new Cerebras PyTorch API brings direct, generalized access to PyTorch for the first time on Cerebras hardware. Users can write fully custom training scripts and models without relying on our reference implementations.

As part of our work to implement PyTorch 2.0 support for Cerebras hardware, we made a number of fundamental contributions to the PyTorch backend for accelerators. We implemented the primary code path to integrate LTC with Torch-MLIR as well as key improvements to PyTorch’s LTC path. These contributions help not just Cerebras but enable the broader PyTorch ecosystem to support for a diverse range of accelerators beyond GPUs.

Diffusion Transformers

In Release 2.0, we are adding support for an exciting new model class – diffusion transformers(DiT). Transformers are not only incredibly powerful for language tasks, they are growing in popularity in vision and multimodal tasks thanks to their proven scalability and robustness. While today’s popular diffusion applications use the U-Net architecture, diffusion transformers show comparable performance and benefits from the growing body of literature in scaling, sparsity, and inference techniques.

As part of our initial release, we support the Adaptive Layer Norm (AdaLN-Zero) variant of DiT model. You can train models of various sizes including Small (~33M), Base(~130M), Large(~458M), XL(~675M) as well as a 2 billion parameter DiT model. A few samples generated from our checkpoint on the Flowers-102 dataset is shown above. To train your own DiT model, check out our ModelZoo reference implementation and accompanying instructions.

Dynamic Sparsity Support

With release 2.0, we are introducing Sparse Pretraining and Dense Finetuning (SPDF), a technique designed to accelerate pretraining by incorporating high levels of sparsity while maintaining downstream task accuracy through dense finetuning. To get started with SPDF, we have provided a comprehensive how-to-guide. Additionally, you can explore reference configurations in the Model Zoo to leverage SPDF effectively in your projects. The Model Zoo reference configuration is accessible here.

Expanded Model Support

Jais: a state of the art Arabic-English model utilizing maximal update parameterization, SwiGLU activations, ALiBi position encodings
BTLM: our state of the art 3B model with 8K context window and ALiBi extrapolation
LLaMA v2 7B, 13B, 70B is supported for training from scratch, continuous pretraining, or fine-tuning from a pretrained checkpoint.
Falcon 40B is supported for training from scratch, continuous pretraining, or fine-tuning from a pretrained checkpoint.
StarCoder 15B is supported for training from scratch, continuous pretraining, or fine-tuning from a pretrained checkpoint.
U-Net 3D: we now support 3D U-Net models, making it easy to train on large volumetric images such as MRI data.

Release 2.0 is a huge milestone for the Cerebras platform, bringing 50% performance improvement, a new PyTorch 2.0 stack, and diffusion transformer support. It is available for customer upgrades today.