Natalia Vassilieva, Director of Product | June 1, 2022

We’re excited to announce the release of Cerebras Software Platform (CSoft) version 1.3. This release enables training and fine-tuning of GPT-J 6B parameters model, further expands PyTorch support and delivers new optimizations to make training of smaller transformers (up to one billion parameters, like the original Transformer and BERT models) even faster.

GPT-J Made Easy with Cerebras’ Weight Streaming Technology

Just four years ago, state-of-the-art natural language processing models (NLP) had 100 million parameters and we thought that was massive. Now, with CSoft 1.3 it is now possible to train autoregressive language models with billions of parameters on a single CS-2 system using our groundbreaking weight streaming execution mode. Easily!

Probably the most exciting use case for this capability is continuous pre-training and fine-tuning of GPT-J. GPT-J is an open autoregressive language model with 6B parameters trained and released by EleutherAI. Availability of the trained weights for this model and lower serving cost compared to larger models such as OpenAI’s GPT-3, have made it very popular in the NLP community. GPT-J is considered as a competitive open alternative to GPT-3 and the generic version has demonstrated reasonably good results on a number of natural language tasks without any further training, in zero-shot setting. (Zero-shot learning is a seemingly magical method where a model learns how to make predictions about things it hasn’t seen before.)

However, to use the full power of this model for a domain-specific task, it is best to adapt it to the task at hand via fine-tuning. If your task is in healthcare, a model that doesn’t know the difference between the outcome of a disease and the outcome of a tennis match isn’t very useful!

Fine-tuning refers to a process of continuous training from a pre-existing checkpoint with domain-specific or task-specific data. Usually, datasets used for fine-tuning are significantly smaller those used for initial pre-training, and thus fine-tuning step is not as computationally expensive. However, a model with 6B parameters doesn’t fit into GPU memory and is challenging even to fine-tune. And by challenging, we mean really expensive in time, hardware and expertise.

With our 1.3 release we made GPT-J fine-tuning easy to do. It runs on a single Cerebras CS-2 system – no need to think about how to fit the model and which libraries to use for distributed training across dozens or hundreds of ordinary computers. You will have total control over the fine-tuning process of this very large model without the usual pain of dealing with very large models.

Expanded PyTorch Support

With this new release we’re also continue to expand our PyTorch support by adding reference model implementations. In this release we focused on BERT fine-tuning tasks in PyTorch. We have added BERT fine-tuning reference implementations for sequence classification, extractive summarization (BERTSUM) and question answering (SQuAD).

For sequence classification example we’re providing data loaders for two tasks from the GLUE Benchmark: sentiment analysis with Stanford Sentiment Treebank (SST-2) dataset and entailment prediction with Multi-Genre Natural Language Inference (MNLI) Corpus. The same reference implementation can be used for other sequence classification tasks, such as mapping proteins to possible protein folds (Remote Homology Detection) with a BERT model pre-trained on proteins as sequences of amino acids, and many other tasks when one needs to assign a label to a whole sequence.

As an example of extractive summarization , we have implemented a BERTSUM model with a data loader for CNN/Daily Mail stories from the DeepMind Q&A dataset. A trained model takes a text document as an input and forms a short summary by extracting and concatenating the most important spans (usually sentences) from the input document.

And for question answering fine-tuning task we provide an example with the Stanford Question Answering Dataset (SQuAD), one of the most common fine-tuning benchmarks for Natural Language Processing models.
We have also upgraded CSoft 1.3 to the latest PyTorch release 1.1, enabling the ML researchers to use the latest PyTorch features with the Cerebras hardware.

Introducing Variable Tensor Shape Computations in PyTorch

Another exciting new feature available in CSoft 1.3 is Variable Tensor Shape (VTS) computations. This feature unlocks even faster training of transformer models. This is a unique to Cerebras capability to perform computation on data samples with heterogeneous shapes efficiently without wasting compute on padded elements. This results in a significant performance boost when a model is trained on data samples of varied sizes, such as sequences of different length. With VTS we observe linear speed-up with respect to the ratio between the longest sequence in a dataset and an average sequence length. This is very meaningful for datasets with a wide range of sequence lengths, such as in text documents processed sentence by sentence (different sentences have different length), or proteins represented as sequences of amino acids, which can vary enormously in length. Read this blog post to learn more about VTS computations on Cerebras System.

Accelerating training of smaller transformers with multi-replica data parallel distribution

Last, we have expanded support of multi-replica data parallel training to transformer-style models in TensorFlow. Smaller models with number of parameters from several tens to several hundreds, such as BERTBASE (110M parameters), Transformer Base (65M parameters) and Transformer Big (213M parameters), do not use the whole area of our Wafer-Scale Engine (WSE). In multi-replica mode, as the name implies, we copy the model until the wafer is completely filled, then train all the copies in parallel. For example, for BERTBASE this feature boosts training speed by 1.7x.