Nov 14 2022

Genomics in Unparalleled Resolution: Cerebras Wafer-Scale Cluster Trains Large Language Models on the Full COVID Genome Sequence - Cerebras

In genomics, a physical characteristic (or phenotype) can be coded for by genes that are far apart in the genetic sequence. Such a long-range genotype-phenotype relationship is difficult to identify. The SARS-CoV-2 virus genome, which causes COVID-19, can have a sequence of 30,000 nucleotides or 10,000 codons. It is a long sequence, which makes finding the segments of interest like finding a needle in a haystack. Large language models (LLMs) can make sense of a variety of sequence data, from human languages to genomic sequences. A key to their success is their self-attention mechanism, which can relate different positions of a sequence. Conventional computer systems for machine learning have limited the input length that can be fed into the LLM. Now, however, the Cerebras CS-2, a wafer-scale system aimed at training neural networks, has the capacity to train state-of-the-art LLMs with input sequences longer than the SARS-CoV-2 genome. That allows researchers to study genomics from a whole new perspective.

Cindy Orozco Bohorquez and I worked with the lab of Dr. Arvind Ramanathan at Argonne National Laboratory (ANL) to develop genome-scale language models (GenSLM) that reveal evolutionary dynamics of SARS-CoV-2. For the first time, we trained LLMs using between 123 million and 25 billion parameters with the full SARS-CoV-2 genome of sequence length 10,240, on a single CS-2 system, and to speed up the training we used as many as 16 CS-2s in a Cerebras Wafer-Scale Cluster.

We’re thrilled to say that our joint work with Argonne National Laboratory (ANL) and NVIDIA won the 2022 Gordon Bell Special Prize for HPC-Based COVID-19 Research.

You can read our paper, “GenSLMs: Genome-scale language models reveal SARS-CoV-2 evolutionary dynamics” on bioRxiv here.

Understanding the SARS-CoV-2 genome using AI

An organism’s genome holds secrets of how it survives and, in the case of a virus, infects a host. COVID-19 is no different. SARS-CoV-2, the virus that causes COVID-19, has been picking up mutations in its genome since early in the pandemic. The SARS-CoV-2 genome now has thousands of variants. Some variants spread faster than others and lead to more severe symptoms, and thus are categorized as variants of concern (VOC). If we can decipher the genetic patterns of different SARS-CoV-2 variants, we will be able to quickly identify VOCs in new sequencing data and implement timely public health interventions.

Huge scientific efforts on COVID-19 have paved the way to our deeper understanding and the improved clinical outlook that we see today. One important contributor is genetic sequencing. This technology has provided abundant data: over 13 million SARS-CoV-2 genomes have been sequenced and made available to the research community. But challenges remain: how do we most effectively use the SARS-CoV-2 genome sequence data to gain insights and unravel its hidden secrets? The key pillars are two – software and hardware.

Traditional methods study genome sequences and identify variants by looking at short candidate regions of high suspicion. These methods are tedious and resource intensive. Now, artificial intelligence (AI) in the form of LLMs can help automate aspects of the process, making it easier, faster, and cheaper. LLMs, with the self-attention mechanism that relates regions in a sequence, help us reason about a genomic sequence by seeing patterns in data. Training LLMs on genomes allows them to learn what to look for, without human-crafted heuristics.

Importantly, the longer the sequence length, the wider the context, the better LLMs perform. We therefore want LLMs to view genomic sequences in their entirety. When LLMs see the whole of a SARS-CoV-2 genome at once, they capture the subtle interactions between blocks regardless how far apart they are in the sequence. But the SARS-CoV-2 genome is long – it has over 30,000 nucleotides, which translate to 10,000 codons. An LLM that can accept this entire sequence as input is quite wide (i.e. the neural network has layers with a large number of neurons), and is therefore costly in memory, compute time, and difficulty in training.

That brings us to a challenge for AI hardware – it takes tremendous compute power to train LLMs, with dense attention, and with very long sequence lengths. Memory usage is high – it scales as L2 with sequence length L. Compute demand is extreme – it scales as L3. The memory and compute growth then limit the batch size that can fit on a single device. (To a first approximation, bigger batch size means fewer training cycles and thus faster time to train.) Developers have to invent and orchestrate complex hybrid parallelism approaches – splitting the network’s parameters across devices along with data parallelism that splits the input batch, in order to optimize the usage of clusters of CPUs and GPUs. Are we blocked by the compute challenge then?

A solution: the Cerebras Wafer-Scale Cluster

Here is what we know: powerful compute enables effective algorithms; powerful compute and effective algorithms together drive new possibilities of scientific discovery. To provide the compute power needed to work with ever larger LLMs (and more capabilities to come), we built the Cerebras Wafer-Scale Cluster. The cluster allows many CS-2 systems to be used efficiently, in parallel, to speed up training dramatically. The largest Cerebras Cluster built to date is Andromeda, which has a remarkable 13.5 million AI-optimize compute cores spread across 16 CS-2 nodes.

With the Weight Streaming execution mode, the cluster offloads the tens of billions of network parameters to a high capacity memory subsystem called MemoryX, allowing the compute devices to be used for compute rather than parameter storage. The Cerebras Wafer-Scale Cluster can now train models of an unprecedented scale, extending the ability of researchers in genomics and other fields to use AI to understand the data that holds the information they seek.

The challenge of training genome-scale LLMs

As explained previously, training LLMs with the full SARS-CoV-2 genome is challenging. The Ramanathan Lab developed genome-scale large language models, or GenSLMs, with brilliant algorithmic ideas that do not rely on pretraining on the entire SARS-CoV-2 genome. As explained in the preprint on bioarxiv, they pretrained on shorter genetic sequences with a maximum sequence length (MSL) of 2,000 codons, significantly fewer than the 10,000 codons in the full SARS-CoV-2 genome. That work was carried out on Polaris, a GPU-based supercomputer, located at the Argonne Leadership Computing Facility (ALCF), and on the Selene system at NVIDIA®.

Even with these powerful compute resources, the researchers still ran into difficulties. As the paper says: “… these training runs frequently take >1 week on dedicated GPU resources (such as the Polaris supercomputer at ALCF) …”. The paper goes onto say that “… for the larger model sizes (2.5B and 25B), training on the 10,240 length SARS-CoV-2 data was infeasible on GPU-clusters due to out-of-memory errors during attention computation”.

Cerebras successfully overcame those problems, training larger models on the full SARS-CoV-2 genome. “To enable training of the larger models on the full sequence length (10,240 tokens), we leveraged AI-hardware accelerators such as Cerebras CS-2, both in a stand-alone mode and as an inter-connected cluster…”.

Now let’s look at what Cerebras has achieved in this work.

Our GenSLM time-to-solution is less than one day

By using a cluster with 1 to 4 CS-2 nodes, we trained large models with the SARS-CoV-2 genomes to convergence in less than one day. The two GenSLMs we trained to convergence have a GPT-2 architecture of 123M and 1.3B parameters with dense attention. Recall that as stated in the preprint, ANL took more than one week to train these models on their GPU clusters. The training times on 1 or 4 CS-2 nodes are reported in Table 1.

We deploy data parallelism across multiple CS-2s, with constant batch size per device. On the Cerebras Wafer-Scale Cluster of four CS-2s, we observed convergence in fewer training steps, as the cluster enables larger effective batch size. Additionally, the compute throughput for the two GenSLMs, in samples per second, scales linearly from 1 to 2 and 4 CS-2s, as seen in Table 2. These two factors effectively result in a shorter time to solution when training with larger Cerebras Clusters.

We can train GPT models, up to 25B parameters with MSL 10,240, on a single CS-2 and achieve linear throughput scaling up to 16 CS-2s.

The Ramanathan Lab aimed to train GenSLMs of 2.5 billion and 25 billion parameters with the full SARS-CoV-2 genome. Yet on the ANL Polaris supercomputer, with its thousands of A100 GPUs, the scaling of the vanilla model with dense attention is challenging and a work in progress. Algorithmic advances in attention and custom kernels are required to overcome this challenge on GPUs. However, the Cerebras Wafer-Scale Cluster enables easy scaling of these models with vanilla dense attention layers. With the cluster, we trained GPT-J versions with 250 million, 2.5 billion and 25 billion parameters that are similar to the Ramanathan Lab’s foundation model configurations. We can train them using MSL 10,240 on 1 CS-2, and achieve linear throughput scaling using up to 16 CS-2s, shown in Figure 1.

Let’s compare our throughput to that of the GPU clusters to have a more concrete idea of what these throughput numbers mean. For the 250 million parameter GenSLM, which is the largest model that the Ramanathan Lab could train with MSL 10,240, 16 CS-2s run faster than 512 A100s, which is about a quarter of GPU resources at Polaris, the #14 supercomputer on the June 2022 Top500 list.

And at the risk of repeating ourselves, the Ramanathan Lab has so far been unable to train the multi-billion parameter models with long sequence lengths on the Polaris supercomputer. Cerebras allowed the team to do something previously intractable, and do it with ease.

The Cerebras Wafer-Scale Cluster makes massive language models easy to train

You can train massive language models too – the Cerebras Wafer-Scale Cluster is very easy to use. The cluster uses our appliance workflow to provide an easy job interface for the user. No complicated code change or additional libraries are needed to use one or multiple CS-2s in the cluster. With a keystroke, you can distribute LLM training across the cluster with just data parallelism. All the GenSLM models trained in the Cerebras cluster used the GPT-2 and GPT-J implementations in the Cerebras Model Zoo public repository. We simply adapted the dataloader to process genomic data and changed the configuration files.

If you have some minutes to spare, here is our live demo in a recent Cerebras Developer Community meeting. If you would like to dive into the technical details behind our cluster and our linear scaling achievement, we have blog posts with comprehensive explanations.

“I am amazed by this and can’t wait to take it to the next level…”. – ANL colleague

Summary

This work demonstrates how the Cerebras Wafer-Scale Cluster unlocks new science. An ANL colleague said of our 16 CS-2 scaling results, “I am amazed by this and can’t wait to take it to the next level…”. We now have the compute capacity to study a viral genome all at once, enabling the discovery of long-range gene interactions. We expect to see more groundbreaking work made accessible, made easy, and made fast with Cerebras’ current and future wafer-scale computing systems.

This work was made possible with support from many parts of the Cerebras team, and we are happy to take on the next challenge together. There is so much more to come from us. Stay tuned!

Claire Zhang, Machine Learning Solutions Engineer
Rebecca Lewington, Technology Evangelist
November 14, 2022

Paper link: https://www.biorxiv.org/content/10.1101/2022.10.10.511571v2