Sia Rezaei, Machine Learning Solutions Engineer | August 16, 2022

Since the advent of the transformer architecture an ongoing area of research and development has been on techniques that allow transformers to process longer sequences. In this post we share our results on how extending sequence length helps to improve accuracy of GPT-2. We then show that with Cerebras’ technology it is now possible to increase sequence length by more than an order of magnitude without massive increases in computational intensity or compromising accuracy.


Today the use of Deep Learning for Natural Language Understanding (NLU) is delivering value to all kinds of real-world applications. The chances are you are interacting with multiple applications every day that are powered by NLU.

NLP vs NLU

The term NLU may be new to some readers. You might think of NLU as taking natural language processing (NLP) to the next level, to derive actual meaning and context from the source. Consider this sentence:

Please crack the window, it’s getting hot in here.

With NLP alone, the services of a glazier will probably be needed! NLU, by contrast, is able to infer that a little ventilation is needed.

The Rise of the Transformer

This proliferation of NLU has been mainly propelled by the surprising performance of the “attention-based” neural networks, such as the Transformer, GPT, and BERT in tasks such as question answering, translation, reasoning, and text generation. Even the researchers who made these networks were quite surprised by their abilities to generate coherent sentences or hold a conversation without it being painfully obvious that it is not a human.

Attention (or attention layer) is an architectural motif that enables neural networks to learn patterns over sets or sequences of tokens. Tokens could be words in a sentence in NLU applications or they could be atoms in a molecule for applications in molecular biology.

One of the reasons for the success of the attention-based neural networks is believed to be the fact that attention layers make few presumptions about the structure and patterns in the data. Instead, the patterns are learned from massive amounts of data; whatever they may be. But this comes at a cost.

Despite its surprising effectiveness, the attention layer is incredibly expensive to compute. This is especially true as input sequences (e.g. a text document) grow longer. In fact, if the input sequence grows by a factor of 2, the memory (Figure 1) and compute required grows by a factor of 4! We call this “quadratic complexity”, and it has been the major roadblock for using attention-based networks on longer sequences.

Figure 1. Total memory required to train GPT-J as a function of sequence length. (16 bytes per parameter and 2 bytes per activation). Note that the axes are logarithmic.(Source: Cerebras.)

Why Do Long Sequences Matter?

We humans are able to read a novel (or long document) while maintaining the big context in our mind. This enables us to not lose track of the overall narrative which is required to comprehend the text. However, today even the biggest networks trained on massive clusters of GPUs, can typically only maintain somewhere between 512 to 2048 tokens in context. For comparison, an 8-page document could exceed 8000 .

To see how larger context helps inference in practice, we looked at the performance of pre-trained GPT-2 on the next token prediction task. This model was trained with 1024 maximum sequence length. The model performed next token prediction for 15,000 passages from the BookCorpus Open dataset.

Keep in mind that GPT style language models try to predict the next token. They do so by assigning a score to every word in the vocabulary. Top-1 accuracy measure the % of times the correct token was assigned the highest score and top-5 accuracy measures the % of time the correct token was among the 5 tokens that got the highest scores. In addition, GPT-2 is an autoregressive model, which means that it predicts the probability of the nth token using all the previous tokens. This means as you move along the sequence the tokens are predicted using a bigger and bigger context.

Figure 2 plots the average accuracy of prediction for all 1024 tokens across all passages. As we expect the tokens towards the end of the sequence are predicted with higher accuracy. This strongly suggests that a larger context helps to better predict which token is about to come next.

Figure 2. Average token prediction accuracy vs. position in the sequence for 15,000 samples from BooksCorpus Open for GPT-2.(Source: Cerebras.)

Previous Attempts to Reduce Compute Complexity

Training larger attention-based networks is extremely cumbersome on GPUs. This is because the computation must be distributed across hundreds of GPUs. A lot of expertise is required for setting up such distributed training, even for short sequences. The ability to process long sequences would make distributed training even more cumbersome and slow, because the computation required for each layer will need to be distributed across multiple GPU which comes with huge communication bandwidth demands. When you consider it takes around 2 microseconds to move a bit of data between two network nodes, but only a nanosecond to move data between two processing elements on the Cerebras Wafer-Scale Engine (WSE), the difficulty of distributed processing becomes obvious.

Therefore, the dominant approaches to solving this problem have been to sparsify the attention operation (Longformer[i]), introduce hierarchy (SMITH[ii]), or approximate attention (Linformer[iii], Performer[iv]). All these methods are attempts at reducing the quadratic complexity of attention layers to something more manageable. They do so by assuming certain constraints about the correlations between tokens.

For example, in the Longformer network, the attention operation is sparsified, or constrained, such that each token is only considered in the context of its neighboring tokens rather than the whole input sequence.

Similarly approaches that use hierarchy break the input into multiple segments and constrain attention to be within those segments. The SMITH model, for example, breaks the input into 3 levels of hierarchy: word, sentence, and document. At the sentence level, each token is considered in the context of that sentence only and so on.

Finally, methods that approximate attention, like Linformer and the Performer, are akin to performing a lossy compression to the attention patterns. They assume certain attention patterns that deviate from true attention.

However, going back, we remember the reason the attention operation has been successful was its minimal built-in presumptions about the patterns in the data. The methods we discussed above inevitably introduce presumptions about the data. And that constrains the patterns the network can learn, compromising its performance. If deep learning has taught us a lesson,[v] it has been that it is better to let the machines learn from the data than to constrain them with our presumptions. This has been especially true for data such as language where there are many complex non-linear relationships that determine the semantics.

How Cerebras' Wafer-Scale Architecture Makes Long Sequence Computation Easy

What if we could have our cake and eat it too? The Wafer-Scale Engine (WSE) developed by Cerebras is a single massive chip containing a whopping 850,000 processing elements (PE), each with its own local memory. The PEs are meshed together with an ultra-low latency, high bandwidth communication fabric. Data can move from PE to PE in a single nanosecond clock cycle. This remarkable memory and communication bandwidth makes it not only possible, but quick, to train better models using sequences up to 50,000 tokens long with dense attention – 20 times longer than a modern GPU!

Figure 3. Comparison of maximum sequence length.

Processing long sequences is enabled by our weight streaming feature. In the weight steaming execution mode, the activations are kept on the wafer and weights are loaded into the wafer one layer at a time. This frees up memory and enables us to run not only much larger networks, but also much longer sequence lengths at full performance because of our enormous memory and on-chip network bandwidth.

This ability to process long sequences puts within reach the use of powerful neural networks in a new range of applications. They include summarizing long documents, such as academic papers and legal documents, as well as applications in biomedical industry such as therapeutic drug discovery, and genomics where we want to learn patterns over long chains of atoms or protein molecules.

We suspect that there are many other fields where the ability to encompass a larger context will be hugely beneficial. Intuitively, being able to ingest a whole genome, an entire candidate drug molecule or a whole long document should give the model much more context to interpret the data. But until now, because it has not been practical to train models with inputs greater than a few thousand tokens, nobody has been able to explore these opportunities.

Thanks to Cerebras, all that is about to change.


[i] Iz Beltagy, Matthew E. Peters, Arman Cohan, “Longformer: The Long-Document Transformer”, arXiv, 2020 https://arxiv.org/abs/2004.05150

[ii] Liu Yang, Mingyang Zhang, Cheng Li, Michael Bendersky, Marc Najork, “Beyond 512 Tokens: Siamese Multi-depth Transformer-based Hierarchical Encoder for Long-Form Document Matching”, CIKM 2020 https://arxiv.org/abs/2004.12297

[iii] Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, Hao Ma, “Linformer: Self-Attention with Linear Complexity”, arXiv, 2020 https://arxiv.org/abs/2006.04768v3

[iv] Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, Adrian Weller, “Rethinking Attention with Performers”, ICLR 2021, https://arxiv.org/abs/2009.14794v3

[v] Rich Sutton, “The Bitter Lesson”, 2019. http://www.incompleteideas.net/IncIdeas/BitterLesson.html