Cerebras Wafer-Scale Cluster

Large Language Models are transforming the art of the possible in AI. But training these models is brutally difficult. Surprisingly, its not the AI that’s hard. It’s the distributed computing. Distributing these models across thousands of small graphics processors is very complex and always sublinear in performance.

Always? No longer. The Cerebras Wafer-Scale Cluster achieves GPU-impossibleTM performance: near-perfect linear scaling across millions of cores without the pain and suffering of distributed computing.

Our approach is radically simple: distributing work across 192 CS-2 nodes is exactly the same as for a single CS-2 and can be applied with a single keystroke from a Jupyter notebook from your laptop. A new era of easy access to extreme-scale AI has just begun.

Read press release

Our cluster architecture – comprised of hardware, software and fabric technology – enables us to quickly and easily build clusters of up to 192 Cerebras CS-2s. Because these clusters run data parallel, there is no distributed computing, and they deliver near-perfect linear scaling performance, making them perfect for extreme-scale AI.

UNPRECEDENTED LINEAR SCALING PERFORMANCE

Nobody has seen performance like this before.

It’s ok to be shocked. Sublinear cluster scaling is the norm in distributed computing. Nobody has ever has delivered linear performance scaling for training extreme scale LLMs. And yet Cerebras does.

Why is the Cerebras Wafer-Scale Cluster so fast? Because we run strictly in data parallel mode. No other AI hardware can do this for large NLP models. Running data parallel is only possible if the entire neural network fits on a single processor—both the compute and the parameter storage. Cerebras pioneered wafer scale integration—enabling us to build the largest processor ever made. And we invented techniques to store trillions of parameters off chip and deliver the performance as if they were on chip. With 100 times more cores and 1000 times more on chip memory, and 10,000 times more memory bandwidth, the Wafer Scale Engine isn’t forced to break big problems into little pieces and distribute them among hundreds or thousands of little processors only then reassemble them into the final answer. Each WSE can support all the layers in every model, including the largest layers in the largest models. And our Memory X technology enables us to store trillions of parameters off processor. The combination of wafers scale integration and Memory X technology allows us to run strictly data parallel for even the largest NLP networks.

Read linear scaling blog
EXTRAORDINARY EASE OF USE

Distribute the largest NLP networks over teraflops of compute with a single keystroke.

Traditionally cluster distributing work over a large cluster is punishingly difficult, even for expert teams. Trying to run a single model on thousands of devices requires the scaling of memory, compute and bandwidth. This is an interdependent distributed constraint problem.

Cerebras cluster users can launch massive training jobs from anywhere, including from a Jupyter notebook, using a Python workflow that makes interacting with teraflops of compute and millions of compute cores as simple as running an application on a laptop.

The Cerebras Software Platform, CSoft, automatically, and invisibly, allocates compute resources inside a CS-2 cluster with no performance overhead and no parallel programming. So go ahead. Dream big, train big. No experience required.

Read ease of use blog
Distributing training across clusters of CS-2 systems takes a single keystroke.

EASE OF DEPLOYMENT

Easy to Install, Easy to Support

Building a high-performance cluster of hundreds or thousands of graphics processing units requires specialized expertise to optimize the network topology, switches, processors, and do the 3D parallelism necessary for NLP models to be distributed. This process can take months. And is so hard that when complete it is publication worthy.

The Cerebras Wafer-Scale Cluster is different. It’s also dead simple to install, scales across bigger clusters, and requires no parallel programming for different models.

Learn about appliance mode
UNDER THE HOOD

Innovation from Wafer Scale to Cluster Scale

The Cerebras Wafer-Scale Cluster brings together several groundbreaking Cerebras technologies:

Weight Streaming

Weight steaming is an execution mode that allows us to store all the model weights externally and stream them onto each node in the cluster without suffering the traditional penalty associated with off chip memory. weight streaming enables the training of models two orders of magnitude larger than the current state-of-the-art, with a simple scaling model.

White paper

MemoryX Weight Store

We store all the model weights “off chip” using our MemoryX technology, and we stream all those weights onto the CS-2 system as they’re needed. As the weights stream through, the CS-2, the WSE performs the computation using the underlying dataflow mechanism and then streams the gradients back to the MemoryX where the weights are updated.

Learn more

SwarmX Fabric

Swarm X is an innovative broadcast/reduce fabric that links the MemoryX to the CS-2 cluster. Using hardware based replication, it broadcasts weights to all CS-2s and then it reduces gradients as they are sent from the CS-2s back to the MemoryX. SwarmX is more than just an interconnect – it’s an active component in the training process, purpose-built for data parallel scale-out.

Learn more

Cerebras CS-2 System

At the heart of each CS-2 is our Wafer-Scale Engine, with 850,000 AI optimized compute cores, 40 Gigabytes of on chip high performance memory and 20 Petabytes/s of memory bandwidth, optimized for the fine-grained, sparse linear algebra underlying neural networks. Clusters of CS- 2s multiply the power of the world’s fastest AI processor.

Learn more

Appliance Mode

Appliance mode enables a wafer scale cluster to be managed as a as a training or inference service. Simply submit jobs and the cluster takes care of the rest. You never need to think about distributed computing, Slurm, Docker containers, latency or any of the infrastructure. And the cluster will automatically partition resources to accommodate multiple concurrent users.

Learn more