Cluster-Scale Performance on a Single Chip

Programming a cluster to scale deep learning is painful. It typically requires dozens to hundreds of engineering hours and remains a practical barrier for many to realize the value of large-scale AI for their work.

On a traditional GPU cluster, ML researchers – typically using a special version of their ML framework – must figure out how to distribute their model while still achieving some fraction of their convergence and performance target. They must navigate the complex hierarchy of individual processors’ memory capacity, bandwidth, interconnect topology, and synchronization; all while performing a myriad of hyperparameter and tuning experiments along the way. What’s worse is that the resultant implementation is brittle to change, and this time only delays overall time to solution.

With the WSE, there is no bottleneck. We give you a cluster-scale AI compute resource with the programming ease of a single desktop machine using stock TensorFlow or PyTorch. Spend your time in AI discovery, not cluster engineering.

Learn more

850K

cores

Compute Designed for AI

Each core on the WSE is independently programmable and optimized for the tensor-based, sparse linear algebra operations that underpin neural network training and inference for deep learning, enabling it to deliver maximum performance, efficiency, and flexibility.

The WSE-2 packs 850,000 of these cores onto a single processor. With that, and any data scientist can run state-of-the-art AI models and explore innovative algorithmic techniques at record speed and scale, without ever touching distributed scaling complexities.

1000x

more capacity

Memory Capacity and Bandwidth: Why Choose?

Unlike traditional devices, in which the working cache memory is tiny, the WSE-2 takes 40GB of super-fast on-chip SRAM and spreads it evenly across the entire surface of the chip. This gives every core single-clock-cycle access to fast memory at extremely high bandwidth – 20 PB/s. This is 1,000x more capacity and 9,800x greater bandwidth than the leading GPU.

This means no trade-off is required. You can run large, state-of-the art models and real-world datasets entirely on a single chip. Minimize wall clock training time and achieve real-time inference within latency budgets, even for large models and datasets.

220Pb/s

processor

High Bandwidth. Low Latency.

Deep learning requires massive communication bandwidth between the layers of a neural network. The WSE uses an innovative high bandwidth, low latency communication fabric that connects processing elements on the wafer at tremendous speed and power efficiency. Dataflow traffic patterns between cores and across the wafer are fully configurable in software.

The WSE-2 on-wafer interconnect eliminates the communication slowdown and inefficiencies of connecting hundreds of small devices via wires and cables. It delivers an incredible 220 Pb/s processor-processor interconnect bandwidth. That’s more than 45,000x the bandwidth delivered between graphics processors.

The Wafer-Scale Advantage

The Wafer-Scale Engine delivers unparalleled performance with no trade-offs

WSE-2 A100 Cerebras Advantage
Chip Size 46,225 mm2 826 mm2 56 X
Cores 850,000 6912 + 432 123X
On-chip memory 40 Gigabytes 40 Megabytes 1,000 X
Memory bandwidth 20 Petabytes/sec 1.6 Terabytes/sec 12,733 X
Fabric bandwidth 220 Petabits/sec 4.8 Terabits/sec 45,833 X

 

Big chips process more data quickly and deliver answers in less time. The table here shows WSE-2 specifications compared to a contemporary graphics processor, illustrating how chip size translates into the performance metrics that matter for deep learning acceleration: compute cores, processor-processor interconnect bandwidth, on-chip memory capacity and bandwidth to main model memory.