logo
logo
  • Product
    • Cloud
    • Cluster
      • Andromeda
    • System
    • Processor
    • Software
  • Applications
    • Natural Language Processing
    • Computer Vision
    • High-Performance Computing
    • Industries
      • Health & Pharma
      • Energy
      • Government
      • Scientific Computing
      • Financial Services
      • Web and Social Media
  • Resources
    • Customer Spotlights
    • Blog
    • Publications
    • Events
    • White Papers
  • Developers
    • Community
    • Developer Blog
    • Documentation
    • Cerebras Model Zoo
    • Request Access to SDK
  • Company
    • About
    • In the News
    • Awards
    • Press Releases
    • Press Kit
  • Join Us
    • Life at Cerebras
    • All openings
  • Get Demo
  • Search
August 17, 2021
In Chip, Machine Learning, System, Blog

An AI Chip With Unprecedented Performance To Do the Unimaginable

AI accelerator chips have made machine learning a reality in nearly every industry. With unprecedented pace of compute demand, model size and data growth, the need for high performance and more efficient solutions is growing rapidly. With Moore’s Law not keeping up with this demand, AI accelerators desperately need to innovate at the system and algorithmic level to satisfy the anticipated needs of AI workloads over the next several years.

Dhiraj Mallick,  VP Engineering & Business Development | August 16, 2021

AI accelerator chips have made machine learning a reality in nearly every industry. With unprecedented pace of compute demand, model size and data growth, the need for high performance and more efficient solutions is growing rapidly. With Moore’s Law not keeping up with this demand, AI accelerators desperately need to innovate at the system and algorithmic level to satisfy the anticipated needs of AI workloads over the next several years.

Cerebras Systems has built the fastest AI accelerator, based on the largest processor in the industry and it’s easy to use. The system is based on a 7nm device that contains 850,000 specialized AI compute cores on a single wafer-scale chip. This single wafer compute engine is known as Wafer Scale Engine 2 (WSE-2).

Cluster-Scale In a Single AI Chip

The WSE-2 is by far the largest silicon product available, with a total silicon area of 46,225mm². It utilizes the maximum square of silicon that can be made out of a 300mm diameter wafer. The square of silicon contains 84 die that are 550mm² each. These die are stitched together using proprietary layers of interconnect, making a continuous compute fabric. By developing this interconnect on a single piece of silicon, we can connect the equivalent of 84 die and significantly lower the communication overhead and physical connections within the systems.

By connecting all 850,000 AI cores in this manner, users of our system (the CS-2) get unprecedented 220Pb/s of aggregate fabric bandwidth. Our proprietary interconnect and on-silicon wires lower the communication overhead and give a significantly better power performance than moving large AI workloads between discrete chips. This innovative wafer-scale technology’s advantage also includes 40GB of on-“chip” (wafer) memory allowing local storage of intermediate results that are normally stored off chip, thus reducing the user’s access time.

Designed with Sparsity In Mind

Deep neural net computations often contain a high number of zeros. This generates an opportunity to reduce the number of computations. The result of multiplying any number by zero is always zero. Adding zero to the accumulated result also has no effect. This allows a multiply-accumulate operation to be removed if one of a multiplier’s operands is zero. Tensors containing many zeros are referred to as sparse tensors. The WSE-2 is designed to harvest the sparsity from sparse tensors and vectors. In comparison, traditional GPU architectures will perform the unnecessary computations, wasting power and computational performance.

The WSE-2 harvests sparsity by taking advantage of Cerebras’ dataflow architecture and fine-grained compute engine. Compute cores communicate back and forth with their neighbors. A compute core sending data will filter out any zero data that would otherwise have been passed to their neighbor. The dataflow protocol results in the receiving core not performing these unnecessary calculations. Instead, it will just skip forward to the next useful computation.

Harvesting sparsity saves power and significantly improves performance. Algorithms such as ReLU in the forward pass and Max Pool in the backward pass of training can be used to introduce sparsity. Small weights that are close to zero can also be rounded to zero without loss of accuracy. By appropriate use of such functions, Cerebras math kernels can exploit the capabilities of WSE-2’s sparsity harvesting features.

Pushing the Limits of AI On-Chip Memory

The WSE-2 40 Gigabytes of on-chip memory is broken to 48kB sub-arrays associated with each of the 850,000 compute cores. The local storage is sufficient to store locally any reusable activations, weights, intermediate results and program code. The total memory bandwidth on WSE-2 is 20 Petabytes/sec, orders of magnitude more than could be achieved with typical off-chip memory architectures. This close coupling of memory and compute keeps data as local as possible to the computing engine, thereby driving up utilization and performance. The latency overhead to move data between caches and off-chip memories is significantly reduced.

The power needed for moving data on- and off-chip is also saved. On-chip memory significantly contributes to the performance/watt benefits of the WSE-2.  Our massive on-wafer memory bandwidth also enables full performance at all BLAS levels. While GPUs are typically used for Matrix-Matrix operations, our engine is also optimized for Matrix-Vector and Vector-Vector operations. This gives us a significant performance advantage in both training and real-time inference.

Built for Faster Time-to-Solution

The fabric between computing cores is uniform across the entire 46255 mm² of the WSE-2. Each core has links to its North, East, South and West neighbors. At die boundaries, the fabric is continuous across the boundary. This uniformity is important for SW. Unlike traditional AI chips, there is no need for kernel programmers and data scientists to consider where on the chip their code will be placed. By having a uniform fabric bandwidth between all compute cores, user code does not need to be optimized for its placement on the chip. This significantly optimizes the user’s time-to-solution. The aggregate fabric bandwidth of 220Pb/s is orders of magnitude larger than would be achievable with off-chip interfaces. For comparison, this is equivalent to over 2 million 100Gb ethernet links.

On-wafer wires are significantly more powerful and latency efficient than going over high-speed external interfaces.  The massive on-wafer bandwidth enables us to map a single problem to the full wafer. This includes physically mapping single layers across multiple die and mapping multiple layers across the entire wafer. The Cerebras architecture can achieve very high utilization even on large matrices. The sustained throughput doesn’t drop off, like on today’s machines, with increasing model size. Our architecture allows us to almost perfectly overlap compute and communication and we are not as susceptible to data movement overheads.

The Cerebras Advantage

In summary, Cerebras’ WSE-2 gives unprecedented levels of computation, memory and interconnect bandwidth on a single, wafer-scale piece of silicon. Further optimizations by sparsity harvesting allow the computation capabilities to be maximized. The outcome is huge performance in an integrated chip without bottlenecks, in which every node is programmable and independent of others. With this revolutionary approach to AI, you get to reduce the cost of curiosity.

The net result of our innovation to date is unmatched utilization, performance levels and scaling properties that were previously unthinkable. And we’re just getting started — we have an exciting roadmap of Wafer Scale Engines that will deliver even more improvements over our market-leading WSE-2.

Interested to learn more? Sign up for a demo!

At the Hot Chips 33 conference, our co-founder, Sean Lie, unveiled our exciting new weight streaming technology, which extends the Cerebras architecture to extreme-scale AI models. Learn more here.

AI Research Projects Deep learning

Rebecca Lewington

Technology Evangelist, Cerebras Systems

Author posts
Related Posts
ChipMachine LearningSoftwareClusterSDKBlogHPCDeveloper Blog

February 16, 2023

Cerebras Announces Fine-Tuning on the Cerebras AI Model Studio

Announcing the addition of fine-tuning capabilities for large language models…


Avatar photoby Udai Mody

Machine LearningSoftwareSDKBlogHPCDeveloper Blog

February 15, 2023

What’s New in R0.6 of the Cerebras SDK

The latest release of our SDK includes a host of new features to improve…


Avatar photoby Leighton Wilson

ChipMachine LearningSoftwareClusterBlogComputer Vision

February 13, 2023

Unlocking High-Resolution Computer Vision with Wafer-Scale Technology

We have built a platform for accelerating CV workloads that allows users to…


Avatar photoby Manikandan Ananth

  • Prev
  • Next

Explore more ideas in less time. Reduce the cost of curiosity.

Sign up

info@cerebras.net

1237 E. Arques Ave
Sunnyvale, CA 94085

Follow

Product

Cluster
System
Chip
Software
Cloud

Applications

Natural Language Processing

Computer Vision

High Performance Computing

Industries

Health & Pharma
Energy
Government
Scientific Computing
Financial Services
Web & Social Media

Resources

Customer Spotlight
Blog
Publications
Event Replays
Whitepapers

Developers

Community
Developer Blog
Documentation
ML Public Repository
Request Access to SDK

Company

About Cerebras
In the News
Press Releases
Privacy
Legal
Careers
Contact

© 2023 Cerebras. All rights reserved

Privacy Preference Center

Privacy Preferences

Manage Cookie Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage vendors Read more about these purposes
View preferences
{title} {title} {title}