Updated, March 13, 2022:
A new paper by Mathias Jacquelin from our Kernel team and TotalEnergies’ Mauricio Araya-Polo and Jie Meng presents more detail on this remarkably efficient stencil algorithm and updated performance results. The CS-2 delivers more than 200x performance compared to an NVIDIA® A100 GPU, which is twice the speedup previously announced.

Earlier today, we announced that TotalEnergies Research & Technology USA chose the Cerebras CS-2 system to accelerate its multi-energy research (press release). In this article, I’ll look closer at why this first publicly announced deployment of the Cerebras CS-2 in the energy sector is so significant.

The energy business is full of big decisions. Where should we build a wind farm? Where should we drill to find oil or natural gas deposits? Is this site going to work for carbon sequestration? Which novel battery chemistry should we research? In every case, a “go” decision commits massive resources. And it won’t be clear that the correct decision was made until those resources are in place. Furthermore, those decisions can’t wait; energy is a competitive business and the race to find the best sites to generate energy, or find the next breakthrough energy storage technology is intense. The stakes are high.

It’s no wonder that companies like TotalEnergies Research and Technology USA use high-performance computing (HPC) to create mind-bogglingly detailed models to help make those decisions.  How much HPC? TotalEnergies’ Pangea III supercomputer is ranked the 29th most powerful in the world on the Top500 list.

Diminishing returns

That’s some serious horsepower, but TotalEnergies is looking for more. A big challenge, though, is one of diminishing returns. Building a supercomputer twice as big using more of the same building blocks doesn’t deliver twice the performance. Splitting workloads into pieces that can run in parallel is very difficult, and the communication overhead between the building blocks, known as nodes, eventually dominates. This is known as the scaling problem.

Equipping nodes with accelerators – special purpose devices like graphics processing units (GPUs), which are designed to perform certain tasks very fast – helps. But only so much: GPUs, designed to excel at the dense matrix math underlying, well, graphics, aren’t a good match for many HPC and AI algorithms. The scaling problem remains. TotalEnergies wanted to find a better way.

Finding a better way

We were thrilled when TotalEnergies asked us to participate in their study to evaluate alternative hardware architectures. As a benchmark, their researchers chose a piece of their seismic modelling code that turns petabytes of data generated by a mesh of seismic sensors into three-dimensional models that extend far beneath the earth’s crust. This task is called (deep breath) “seismic wave propagation modelling using stencil based finite difference algorithms”. Just one run of the full code on a supercomputer can take weeks. And it is standard practice to run these simulations many times with small changes to the input conditions, to build confidence that there isn’t some butterfly effect that leads to false results. (Remember: fleets of ships may or may not be launched based on these results.) Clearly, reducing the run time from weeks to hours would be huge benefit, enabling more runs, more experiments and higher confidence in the results.

Of course, now, more than ever, energy is more than fossil fuels. As Dr. Vincent Saubestre, CEO and president of TotalEnergies Research & Technology, USA put it: “TotalEnergies’ roadmap is crystal clear: more energy, less emissions.” Fortunately, the same finite difference math used for this benchmark also underpins a wide range of other algorithms such as computational fluid dynamics (CFD), molecular dynamics and finite element analysis that are used for wind field modelling, materials research and long-term geology simulation for CO2 reservoirs. This makes the benchmark a useful tool to compare the performance of different accelerators.

The results are in

The Cerebras CS-2 system outperformed the benchmark score for a modern GPU by more than 200 times! (Updated April 13, 2022) As a result, a CS-2 system is now installed at TotalEnergies research facility in Houston, Texas, ready to help TotalEnergies researchers across a wide range of important projects.

Why did the CS-2 perform so strongly? Because, even compared to other novel accelerators, it’s a completely different, and much better approach. Instead of adding many small accelerators to compute nodes, we replace entire racks of conventional nodes and endless miles of networking cables with one network-attached system, powered by one massive microchip, the Wafer-Scale Engine (WSE), which features an astonishing 850,000 cores. We call this approach wafer-scale integration and it’s a perfect match for TotalEnergies finite difference benchmark.

Why? Because this workload demands very high-speed memory access to compute cores, which is exactly what limits the performance of conventional systems – they are said to be “memory-bound” or “communication-bound”, because they have to wait for data to travel between cores and memory across a circuit board or across a network.

The Cerebras architecture puts all the processors on the same chip, so data spends less time in flight. Even the fastest external network is a thousand times slower than the single clock cycle it takes to move from core to core on the WSE. And those processors don’t waste time accessing data – the entirety of each core’s memory is one clock cycle away.

The result is that communication is no longer a bottleneck. Our solution is, uniquely, “compute-bound”, which is a magic phrase to any HPC developer.

More than performance

The Cerebras solution is about more than pure performance, though. The fastest accelerator in the world isn’t useful if it doesn’t have a complete, easy-to-use software stack. TotalEnergies and Cerebras engineers wrote the benchmark code using the new Cerebras Software Language (CSL), which is part of the Cerebras Software Development Kit. Developers can use the Cerebras SDK to create custom kernels for their standalone applications or modify the kernel libraries provided for their unique use cases. The SDK enables developers to harness the power of wafer-scale computing with the tools and software used by the Cerebras development team. And, of course, the SDK includes code samples, tutorials, getting started guides, a language guide, and information on how to use tools like the included debugger.

Cerebras systems are doing AI and HPC important work at customers across many fields, including Argonne National LaboratoryLawrence Livermore National LaboratoryPittsburgh Supercomputing Center, EPCC, Tokyo Electron Devices, and GlaxoSmithKline. We’re excited to have TotalEnergies as our first customer in the Energy industry.

Person with very large silicon chip
Diego Klahr, VP of Computational Science and Engineering at TotalEnergies Research & Technology USA, poses with the Cerebras Wafer-Scale Engine during a visit to Cerebras Systems in Sunnyvale, California.
Learn more

Mathias Jacquelin, Mauricio Araya-Polo, Jie Meng, “Massively scalable stencil algorithm”, Submitted to SuperComputing 2022, https://arxiv.org/abs/2204.03775 New!

Read the press release here.

Mauricio Araya, TotalEnergies’ Senior Research Manager HPC and ML, shared more detail of the study in a presentation titled “Benchmarking Considerations for Current Energy HPC Systems” at the 2022 Energy HPC Conference at Rice University in Houston, Texas.

For more information about the Cerebras CS-2 system and its applications in energy, please visit https://cerebras.net/industries/energy.

To learn more about the Cerebras SDK, please read our technical brief here.

If you would like to evaluate how the CS-2 system can benefit your organization, we encourage you to get in touch here.

Rebecca Lewington, Technology Evangelist | March 2, 2022