Pipelined Execution

Cerebras Systems offers two execution modes: pipelined and weight streaming execution modes. In pipelined execution mode, all of the model weights are present on one WSE. The memory capacity of the WSE limits the model size. This approach works well for small to medium size models. (Small is a relative term; models with up to about a billion parameters will work in pipelined mode. That size model was state-of-the-art only a few years ago.) The Cerebras Software Platform (CSoft) in pipeline execution mode uses pipeline parallelism. Different areas of the Wafer-Scale Engine (WSE) are responsible for different sub-computations of the model. Samples are then fed one-by-one through this pipeline of computations such that at any given point in time multiple different sub-components of the model are being performed, each acting on a different sample in a different physical area of the fabric. 

Pipelined execution is an important component of Cerebras Systems’ technology stack. It can efficiently perform standard layer-sequential, data-parallel execution at smaller neural network batch sizes than clusters of traditional accelerators, and it also enables more flexible layer-pipelined execution modes. Layer-sequential and layer-pipelined execution can be combined with data-parallelism at many different levels of granularity. It allows training today’s architectures faster without tuning batch sizes and learning rates. For small networks, it allows combining both layer and batch parallelism, while the largest networks can use layer-sequential execution efficiently at a neural network batch size of one. Midsize networks can be executed in a “block-sequential” mode, when one block of layers is evaluated at a time with layer-pipelined execution within each block. This gives practitioners the freedom to train networks of all shapes and sizes efficiently, be it deep or shallow, wide or narrow, large or small.