Pipelined Backpropagation

Pipelined backpropagation is a method that avoids the fill and drain overhead of standard layer-pipelined execution. In pipelined backpropagation, the weights are updated for every sample, without ever draining the pipeline. Updating the weights without draining the pipeline can lead to slightly different weights being used on the forward and backwards passes. Cerebras researchers analyzed the implications of this and designed a simple momentum-based technique to correct these differences and enable training to full accuracy using pipelined backpropagation.

Fine-grained Pipelined Backpropagation has several advantages for hardware compared to traditional training using batch parallel Stochastic Gradient Descent. This can give large efficiency improvements for hardware architectures that can properly exploit these properties, such as Coarse-Grained Reconfigurable Arrays. However, traditional PB training can suffer from accuracy degradation and instability compared to standard training due to delayed gradients and weight inconsistency.  Combined with an appropriate choice (or scaling) of hyperparameters, small batches reduce the negative effects of gradient delay and weight inconsistency. The use of small micro-batches also reduces the memory requirements that could otherwise be excessive.

Unlike traditional training, fine-grained pipelined backpropagation can be efficient with small microbatch sizes when combined with persistent kernels that do not need to amortize weight loading. A good choice of normalization can also significantly aid Pipelined Backpropagation training.  Our neural network experiments with PB confirm these advantages. We find that the combined mitigation outperforms existing mitigation strategies, allowing our PB training to match the reference accuracy on both ImageNet and CIFAR-10 with minimal overhead and without the need of additional hyperparameter tuning. With our methods, PB is a promising alternative to traditional training. Future hardware architectures could reap significant efficiency gains from using small batch size, finegrained Pipelined Backpropagation.