My colleagues and I were excited to present **Online Normalization** at NeurIPS 2019. Online Normalization is a batch-agnostic normalization technique which demonstrates similar, and in some cases improved, performance compared to other state-of-the-art normalization techniques.

In our paper “Online Normalization for Training Neural Networks” we describe how this technique works. Even though it operates at batch size 1, Online Normalization approximates statistics using information from multiple samples. This is accomplished by using exponential moving estimates of the mean and variance. Unlike similar techniques, Online Normalization uses a control process motivated by a geometric interpretation of normalization to differentiate through the corresponding expectations.

Before we delve into the details of our new approach, let’s take a step back and remind ourselves what normalization is, why it matters, and how existing techniques normalize hidden activations.

## 1 Normalization

In deep learning, normalization of activations is known to accelerate learning [1, 2]. One hypothesis for how it does this is that normalization eliminates internal covariate shift [1]

Others conjecture that normalization speeds up training by transforming the optimization space to be better conditioned, thereby easing optimization [2] (Figure 1).

While ML practitioners have differing ideas about normalization, it is generally undisputed that it does, in fact, accelerate neural network training. Normalization, as defined by [1], is a process that z-scores data by subtracting out the distribution mean and dividing out the distribution standard deviation. Distribution statistics are rarely known in the real world; they must therefore be estimated.

### 1.1 Statistics Estimation

Given a dataset of N elements, D = \left\{ x_i \right\}_{i=0}^N , the minimum variance unbiased estimates are calculated using the entire dataset; the estimates are statistically optimal if all N elements of the dataset are used. In neural networks, estimating the first and second order statistics of hidden activations using the entire dataset is computationally not feasible. Normalization must therefore be done using other estimation processes.

Batch Normalization [1] operates on a batch of data. Using a subset of the data (or batch) to estimate the dataset statistics is a simple and often effective technique, but it is not always applicable in every situation – batching only tends to be effective when a sufficiently large batch is used [4].

This works well when the hidden activations have limited dimensionality, but this is not always the case. Processing megapixel images or running a localization / segmentation network such as Mask R-CNN [5] increases the dimensionality of the hidden activations, and therefore also increases the memory requirements of the hardware being used. Batching compounds the memory requirements, making large batch training impractical in some scenarios.

Without normalization neural networks are functions of their inputs. Normalization brings in the dependency on the entire input data set turning networks into statistical operators. Layer Normalization [6], Group Normalization [4], and Instance Normalization [7] estimate statistical properties using only a single sample of data. This keeps the networks functionally dependent on a single input but limits the quality of the estimated statistics typically leading to suboptimal performance. Using a batch to estimate statistics eliminates this one-to-one functional mapping and introduces a mapping based on the input set used for normalization. In many cases, batch normalization outperforms other standard normalization techniques.

Normalization methods such as Batch Renormalization [8], and Streaming Normalization [9] use exponentially moving estimates of the mean and standard deviation. While this is a valid method for estimating the hidden activation’s statistics, Batch Renormalization still relies on batching, and both methods use heuristics for differentiating through the time-evolving process.

Using a geometric interpretation of normalization, we set out to create a normalization technique which also uses exponential moving processes to estimate the distribution mean and standard deviation; but, rather than relying on batching or heuristics, would have a principled approach to differentiating through the time-evolving process.

## 2 Online Normalization

Like some of the aforementioned heuristic methods, Online Normalization [10] also uses exponentially moving estimates of the mean and standard deviation for normalization, but proposes a technique to backpropogate through normalization which provably produces an unbiased estimate of the gradients flowing through normalization. This allows Online Normalization to operate at any batch size, including batch size 1, to accelerate neural network training without any degradation when compared to other normalization methods.

Normalization z-scores the data.

This produces output activation’s which are zero mean and unit variance:

\mu[y]=0 \ \ \ \ \ \textrm{and} \ \ \ \ \ \mu[y^2]=1The equations above describe a unit sphere at the origin. The derivative of a sphere is tangent to the sphere. Using this logic, we can deduce that backpropogation through normalization is simply a projection of the gradients onto a plane which is tangent to the sphere; this is depicted in Figure 3. The resulting backpropagation equation is:

\vec{x}\,' = \frac{1}{\sigma} (I - P_{\vec{1}})(I - P_{\vec{y}})\vec{y}\,'In the forward pass, Online Normalization uses exponential moving processes to get an unbiased estimate of the first and second order statistics to z-score the hidden activations. Online Normalization projects the gradients such that they lie on the tangent plane using exponentially decaying inner products. Online Normalization is depicted Figure 4. The exact form of the forward and backward equations can be seen in Equations (8) – (12) of [10].

### 2.1 Error checking

All estimation processes are fallible and by definition have errors. In the neural network setting, these errors can compound exponentially leading to numerical instability. To guard against numerical instability due to estimation errors, Online Normalization proposes layer scaling (depicted in Figure 4). Layer scaling divides out the RMS of the activation set. If there are no errors in our statistical estimates, RMS of the normalized activations is 1. If there are no errors in the statistical estimates then Layer Scaling divides by 1; it does nothing. When there are errors, layer scaling uses the RMS to stop the errors from compounding exponentially to cause numerical instability.

## 3 Results

Online Normalization can therefore operate at any batch size (including batch size 1) and achieves performance comparable to, and in some cases outperforming, state-of-the-art normalization methods. This unlocks deep learning practitioners to explore training paradigms and deep learning architects which were previously infeasible due to memory constraints

The research is published by NeurIPS and on arXiv. The code is available on GitHub.

### Acknowledgements

Online normalization is the result of the strong collaborative work of many of us here at Cerebras Systems. I want to thank my coauthors: Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente, Vishal Subbiah, Michael James in the research effort.

### References

[1] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.

[2] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2483–2493. Curran Associates, Inc., 2018.

[3] Mohamed Elgendy. Deep Learning for Vision Systems. O’Reilly Media, 2020.

[4] Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), September 2018.

[5] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B Girshick. Mask r-cnn. corr abs/1703.06870 (2017). arXivpreprint arXiv:1703.06870, 2017.

[6] Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. ArXiv, abs/1607.06450, 2016.

[7] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. ArXiv, abs/1607.08022, 2016.

[8] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1945–1953. Curran Associates, Inc., 2017.

[9] Qianli Liao, Kenji Kawaguchi, and Tomaso A. Poggio. Streaming normalization: Towards simpler and more biologically-plausible normalizations for online and recurrent learning. ArXiv, abs/1610.06160, 2016.

[10] Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente, Vishal Subbiah, and Michael James. Online normalization for training neural networks. In H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8431–8441. Curran Associates, Inc., 2019.