Online Normalization [1] is a normalization technique that uses moving average statistics to normalize activations in a neural network. As we have shown in our NeurIPS2019 paper, Online Normalization outperforms or is competitive with state of the art normalization methods [2, 3, 4, 5, 6] in their respective areas.

Online Normalization uses moving average statistics on the forward and backward pass and adds layer scaling (Figure 1) to guard against the effects of errors in the statistical estimates. Layer scaling divides out the root mean square (RMS) of the activation vector across all features to prevent exponential growth of activation magnitudes.

In this post, we introduce Activation Clamping as an improvement over the original layer scaling that performs equally well with the added benefit of being less computationally expensive.

## Activation Clamping

Layer scaling helps stabilize training by eliminating the compounding of estimation errors. Left unchecked, these estimation errors can lead to the exponential growth of activation magnitudes across layers [1]. We propose simply clamping activations to stabilize training. Activation clamping can be expressed as:

z = \text{clamp}(y; c) = \text{min}(\text{max}(-c, y), c)where activations are constrained to the range −c ≤ z ≤ c . A statistically motivated setting for the clamping hyperparameter can be argued given the definition of normalization. The output of the affine norm y should be zero mean unit variance. Assuming a Gaussian distribution, the chances of activations being outside of a few standard deviations shrink at the rate of the complementary error function.^{1}

Online Normalization with activation clamping is depicted in Figure 2. If the statistical estimates of Online Normalization are accurate, clamping does nothing to the activation; clamping only modifies activations when there is a large error in the statistical estimates. Furthermore, as the network asymptotically nears convergence and the learning rate is annealed, the error in the statistical estimates approaches zero. For inference, clamping can be removed from Online Normalization.

Figure 2: Online Normalization with activation clamping. Adaptive bias and gain excluded for simplicity.

## Experiments

Table 1: Best validation loss and accuracy.

Normalizer (Error Compensation Mechanism) | Loss | Accuracy |
---|---|---|

OnlineNorm (Activation Clamping) | 0.93 | 76.3% |

OnlineNorm (Layer Scaling)^{2} | 0.94 | 76.3% |

BatchNorm^{2} | 0.97 | 76.4% |

To validate that clamping works well as an alternative to layer scaling, we rerun the ImageNet[7] image classificationexperiments shown in Section 4 of [1] conducting a single run with ResNet-50. This experiment uses the same decay factors used in [1]. Better results could be achieved with a hyperparameter sweep. As in [1], our training procedure is based on [8] which is tuned for Batch Normalization. Table 1 summarizes the results of our experiments; training curves are included in Figure 3. The experiment uses a clamping value of c=5 . After training, we removed the clamping operation, which produced identical validation performance.

Figure 3: ImageNet / ResNet-50

To conclude, activation clamping can be used as a less computationally intensive alternative to layer scaling for error compensation in Online Normalization. Activation clamping achieves the same accuracy as layer scaling, with the added benefit that it removes the cross channel dependence introduced by layer scaling and can itself be removed for inference. Activation clamping has been integrated into the Online Normalization implementation.

### Acknowledgements

I would like to acknowledge Atli Kosson and Urs Koster for their contribution to this work.

### Footnotes

^{1} This statistical analysis holds if the adaptive bias and gain in Online Normalization are disabled.
^{2} Results from [1].

### References

[1] Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente, Vishal Subbiah, and Michael James. Online normalization for training neural networks. In Advances in Neural Information Processing Systems 32, pages 8431–8441. Curran Associates, Inc., 2019.

[2] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.

[3] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1945–1953. Curran Associates, Inc., 2017.

[4] Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), September 2018.

[5] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. ArXiv, abs/1607.08022, 2016.

[6] Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. ArXiv, abs/1607.06450, 2016.

[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.