logo
logo
  • Product
    • Cloud
    • Cluster
      • Andromeda
    • System
    • Processor
    • Software
  • Applications
    • Natural Language Processing
    • Computer Vision
    • High-Performance Computing
    • Industries
      • Health & Pharma
      • Energy
      • Government
      • Scientific Computing
      • Financial Services
      • Web and Social Media
  • Resources
    • Customer Spotlights
    • Blog
    • Publications
    • Events
    • White Papers
  • Developers
    • Community
    • Developer Blog
    • Documentation
    • Cerebras Model Zoo
    • Request Access to SDK
  • Company
    • About
    • In the News
    • Awards
    • Press Releases
    • Press Kit
  • Join Us
    • Life at Cerebras
    • All openings
  • Get Demo
  • Search
April 20, 2020
In Machine Learning, Blog

Error Compensation Mechanism in Online Normalization

Online Normalization is a new technique for normalizing the hidden activations of a neural network.

Vitality Chiley, Machine Learning Engineer | April 3, 2020

Online Normalization [1] is a normalization technique that uses moving average statistics to normalize activations in a neural network. As we have shown in our NeurIPS2019 paper, Online Normalization outperforms or is competitive with state of the art normalization methods [2, 3, 4, 5, 6] in their respective areas.

Online Normalization uses moving average statistics on the forward and backward pass and adds layer scaling (Figure 1) to guard against the effects of errors in the statistical estimates. Layer scaling divides out the root mean square (RMS) of the activation vector across all features to prevent exponential growth of activation magnitudes.

In this post, we introduce Activation Clamping as an improvement over the original layer scaling that performs equally well with the added benefit of being less computationally expensive.

Figure 1: Online Normalization with layer scaling. The incoming feature is represented as x, \mu and \sigma are the moving average mean and standard deviation, and y is the normalized feature. The feature z is the output of Online Normalization. Activation vectors across all features are represented by \{x\}, \{y\}, and \{z\}. Layer scaling introduces a cross-feature dependence by dividing out the RMS of \{y\}. Trainable bias and scale are excluded for simplicity.

Activation Clamping

Layer scaling helps stabilize training by eliminating the compounding of estimation errors. Left unchecked, these estimation errors can lead to the exponential growth of activation magnitudes across layers [1]. We propose simply clamping activations to stabilize training. Activation clamping can be expressed as:

 z = \text{clamp}(y; c) = \text{min}(\text{max}(-c, y), c)

where activations are constrained to the range  −c ≤ z ≤ c . A statistically motivated setting for the clamping hyperparameter can be argued given the definition of normalization. The output of the affine norm y should be zero mean unit variance. Assuming a Gaussian distribution, the chances of activations being outside of a few standard deviations shrink at the rate of the complementary error function.1

Online Normalization with activation clamping is depicted in Figure 2. If the statistical estimates of Online Normalization are accurate, clamping does nothing to the activation; clamping only modifies activations when there is a large error in the statistical estimates. Furthermore, as the network asymptotically nears convergence and the learning rate is annealed, the error in the statistical estimates approaches zero. For inference, clamping can be removed from Online Normalization.

Figure 2: Online Normalization with activation clamping. Adaptive bias and gain excluded for simplicity.

Experiments

Table 1: Best validation loss and accuracy.

Normalizer (Error Compensation Mechanism) Loss Accuracy
OnlineNorm (Activation Clamping) 0.93 76.3%
OnlineNorm (Layer Scaling)2 0.94 76.3%
BatchNorm2 0.97 76.4%

To validate that clamping works well as an alternative to layer scaling, we rerun the ImageNet[7] image classification experiments shown in Section 4 of [1] conducting a single run with ResNet-50. This experiment uses the same decay factors used in [1]. Better results could be achieved with a hyperparameter sweep. As in [1], our training procedure is based on [8] which is tuned for Batch Normalization. Table 1 summarizes the results of our experiments; training curves are included in Figure 3. The experiment uses a clamping value of  c=5 . After training, we removed the clamping operation, which produced identical validation performance.

Figure 3: ImageNet / ResNet-50

To conclude, activation clamping can be used as a less computationally intensive alternative to layer scaling for error compensation in Online Normalization. Activation clamping achieves the same accuracy as layer scaling, with the added benefit that it removes the cross channel dependence introduced by layer scaling and can itself be removed for inference. Activation clamping has been integrated into the Online Normalization implementation.

Acknowledgements

I would like to acknowledge Atli Kosson and Urs Koster for their contribution to this work.

Footnotes

1 This statistical analysis holds if the adaptive bias and gain in Online Normalization are disabled.
2 Results from [1].

References

[1] Vitaliy Chiley, Ilya Sharapov, Atli Kosson, Urs Koster, Ryan Reece, Sofia Samaniego de la Fuente, Vishal Subbiah, and Michael James. Online normalization for training neural networks. In Advances in Neural Information Processing Systems 32, pages 8431–8441. Curran Associates, Inc., 2019.
[2] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
[3] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batch-normalized models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 1945–1953. Curran Associates, Inc., 2017.
[4] Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), September 2018.
[5] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Instance normalization: The missing ingredient for fast stylization. ArXiv, abs/1607.08022, 2016.
[6] Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. ArXiv, abs/1607.06450, 2016.
[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.
[8] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.


Rebecca Lewington

Technology Evangelist, Cerebras Systems

Author posts
Related Posts
ChipMachine LearningSoftwareClusterSDKBlogHPCDeveloper Blog

February 16, 2023

Cerebras Announces Fine-Tuning on the Cerebras AI Model Studio

Announcing the addition of fine-tuning capabilities for large language models…


Avatar photoby Udai Mody

Machine LearningSoftwareSDKBlogHPCDeveloper Blog

February 15, 2023

What’s New in R0.6 of the Cerebras SDK

The latest release of our SDK includes a host of new features to improve…


Avatar photoby Leighton Wilson

ChipMachine LearningSoftwareClusterBlogComputer Vision

February 13, 2023

Unlocking High-Resolution Computer Vision with Wafer-Scale Technology

We have built a platform for accelerating CV workloads that allows users to…


Avatar photoby Manikandan Ananth

  • Prev
  • Next

Explore more ideas in less time. Reduce the cost of curiosity.

Sign up

info@cerebras.net

1237 E. Arques Ave
Sunnyvale, CA 94085

Follow

Product

Cluster
System
Chip
Software
Cloud

Applications

Natural Language Processing

Computer Vision

High Performance Computing

Industries

Health & Pharma
Energy
Government
Scientific Computing
Financial Services
Web & Social Media

Resources

Customer Spotlight
Blog
Publications
Event Replays
Whitepapers

Developers

Community
Developer Blog
Documentation
ML Public Repository
Request Access to SDK

Company

About Cerebras
In the News
Press Releases
Privacy
Legal
Careers
Contact

© 2023 Cerebras. All rights reserved

Privacy Preference Center

Privacy Preferences

Manage Cookie Consent
To provide the best experiences, we use technologies like cookies to store and/or access device information. Consenting to these technologies will allow us to process data such as browsing behavior or unique IDs on this site. Not consenting or withdrawing consent, may adversely affect certain features and functions.
Functional Always active
The technical storage or access is strictly necessary for the legitimate purpose of enabling the use of a specific service explicitly requested by the subscriber or user, or for the sole purpose of carrying out the transmission of a communication over an electronic communications network.
Preferences
The technical storage or access is necessary for the legitimate purpose of storing preferences that are not requested by the subscriber or user.
Statistics
The technical storage or access that is used exclusively for statistical purposes. The technical storage or access that is used exclusively for anonymous statistical purposes. Without a subpoena, voluntary compliance on the part of your Internet Service Provider, or additional records from a third party, information stored or retrieved for this purpose alone cannot usually be used to identify you.
Marketing
The technical storage or access is required to create user profiles to send advertising, or to track the user on a website or across several websites for similar marketing purposes.
Manage options Manage services Manage vendors Read more about these purposes
View preferences
{title} {title} {title}