Memory Hierarchy
Memory hierarchy is an organizational structure of computer memory which dictates the speed and capacity of various types of memory. This hierarchical system organizes memory into several distinct levels in order to maximize both performance and cost efficiency. Each level contains progressively smaller, faster, more expensive forms of storage. The most commonly used levels include cache, RAM, ROM, hard disk drives, SSDs, optical discs and magnetic tapes. Each has its own advantages and disadvantages depending on the application that it is being used for. By understanding how these different levels interact within the overall architecture of a computer system it is possible to design applications that are optimized for maximum performance while still keeping costs manageable. Memory hierarchy is an important concept for anyone designing and programming software applications. Understanding the various levels of memory and their individual characteristics can help to create more efficient code, better manage resources and ultimately lead to improved performance. Additionally, it is important for developers to recognize that not all memory types are created equally and some may provide better performance depending on the application being developed. By taking advantage of the available memory options at each level of the hierarchy it is possible to ensure that a program will run as fast as possible without sacrificing cost efficiency.
While computer architects have tried for decades to address the memory bottleneck, the primary solution has been memory hierarchies — many levels of on-chip and near-chip caches. These are costly, small, and provide sometimes unpredictable benefits. Once the on-chip/off-chip boundary is crossed, the latency-to-memory explodes and bandwidth plummets. It forces a downward spiral of performance, and is one of the fundamental reasons GPUs are challenged when doing artificial intelligence work.
Cerebras has solved this problem. The WSE has 18 Gigabytes of on chip memory and 9.6 Petabytes of memory bandwidth — respectively, 3,000x and 10,000x more than is available on the leading GPU. As a result, the WSE can keep the entire neural network parameters on the same silicon as the compute cores, where they can be accessed at full speed. This is possible because memory on the WSE is uniformly distributed alongside the computational elements, allowing the system to achieve extremely high memory bandwidth at single-cycle latency, with all model parameters in on-chip memory, all of the time.

Further reading
Add links to other articles or sites here. If none, delete this placeholder text.