[Updated April 2023: R1.8 of the Cerebras Software Platform now supports image segmentation on 50 megapixel images, up from 25 megapixels in R1.7.]

Deep learning for computer vision (CV) has progressed rapidly in recent years, with networks able to identify objects in images and generate realistic images based on text input. With the exploding availability of high-quality, high-resolution data [1], researchers must find ways to train deep neural networks on large images and take advantage of their rich contextual information.

The Cerebras CS-2 system is designed to overcome the limitations of GPUs and allow users to easily and rapidly train large models on high-resolution, 50-megapixel images.

Introduction

In 2012, the goal of the ImageNet challenge was to classify small, 224×224 resolution images containing single subjects. Since then, the field of deep learning for computer vision (CV) has progressed rapidly and has tackled increasingly complex challenges.

Today, network architectures are designed to provide pixelwise segmentation, identify and locate up to hundreds of objects in single images, as well as identify small, sparsely distributed objects in large images. Similarly, there has recently been incredible progress in image generation. Diffusion models, while still relatively new, are currently able to generate state-of-the-art quality images and are setting the bar higher for image generation tasks. In particular, diffusion models have demonstrated the ability to generate remarkably realistic images based on text input.

These increasingly sophisticated tasks often require higher-resolution data, larger models, and more computation. In CV, deep neural networks (DNN) typically consist of many convolutional layers. Each layer must apply the convolution operation to its inputs and save the high-dimensional output activations. As models become deeper, there are more layers which require more computation and more memory. Similarly, as models get bigger, they have more parameters which also use  memory.

The hardware used to train these DNNs must have sufficient memory and be designed to store all the required activations and model parameters. Although GPUs have increased both their processing power and their memory capacity, they are still often a bottleneck to being able to train larger models on higher-resolution data. The Cerebras CS-2 system is designed to overcome these bottlenecks and allow users to train large models on high-resolution data in a seamless, user-friendly workflow.

Medical Applications

There is particular interest in the medical field in using deep learning for computational pathology. In this use case, large, potentially gigapixel, images of tissue samples are used for cancer diagnosis, prognosis, and calculation of how a patient will respond to a particular treatment (Figure 1).  In oncology, the analysis of histopathological images by an expert pathologist is considered the “gold standard” for patient diagnosis. This process requires careful, time-consuming examination which, if automated,  could  have a huge impact on this important field.

Figure 1. Whole slide image showing a slice of a human lymph node. The detection and classification of breast cancer metastases can be improved through large image segmentation. (From the CAMELYON17 data set.)

In addition to simply reducing costs, the advancement of computational pathology can help reduce errors and diagnostic variability, and opens up the possibility of using deep learning to integrate data from multiple sources to aid in diagnoses. In order to accurately detect tumors in tissue samples it is important to examine cell level features in the context of their surrounding tissue and not just  consider cells in isolation [2]. For DNNs to accurately identify tumors it is therefore necessary to be able to process images which are sufficiently large (often greater than 4K x 4K resolution), that they capture this contextual information [3].

Because the memory required to train a DNN on these large images often surpasses the memory available on current GPUs, researchers have had to devise methods to get around these memory limitations. These workarounds generally come with either a loss in accuracy or slower training [1]; and the implementer may need specialized knowledge about the dataflow between GPU and CPU [1,5].

Remote Sensing Applications

In the field of remote sensing, high-resolution satellite imagery is becoming more and more available. The number of satellites monitoring the earth’s surface has increased dramatically over the last few decades and there is a growing trend to making that data freely available [5]. The Sentinel satellite program alone added more than 8,000 TB of imagery data in 2021 [6].

Along with the increased availability of satellite imagery there is also increased demand for this data in a diverse range of fields including agriculture, climate science, surveillance, and economic development [7]. Like the medical field, processing this data may be limited by the memory available on current GPU devices. For example, in the remote sensing field of change detection, in order to avoid memory constraint issues, small patches must be extracted from the large, high-resolution satellite images [8, 9]. This process throws away contextual information and may degrade accuracy.

Object counting is another area of remote sensing that benefits from the contextual information available in high-resolution images. Using remote sensors to count objects is a challenging task due to object occlusion, cluttered backgrounds, and the diversity present in natural images. One approach to tackle this problem is to break high-resolution images into small patches and directly count the objects in each patch. However it has been shown that estimating the number of objects in an image benefits from contextual information and can be done more efficiently by processing entire images and generating the counts directly [10]. The maximum resolution that can be directly processed for object counting is determined by the training system’s available memory.

Barriers to High-Resolution Computer Vision

There is a clear trend toward the increasing availability of high-quality, high-resolution data for computer vision. In the fields discussed above, there are significant advantages to being able to process these larger-, higher-resolution images. Currently, however, the memory available on GPUs is often a limiting factor in being able to fully utilize this data. The simplest method to get around memory limitations is to downsample the input images. This, however, removes information from the input images and reduces network accuracy which may be unacceptable for many applications. Similarly, splicing high-resolution images into smaller patches throws away contextual information and also increases training time. There has been work developing specialized DNN architectures to combine information at different scales to avoid running into GPU memory limitations. In [11], the input images are both downsampled and split into patches, this information is then combined at the feature map level in the model.

Other approaches developed to circumvent GPU memory limitations look at how memory can be efficiently managed to avoid running into out of memory errors. One method is to throw out the output activations and then recompute them as needed. This however requires additional computation and slows down training [11]. Others have developed methods to share memory between the CPU host and the GPU [4]. In this method the CPU memory is used as a larger memory pool to store feature map activations. These activations are then swapped between the CPU and GPU as needed. This process, however, is rather complicated, involves rewriting the graph and defining how memory should be shared between devices which will be model-dependent.

Cerebras to the Rescue

Because the processor at the heart of the Cerebras CS-2 system – the Wafer-Scale Engine – has 40GB of on-chip SRAM and our unique high-speed memory access and weight-streaming capability, the CS-2 can train large models on a single system without the need for complicated, inefficient workarounds that throw out information and degrade performance.

As shown in Figure 2, a single CS-2 system can train directly on full-resolution satellite images, at least 7k x 7k pixels in size.  The CS-2 system makes training on these large images as simple as training on smaller images. There is no need for tiling which then requires stitching the results back together, nor do we need to down sample the images at the cost of prediction accuracy. If even more performance is required, a Cerebras Wafer-Scale Cluster, featuring multiple CS-2s working together, can be used.

The full implementation of the model and training shown in Figure 2 can be found in the Cerebras Model Zoo here. You can also learn how easy it is to implement high-resolution image segmentation in this short walk though video.

Figure 2. Image segmentation on 5120x5120 resolution images from the Inria Aerial dataset using the CS-2 system. Left, ground truth labels displayed on top of the original satellite image. Right, pixelwise predictions on top of the same image. Predictions were generated by a UNet model [12] which was trained directly on 5,120 x 5,120 images on the CS2 system without the need to crop or downsize the images.

Conclusions

As high-resolution data becomes more available in fields such medical imaging, the ability to process large enough images to capture the contextual information present in these images becomes increasingly important.  Currently however, the maximum image sizes that can be used for training are often limited by the memory available on the training system, which in turn limits the insights that can be gained from such images. The Cerebras CS-2 system is uniquely built to handle the high-memory and high-computational demands that will allow researchers to more fully extract the information contained in these high-resolution images.

Authors

Jason Wolfe, ML Engineer
Aarti Ghatkesar, ML Engineer
January 27, 2023

References

[1]: Arian Bakhtiarnia, Qi Zhang, and Alexandros Iosifidis, “Efficient High-Resolution Deep Learning: A Survey”, arXiv, 2022, https://arxiv.org/abs/2207.13050

[2]: Douglas Hanahan and Robert Weinberg, “Hallmarks of Cancer: The Next Generation”, Cell, 2011, https://www.cell.com/cell/fulltext/S0092-8674(11)00127-9

[3]: Chen et al. “Scaling Vision Transformers to Gigapixel Images via Hierarchical Self-Supervised Learning”, arXiv, 2022, https://arxiv.org/abs/2206.02647

[4]: Meng et al. “Training Deeper Models by GPU Memory Optimization on Tensorflow”, Conference on Neural Information Processing Systems, 2017, http://learningsys.org/nips17/assets/papers/paper_18.pdf

[5]: Alan S. Belward, Jon O. Skagen, “Who launched what, when and why; trends in global land-cover observation capacity from civilian earth observation satellites”, ISPRS Journal of Photogrammetry and Remote Sensing, 2015, https://www.sciencedirect.com/science/article/pii/S0924271614000720

[6]: Brandon Victor, Zhen He, Aiden Nibali , “A systematic review of the use of Deep Learning in Satellite Imagery for Agriculture”, arXiv, 2022, https://arxiv.org/abs/2210.01272

[7]: Gudžius et al. “Deep learning-based object recognition in multispectral satellite imagery for real-time applications”, Machine Vision and Application, 2021, https://link.springer.com/article/10.1007/s00138-021-01209-2

[8]: Daifeng Peng, Yongjun Zhang, Haiyun Guan, “End-to-End Change Detection for High Resolution Satellite Images Using Improved UNet++ ”, Remote Sensing, 2019, https://www.mdpi.com/2072-4292/11/11/1382

[9]: Hao Chen and Zhenwei Shi, “A Spatial-Temporal Attention-Based Method and a New Dataset for Remote Sensing Image Change Detection”, Remote Sensing, 2020, https://www.mdpi.com/2072-4292/12/10/1662

[10]: C. Shang, H. Ai and B. Bai, “End-to-end crowd counting via joint learning local and global count,” 2016 IEEE International Conference on Image Processing, 2016, https://ieeexplore.ieee.org/document/7532551

[11]: Chen et al. “Training Deep Nets with Sublinear Memory Cost”, arXiv, 2016, https://arxiv.org/abs/1604.06174

[12]: Olaf Ronneberger, Philipp Fischer and Thomas Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation”, arXiv, 2015, https://arxiv.org/abs/1505.04597