ISPD 2020 Contest: Wafer-Scale Deep Learning Accelerator Placement

Contest

cerebras | December 12, 2019

The ACM International Symposium on Physical Design (ISPD) in collaboration with Cerebras Systems announce the 2020 ISPD contest in the area of placement for a wafer-scale deep learning accelerator.

The placement problem for a wafer-scale engine is similar to traditional EDA placement tools but with the added twist of optimizing for an entirely new set of performance metrics. The goal of the contest is to encourage academic research on this new class of place and route problems for wafer scale deep learning accelerators.

Registration is now closed.

Benchmark Description

A massive wafer-scale chip for Machine Learning training (code-named CS-1) is operational, containing over a trillion transistors. The engine is organized as a homogenous multiprocessor grid of 633×633 processing tiles. Each tile has a programmable execution core, memory, and a router for interconnection with other tiles. A typical deep learning model specifies tens of billions of arithmetic operations to be repeatedly evaluated. A compiled model for CS-1 must coordinate which arithmetic operation is performed on each core as well as the communication path to deliver the result of one operation to all dependent operations.

The purpose of this challenge is to develop algorithms and heuristics that produce compiled neural models that run with the highest possible performance. It is necessary to specify both how the operations will execute and to specify the location of the operations on the wafer engine.

Model operations are grouped into kernels such as matrix multiplication and image convolution. Formally, a kernel is a parametric program that performs specific tensor operations. For instance, a convolution kernel may perform three convolutions corresponding to the forward, backward, and weight update passes of a neural network.

Kernels have formal arguments that specify the shapes of the tensor operations to be performed. For the convolution example, formal arguments are: H, W, R, S, C, K. These specify the size of the input image (H, W); the size of the convolution’s receptive field (R, S); the number of input and output features (C, K). The values of formal arguments are provided and cannot be changed by the compilation procedure.

Kernels also have execution arguments that describe how the operation is parallelized across tiles. For example, the convolution kernel’s execution arguments split the various dimensions of its tensor operations independently, (h’, w’, c’, k’). The values of the execution arguments are variables for the optimization procedure.

A resource cuboid is a 4-tuple (height, width, time, memory) that describes the array of tiles necessary to execute a kernel. Each kernel provides a performance function that maps from its arguments to resource a cuboid.

The convolution kernel’s performance function is given as follows:

convperf(H, W, R, S, C, K; h’, w’, c’, k’) = {
  height = h’*w’*(k’+1)
  width  = 3*c’
  time   = ceil(H/h’) * ceil(W/w’) * ceil(C/c’) * ceil(K/k’) * R * S
  memory = C/c’ * K/k’ * R * S + (W+S-1)/w’ * (H+R-1)/h’ * K/k’
 } 

Each benchmark challenge includes a kernel graph and a kernel library. The graph lists all kernels with their formal arguments as nodes. Data-dependencies between kernels are arcs in this graph. The library is a list of performance functions that can be invoked for each kernel, and the domains for all execution arguments. A solution to the challenge specifies the following values for each kernel in the kernel graph:

  • Execution arguments
  • The x,y coordinates of kernel’s origin
  • Orientation of the kernel (as an isometry in D8)

Evaluation

Solutions will be evaluated against meeting the following constraints:

Legality constraints

  1. All kernels fit within the 633×633 area of the wafer scale engine
  2. No kernels overlap
  3. No kernel’s performance cuboid exceeds the tile’s memory limit of 48 kB
  4. Upper bound of runtime (example: < 8hrs)

Optimality objectives (scoring function)

  1. Minimize maximum compute time (… DeltaT … )
  2. Minimize L1 distance between communicating kernels
  3. Minimize runtime

Solutions are scored by the maximum time of any kernel within the kernel graph, plus an added weighted L1 communication penalty based on the distance from the centers of connected kernels in the kernel graph.

The details of the metrics and scoring system will be released on contest website.

Example Benchmark

Interested in more details? An initial example benchmark is described at this link. (note: updated February 10th to correct small error)

Contest Dates

  • To officially participate, contestants must register by Jan 27, 2020.  See Registration below. Registration has been closed.
  • Initial placement benchmarks is available at this link
  • More testing benchmarks will be made available before March 2nd, 2020
  • Teams must submit final source code by April 24th (Note: this is updated to allow extra time).
  • The contest winners will be announced in May, prizes will be handed out during the ISPD symposium (likely September 2020).

Registration

Registration is closed at this time.

For questions or comments you can contact the organizers at: ispd2020contest@cerebras.net

Prizes

Prizes will be handed out at the ISPD symposium.