Author: Jeff Forte, VP of Federal Sales, Cerebras Systems

Strategic decision-makers aren’t just challenged by too-little data. Today, the problem of too much data can be an equally critical problem. As the total amount of available information grows, processing and using this information effectively is more challenging and critical than ever before.

The Department of Defense and U.S. Intelligence agencies are often tasked with analyzing information gathered from multiple sources. The ability to sort and process this data into actionable intelligence is what allows intelligence analysts and military forces to act with confidence. Analyzing these disparate streams of data quickly enough to provide useful insights is challenging, but a data processing technique known as multimodal machine learning can provide a solution to this problem.

Multimodal Data

Modality refers to the way that something is experienced. We observe the world through sounds, smells, tastes, and more. Much of human learning is multimodal. Students in a classroom may consult pictures in a textbook while simultaneously listening to a teacher or lecturer. A person watching a weather report learns by listening to the forecaster and by consulting the displayed weather map.

In AI research, learning is considered multimodal when it contains multiple different types of data for the machine to learn from and cross-reference. Multimodal learning is easy for humans, but not so easy for machines.

In order for multimodal machine learning to be effective, each unique data source needs to be interpreted in the context of the other pieces. A student listening to a lecturer while consulting their textbook understands that the information on the page is contextually related to the classroom discussion. Teaching contextual understanding to an AI is a little more complicated.

Multimodal Machine Learning

In a multimodal environment, an AI system isn’t trained on a single input type like vision, speech, or text, but on many simultaneously. This allows the system to translate what has been learned in one domain to another. Capturing the relationship between language, visual signals, and audio allows researchers to make decisions with the most complete view of the situation, from cyber activity, to land operations, to subsurface happenings.

While the techniques for building large, multimodal models are widely known, one of the biggest challenges is the massive amount of processing power that is required to train the models. To increase the speed of multimodal machine learning, many organizations are using large clusters of graphics processing units (GPUs). While these clusters reduce training time, they are expensive to build, complicated to program for, and can still take days to weeks to train a network. The added hardware, energy, and administrative costs mean that the pace of actionable intelligence is slowed.

Large-scale pretraining is fast becoming the norm in Vision-Language (VL) modeling. However, prevailing VL approaches are limited by the requirement for labeled data and the use of complex multi-step pretraining objectives. AI and machine learning are critical to analyzing these ever-growing data silos within a reasonable period of time.

The Cerebras Systems Approach

Powered by a wafer-scale processor, the Cerebras CS-2 combines the compute and memory of an entire cluster onto a single chip. When machine learning researchers are training models with multiple data sources and formats, having the programming ease of a single machine becomes invaluable. This allows researchers to focus on the model and application rather than wasting time addressing the challenges that come with programming GPU clusters.

Multimodal machine learning will help your organization take multi-sensor/multimodal data and piece together a complete view of the landscape. A revolutionary system will enable your organization to form actionable insights in a fraction of the time. To learn more about the areas where Cerebras Systems can accelerate the speed of innovation, check out our government page.