In November of 2022,
ChatGPT (Trends 2023) captured the imagination of the public and offered a glimpse of what was possible with generative artificial intelligence (GenAI). Since then, it has been a race to invest and gain advantage. While the viral “blue duck” video introducing Gemini by Google DeepMind risked overselling the technology, multimodal AI will be an important evolution for artificial intelligence. Multimodal AI is artificial intelligence that combines multiple types (or modes) of data input that make it possible to generate more insightful or nuanced conclusions about realworld questions. Until now, most AI systems have been unimodal, designed and trained to work with one type of data exclusively and tuned for that modality. For example, the original ChatGPT uses natural language processing (NLP) algorithms to extract meaning from text content and produce a textonly output. Multimodal AI, instead, accepts and processes data from multiple sources, including images, video, speech, sound, as well as code and text. Multiple inputs allow for a more detailed, refined assessment of a particular environment or given situation. A multimodal NLP, for example, may identify signs of emotion in a user’s voice and combine that with facial expressions to better interpret a query and tailor a proper response. In this way, multimodal AI more closely resembles human perception.
Multimodal AI will be central to the development of autonomous vehicles (
Passenger Economy, Trends 2018) and robotics (
Rise of the Machines, Trends 2014) that need to interact with real-world environments. Multimodal AI uses data from cameras, microphones, GPS, radar, LiDAR, and a host of other sensors to better understand and more successfully interact with its surroundings. Likewise, multimodal AI will enable more effective and intuitive human-computer interaction through the use of sensors and wearables (
XR Trends 2023) that may even extend to the (
Metaverse Trends 2022).