Multi-modality: a gentle approach

What is it and how it works... Can we use it too?

Aug 18, 2024

Multi-Modality in generative AI refers to the ability of a model to work with multiple types of data simultaneously. This means instead of just processing images or text, it can take in and generate outputs across different modalities like:

Text & Image: Generating images based on descriptions (like "a cat wearing sunglasses on a beach").
Image & Audio: Creating music inspired by an image, or generating soundscapes that match the visual elements of a scene.
Video & Text: Generating captions for videos, summarizing video content, or creating new video sequences with specific events and dialogue based on textual input.

Multi-modal models are indeed a significant trend in the Generative AI and Machine Learning community, and reflects the growing recognition that many real-world problems and applications involve multiple types of information and that there are potential synergies in processing different modalities together.

Why is Multi-Modality important?

It breaks down barriers between disciplines and allows AI models to understand complex relationships across different data types:

More realistic outputs: Models can generate more believable and engaging outputs by combining multiple inputs that reflect real-world scenarios.
Better understanding of context: Multi-modal AI systems can integrate information from various sources for a richer, more nuanced analysis.
Enhanced creativity: Combining different modalities unlocks new possibilities for creative expression in areas like music composition, storytelling, and design.

Imagine a Black and white and colors, 3d surreal landscape, child contemplating, immersive-inspired landscape, cosmos, cityscape, rene magritte inspiration

Here few examples of Multi-Modal Generative AI Models:

DALL-E 2 & Stable Diffusion: These models generate images from text prompts. You can describe a scene or object, and they'll create unique variations based on your input.
Jukebox: This model is trained to produce music in different genres by taking textual descriptions as input.
CLIP (Contrastive Language-Image Pre-training): This model learns the relationship between images and their associated text labels, allowing for image captioning and visual question answering tasks.

Multi-modal AI offers immense potential across various fields:

Entertainment: Creating realistic video games, immersive storytelling experiences, and personalized music production tools.
Research & Development: Analyzing complex datasets from multiple sources to gain deeper insights and accelerate innovation in healthcare, climate modeling, and other scientific disciplines.
Accessibility: Providing assistive technologies for visually impaired individuals or those with speech disabilities by generating audio descriptions of images or translating text into sign language.

In a recent Nvidia event, at #SIGGRAPH2024, the team unveiled a real-time rendering system for complex scenes. Using neural decoders and graphics priors, it enables film-quality visuals in real-time applications such as games.

Multi Modal Real Case - Autonomous cars…

Waabi is revolutionizing the autonomous trucking industry by leveraging generative AI to develop its self-driving solution called Waabi Driver. This Toronto-based startup has partnered with NVIDIA, utilizing the company's DRIVE Thor centralized computer and NVIDIA DRIVE OS, an operating system for safe and AI-defined autonomous vehicles.

Their approach combines two powerful generative AI systems: a "teacher" named Waabi World which trains and validates their "student," the Waabi Driver, a single end-to-end AI system capable of human-like reasoning.

This innovative method significantly reduces on-road testing requirements while enhancing safety and efficiency. By using Generative AI to create an end-to-end system where foundation models learn from real-world observations without manual intervention, Waabi's approach allows for faster development cycles and a more scalable solution.

The combination of generative AI simulation with a foundation AI model specifically designed for physical action enables the company to rapidly deploy their technology in various environments.

Waabi is taking significant strides towards making driverless trucking a reality. Their collaboration with NVIDIA, coupled with funding from prominent investors like NVIDIA, highlights their commitment to driving advancements in autonomous vehicle development.

Ok, but can I do something with it too?

The answer is obviously… YES!

In the past months I tried with all my powerless PoorGPU-pc to run multi modal models. What I managed to do is to run on my CPU only mini-pc both an image-to-text model and an audio-to-text model.

An I am sharing it with you all, with a double gift: