Multi-Modality in generative AI refers to the ability of a model to work with multiple types of data simultaneously. This means instead of just processing images or text, it can take in and generate outputs across different modalities like:
Text & Image: Generating images based on descriptions (like "a cat wearing sunglasses on a beach").
Image & Audio: Creating music inspired by an image, or generating soundscapes that match the visual elements of a scene.
Video & Text: Generating captions for videos, summarizing video content, or creating new video sequences with specific events and dialogue based on textual input.
Multi-modal models are indeed a significant trend in the Generative AI and Machine Learning community, and reflects the growing recognition that many real-world problems and applications involve multiple types of information and that there are potential synergies in processing different modalities together.
Why is Multi-Modality important?
It breaks down barriers between disciplines and allows AI models to understand complex relationships across different data types:
More realistic outputs: Models can generate more believable and engaging outputs by combining multiple inputs that reflect real-world scenarios.
Better understanding of context: Multi-modal AI systems can integrate information from various sources for a richer, more nuanced analysis.
Enhanced creativity: Combining different modalities unlocks new possibilities for creative expression in areas like music composition, storytelling, and design.

Here few examples of Multi-Modal Generative AI Models:
DALL-E 2 & Stable Diffusion: These models generate images from text prompts. You can describe a scene or object, and they'll create unique variations based on your input.
Jukebox: This model is trained to produce music in different genres by taking textual descriptions as input.
CLIP (Contrastive Language-Image Pre-training): This model learns the relationship between images and their associated text labels, allowing for image captioning and visual question answering tasks.
Multi-modal AI offers immense potential across various fields:
Entertainment: Creating realistic video games, immersive storytelling experiences, and personalized music production tools.
Research & Development: Analyzing complex datasets from multiple sources to gain deeper insights and accelerate innovation in healthcare, climate modeling, and other scientific disciplines.
Accessibility: Providing assistive technologies for visually impaired individuals or those with speech disabilities by generating audio descriptions of images or translating text into sign language.
In a recent Nvidia event, at #SIGGRAPH2024, the team unveiled a real-time rendering system for complex scenes. Using neural decoders and graphics priors, it enables film-quality visuals in real-time applications such as games.
Multi Modal Real Case - Autonomous cars…
Waabi is revolutionizing the autonomous trucking industry by leveraging generative AI to develop its self-driving solution called Waabi Driver. This Toronto-based startup has partnered with NVIDIA, utilizing the company's DRIVE Thor centralized computer and NVIDIA DRIVE OS, an operating system for safe and AI-defined autonomous vehicles.
Their approach combines two powerful generative AI systems: a "teacher" named Waabi World which trains and validates their "student," the Waabi Driver, a single end-to-end AI system capable of human-like reasoning.
This innovative method significantly reduces on-road testing requirements while enhancing safety and efficiency. By using Generative AI to create an end-to-end system where foundation models learn from real-world observations without manual intervention, Waabi's approach allows for faster development cycles and a more scalable solution.
The combination of generative AI simulation with a foundation AI model specifically designed for physical action enables the company to rapidly deploy their technology in various environments.
Waabi is taking significant strides towards making driverless trucking a reality. Their collaboration with NVIDIA, coupled with funding from prominent investors like NVIDIA, highlights their commitment to driving advancements in autonomous vehicle development.
Ok, but can I do something with it too?
The answer is obviously… YES!
In the past months I tried with all my powerless PoorGPU-pc to run multi modal models. What I managed to do is to run on my CPU only mini-pc both an image-to-text model and an audio-to-text model.
An I am sharing it with you all, with a double gift:

This is only the start!
Hope you will find all of this useful. Feel free to contact me on Medium.
I am using Substack only for the newsletter. Here every week I am giving free links to my paid articles on Medium.
Follow me and Read my latest articles https://medium.com/@fabio.matricardi