Course Syllabus

Go to edX course page

Unlock the power of multimodal AI and learn how modern systems combine text, images, speech, and video to create intelligent applications. This course teaches the foundational concepts behind multimodal GenAI applications, the challenges of integrating diverse data types, and the techniques used to build advanced, interactive systems. You’ll develop core skills in transcription, text-to-speech, image generation, video synthesis, and multimodal reasoning.

Through hands-on labs, you’ll work with Generative AI models like IBM Granite, OpenAI Whisper, DALL·E, Sora, Meta’s Llama, Mixtral, and vision-language architectures to apply multimodal AI in practical scenarios. You’ll build tools such as captioning systems, video-from-text generators, and AI-powered assistants that can process and respond across multiple data streams.

The course includes full-stack projects using Python, Flask, and Gradio, where you’ll design and deploy complete multimodal AI applications. By the end, you’ll have the technical skills needed to create next-generation AI systems used in search engines, chatbots, creative tools, and enterprise applications.



Starts: N/A
Ends: N/A

Course Summary:

Course Summary
Date Details Due