Multimodal Generative AI: Vision, Speech, and Assistants

Go to edX course page

This comprehensive course offers a deep dive into the practical application of multi-modal AI, taking you from foundational concepts to advanced integration. You will begin by exploring Vision capabilities to master Image-to-Text analysis, then transition into the world of audio by learning to generate lifelike voices with Text-to-Speech and transcribe recordings using Speech-to-Text (Whisper). The curriculum culminates in a powerful exploration of the Assistants API , where you will learn to build autonomous agents equipped with Code Interpreter , File Search , and Function Calling. By combining these pillars, you will gain the skills necessary to develop sophisticated, end-to-end AI solutions that can see, hear, speak, and act on complex data.



Starts: N/A
Ends: N/A

Course Summary:

Course Summary
Date Details Due