Multi-Modal Models

“A picture is worth a thousand tokens.” — Modern ML proverb

Learning Objectives

By the end of this module, you will be able to:

Understand what multi-modal means and which modes are available
Identify when images, audio, or video add value to AI interactions
Use vision capabilities effectively for practical tasks
Recognize the limitations of each modality
Choose the right mode for the right task

Beyond Text

Everything so far has been text-based: you write text, AI responds with text.

But modern AI models can process multiple modalities:

Text — What we’ve been using
Images — Photos, screenshots, diagrams, documents
Audio — Speech, music, sound
Video — Moving images with or without audio

This opens new possibilities — but also new considerations.

Multi-modal = The model can process multiple types of input and/or produce multiple types of output.

Model	Input Modes	Output Modes
Text-only (older)	Text	Text
Vision-enabled	Text + Images	Text
Audio-enabled	Text + Audio	Text (+ Audio in some)
Full multi-modal	Text + Images + Audio + Video	Text + Images + Audio

Current state (late 2025):

Vision is widely available and reliable
Audio input is available (transcription, understanding)
Audio output (speech synthesis) is improving rapidly
Video understanding is emerging but limited
Image generation is separate (DALL-E, Midjourney, etc.)

Vision: The Most Useful Modality

For software development, vision capabilities are the most immediately practical.

What Vision Models Can Do

Task	Example
Read text in images	Extract text from screenshots, photos of whiteboards
Understand diagrams	Interpret flowcharts, architecture diagrams, wireframes
Analyze screenshots	Describe UI, identify issues, suggest improvements
Read documents	Process PDFs, scanned documents, handwritten notes
Understand code in images	Extract code from screenshots or photos
Compare images	Spot differences, analyze changes

When to Use Vision

Use vision when:

You have visual information that’s hard to describe in text
You want to reference something specific you can see
The image contains text/code you don’t want to manually transcribe
You’re debugging UI issues shown in screenshots
You want feedback on visual design

Don’t use vision when:

You can easily describe it in text
You need precise character-by-character accuracy
The image contains sensitive information you shouldn’t share
Text-based context would be clearer

Practical Example: Screenshot Analysis

Instead of describing a bug in words:

Text-only approach:

“There’s a button on my page that’s positioned wrong. It’s a blue button with white text that says ‘Submit’ and it’s overlapping with the form fields above it…”

Vision approach:

[Attach screenshot] “Why is the Submit button overlapping the form fields? How do I fix the positioning?”

The screenshot shows the problem instantly.

Using Vision Effectively

Good Practices

1. Provide context with images

Don’t just attach an image silently. Explain what it is and what you want:

[Attached: screenshot of my login page]

This is my current login page design. I want feedback on:
1. Is the layout intuitive?
2. Are there accessibility issues?
3. How could I improve the error state styling?

2. Crop to relevance

A full-screen screenshot has lots of irrelevant information. Crop to show just what matters:

Full desktop screenshot → Harder to process
Cropped to the specific UI element → Clearer, faster

3. Ensure readability

Text in images should be large enough to read
Diagrams should be clear, not blurry
High contrast helps

4. Reference specific parts

If the image is complex, tell the AI where to look:

“In the top-right of this screenshot, there’s an error message. What does it mean and how do I fix it?”

What Vision Struggles With

Limitation	Example
Tiny text	Small font sizes may be misread
Handwriting	Messy handwriting is unreliable
Complex diagrams	Very dense diagrams may be misinterpreted
Precise counting	”How many items in this list?” may be wrong
Subtle details	Minor visual differences may be missed
Spatial reasoning	Complex layouts may confuse the model

Audio Capabilities

Audio Input (Transcription & Understanding)

Modern models can:

Transcribe speech to text
Understand spoken questions directly
Identify speakers in conversations
Understand tone and context

Use cases:

Meeting transcription
Voice-based coding assistance
Processing podcasts or lectures
Accessibility (voice input)

Audio Output (Speech Synthesis)

Some systems can:

Read responses aloud
Generate natural-sounding speech
Match different voices or styles

Current state: Improving rapidly, but still often sounds synthetic.

Audio in Practice

For most development tasks, audio is secondary:

Text remains faster for coding
Audio helps for accessibility or hands-free work
Transcription is useful for processing recorded content

Video Understanding

Current Capabilities (Emerging)

Understand what’s happening in short video clips
Answer questions about video content
Extract information from video presentations

Limitations

Processing time is significant
Context is limited (can’t watch full movies)
Understanding is surface-level compared to images
Not all models support video

Practical Use

For now, video is most useful for:

Analyzing short UI recordings (bug reproduction)
Understanding tutorial clips
Processing recorded demos

For longer content, consider extracting key frames as images instead.

Choosing the Right Mode

Decision Framework

Is the information primarily visual?
├── Yes → Consider using an image
│   └── Is it complex/hard to describe?
│       ├── Yes → Image is probably better
│       └── No → Text might be simpler
└── No → Stick with text

Is there audio content to process?
├── Yes → Use audio input
│   └── Or transcribe first, then use text
└── No → Not needed

Is there video?
├── Yes → Does it need to be video?
│   ├── Yes → Use video (if supported)
│   └── No → Extract frames as images
└── No → Not needed

Mode Comparison

Mode	Best For	Avoid When
Text	Code, structured data, clear descriptions	Visual information hard to describe
Images	Screenshots, diagrams, UI, documents	Sensitive content, tiny details
Audio	Transcription, voice input, accessibility	When text is more efficient
Video	Motion-dependent info, UI recordings	Long content, when frames suffice

Practical Exercises

Exercise 1: Screenshot Workflow

Take a screenshot of some code in your editor
Ask the AI: “What does this code do?”
Compare: How does this compare to copy-pasting the code as text?
When would each approach be better?

Exercise 2: Diagram Understanding

Find or draw a simple flowchart
Take a photo/screenshot
Ask the AI to explain what the flowchart shows
Ask it to convert the flowchart to pseudocode

Exercise 3: Document Processing

Find a PDF or image of a document
Ask the AI to extract specific information from it
Verify accuracy — did it get the details right?

Exercise 4: Mode Comparison

Describe the same problem two ways:

In text only
With a screenshot

Compare the AI’s responses. Which was more effective?

Combining Text and Images

[Image attached: error message screenshot]

I'm seeing this error when I try to run my application.

Context:
- Python 3.11
- Flask application
- Just installed a new dependency

What's causing this error and how do I fix it?

The image shows the exact error; the text provides context the image doesn’t have.

Multiple Images

You can often attach multiple images:

[Image 1: current UI design]
[Image 2: mockup of desired design]

How do I modify my CSS to make Image 1 look like Image 2?
Focus on the navigation bar positioning and colors.

Describing What You Want from the Image

Be specific about what you need:

“Transcribe all text in this image”
“Describe the layout of this UI”
“What error message is shown?”
“Extract the data from this table”
“What’s wrong with this diagram?”

Privacy and Security Considerations

Ask yourself:

Does this image contain sensitive information?
Are there passwords, API keys, or credentials visible?
Is there personally identifiable information?
Would it be a problem if this image were stored or logged?

Sanitizing Images

Before uploading:

Blur or crop sensitive information
Use placeholders for secrets ([REDACTED])
Consider whether a description would suffice

Remember

When you send images to AI systems:

They may be processed by external servers
They may be stored (check provider policies)
They may be used for training (depends on settings/provider)

Apply the same caution you would with text containing sensitive data.

Key Insights

Concept	Practical Rule
Vision is most useful	Screenshots, diagrams, documents — use images when visual
Provide context	Don’t just attach; explain what you want
Crop to relevance	Less noise = better understanding
Know the limits	Tiny text, handwriting, complex details may fail
Consider privacy	Images can contain sensitive info — sanitize first
Choose appropriately	Text is often simpler; images when visual matters

Connection to What’s Next

Multi-modal expands what you can communicate to AI. Next:

Module 08: Reasoning models — when to use extended thinking
Module 09: Context and memory — managing long conversations

These build on your ability to provide rich context (text + images) and get sophisticated responses.

Reflection Questions

Think of a recent problem you solved with AI. Could you have solved it faster with images? Or was text sufficient?
What are the privacy implications of sharing screenshots of your work with AI systems? How would you mitigate risks?
As video understanding improves, what new use cases might emerge for software development?

Next module: Reasoning Models — when and how to use extended thinking capabilities.