Azure OpenAI Service Models Part 3 – Multimodal Models

mydinhletra1

I have introduced Azure OpenAI models and gone into detail about Language Models in previous blogs. You can find it here:

Language Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-1-language-models/

Reasoning & Pronlem-Solving Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-2-reasoning-problem-solving-models/

Now, let’s look at another type of model: Multimodal Models!

What is Multimodal Models?

Multimodal models are models that can process multiple input types (like images and audio alongside text).

Some example models:

GPT-4o Series (Omni Models)

gpt-4o (text, vision, and audio)

gpt-4o-mini

gpt-4o-audio-preview

gpt-4o-realtime-preview

gpt-4o-mini-realtime-preview

gpt-4o-mini-audio-preview

Examples

Note: Below are just the writer’s personal experiments.

Example Use Scenarios

Text + Image Input (Vision)

Prompt

“Describe what’s happening in this image and write a short story based on it.”

✅ Supported by: gpt-4o, gpt-4o-mini

Voice Input → Text Output (Speech)

Prompt

(User says: “What’s the weather in Paris today?”)

✅ Supported by: gpt-4o, gpt-4o-audio-preview, gpt-4o-realtime-preview

Multimodal Reasoning

Prompt

Provide a detailed explanation of this chart and convert the trend into a spoken summary.

✅ Supported by: gpt-4o (text + image in, audio out)

Now, let’s go through the scenarios to compare in more detail!

Scenario 1: Image + Text (Multimodal Visual Task)

Prompt

“Describe the contents of this image and suggest a creative caption for social media.”

Input: [Imagine a photo showing a cat lounging in a hammock under palm trees.]

Model	Response Output
gpt-4o ✅ Text + Image	🧠 “This image shows a relaxed tabby cat lounging in a hammock tied between two palm trees on a sunny beach. 🌴🐱 The lighting suggests late afternoon. Creative Caption: ‘Living my best 9 lives 😸☀️ #PawsAndRelax’”
gpt-4o-mini ✅ Text + Image	💡 “A cat is resting in a hammock on a beach. Caption: ‘Beach bum feline mode activated.’”
gpt-4o-audio-preview ❌	⚠️ Not applicable – image input unsupported
gpt-4o-mini-audio-preview ❌	⚠️ Not applicable – image input unsupported

Scenario 2: Audio + Text (Transcription + Summarization)

Prompt (Audio)

User speaks:

“Hey, can you please summarize today’s team meeting? We discussed the launch timeline, shifted the QA deadline to next Friday, and assigned testing to Priya.”

Model	Response Output
gpt-4o ✅ Audio	`🧠`“Summary: The team discussed the product launch timeline. The QA deadline has been moved to next Friday. Priya is now responsible for testing.”
gpt-4o-audio-preview ✅ Audio	🎧 “Meeting Summary: QA testing deadline changed to next Friday. Priya to lead testing. Launch schedule reviewed.”
gpt-4o-realtime-preview ✅ Audio	⚡ “Got it! Priya is testing, and QA deadline is next Friday. Launch timing confirmed.” (streamed response)
gpt-4o-mini-audio-preview ✅ Audio	💬 “QA moved to Friday. Priya handles testing.” (shorter, lower latency)
gpt-4o-mini ❌	⚠️ Audio input not supported

Scenario 3: Text-Only Reasoning (Shared Modality)

Prompt

“A train leaves Boston at 2:00 PM going 60 mph. Another train leaves New York at 3:00 PM going 80 mph toward Boston. When do they meet?”

Model	Response Output
All Models (Text OK)	⏱ All models (even audio/image-specialized) handle this, but gpt-4o and gpt-4o-mini provide clearer step-by-step math breakdowns.
gpt-4o	🧮 “Let the distance be D miles. Train A travels for t hours at 60 mph. Train B starts 1 hour later at 80 mph. They meet when 60t + 80(t – 1) = D…” → Final answer: 4:40 PM
gpt-4o-mini	🧠 Same answer, slightly less explanation
gpt-4o-realtime-preview	✅ Still accurate, responds conversationally

Summary Comparison

Scenario	Best Models	Why
Image + Text	gpt-4o, gpt-4o-mini	Only ones with vision support
Audio + Text	gpt-4o, gpt-4o-realtime-preview, gpt-4o-mini-audio-preview	Audio understanding and response
Text-Only Reasoning	All models	All capable, gpt-4o gives most detail
Real-Time Chat	gpt-4o-realtime-preview, gpt-4o-mini-audio-preview	Lowest latency, voice enabled

Coding Task Example: Bug Identification

Prompt (Text)

“Here’s a Python function that should return the factorial of a number, but it’s not working correctly. Please identify and fix the bug.”

def factorial(n):
    if n == 0:
        return 0
    else:
        return n * factorial(n - 1)

Objective

Detect the logic bug (it returns 0 for n = 0, but factorial of 0 is 1) and correct it.

Model Responses (Side-by-Side)

Model	Response
gpt-4o ✅	“There is a logic error: factorial of 0 should return 1. Fixed code:

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)

gpt-4o-mini

✅ | Same fix, slightly more concise:
Replace return 0 with return 1 when n == 0.

gpt-4o-realtime-preview

✅ More conversational and fast:
“Oh, I see the issue. Factorial of 0 should be 1! Just change that line.”

gpt-4o-mini-audio-preview

✅ Similar fix, shorter response:
“Return 1 when n == 0, not 0.”

gpt-4o-audio-preview

✅ Provides explanation plus fixed code verbally if using voice:
“Bug found. In recursion, factorial(0) should be 1. Here’s the corrected version…” (then recites the code)

Audio Variant (Spoken Prompt Input)

User speaks

“Hey, here’s some code. Can you fix the factorial bug?”

gpt-4o, gpt-4o-realtime-preview, and gpt-4o-mini-audio-preview

gpt-4o, gpt-4o-realtime-preview, and gpt-4o-mini-audio-preview: respond instantly with a fix.

gpt-4o-audio-preview: gives a slower but detailed spoken explanation.

Observations

Model	Strength in Code Tasks	Notes
gpt-4o	⭐ Best reasoning + clear fix	Handles logic + recursion details
gpt-4o-mini	✅ Efficient fix	Fewer words, good for fast apps
gpt-4o-realtime-preview	⚡ Interactive fix via voice	Great for quick assistant use
gpt-4o-mini-audio-preview	⚡ Voice input & fix fast	Ideal for mobile/embedded systems
gpt-4o-audio-preview	🎧 Detailed voice support	Longer, audio-focused responses

Conclusion

Some general comparisons

Multimodal Models in Azure OpenAI (As of 2025)

Model Name	Modalities Supported	Context Length	Strengths	Use Cases	Notes
gpt-4o	Text, Vision, Audio	128K tokens	Fast, smart, real-time, handles all 3 modalities	Chatbots, image understanding, speech, multimodal reasoning	Flagship model, general purpose
gpt-4o-mini	Text, Vision	128K tokens	Smaller, more efficient than full gpt-4o	Lightweight multimodal reasoning	No audio yet
gpt-4o-audio-preview	Text, Audio	128K tokens	Low-latency voice input	Speech-to-text, voice understanding	Preview model for audio interaction
gpt-4o-realtime-preview	Text, Audio	128K tokens	Real-time responsiveness	Live conversation, fast chatbot	Also in preview
gpt-4o-mini-audio-preview	Text, Audio	128K tokens	Lightweight speech model	Compact voice interfaces	Mini version
gpt-4o-mini-realtime-preview	Text, Audio	128K tokens	Optimized for speed	Fast, on-device voice interface	Experimental

Key Differences

Feature	gpt-4o	gpt-4o-mini	gpt-4o-audio-preview	gpt-4o-realtime-preview
Text	✅	✅	✅	✅
Image (Vision)	✅	✅	❌	❌
Audio (Voice/Speech)	✅	❌	✅	✅
Latency	Medium	Low	Very Low	Real-time
Model Size	Large	Mini	Medium	Medium
Ideal For	Complex tasks	Edge devices	Voice UX	Live conversation

Summary

If you need comprehensive multimodal capabilities (text, images, audio): go with gpt-4o.

If you need fast response and voice input/output: try gpt-4o-realtime-preview.

If you’re targeting low-resource environments or edge devices: try gpt-4o-mini.

Note: The above results are only based on the author’s personal experience and testing, depending on the case, the results may be different. In addition, AI is constantly developing and new models will be continuously released, so in the future, the recommendations may no longer be correct.

Azure OpenAI Service Models Part 3 – Multimodal Models

mydinhletra1

Table of Contents

What is Multimodal Models?

Examples

Example Use Scenarios

Scenario 1: Image + Text (Multimodal Visual Task)

Scenario 2: Audio + Text (Transcription + Summarization)

Scenario 3: Text-Only Reasoning (Shared Modality)

Summary Comparison

Coding Task Example: Bug Identification

Prompt (Text)

Objective

Model Responses (Side-by-Side)

Audio Variant (Spoken Prompt Input)

Observations

Conclusion

Some general comparisons

Multimodal Models in Azure OpenAI (As of 2025)

Key Differences

Summary

mydinhletra1

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements