
I have introduced Azure OpenAI models and gone into detail about Language Models in previous blogs. You can find it here:
- Language Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-1-language-models/
- Reasoning & Pronlem-Solving Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-2-reasoning-problem-solving-models/
Now, let’s look at another type of model: Multimodal Models!
What is Multimodal Models?
Multimodal models are models that can process multiple input types (like images and audio alongside text).
Some example models:
- gpt-4o (text, vision, and audio)
- gpt-4o-mini
- gpt-4o-audio-preview
- gpt-4o-realtime-preview
- gpt-4o-mini-realtime-preview
- gpt-4o-mini-audio-preview
Examples
Note: Below are just the writer’s personal experiments.
Example Use Scenarios
Text + Image Input (Vision)
Prompt
“Describe what’s happening in this image and write a short story based on it.”
✅ Supported by: gpt-4o, gpt-4o-mini
Voice Input → Text Output (Speech)
Prompt
(User says: “What’s the weather in Paris today?”)
✅ Supported by: gpt-4o, gpt-4o-audio-preview, gpt-4o-realtime-preview
Multimodal Reasoning
Prompt
Provide a detailed explanation of this chart and convert the trend into a spoken summary.
✅ Supported by: gpt-4o (text + image in, audio out)
Now, let’s go through the scenarios to compare in more detail!
Scenario 1: Image + Text (Multimodal Visual Task)
Prompt
“Describe the contents of this image and suggest a creative caption for social media.”
Input: [Imagine a photo showing a cat lounging in a hammock under palm trees.]
| Model | Response Output |
| gpt-4o ✅ Text + Image | 🧠 “This image shows a relaxed tabby cat lounging in a hammock tied between two palm trees on a sunny beach. 🌴🐱 The lighting suggests late afternoon. Creative Caption: ‘Living my best 9 lives 😸☀️ #PawsAndRelax’” |
| gpt-4o-mini ✅ Text + Image | 💡 “A cat is resting in a hammock on a beach. Caption: ‘Beach bum feline mode activated.’” |
| gpt-4o-audio-preview ❌ | ⚠️ Not applicable – image input unsupported |
| gpt-4o-mini-audio-preview ❌ | ⚠️ Not applicable – image input unsupported |
Scenario 2: Audio + Text (Transcription + Summarization)
Prompt (Audio)
User speaks:
“Hey, can you please summarize today’s team meeting? We discussed the launch timeline, shifted the QA deadline to next Friday, and assigned testing to Priya.”
| Model | Response Output |
| gpt-4o ✅ Audio | 🧠“Summary: The team discussed the product launch timeline. The QA deadline has been moved to next Friday. Priya is now responsible for testing.” |
| gpt-4o-audio-preview ✅ Audio | 🎧 “Meeting Summary: QA testing deadline changed to next Friday. Priya to lead testing. Launch schedule reviewed.” |
| gpt-4o-realtime-preview ✅ Audio | ⚡ “Got it! Priya is testing, and QA deadline is next Friday. Launch timing confirmed.” (streamed response) |
| gpt-4o-mini-audio-preview ✅ Audio | 💬 “QA moved to Friday. Priya handles testing.” (shorter, lower latency) |
| gpt-4o-mini ❌ | ⚠️ Audio input not supported |
Scenario 3: Text-Only Reasoning (Shared Modality)
Prompt
“A train leaves Boston at 2:00 PM going 60 mph. Another train leaves New York at 3:00 PM going 80 mph toward Boston. When do they meet?”
| Model | Response Output |
| All Models (Text OK) | ⏱ All models (even audio/image-specialized) handle this, but gpt-4o and gpt-4o-mini provide clearer step-by-step math breakdowns. |
| gpt-4o | 🧮 “Let the distance be D miles. Train A travels for t hours at 60 mph. Train B starts 1 hour later at 80 mph. They meet when 60t + 80(t – 1) = D…” → Final answer: 4:40 PM |
| gpt-4o-mini | 🧠 Same answer, slightly less explanation |
| gpt-4o-realtime-preview | ✅ Still accurate, responds conversationally |
Summary Comparison
| Scenario | Best Models | Why |
| Image + Text | gpt-4o, gpt-4o-mini | Only ones with vision support |
| Audio + Text | gpt-4o, gpt-4o-realtime-preview, gpt-4o-mini-audio-preview | Audio understanding and response |
| Text-Only Reasoning | All models | All capable, gpt-4o gives most detail |
| Real-Time Chat | gpt-4o-realtime-preview, gpt-4o-mini-audio-preview | Lowest latency, voice enabled |
Coding Task Example: Bug Identification
Prompt (Text)
“Here’s a Python function that should return the factorial of a number, but it’s not working correctly. Please identify and fix the bug.”
def factorial(n):
if n == 0:
return 0
else:
return n * factorial(n - 1)
Objective
Detect the logic bug (it returns 0 for n = 0, but factorial of 0 is 1) and correct it.
Model Responses (Side-by-Side)
| Model | Response |
| gpt-4o ✅ | “There is a logic error: factorial of 0 should return 1. Fixed code: |
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n - 1)
✅ | Same fix, slightly more concise:
Replace return 0 with return 1 when n == 0.
✅ More conversational and fast:
“Oh, I see the issue. Factorial of 0 should be 1! Just change that line.”
✅ Similar fix, shorter response:
“Return 1 when n == 0, not 0.”
✅ Provides explanation plus fixed code verbally if using voice:
“Bug found. In recursion, factorial(0) should be 1. Here’s the corrected version…” (then recites the code)
Audio Variant (Spoken Prompt Input)
User speaks
“Hey, here’s some code. Can you fix the factorial bug?”
gpt-4o, gpt-4o-realtime-preview, and gpt-4o-mini-audio-preview
Observations
| Model | Strength in Code Tasks | Notes |
| gpt-4o | ⭐ Best reasoning + clear fix | Handles logic + recursion details |
| gpt-4o-mini | ✅ Efficient fix | Fewer words, good for fast apps |
| gpt-4o-realtime-preview | ⚡ Interactive fix via voice | Great for quick assistant use |
| gpt-4o-mini-audio-preview | ⚡ Voice input & fix fast | Ideal for mobile/embedded systems |
| gpt-4o-audio-preview | 🎧 Detailed voice support | Longer, audio-focused responses |
Conclusion
Some general comparisons
Multimodal Models in Azure OpenAI (As of 2025)
| Model Name | Modalities Supported | Context Length | Strengths | Use Cases | Notes |
| gpt-4o | Text, Vision, Audio | 128K tokens | Fast, smart, real-time, handles all 3 modalities | Chatbots, image understanding, speech, multimodal reasoning | Flagship model, general purpose |
| gpt-4o-mini | Text, Vision | 128K tokens | Smaller, more efficient than full gpt-4o | Lightweight multimodal reasoning | No audio yet |
| gpt-4o-audio-preview | Text, Audio | 128K tokens | Low-latency voice input | Speech-to-text, voice understanding | Preview model for audio interaction |
| gpt-4o-realtime-preview | Text, Audio | 128K tokens | Real-time responsiveness | Live conversation, fast chatbot | Also in preview |
| gpt-4o-mini-audio-preview | Text, Audio | 128K tokens | Lightweight speech model | Compact voice interfaces | Mini version |
| gpt-4o-mini-realtime-preview | Text, Audio | 128K tokens | Optimized for speed | Fast, on-device voice interface | Experimental |
Key Differences
| Feature | gpt-4o | gpt-4o-mini | gpt-4o-audio-preview | gpt-4o-realtime-preview |
| Text | ✅ | ✅ | ✅ | ✅ |
| Image (Vision) | ✅ | ✅ | ❌ | ❌ |
| Audio (Voice/Speech) | ✅ | ❌ | ✅ | ✅ |
| Latency | Medium | Low | Very Low | Real-time |
| Model Size | Large | Mini | Medium | Medium |
| Ideal For | Complex tasks | Edge devices | Voice UX | Live conversation |
Summary
- If you need comprehensive multimodal capabilities (text, images, audio): go with gpt-4o.
- If you need fast response and voice input/output: try gpt-4o-realtime-preview.
- If you’re targeting low-resource environments or edge devices: try gpt-4o-mini.
Note: The above results are only based on the author’s personal experience and testing, depending on the case, the results may be different. In addition, AI is constantly developing and new models will be continuously released, so in the future, the recommendations may no longer be correct.