NashTech Blog

Azure OpenAI Service Models Part 3 – Multimodal Models

Table of Contents

I have introduced Azure OpenAI models and gone into detail about Language Models in previous blogs. You can find it here:

  • Language Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-1-language-models/
  • Reasoning & Pronlem-Solving Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-2-reasoning-problem-solving-models/

Now, let’s look at another type of model: Multimodal Models!

Multimodal models are models that can process multiple input types (like images and audio alongside text).

Some example models:

GPT-4o Series (Omni Models)
  • gpt-4o (text, vision, and audio)
  • gpt-4o-mini
  • gpt-4o-audio-preview
  • gpt-4o-realtime-preview
  • gpt-4o-mini-realtime-preview
  • gpt-4o-mini-audio-preview

Text + Image Input (Vision)

Prompt

“Describe what’s happening in this image and write a short story based on it.”

✅ Supported by: gpt-4o, gpt-4o-mini

Voice Input → Text Output (Speech)

Prompt

(User says: “What’s the weather in Paris today?”)

✅ Supported by: gpt-4o, gpt-4o-audio-preview, gpt-4o-realtime-preview

Multimodal Reasoning

Prompt

Provide a detailed explanation of this chart and convert the trend into a spoken summary.

✅ Supported by: gpt-4o (text + image in, audio out)

Now, let’s go through the scenarios to compare in more detail!

Prompt

“Describe the contents of this image and suggest a creative caption for social media.”

Input: [Imagine a photo showing a cat lounging in a hammock under palm trees.]

ModelResponse Output
gpt-4o ✅ Text + Image🧠 “This image shows a relaxed tabby cat lounging in a hammock tied between two palm trees on a sunny beach. 🌴🐱 The lighting suggests late afternoon. Creative Caption: ‘Living my best 9 lives 😸☀️ #PawsAndRelax’”
gpt-4o-mini ✅ Text + Image💡 “A cat is resting in a hammock on a beach. Caption: ‘Beach bum feline mode activated.’”
gpt-4o-audio-preview⚠️ Not applicable – image input unsupported
gpt-4o-mini-audio-preview⚠️ Not applicable – image input unsupported

Prompt (Audio)

User speaks:

“Hey, can you please summarize today’s team meeting? We discussed the launch timeline, shifted the QA deadline to next Friday, and assigned testing to Priya.”

ModelResponse Output
gpt-4o ✅ Audio🧠“Summary: The team discussed the product launch timeline. The QA deadline has been moved to next Friday. Priya is now responsible for testing.”
gpt-4o-audio-preview ✅ Audio🎧 “Meeting Summary: QA testing deadline changed to next Friday. Priya to lead testing. Launch schedule reviewed.”
gpt-4o-realtime-preview ✅ Audio⚡ “Got it! Priya is testing, and QA deadline is next Friday. Launch timing confirmed.” (streamed response)
gpt-4o-mini-audio-preview ✅ Audio💬 “QA moved to Friday. Priya handles testing.” (shorter, lower latency)
gpt-4o-mini ⚠️ Audio input not supported

Prompt

“A train leaves Boston at 2:00 PM going 60 mph. Another train leaves New York at 3:00 PM going 80 mph toward Boston. When do they meet?”

ModelResponse Output
All Models (Text OK)⏱ All models (even audio/image-specialized) handle this, but gpt-4o and gpt-4o-mini provide clearer step-by-step math breakdowns.
gpt-4o🧮 “Let the distance be D miles. Train A travels for t hours at 60 mph. Train B starts 1 hour later at 80 mph. They meet when 60t + 80(t – 1) = D…” → Final answer: 4:40 PM
gpt-4o-mini🧠 Same answer, slightly less explanation
gpt-4o-realtime-preview✅ Still accurate, responds conversationally

ScenarioBest ModelsWhy
Image + Textgpt-4o, gpt-4o-miniOnly ones with vision support
Audio + Textgpt-4o, gpt-4o-realtime-preview, gpt-4o-mini-audio-previewAudio understanding and response
Text-Only ReasoningAll modelsAll capable, gpt-4o gives most detail
Real-Time Chatgpt-4o-realtime-preview, gpt-4o-mini-audio-previewLowest latency, voice enabled

“Here’s a Python function that should return the factorial of a number, but it’s not working correctly. Please identify and fix the bug.”

def factorial(n):
    if n == 0:
        return 0
    else:
        return n * factorial(n - 1)

Detect the logic bug (it returns 0 for n = 0, but factorial of 0 is 1) and correct it.

ModelResponse
gpt-4o ✅“There is a logic error: factorial of 0 should return 1. Fixed code:

def factorial(n):
    if n == 0:
        return 1
    else:
        return n * factorial(n - 1)
gpt-4o-mini

✅ | Same fix, slightly more concise:
Replace return 0 with return 1 when n == 0.

gpt-4o-realtime-preview

✅ More conversational and fast:
“Oh, I see the issue. Factorial of 0 should be 1! Just change that line.”

gpt-4o-mini-audio-preview

✅ Similar fix, shorter response:
“Return 1 when n == 0, not 0.”

gpt-4o-audio-preview

✅ Provides explanation plus fixed code verbally if using voice:
“Bug found. In recursion, factorial(0) should be 1. Here’s the corrected version…” (then recites the code)

User speaks

“Hey, here’s some code. Can you fix the factorial bug?”

gpt-4o, gpt-4o-realtime-preview, and gpt-4o-mini-audio-preview

gpt-4o, gpt-4o-realtime-preview, and gpt-4o-mini-audio-preview: respond instantly with a fix.
gpt-4o-audio-preview: gives a slower but detailed spoken explanation.
ModelStrength in Code TasksNotes
gpt-4o⭐ Best reasoning + clear fixHandles logic + recursion details
gpt-4o-mini✅ Efficient fixFewer words, good for fast apps
gpt-4o-realtime-preview⚡ Interactive fix via voiceGreat for quick assistant use
gpt-4o-mini-audio-preview⚡ Voice input & fix fastIdeal for mobile/embedded systems
gpt-4o-audio-preview🎧 Detailed voice supportLonger, audio-focused responses

Model NameModalities SupportedContext LengthStrengthsUse CasesNotes
gpt-4oText, Vision, Audio128K tokensFast, smart, real-time, handles all 3 modalitiesChatbots, image understanding, speech, multimodal reasoningFlagship model, general purpose
gpt-4o-miniText, Vision128K tokensSmaller, more efficient than full gpt-4oLightweight multimodal reasoningNo audio yet
gpt-4o-audio-previewText, Audio128K tokensLow-latency voice inputSpeech-to-text, voice understandingPreview model for audio interaction
gpt-4o-realtime-previewText, Audio128K tokensReal-time responsivenessLive conversation, fast chatbotAlso in preview
gpt-4o-mini-audio-previewText, Audio128K tokensLightweight speech modelCompact voice interfacesMini version
gpt-4o-mini-realtime-previewText, Audio128K tokensOptimized for speedFast, on-device voice interfaceExperimental
Featuregpt-4ogpt-4o-minigpt-4o-audio-previewgpt-4o-realtime-preview
Text
Image (Vision)
Audio (Voice/Speech)
LatencyMediumLowVery LowReal-time
Model SizeLargeMiniMediumMedium
Ideal ForComplex tasksEdge devicesVoice UXLive conversation
  • If you need comprehensive multimodal capabilities (text, images, audio): go with gpt-4o.
  • If you need fast response and voice input/output: try gpt-4o-realtime-preview.
  • If you’re targeting low-resource environments or edge devices: try gpt-4o-mini.

Picture of mydinhletra1

mydinhletra1

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top