Azure OpenAI Service Models Part 2 – Reasoning & Problem-Solving Models

mydinhletra1

I have introduced Azure OpenAI models and gone into detail about Language Models in a previous Blog. You can find it here:

Language Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-1-language-models/

Now, let’s look at another type of model: Reasoning & Pronlem-Solving Models!

Reasoning & Problem-Solving Models

Focused on tasks like complex logic, math, code reasoning, and structured decision-making. Some example models:

o4-mini (latest at present, part of the GPT-4o family)
o3, o3-mini
o1, o1-mini, o1-preview

Strengths of o-Series

Trained for structured logic, problem-solving, and math-heavy tasks

Ideal for use in apps where speed and reasoning matter more than long, creative outputs

Excellent for tools like retrieval-augmented generation (RAG) and code analysis

Now, let’s go through some examples to see more clearly the strengths of these models as well as the differences between them!

Examples

Note: Below are just the writer’s personal experiments.

Example 1 – Example Prompt to Compare

Prompt

If a train leaves at 8:00 AM and travels at 60 km/h, and another leaves the same station at 9:00 AM at 90 km/h, when will the second train catch up?

Now, let’s take a look at the answers!

o4-mini

At 9:00 AM, the first train has a 60 km head start.
Relative speed = 90 – 60 = 30 km/h.
Time to catch up = 60 / 30 = 2 hours.
Answer: 11:00 AM

o3-mini

60 km head start. Catch-up rate = 30 km/h → Time = 2 hrs → 11:00 AM

2 hours to catch up. So, at 11.

o1-mini

Maybe 10 or 11? (Less reliable)

Compare simulated result

	o4-mini	o3-mini	o1	o1-mini
Efficiency	Perfect reasoning, clean steps	Correct and well-structured	Correct, but terse	Often less confident/clear

Example 2 – Coding Task

Prompt

Write a Python function that takes a list of integers and returns a new list containing only the prime numbers, in ascending order.

Simulated Responses by Model

o4-mini (GPT-4o)

def is_prime(n):
    if n < 2:
        return False
    for i in range(2, int(n**0.5)+1):
        if n % i == 0:
            return False
    return True

def filter_primes(numbers):
    return sorted([n for n in numbers if is_prime(n)])

03-mini

def is_prime(n):
    if n <= 1:
        return False
    for i in range(2, int(n**0.5)+1):
        if n % i == 0:
            return False
    return True

def get_primes(lst):
    primes = []
    for num in lst:
        if is_prime(num):
            primes.append(num)
    return sorted(primes)

def prime_filter(nums):
    def is_prime(n):
        if n < 2:
            return False
        for i in range(2, n):
            if n % i == 0:
                return False
        return True
    return sorted([n for n in nums if is_prime(n)])

o1-mini

def get_primes(lst):
    def prime(n):
        if n < 2:
            return False
        for i in range(2, n):
            if n % i == 0:
                return False
        return True
    primes = []
    for x in lst:
        if prime(x):
            primes.append(x)
    return sorted(primes)

Compare the results

	o4-mini	o3-mini	o1	o1-mini
Efficiency	Correct, optimal, readable, clean.	Correct, slightly more verbose.	Correct, but uses less efficient range(2, n).	Works, but less optimal and lacks explanation.

Example 3 – Code-Debugging

Prompt (Buggy Code)

The following function is supposed to return the longest word in a sentence. Fix the bug:

def longest_word(sentence):
    words = sentence.split()
    longest = ""
    for word in words:
        if len(word) < len(longest):
            longest = word
    return longest

Problem

The bug is in the comparison: it uses < instead of >.

Test Case

print(longest_word("The elephant is big"))

Expected Output

Output: “elephant” (when fixed)

Simulated Model Responses

o4-mini

The logic is reversed. It should update longest when a word is longer, not shorter.

✅ Fixed Code

def longest_word(sentence):
    words = sentence.split()
    longest = ""
    for word in words:
        if len(word) > len(longest):
            longest = word
    return longest

=> Now it correctly finds the longest word in the sentence

o3-mini

The comparison is flipped. Use > instead of <.

✅ Fixes and explains

def longest_word(sentence):
    words = sentence.split()
    longest = ""
    for word in words:
        if len(word) > len(longest):
            longest = word
    return longest

=> Reasoning is solid and efficient

Use > not <. Fixed:

def longest_word(sentence):
    words = sentence.split()
    longest = ""
    for word in words:
        if len(word) > len(longest):
            longest = word
    return longest

=> Correct, but minimal explanation.

o1-mini

Maybe use >?
(No explanation, often uncertain)

def longest_word(sentence):
    words = sentence.split()
    longest = ""
    for word in words:
        if len(word) > len(longest):
            longest = word
    return longest

=> Works, but doesn’t justify change or confirm confidence

Let’s Compare!

Model	Bug Detection	Explanation	Code Fix Quality	Confidence
o4-mini	✅ Accurate	✅ Clear	✅ Clean + readable	⭐⭐⭐⭐⭐
o3-mini	✅ Accurate	✅ Brief	✅ Effective	⭐⭐⭐⭐
o1	✅ Accurate	⚠️ Minimal	✅ Works	⭐⭐⭐
o1-mini	⚠️ Tentative	❌ Poor	✅ Lucky fix	⭐⭐

Summary

Practical Differences

o4-mini is much better at mixed tasks (e.g., logic + code + explanations), and performs close to GPT-4, but is not as cheap as o1/o3.

o3-mini is the most reasoning-optimized of the o-Series and does well on math puzzles, logic chains, and agent planning.

o1/o1-mini are ideal for low-cost or lightweight applications, but they’re not as precise for complex logic.

Some general comparisons

	o4-mini	o3-mini	o1	o1-mini
Best For	General LLM + Reasoning + Vision/Audio	Pure logic/math/RAG/code agents	Lightweight reasoning agents	Budget logic with fast response
Reasoning Depth	⭐⭐⭐⭐⭐ Very strong (GPT-4-level)	⭐⭐⭐⭐ Strongest of o-Series	⭐⭐⭐ Moderate	⭐⭐ Basic
Speed	⭐⭐⭐⭐ Fast (on par with GPT-3.5)	⭐⭐⭐⭐⭐ Very fast	⭐⭐⭐⭐⭐ Very fast	⭐⭐⭐⭐⭐⭐ Extremely fast
Cost (in tokens)	💲💲💲 Higher than o-Series	💲💲 Low	💲 Very low	💲 Extremely low
Model Type	Multimodal / General Purpose LLM	Specialized Reasoning Model	Lightweight Reasoning Model	Mini Reasoning Model
Modality Support	Text, Vision, (Audio soon)	Text only	Text only	Text only

Note: The above results are only based on the author’s personal experience and testing, depending on the case, the results may be different. In addition, AI is constantly developing and new models will be continuously released, so in the future, the recommendations may no longer be correct.

Solutions

Industry

Our thinking

Azure OpenAI Service Models Part 2 – Reasoning & Problem-Solving Models

mydinhletra1

Table of Contents

Reasoning & Problem-Solving Models

Strengths of o-Series

Examples

Example 1 – Example Prompt to Compare

Prompt

Now, let’s take a look at the answers!

Compare simulated result

Example 2 – Coding Task

Prompt

Simulated Responses by Model

Compare the results

Example 3 – Code-Debugging

Prompt (Buggy Code)

Simulated Model Responses

Let’s Compare!

Summary

Practical Differences

Some general comparisons

mydinhletra1

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements