
I have introduced Azure OpenAI models and gone into detail about Language Models in a previous Blog. You can find it here:
- Language Models: https://blog.nashtechglobal.com/azure-openai-service-models-part-1-language-models/
Now, let’s look at another type of model: Reasoning & Pronlem-Solving Models!
Reasoning & Problem-Solving Models
Focused on tasks like complex logic, math, code reasoning, and structured decision-making. Some example models:
- o4-mini (latest at present, part of the GPT-4o family)
- o3, o3-mini
- o1, o1-mini, o1-preview
Strengths of o-Series
- Trained for structured logic, problem-solving, and math-heavy tasks
- Ideal for use in apps where speed and reasoning matter more than long, creative outputs
- Excellent for tools like retrieval-augmented generation (RAG) and code analysis
Now, let’s go through some examples to see more clearly the strengths of these models as well as the differences between them!
Examples
Note: Below are just the writer’s personal experiments.
Example 1 – Example Prompt to Compare
Prompt
If a train leaves at 8:00 AM and travels at 60 km/h, and another leaves the same station at 9:00 AM at 90 km/h, when will the second train catch up?
Now, let’s take a look at the answers!
o4-mini
At 9:00 AM, the first train has a 60 km head start.
Relative speed = 90 – 60 = 30 km/h.
Time to catch up = 60 / 30 = 2 hours.
Answer: 11:00 AM
o3-mini
60 km head start. Catch-up rate = 30 km/h → Time = 2 hrs → 11:00 AM
o1
2 hours to catch up. So, at 11.
o1-mini
Maybe 10 or 11? (Less reliable)
Compare simulated result
| o4-mini | o3-mini | o1 | o1-mini | |
| Efficiency | Perfect reasoning, clean steps | Correct and well-structured | Correct, but terse | Often less confident/clear |
Example 2 – Coding Task
Prompt
Write a Python function that takes a list of integers and returns a new list containing only the prime numbers, in ascending order.
Simulated Responses by Model
o4-mini (GPT-4o)
def is_prime(n):
if n < 2:
return False
for i in range(2, int(n**0.5)+1):
if n % i == 0:
return False
return True
def filter_primes(numbers):
return sorted([n for n in numbers if is_prime(n)])
03-mini
def is_prime(n):
if n <= 1:
return False
for i in range(2, int(n**0.5)+1):
if n % i == 0:
return False
return True
def get_primes(lst):
primes = []
for num in lst:
if is_prime(num):
primes.append(num)
return sorted(primes)
o1
def prime_filter(nums):
def is_prime(n):
if n < 2:
return False
for i in range(2, n):
if n % i == 0:
return False
return True
return sorted([n for n in nums if is_prime(n)])
o1-mini
def get_primes(lst):
def prime(n):
if n < 2:
return False
for i in range(2, n):
if n % i == 0:
return False
return True
primes = []
for x in lst:
if prime(x):
primes.append(x)
return sorted(primes)
Compare the results
| o4-mini | o3-mini | o1 | o1-mini | |
| Efficiency | Correct, optimal, readable, clean. | Correct, slightly more verbose. | Correct, but uses less efficient range(2, n). | Works, but less optimal and lacks explanation. |
Example 3 – Code-Debugging
Prompt (Buggy Code)
The following function is supposed to return the longest word in a sentence. Fix the bug:
def longest_word(sentence):
words = sentence.split()
longest = ""
for word in words:
if len(word) < len(longest):
longest = word
return longest
Problem
The bug is in the comparison: it uses < instead of >.
Test Case
print(longest_word("The elephant is big"))
Expected Output
Output: “elephant” (when fixed)
Simulated Model Responses
o4-mini
The logic is reversed. It should update longest when a word is longer, not shorter.
✅ Fixed Code
def longest_word(sentence):
words = sentence.split()
longest = ""
for word in words:
if len(word) > len(longest):
longest = word
return longest
=> Now it correctly finds the longest word in the sentence
o3-mini
The comparison is flipped. Use > instead of <.
✅ Fixes and explains
def longest_word(sentence):
words = sentence.split()
longest = ""
for word in words:
if len(word) > len(longest):
longest = word
return longest
=> Reasoning is solid and efficient
o1
Use > not <. Fixed:
def longest_word(sentence):
words = sentence.split()
longest = ""
for word in words:
if len(word) > len(longest):
longest = word
return longest
=> Correct, but minimal explanation.
o1-mini
Maybe use
>?
(No explanation, often uncertain)
def longest_word(sentence):
words = sentence.split()
longest = ""
for word in words:
if len(word) > len(longest):
longest = word
return longest
=> Works, but doesn’t justify change or confirm confidence
Let’s Compare!
| Model | Bug Detection | Explanation | Code Fix Quality | Confidence |
| o4-mini | ✅ Accurate | ✅ Clear | ✅ Clean + readable | ⭐⭐⭐⭐⭐ |
| o3-mini | ✅ Accurate | ✅ Brief | ✅ Effective | ⭐⭐⭐⭐ |
| o1 | ✅ Accurate | ⚠️ Minimal | ✅ Works | ⭐⭐⭐ |
| o1-mini | ⚠️ Tentative | ❌ Poor | ✅ Lucky fix | ⭐⭐ |
Summary
Practical Differences
- o4-mini is much better at mixed tasks (e.g., logic + code + explanations), and performs close to GPT-4, but is not as cheap as o1/o3.
- o3-mini is the most reasoning-optimized of the o-Series and does well on math puzzles, logic chains, and agent planning.
- o1/o1-mini are ideal for low-cost or lightweight applications, but they’re not as precise for complex logic.
Some general comparisons
| o4-mini | o3-mini | o1 | o1-mini | |
| Best For | General LLM + Reasoning + Vision/Audio | Pure logic/math/RAG/code agents | Lightweight reasoning agents | Budget logic with fast response |
| Reasoning Depth | ⭐⭐⭐⭐⭐ Very strong (GPT-4-level) | ⭐⭐⭐⭐ Strongest of o-Series | ⭐⭐⭐ Moderate | ⭐⭐ Basic |
| Speed | ⭐⭐⭐⭐ Fast (on par with GPT-3.5) | ⭐⭐⭐⭐⭐ Very fast | ⭐⭐⭐⭐⭐ Very fast | ⭐⭐⭐⭐⭐⭐ Extremely fast |
| Cost (in tokens) | 💲💲💲 Higher than o-Series | 💲💲 Low | 💲 Very low | 💲 Extremely low |
| Model Type | Multimodal / General Purpose LLM | Specialized Reasoning Model | Lightweight Reasoning Model | Mini Reasoning Model |
| Modality Support | Text, Vision, (Audio soon) | Text only | Text only | Text only |
Note: The above results are only based on the author’s personal experience and testing, depending on the case, the results may be different. In addition, AI is constantly developing and new models will be continuously released, so in the future, the recommendations may no longer be correct.