Observability: Throughput vs Latency — What Really Matters?

Duong Dao Viet

One day, your dashboard shows this:

CPU: 40%
Memory: Stable
Error rate: 0%
Requests per second: High and healthy

But Slack is exploding.

“The system is slow.”

How can that be?

The system is processing thousands of requests per second.
Throughput looks great.

But users are waiting.

That’s when you realize:

High throughput does not mean low latency.

And if you don’t understand the difference, your architecture decisions will hurt you later.

First, What Do These Words Actually Mean?

Throughput

Throughput is:

How much work your system can handle per unit of time.

Examples:

5,000 requests per second
200 MB/s data processing
10,000 messages per minute

Think of throughput as:

Capacity.

Latency

Latency is:

How long one request takes from start to finish.

Examples:

20 ms API response time
200 ms database query
2 seconds page load

Think of latency as:

Waiting time.

Simple analogy: Highway vs Travel Time

Imagine a highway.

Throughput = how many cars pass per hour.
Latency = how long one car takes to reach destination.

You can have:

A highway moving 10,000 cars/hour
But each car takes 2 hours to reach its destination

That’s high throughput, high latency.

Or:

A small road moving 500 cars/hour
But each car reaches in 5 minutes

Low throughput, low latency.

They measure different things.

Why Observability Must Track Both

If you only monitor throughput:

You might think your system is healthy.
But users may still feel it’s slow.

If you only monitor latency:

You might optimize response time.
But your system may collapse under load.

True observability requires both.

The Throughput–Latency Relationship

There is a hidden relationship between them.

When load increases:

Throughput increases (to a point)
Latency eventually increases dramatically

This is queueing theory in action.

What happens under load?

Imagine your service can handle:

1,000 requests per second comfortably.

At 800 rps:

Latency = 20 ms

At 950 rps:

Latency = 40 ms

At 1,000 rps:

Latency = 100 ms

At 1,100 rps:

Latency = 2 seconds
Timeouts begin

Latency rises slowly… then explodes.

This is why systems feel “fine” until suddenly they don’t.

Why This Matters in Microservices

In microservice architecture:

API → Service A → Service B → Database

If each service has:

50 ms latency

Total user latency:

50 + 50 + 50 = 150 ms

Now add:

Network overhead
TLS handshake
Retry

Suddenly:

250–400 ms

Even if throughput is high.

Latency compounds across services.

When Do You Optimize for Low Latency?

You choose low latency when:

1. User experience is critical

Examples:

Search results
Autocomplete
Trading platforms
Gaming
Real-time dashboards

In these systems:

100 ms feels instant
300 ms feels slow
1 second feels broken

Users care about responsiveness.

2. Services are chained

If your architecture has many hops:

Client → API → Auth → Profile → Order → Payment

Small latencies add up.

Low latency per service prevents explosion at the edge.

3. Synchronous systems

If the caller is waiting:

Mobile app
Browser
Frontend

You care about latency.

When Do You Optimize for High Throughput?

Throughput matters more when:

1. Batch processing systems

Examples:

Data pipelines
Log processing
ETL jobs
Machine learning training

Users don’t wait per request.

They care about:

Total work completed over time.

2. Streaming systems

Examples:

Kafka consumers
Event processors
Metrics ingestion

Goal:

Handle massive volume
Without crashing

Latency per message may not matter as much.

3. Background jobs

Examples:

Email sending
Image resizing
Report generation

As long as it finishes eventually, latency is less critical.

Why You Can’t Maximize Both Easily

There is often a trade-off.

High Throughput Systems

They usually:

Use batching
Queue requests
Maximize CPU utilization
Run near capacity

But:

Queuing increases latency

Low Latency Systems

They often:

Keep spare capacity
Avoid deep queues
Prefer fast fail over retry
Use in-memory caching

But:

May waste resources
Lower maximum throughput

Observability: What Should You Measure?

At minimum:

For Latency

p50 (median)
p95
p99

Average latency is misleading.

If:

p50 = 20 ms
p99 = 2 seconds

Users will complain.

For Throughput

Requests per second
Messages per second
Transactions per second

Track per service, not just globally.

For System Health

Queue length
CPU utilization
Thread pool usage
Connection pool saturation
Retry rate

Often latency spikes before CPU hits 100%.

A Dangerous Pattern in Microservices

Many teams optimize for:

“Handle more traffic.”

They scale horizontally:

More pods
More containers

Throughput increases.

But they ignore:

Tail latency (p99)
Retry amplification
Network overhead

Then during peak load:

Latency spikes
Retries increase traffic
Congestion worsens
System collapses

Throughput obsession can kill latency.

A Simple Mental Model

Ask this question:

Is my system user-facing or work-facing?

If it is:

User-facing → prioritize latency
Work-facing → prioritize throughput

Better:

Design separate paths

Example:

API layer optimized for low latency
Background processing optimized for high throughput

Architecture Patterns That Help

1. Caching

Reduces latency dramatically.

			
Without cache:
App → DB → 50 ms
With cache:
App → Memory → 2 ms

2. Asynchronous processing

Move non-critical work out of request path.

Instead of:

User request → process everything

Do:

User request → enqueue job → respond quickly

3. Backpressure

Prevent overload.

If system is full:

Reject early
Don’t queue infinitely

Better to fail fast than die slowly.

The Real Goal: Predictable Performance

Users don’t need:

The absolute fastest system.

They need:

Consistent, predictable performance.

A system with:

100 ms stable latency

Feels better than:

20 ms most of the time
2 seconds sometimes

That’s why:

Tail latency (p99) matters more than average.

Final Takeaway

Throughput answers:

How much can we handle?

Latency answers:

How long must someone wait?

In distributed systems:

Network calls are expensive
Queues amplify delays
Retries increase load
Latency compounds

Good architecture is not about maximizing one metric.

It’s about choosing the right priority for your system.

Solutions

Industry

Our thinking

Observability: Throughput vs Latency — What Really Matters?

Duong Dao Viet

Table of Contents

First, What Do These Words Actually Mean?

Throughput

Latency

Simple analogy: Highway vs Travel Time

Why Observability Must Track Both

The Throughput–Latency Relationship

What happens under load?

Why This Matters in Microservices

When Do You Optimize for Low Latency?

1. User experience is critical

2. Services are chained

3. Synchronous systems

When Do You Optimize for High Throughput?

1. Batch processing systems

2. Streaming systems

3. Background jobs

Why You Can’t Maximize Both Easily

High Throughput Systems

Low Latency Systems

Observability: What Should You Measure?

For Latency

For Throughput

For System Health

A Dangerous Pattern in Microservices

A Simple Mental Model

Architecture Patterns That Help

1. Caching

2. Asynchronous processing

3. Backpressure

The Real Goal: Predictable Performance

Final Takeaway

Share this:

Like this:

Related

Duong Dao Viet

Leave a CommentCancel reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements

Discover more from NashTech Blog