NashTech Blog

Observability: Throughput vs Latency — What Really Matters?

Table of Contents

One day, your dashboard shows this:

  • CPU: 40%
  • Memory: Stable
  • Error rate: 0%
  • Requests per second: High and healthy

But Slack is exploding.

“The system is slow.”

How can that be?

The system is processing thousands of requests per second.
Throughput looks great.

But users are waiting.

That’s when you realize:

High throughput does not mean low latency.

And if you don’t understand the difference, your architecture decisions will hurt you later.


First, What Do These Words Actually Mean?

Throughput

Throughput is:

How much work your system can handle per unit of time.

Examples:

  • 5,000 requests per second
  • 200 MB/s data processing
  • 10,000 messages per minute

Think of throughput as:

Capacity.


Latency

Latency is:

How long one request takes from start to finish.

Examples:

  • 20 ms API response time
  • 200 ms database query
  • 2 seconds page load

Think of latency as:

Waiting time.


Simple analogy: Highway vs Travel Time

Imagine a highway.

  • Throughput = how many cars pass per hour.
  • Latency = how long one car takes to reach destination.

You can have:

  • A highway moving 10,000 cars/hour
  • But each car takes 2 hours to reach its destination

That’s high throughput, high latency.

Or:

  • A small road moving 500 cars/hour
  • But each car reaches in 5 minutes

Low throughput, low latency.

They measure different things.


Why Observability Must Track Both

If you only monitor throughput:

  • You might think your system is healthy.
  • But users may still feel it’s slow.

If you only monitor latency:

  • You might optimize response time.
  • But your system may collapse under load.

True observability requires both.


The Throughput–Latency Relationship

There is a hidden relationship between them.

When load increases:

  • Throughput increases (to a point)
  • Latency eventually increases dramatically

This is queueing theory in action.


What happens under load?

Imagine your service can handle:

  • 1,000 requests per second comfortably.

At 800 rps:

  • Latency = 20 ms

At 950 rps:

  • Latency = 40 ms

At 1,000 rps:

  • Latency = 100 ms

At 1,100 rps:

  • Latency = 2 seconds
  • Timeouts begin

Latency rises slowly… then explodes.

This is why systems feel “fine” until suddenly they don’t.


Why This Matters in Microservices

In microservice architecture:

API → Service A → Service B → Database

If each service has:

  • 50 ms latency

Total user latency:

50 + 50 + 50 = 150 ms

Now add:

  • Network overhead
  • TLS handshake
  • Retry

Suddenly:

250–400 ms

Even if throughput is high.

Latency compounds across services.


When Do You Optimize for Low Latency?

You choose low latency when:

1. User experience is critical

Examples:

  • Search results
  • Autocomplete
  • Trading platforms
  • Gaming
  • Real-time dashboards

In these systems:

  • 100 ms feels instant
  • 300 ms feels slow
  • 1 second feels broken

Users care about responsiveness.


2. Services are chained

If your architecture has many hops:

Client → API → Auth → Profile → Order → Payment

Small latencies add up.

Low latency per service prevents explosion at the edge.


3. Synchronous systems

If the caller is waiting:

  • Mobile app
  • Browser
  • Frontend

You care about latency.


When Do You Optimize for High Throughput?

Throughput matters more when:

1. Batch processing systems

Examples:

  • Data pipelines
  • Log processing
  • ETL jobs
  • Machine learning training

Users don’t wait per request.

They care about:

Total work completed over time.


2. Streaming systems

Examples:

  • Kafka consumers
  • Event processors
  • Metrics ingestion

Goal:

  • Handle massive volume
  • Without crashing

Latency per message may not matter as much.


3. Background jobs

Examples:

  • Email sending
  • Image resizing
  • Report generation

As long as it finishes eventually, latency is less critical.


Why You Can’t Maximize Both Easily

There is often a trade-off.

High Throughput Systems

They usually:

  • Use batching
  • Queue requests
  • Maximize CPU utilization
  • Run near capacity

But:

  • Queuing increases latency

Low Latency Systems

They often:

  • Keep spare capacity
  • Avoid deep queues
  • Prefer fast fail over retry
  • Use in-memory caching

But:

  • May waste resources
  • Lower maximum throughput

Observability: What Should You Measure?

At minimum:

For Latency

  • p50 (median)
  • p95
  • p99

Average latency is misleading.

If:

  • p50 = 20 ms
  • p99 = 2 seconds

Users will complain.


For Throughput

  • Requests per second
  • Messages per second
  • Transactions per second

Track per service, not just globally.


For System Health

  • Queue length
  • CPU utilization
  • Thread pool usage
  • Connection pool saturation
  • Retry rate

Often latency spikes before CPU hits 100%.


A Dangerous Pattern in Microservices

Many teams optimize for:

“Handle more traffic.”

They scale horizontally:

  • More pods
  • More containers

Throughput increases.

But they ignore:

  • Tail latency (p99)
  • Retry amplification
  • Network overhead

Then during peak load:

  • Latency spikes
  • Retries increase traffic
  • Congestion worsens
  • System collapses

Throughput obsession can kill latency.


A Simple Mental Model

Ask this question:

Is my system user-facing or work-facing?

If it is:

  • User-facing → prioritize latency
  • Work-facing → prioritize throughput

Better:

  • Design separate paths

Example:

  • API layer optimized for low latency
  • Background processing optimized for high throughput

Architecture Patterns That Help

1. Caching

Reduces latency dramatically.

Without cache:
App → DB → 50 ms
With cache:
App → Memory → 2 ms

2. Asynchronous processing

Move non-critical work out of request path.

Instead of:

User request → process everything

Do:

User request → enqueue job → respond quickly

3. Backpressure

Prevent overload.

If system is full:

  • Reject early
  • Don’t queue infinitely

Better to fail fast than die slowly.


The Real Goal: Predictable Performance

Users don’t need:

  • The absolute fastest system.

They need:

Consistent, predictable performance.

A system with:

  • 100 ms stable latency

Feels better than:

  • 20 ms most of the time
  • 2 seconds sometimes

That’s why:

  • Tail latency (p99) matters more than average.

Final Takeaway

Throughput answers:

How much can we handle?

Latency answers:

How long must someone wait?

In distributed systems:

  • Network calls are expensive
  • Queues amplify delays
  • Retries increase load
  • Latency compounds

Good architecture is not about maximizing one metric.

It’s about choosing the right priority for your system.

Picture of Duong Dao Viet

Duong Dao Viet

Leave a Comment

Suggested Article

Discover more from NashTech Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading