NashTech Blog

SNAT Port Exhaustion in AKS: Lessons from the Outbound Connectivity Issues

Table of Contents

Running backend services inside an Azure Kubernetes Service (AKS) cluster gives us a lot of scalability and flexibility. But with great scale comes great responsibility — especially when those services call external APIs.

Suddenly, we hit an unexpected roadblock: a sudden spike of failed requests to third-party APIs. What we discovered turned out to be a classic cloud networking challenge: SNAT port exhaustion. In this post, I’ll walk through what happened, what SNAT actually is, why exhaustion happens, and how we solved it.


The Incident: When Outbound Calls Started Failing

Our system runs multiple backend services in AKS. These services not only talk to each other but also to several third-party providers via REST APIs (for payments, loyalty, and other integrations).

One day, we received alerts and reports that calls to one of the third-party APIs were failing intermittently. The logs showed connection errors like:

  • Connection reset by peer
  • Timeout while establishing connection
  • 503 Service Unavailable

At first, we suspected the third-party was down. But their status page looked fine. Digging deeper into AKS networking metrics, we found the real cause: our cluster had run out of SNAT ports.


Understanding SNAT and SNAT Ports

Before jumping to the fix, let’s clarify what SNAT means.

  • SNAT (Source Network Address Translation): When a pod inside AKS makes an outbound connection (say, to an external API), the source IP of the pod is translated to the node’s IP or another public IP. This is required so the request can traverse the internet and the external service knows where to respond.
  • SNAT Port: Each outbound TCP/UDP connection needs a unique combination of:
    • Source IP
    • Source port
    • Destination IP
    • Destination port
    Since the source IP is translated, the system allocates a SNAT port for each unique outbound connection.

The challenge: SNAT ports are a finite resource. By default, Azure allows 64K ports per IP, but only a portion is actually usable for SNAT in AKS. Once you hit the limit, new outbound connections can’t be established — hence, failures.


Default SNAT Port Allocation

When using default (automatic) allocation, SNAT ports are distributed among backend VMs based on the pool size. The following table shows the SNAT port preallocations for a single frontend IP:

Backend Pool Size (VMs)SNAT Ports per VM
1–501,024
51–100512
101–200256
201–400128
401–80064
801–1,00032

For example, with 100 VMs in a backend pool and only one frontend IP, each VM is assigned 512 ports. 

  • Add another frontend IP → Each VM now gets an additional 512 ports, totaling 1,024 per VM.
  • Adding a third frontend IP does not increase the ports per VM beyond 1,024 (maximum limit per VM).

Why We Hit Exhaustion

In our case, the root causes were:

  1. High concurrency: Our services opened thousands of concurrent outbound requests to the same third-party endpoints.
  2. Short connection lifetimes: The apps didn’t use proper connection pooling, so many short-lived connections were created instead of reusing existing ones.
  3. Single outbound IP: Our nodes were sharing one public IP for egress, reducing the total number of available SNAT ports.

This combination meant we quickly burned through the available port range.


Detecting SNAT Port Exhaustion

Use Azure Load Balancer metrics to monitor SNAT port usage and availability. For Standard Load Balancers, navigate to the resource in the Azure portal and select Monitoring > Metrics. 

  1. Set Time Aggregation: 1 minute.
  2. Select Metrics:
    • Used SNAT Ports
    • Allocated SNAT Ports
    • Use Average for per-VM insights.
    • Use Sum for total load balancer-level insights.
  3. Apply Filters:
    • Protocol TypeBackend IPsFrontend IPs
  4. Use Splitting to monitor per instance (only one metric at a time).
  5. Example: To monitor TCP SNAT usage per backend VM:
    • Aggregate: Average
    • Split: Backend IPs
    • Filter: Protocol = TCP

Relevant Metrics:

MetricResource TypeDescriptionRecommended Aggregation
SNAT Connection CountPublic Load BalancerNumber of outbound flows using SNATSum
Allocated SNAT PortsPublic Load BalancerNumber of SNAT ports allocated per backendAverage
Used SNAT PortsPublic Load BalancerNumber of SNAT ports in use per backendAverage

Solutions: How We Prevented SNAT Exhaustion

Azure provides several ways to mitigate SNAT exhaustion in AKS. We applied a mix of these:

1. Connection Pooling (Application-Level Fix)

We updated our services to reuse connections instead of creating new ones per request. In .NET, this meant properly configuring HttpClientFactory and ensuring sockets weren’t disposed after every call.

This alone significantly reduced the port churn.


2. Azure NAT Gateway (Infrastructure Fix)

By default, AKS nodes use their node IPs for outbound traffic. To scale SNAT capacity, we attached an Azure NAT Gateway to the cluster subnet.

  • Each NAT Gateway can support 64K SNAT ports per public IP.
  • You can attach multiple public IPs to increase the available SNAT pool.
  • Provides consistent outbound IPs (good for whitelisting with third parties).

3. Modify Outbound Rules

If you’re using a public standard load balancer and experience SNAT exhaustion or connection failures, ensure you’re using outbound rules with manual port allocation. Otherwise, you’re likely relying on the load balancer’s default port allocation, which isn’t a recommended method for enabling outbound connections. You can configure outbound rules to:

  • Increase (decrease) idle timeouts so that ports are held longer or released faster depending on your traffic pattern.
  • Fine-tune port allocation per frontend IP by increasing the number of available SNAT ports per VM, ensuring workloads don’t starve each other.

This is a useful knob if you don’t want to immediately introduce NAT Gateway but need to optimize existing port usage.


4. Add More Frontend IPs

To add more IP addresses for outbound connections, create a frontend IP configuration for each new IP. When outbound rules are configured, you’re able to select multiple frontend IP configurations for a backend pool. 

Each public IP provides its own pool of SNAT ports. By attaching multiple frontend IPs (via Load Balancer or NAT Gateway), you can:

  • Multiply the number of available ports.
  • Spread outbound traffic across IPs for better scaling.
  • Maintain resiliency if one IP gets throttled by a third-party provider.

We eventually used this approach in combination with NAT Gateway to guarantee plenty of outbound capacity.


Illustration of the Problem & Solution

Without enough SNAT capacity (risk of exhaustion):

[Pod] -> [Node IP + Limited SNAT Ports] -> [3rd API]

With NAT Gateway and multiple frontend IPs (scalable outbound):

[Pod] -> [Node IP] -> [NAT Gateway + Multiple IPs + Tuned Outbound Rules] -> [3rd API]

Lessons Learned

  • SNAT ports are a hidden but critical resource when your AKS workloads make outbound calls.
  • Lack of connection pooling at the app level can waste ports quickly.
  • For production AKS clusters with heavy egress, always plan for a NAT Gateway.
  • If you need more control, tune outbound rules and add frontend IPs to expand port capacity.
  • Scaling nodes can help, but it’s usually better to handle this with networking design.

By combining application best practices (connection pooling) with infrastructure scaling (NAT Gateway, outbound rules, and extra IPs), we solved our port exhaustion issue and ensured our third-party integrations stayed reliable.

References

Troubleshoot SNAT port exhaustion on AKS nodes – Azure | Microsoft Learn

Troubleshoot Azure Load Balancer outbound connectivity issues | Microsoft Learn

Source Network Address Translation (SNAT) for outbound connections – Azure Load Balancer | Microsoft Learn

Picture of Ha Hoang

Ha Hoang

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top