TCP/UDP connections and port exhaustion on Windows Cells
search cancel

TCP/UDP connections and port exhaustion on Windows Cells

book

Article ID: 297452

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Summary

Your Windows cell is reporting healthy, but your applications are slow or failing to make outgoing TCP/UDP requests due to port exhaustion. 
 

Symptom

There is a process running in one or more containers that is consuming many outgoing ports. If you have a legacy app or system that is consuming a high number of outgoing ports (ex: a logging system that creates a new UDP connection for every log request could quickly consume all the available ports) and is difficult to change, you can follow the below steps to the issue and temporarily resolve the issue by increasing the available number of ports.

On Windows, if an app is not reusing sessions when creating connections, even when they stop using the session, it will take time for the socket to be cleaned even if the socket has been closed. If you increase the number of available ports for the container, you may end up with a different issue where the UDP session creates and closing will overwhelm the cpu. If the WinNat service is at 100%, the windows cell will become unusable even if bosh reports it as healthy. The Windows cell will be unable to create any new sessions and requests will fail.
 

Background

In Windows 2019, the number of Dynamic Ports assigned to UDP and TCP on the cell level is by default 16384 for each protocol.
A Container gets assigned by default a port chunk of 100. This means, each container can use up to this number of ports until it needs to request a new chunk.
After a UDP session is closed it takes up to 300 seconds that the port can be reused.

Reference: https://docs.microsoft.com/en-us/windows/client-management/troubleshoot-tcpip-port-exhaust


Resolution

Debugging

  1. Execute the following command on the cell to get the current number of Nat Sessions per protocol, to identify a large use of NatSessions and additionally for which Protocol (UDP = 17, TCP = 6). A list of friendly names for the Protocol numbers can be found here:  https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml

    Get-NetNatSession | Group-Object -Property Protocol -NoElement
    
  2. Enable the WinNat Service Operations Log (does not need a restart).

    $logName = 'Microsoft-Windows-WinNat/Oper'
    $log = New-Object System.Diagnostics.Eventing.Reader.EventLogConfiguration $logName
    $log.IsEnabled=$true
    $log.SaveChanges()
  3. Review the WinNat Logs with: Get-WinEvent -ProviderName "Microsoft-Windows-WinNat" | Format-List and look for events similar to “NAT instance XXXXXXX failed to allocate a UDP port dynamically because all ports in the instance's external address pool are in use”.
  4. Validate that the Cell has enough Dynamic ports assigned to UDP and TCP. The Default is 16384 for each. This can be achieved by Get-NetTCPSetting and Get-NetUDPSetting.

  5. Monitor the CPU usage of the `System` Process PID 4 with Sysinternals Process Explorer (https://docs.microsoft.com/en-us/sysinternals/downloads/process-explorer). This process includes the WinNat service. If the process PID 4 looks like using one full cpu, e.g. 25% on a host with 4 cpus, look at the threads of this process. A high amount of `ntoskrnl.exe+0x74A90` indicates that the WinNat service is waiting to get new ports.

  6. If the number of Nat sessions for UDP or TCP is close or equal to the configured Dynamic Port Range, does indicate that the application instances are opening more than the available ports or are opening and closing faster than the default UDP session timeout of 300 seconds. A high number of  ntoskrnl.exe+0x74A90 for the PID 4 process further strengthens the assumption. 

  7. If the number of Nat sessions for UDP or TCP is below the configured Dynamic Port Range, but there are “NAT instance XXXXXXX failed to allocate a UDP port dynamically” messages in the WinNAT log, does indicate that there are app instances which require in bursts more than the default 100 ports available to a container.

(Temporary) Workarounds

  1. Distribute application instances across a greater number of Windows cells. This may involve decreasing the size of the cells and increasing the total number of cells. This should reduce the number of requests made by the process such that there is a decreased likelihood of exhausting the maximum available number of ports.
  2. Modify the PortChunkSize property up to 2000:
    New-ItemProperty "HKLM:\SYSTEM\CurrentControlSet\Services\WinNat" -Name "PortChunkSize" -Value 2000 -PropertyType "Dword". This might help applications which require a large amount of ports in bursts. This requires a reboot to take effect and hence has to be done in the stemcell build process.

  3. Increase the number of Dynamic Ports:
    Set-NetUDPSetting -DynamicPortRangeStartPort 39536 -DynamicPortRangeNumberOfPorts 26000. This might give the cell enough headroom to clean up old UDP sessions to make room for new. This requires a reboot to take effect and hence has to be done in the stemcell build process.

  4. (Currently only on Windows Server 1903) Modify the WinNAT UDP Timeout property: 
    New-ItemProperty "HKLM:\SYSTEM\CurrentControlSet\Services\WinNat" -Name "UdpSessionTimeout" -Value 30 -PropertyType "Dword".

    This does decrease the time necessary to clean up old UDP sessions and hence make port available quicker to new sessions.

    This requires a reboot to take effect and hence has to be done in the stemcell build process.


(Permanent) Workarounds

If the process can be fixed to reuse sessions when making connections, this would be the most permanent solution.