TCP Handshake Timeout error on Load balancer pool members when using TCP health-checks.
search cancel

TCP Handshake Timeout error on Load balancer pool members when using TCP health-checks.

book

Article ID: 429405

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

In an NSX environment configured for one-arm load balancing where the Virtual Server is hosted on a T1 with a service interface uplink using the same IP as the virtual server you may observe:

  • LB pool members using TCP health-checks intermittently report as down with a failure reason of "TCP handshake timeout".
  • Pool members and entire VS may flap depending on exact scenario.
  • vMotioning pool member VMs may temporarily resolve the connectivity issue.
  • Intermittent traffic "blackholing" where some requests succeed while others fail to the Load Balancer.
  • Traceflow between the pool member and the Virtual Server IP delivered to an unexpected T1.

Environment

  • VMware NSX

Cause

This issue occurs because the IP address used for the T1 Uplink (and the Virtual Server) is duplicated on multiple Tier-1 Gateways. This causes return traffic from the pool members to be routed inconsistently; traffic may reach the correct T1 or be sent to an incorrect T1 that does not have the active LB session, leading to a dropped connection and subsequent failed healthcheck and or LB session flow.

Resolution

To resolve this issue, you must ensure that no duplicate IPs exist within the return path.

  • Log in to the NSX Manager UI.
  • Use the Search function at the top of the interface to search for the specific IP address used by the Virtual Server/T1 Uplink.
  • Review the search results to see all objects associated with that IP.
  • If the IP is assigned to more than one T1 Uplink, edit the incorrect T1 Gateways to remove or change the duplicate IP.
  • Verify that the IP is only assigned to the single T1 Gateway intended to host the Load Balancer service.

Additional Information

This is not an issue specific to NSX and IPs should not be duplicated within a network. However the above describes a specific symptom and quick verification method if the error is observed.