Troubleshooting NSX Edge High Availability
search cancel

Troubleshooting NSX Edge High Availability

book

Article ID: 376948

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

When troubleshooting NSX Edge High Availability (HA) (failover or failure to failover), a specific set of data must be gathered at the time of the event. This article details what documentation is required and how to gather it prior to opening a support request with Broadcom.

Environment

VMware NSX

Resolution

When it comes to NSX Edges, many troubleshooting sessions stem from the question “Why did this Edge failover?” or, more frequently, “Why did it NOT failover when it was expected to?” The answer to both of these questions is usually found in the NSX Edge’s reaction or lack of reaction to environmental changes such as configuration updates, workload increases, vMotion of the Edge, physical component failover testing, or other external factors beyond the Edge’s control. 

NOTE: The NSX Edge virtual machines can benefit from vSphere HA. This article is not about troubleshooting that feature.

 

Edge High Availability (HA) Requirements:

  • A TEP network is required for the HA Channel
    • Load Balancing on VLAN logical switches
    • Layer 2 Bridging
    • The HA Channel uses the overlay/TEP and management networks of the Edge
    • A TEP network is required even when the Edge is used only for
  • Even if no virtual machines will use the overlay/TEP network, this network MUST exist.

 

Edge Failover Detection Mechanisms:

  • BFD - Used on Management and overlay/TEP networks
    • If management network connectivity only is lost - a failover will not occur
    • If overlay network connectivity only is lost - a failover will not occur
    • If both are lost, a failover will occur
  • Dynamic Routing Protocols (BGP/OSPF) on Edge uplink interfaces
    • Connectivity to ALL BGP/OSPF peers on a single uplink must be lost for a failover to take place

 

Documentation on how Edge HA works in various implementations can be found at the following links:

 

Log locations and keywords:

  • In Edge VM/BME
    • /var/log/syslog*
    • Relevant Edge log keywords (use “grep -i” if reviewing from the NSX CLI as root)
      • node HA state transition
      • failure_reason
      • HA state
      • HA tunnel
      • state changed from
      • state changed to
      • Edge node status changed
      • Service router switches over from
      • Process DP BFD state update
      • BFD State Updated reason
      • state updated from
      • BFD session for peer
      • BGP neighbor
      • is down. Reason:
  • In NSX Manager
    • /var/log/syslog*
    • Relevant log keywords
      • tier0_gateway_failover
      • tier1_gateway_failover
      • _gateway_failover
      • All BGP sessions are down
      • All BFD sessions are down
      • Management channel to Manager Node
      • Edge node NIC fp-eth
      • All members of failure domain

 

CLI commands to check/verify Edge status:

  • get edge-cluster status
    • This command will show the current status of this Edge within the cluster
    • Look for Edge node Status Up/Down
    • Look for Healthcheck Sessions
  • get edge-cluster history state
    • Showcases the state of the Edge and the dataplane service, the time of that state, and the reason for any changes
  • get bfd-sessions
    • Reveals the BFD sessions, usually TEP tunnel endpoints, the Edge is aware of and uses for High-Availability status
  • get bfd-sessions stats
    • Review the statistics such as packet counts, drop count with reasons for each tunnel/BFD session
  • # get logical-routers
    • # get high-availability history state details
    • This command, executed inside a specific vrf shows the high availability details and history of a given service router
    • # vrf #

Known Issues

Additional Information

Log Line Analysis:

Because BGP is frequently needed in order to determine the up and down status of an Edge’s service routers, it is frequently involved in failovers.  Some significant log lines cross both BGP troubleshooting and HA troubleshooting. For example:

  • BGP neighbor ########-####-####-####-############ (###.###.###.###) is down. Reason: Network or config error.
    • The Edge only knows it tried communicating with this BGP peer and did not receive the required responses within the timeout period.
    • This is usually an indication that something external to the Edge has affected communication to this BGP peer.  Some examples of such issues include:
      • ESXi host lost network connectivity to upstream switch
      • ESXi host may be experiencing physical NIC issues that result in lost or delayed packets
      • Edge may have temporarily lost upstream connectivity due to vMotion
      • Physical upstream switch went down or experienced some other issue
      • Workload may have temporarily increased on ESXi host or Edge, causing high CPU or memory utilization, which caused dropped packets
      • Edge virtual machine may have insufficient resources (i.e. medium Edge footprint when a Large or Extra-Large Edge may be necessary)
      • Insufficient host resources available to the Edge
  • BGP neighbor ########-####-####-####-############ (###.###.###.###) is down. Reason: Edge is not ready.
    • This is an indication that a component inside the Edge is not fully up and capable of processing BGP updates or traffic. Such as:
      • The Edge is in Maintenance Mode
      • The Edge is exiting Maintenance Mode but hasn’t fully started all necessary processes
      • The component of the Edge is in an administratively down state
      • The specified neighbor is administratively down (flag NSXA_OP_STATE_DOWN)

 

If you are contacting Broadcom support about this issue, please provide the following:

  • NSX Edge log bundles for all Edges in the Edge Cluster
  • Ensure log date range covers the full date of the event(s) being investigated. When in doubt, retrieve logs for all time.
  • NSX Manager log bundles
  • ESXi host log bundles for all hosts supporting affected Edge VMs
  • Text of any error messages seen in NSX GUI or command lines pertinent to the investigation

Handling Log Bundles for offline review with Broadcom support