Troubleshooting NSX Edge High Availability

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

When troubleshooting NSX Edge High Availability (HA) (failover or failure to failover), a specific set of data must be gathered at the time of the event. This article details what documentation is required and how to gather it prior to opening a support request with Broadcom.

Environment

VMware NSX
VMware NSX-T Data Center

Resolution

When it comes to NSX Edges, many troubleshooting sessions stem from the question “Why did this Edge failover?” or, more frequently, “Why did it NOT failover when it was expected to?” The answer to both of these questions is usually found in the NSX Edge’s reaction or lack of reaction to environmental changes such as configuration updates, workload increases, vMotion of the Edge, physical component failover testing, or other external factors beyond the Edge’s control.

NOTE: The NSX Edge virtual machines can benefit from vSphere HA. This article is not about troubleshooting that feature.

Edge High Availability (HA) Requirements:

A TEP network is required for the HA Channel
- Load Balancing on VLAN logical switches
- Layer 2 Bridging
- The HA Channel uses the overlay/TEP and management networks of the Edge
- A TEP network is required even when the Edge is used only for VLAN
- Even if no virtual machines use the overlay/TEP network, this network MUST exist.

Edge Failover Detection Mechanisms:

BFD - Used on Management and overlay/TEP networks

If management network connectivity only is lost - a failover will not occur
If overlay network connectivity only is lost - a failover will not occur
If both are lost, a failover will occur

Dynamic Routing Protocols (BGP/OSPF) on Edge uplink interfaces

Connectivity to ALL BGP/OSPF peers on a single uplink must be lost for a failover to take place

Documentation on how Edge HA works in various implementations can be found at the following links:

Log locations and keywords:

In Edge VM/BME
- /var/log/syslog*
- Relevant Edge log keywords (use “grep -i” if reviewing from the NSX CLI as root)
  - node HA state transition
  - failure_reason
  - HA state
  - HA tunnel
  - state changed from
  - state changed to
  - Edge node status changed
  - Service router switches over from
  - Process DP BFD state update
  - BFD State Updated reason
  - state updated from
  - BFD session for peer
  - BGP neighbor
  - is down. Reason:
In NSX Manager
- /var/log/syslog*
- Relevant log keywords
  - tier0_gateway_failover
  - tier1_gateway_failover
  - _gateway_failover
  - All BGP sessions are down
  - All BFD sessions are down
  - Management channel to Manager Node
  - Edge node NIC fp-eth
  - All members of failure domain

CLI commands to check/verify Edge status:

get edge-cluster status

This command will show the current status of this Edge within the cluster
Look for Edge node Status Up/Down
Look for Healthcheck Sessions

get edge-cluster history state

Showcases the state of the Edge and the dataplane service, the time of that state, and the reason for any changes

get bfd-sessions

Reveals the BFD sessions, usually TEP tunnel endpoints, the Edge is aware of and uses for High-Availability status

get bfd-sessions stats

Review the statistics such as packet counts, drop count with reasons for each tunnel/BFD session

# get logical-routers
- # get high-availability history state details
- This command, executed inside a specific vrf shows the high availability details and history of a given service router

# vrf #

Known Issues

Additional Information

Log Line Analysis:

Because BGP is frequently needed in order to determine the up and down status of an Edge’s service routers, it is frequently involved in failovers. Some significant log lines cross both BGP troubleshooting and HA troubleshooting. For example:

BGP neighbor ########-####-####-####-############ (###.###.###.###) is down. Reason: Network or config error.
- ESXi host lost network connectivity to upstream switch
- ESXi host may be experiencing physical NIC issues that result in lost or delayed packets
- Edge may have temporarily lost upstream connectivity due to vMotion
- Physical upstream switch went down or experienced some other issue
- Workload may have temporarily increased on ESXi host or Edge, causing high CPU or memory utilization, which caused dropped packets
- Edge virtual machine may have insufficient resources (i.e. medium Edge footprint when a Large or Extra-Large Edge may be necessary)
- Insufficient host resources available to the Edge

The Edge only knows it tried communicating with this BGP peer and did not receive the required responses within the timeout period.
This is usually an indication that something external to the Edge has affected communication to this BGP peer. Some examples of such issues include:

BGP neighbor ########-####-####-####-############ (###.###.###.###) is down. Reason: Edge is not ready.
- The Edge is in Maintenance Mode
- The Edge is exiting Maintenance Mode but hasn’t fully started all necessary processes
- The component of the Edge is in an administratively down state
- The specified neighbor is administratively down (flag NSXA_OP_STATE_DOWN)

This is an indication that a component inside the Edge is not fully up and capable of processing BGP updates or traffic. Such as:

If you are contacting Broadcom support about this issue, please provide the following:

NSX Edge log bundles for all Edges in the Edge Cluster
Ensure log date range covers the full date of the event(s) being investigated. When in doubt, retrieve logs for all time.
NSX Manager log bundles
ESXi host log bundles for all hosts supporting affected Edge VMs
Text of any error messages seen in NSX GUI or command lines pertinent to the investigation

Handling Log Bundles for offline review with Broadcom support