Failover pre-checks are not performed when ESX is put into maintenance mode manually

Products

VMware NSX

Issue/Introduction

Edge VMs are critical for data center traffic. During an upgrade, vMotion of the Edge VMs may cause more down time than a graceful failover. Edges deployed in a non-homogeneous network may have affinity with the underlying ESX host and require them to be pinned. Addressing these problems, the edge host pinning feature is designed to be able to pin edges to certain host(s). The Edge VM will be associated with a host group using an affinity rule and the Best Effort Restart(BER) compute policy will be configured on it...enabling host remediation in vSphere Lifecycle Manager (vLCM) workflow to gracefully shutdown edges when maintenance mode is enabled.

Sequence of steps in host remediation flow are as follows-

vLCM Host remediation task performs pre-checks before going ahead with host maintenance mode operation.
NSX manager pre-checks are invoked for the host and edge VMs on the host going under maintenance. Checks are performed on the health of active/standby peer edge.
1. If not healthy for failover, upgrade will not proceed. Pre-checks errors must be resolved manually.
2. If pre-checks pass, vSphere, with the help of having BestEffortRestart policy configured, will shut down the edge VMs gracefully.
BER policy will try to see if another compatible host in the host group exists for the edge VMs to be powered onto. If not, then edge VM will stay shut down during host maintenance window.
Once the host exits maintenance mode, the edge will be powered back on.

Environment

VMware VCF 9.0
VMware NSX 9.0

Cause

vLCM triggers remediate host workflow which triggers the pre-checks and then puts the ESX host in maintenance mode. This workflow will do the necessary pre-checks for peer edge failover readiness. However, if a user puts a host in maintenance mode manually, the pre-checks will not be triggered. The edges will shutdown with the host maintenance mode as the BER policy is set.

Following are some scenarios to note the difference in behavior

Scenario	Edge Shutdown Behavior	Pre-check Before Shutdown Behavior	Remarks
Host remediation via vLCM orchestrated upgrade	Edges will shutdown gracefully before putting host in maintenance mode	Pre-checks done on peer edge before shutdown of Active edge.	N/A
Host bits upgrade using vSphere Upgrade Manager(VUM)	Edges will shutdown gracefully before putting host in maintenance mode	Pre-checks not performed on peer edge before shutdown of Active edge.	Refer to the workarounds section and invoke the pre-checks API before putting ESX in maintenance mode
NSX based Host upgrade	Edges will shutdown gracefully before putting host in maintenance mode	Pre-checks done on peer edge before shutdown of Active edge.	N/A
Host is put in maintenance mode manually at vCenter	Edges will shutdown gracefully before putting host in maintenance mode	Pre-checks not performed on peer edge before shutdown of Active edge.	Refer to the workarounds section and invoke the pre-checks API before putting ESX in maintenance mode
Host crashes	Edges will be unresponsive, and shutdown abruptly.	Pre-checks not performed on peer edge before shutdown of Active edge.	Refer to the workarounds section and invoke the pre-checks API before putting ESX in maintenance mode
NSX based edge upgrade	Same as earlier NSX versions	Pre-checks done on peer edge before shutdown of Active edge.	N/A

Table 1: Behavior in various maintenance scenarios

DRS recommendations:

Edge VMs host critical network traffic and vMotion, depending on various factors, may have higher packet drops as compared to NSX edge failover during maintenance mode. Hence, pinning edge to host is recommended.
For this feature to work it is expected that DRS is enabled and should at least be in manual mode. A few things to note regarding DRS settings:
1. DRS in disabled mode: When DRS is disabled, the cluster’s resource pool hierarchy and affinity rules are not re-established when DRS is turned back on. If DRS is disabled, the resource pools are removed from the cluster. To avoid losing the resource pools, save a snapshot of the resource pool tree. The snapshot can be used to restore the resource pool when DRS is reenabled. Using resource pools will need appropriate precautions to be taken.
2. DRS in manual mode: Initial placement and load balancing recommendation must be manually applied. DRS will only give recommendations and not apply them even in case of resource contentions. Also, having the edge VMs distributed properly across hosts will allow for better capacity planning. The reservation of shares guarantees the resource hungry VM(s) get as much of resources as needed, even when the resources are being shared.
3. VMware DRS Overview: Optimizing Resource Allocation in Your vSphere Cluster provides a comprehensive overview for understanding DRS.

VM to Host group affinity rules:

BER policy simply guarantees that power cycle of the edge VM stays aligned with host maintenance mode. However, in order to define placement boundary of edge VM, an affinity rule between Edge VM and host must be created.
This affinity rule is configured when edge host affinity is enabled using Edge API/UI beginning with VCF 9.0.
Avoid changing the affinity rule from vCenter directly.
Instead, change the host group via Edge intent APIs or remove the edge host affinity config altogether from API to not use the feature.
The affinity rules punched are mandatory and not preferential should rules, such that there are no un-intentional migration of VMs.

BER policy:

BER policy is an override policy and not fallback for vMotion. Even if vMotion is possible during host maintenance mode operation, since BER policy is configured, preference will be given to shutting down the edge VM and not to vMotion.
System attempts to restart the VM on another compatible host.
Retry interval is every 3 minutes. Corresponding tasks will be observed in vSphere (see figure below), which in this case must be ignored.

Resolution

For the cases mentioned above where pre-checks are not called by default via the framework.

NSX manager already exposes an API which needs to called via clients manually or can be automated. This API is documented at Execute checks before entering host into maintenance mode.

Get the vCenter UUID from the URL bar of a browser connected to the vCenter Server. In the following example, the vCenter UUID is 328cc7ad-####-####-984903321d2c.
Get the host moref value for the specific host which is under consideration to be put in maintenance mode. In the following example, having selected the host under consideration, 10.160.232.75, it's moref id is host-##.

Use an API call similar to the following against the NSX manager to get the pre-checks triggered for host-##:

POST https://<NSX manager IP/FQDN>/api/v1/upgrade/pre-upgrade-checks/host/planned-maintenance

Request body:

{
    "vcenter-uuid": "328cc7ad-####-####-####-984903321d2c", <<<<<<<<<<<<<< vCenter uuid as mentioned in step 1
     "entity-id": "host-##"                       <<<<<<<<<<<<<< host moref as mentioned in step 2
}

Response:

{
    "check_statuses": [
        {
            "status": "OK",
            "issues": [],
            "info": {
                "check": "com.vmware.nsxt.PeerEdgesOnSameHostCheck",
                "name": {
                    "default_message": "Peer edges on same host check",
                    "id": "",
                    "localized": "Peer edges on same host check"
                },
                "description": {
                    "default_message": "This precheck ensures edges hosting peer LRs are not on same host under upgrade",
                    "id": "",
                    "localized": "This precheck ensures edges hosting peer LRs are not on same host under upgrade"
                }
            }
        },
        {
            "status": "OK",
            "issues": [],
            "info": {
                "check": "com.vmware.nsxt.PeerEdgeHealthCheckHostCheck",
                "name": {
                    "default_message": "Peer edges health check task",
                    "id": "",
                    "localized": "Peer edges health check task"
                },
                "description": {
                    "default_message": "This precheck ensures peers of edges which are on the host under upgrade are healthy",
                    "id": "",
                    "localized": "This precheck ensures peers of edges which are on the host under upgrade are healthy"
                }
            }
        },
        {
            "status": "OK",
            "issues": [],
            "info": {
                "check": "com.vmware.nsxt.PeerLRStatusHostCheck",
                "name": {
                    "default_message": "LR status check on peer edges",
                    "id": "",
                    "localized": "LR status check on peer edges"
                },
                "description": {
                    "default_message": "This precheck ensures LRs of edges on the host under upgrade have a healthy peer if fail over happens",
                    "id": "",
                    "localized": "This precheck ensures LRs of edges on the host under upgrade have a healthy peer if fail over happens"
                }
            }
        },
        {
            "status": "OK",
            "issues": [],
            "info": {
                "check": "com.vmware.nsxt.PeerLRBgpNeighbourHostCheck",
                "name": {
                    "default_message": "BGP status check on peer edges",
                    "id": "",
                    "localized": "BGP status check on peer edges"
                },
                "description": {
                    "default_message": "This precheck ensures BGP neighbours of edges on the host under upgrade have a healthy peer if fail over happens",
                    "id": "",
                    "localized": "This precheck ensures BGP neighbours of edges on the host under upgrade have a healthy peer if fail over happens"
                }
            }
        }
    ],
    "status": "OK"
}

The above API will trigger the pre-checks for edge VMs hosted on host-## and will check for the health of it's active/standby peer.

If the status is "OK" for the above API, it means peer is healthy and host can be put into maintenance mode

Additional Information

Execute checks before entering host into maintenance mode

NSX Version based behavior:

The previously noted peer checks are enabled by default starting from upgrades to 9.0 and above. The pre-checks are done by default for the edges which are pinned using edge host affinity config, during host or edge upgrades.