Transport nodes fail to exit maintenance mode after running the enable EDP standard “enable

Products

VMware NSX

Issue/Introduction

Enabling EDP using the script on a cluster is successful on some of the hosts but appears to fail on others.
The “enable_uens” script is being used to automate this procedure, and the recommended steps from the following Tech Doc are being followed:

Enabling EDP Standard in Active Environments
When you check in System-> Fabric-> Hosts, the hosts were it appeared to fail are reported as 'Partial Success' under the 'NSX Configuration' column.
When you check in System-> Fabric-> Hosts-> Configure NSX-> Prepare Host-> Advanced Configuration, you see that the Mode is successfully enabled for EDP-Standard.
When you check the output from the command line running the script you see messages similar to the following about an error when exiting maintenance mode:

INFO - Exit MM task is still running for host host-<ID> with status running. Time elapsed in seconds : 0. Total timeout in seconds 240
INFO - Exit MM task is still running for host host-<ID> with status running. Time elapsed in seconds : 60. Total timeout in seconds 240
ERROR - Exit MM task failed for host host-<ID>, task status error, Error: (vmodl.fault.HostCommunication) {
dynamicType = <unset>,
dynamicProperty = (vmodl.DynamicProperty) [],
msg = 'An error occurred while communicating with the remote host.',
faultCause = <unset>,
faultMessage = (vmodl.LocalizableMessage) []
}
INFO - Exit MM operation failed for host <HOSTNAME> Exit MM task failed for host: host-<ID>, task status: error, Message: An error occurred while communicating with the remote host.
At the time ENS is enabled on the host in the /var/run/log/vmkernel logs, similar logs to the following are reported which confirm that ENS mode configuration is successful:

In(182) vmkernel: cpu29:2097520)ENS: 1679: Setting ENS mode to 2 for DvsPortset-0
In(182) vmkernel: cpu2:2097520)ENS: Ens_NetWorldCreateWorlds:427: Created 8 ENS RX worlds 8 TX worlds for portset: DvsPortset-0
In(182) vmkernel: cpu2:2097520)ENS: Ens_CreateSwitch:2492: Create ENS switch DvsPortset-0: maxPorts 16384, swID 0, mode: interrupt
In(182) vmkernel: cpu2:2097520)ENS: EnsActivateSW:184: Activate ens switch: DvsPortset-0, handle: 0x450140056000, swID: 0
In(182) vmkernel: cpu2:2097520)ENS: 1890: Portset DvsPortset-0 ENS Activation with status: 0
In(182) vmkernel: cpu76:2100624)Uplink: 620: hostd-worker, Uplink vmnic0 ENS mode is set to bitmap 0x4
In(182) vmkernel: cpu76:2100624)Uplink: 1159: vmnic0, Configure uplink ENS mode to 4 on portset DvsPortset-0, status: Success
After that in the /var/run/log/hostd logs you see that the host does successfully exit MM:

In(14) vobd[2097858]: [UserLevelCorrelator] 1052656325148us: [vob.user.maintenancemode.exited] The host has exited maintenance mode
In(14) vobd[2097858]: [GenericCorrelator] 1052656325148us: [vob.user.maintenancemode.exited] The host has exited maintenance mode
In(14) vobd[2097858]: [UserLevelCorrelator] 1052656325706us: [esx.audit.maintenancemode.exited] The host has exited maintenance mode.
In(14) vobd[2097858]: The event ([esx.audit.maintenancemode.exited] The host has exited maintenance mode.) was sent immediately to hostd;
But the uens script log reports that the exit MM task failed due to a communication error with the host:

INFO - Exit MM operation failed for host <HOSTNAME> Exit MM task failed for host: host-<ID>, task status: error, Message: An error occurred while communicating with the remote host.
The host has i40en physical NIC's running a driver version lower than 2.11.1.0.

Environment

VMware NSX 4.2.X

Cause

The cause of the problem is the 'RX Missed' i40en driver problem which is described in the following KB: After enabling EDP standard mode, RX missed error alarms being reported on hosts with pNIC's using the i40en driver

Because the driver in the host is dropping packets, the connection between the vCenter and the host was interrupted (at least not stable).
The uens-enable script sends a request to the host to exit MM through vCenter which would relay the request to the host.
The host does receive the request and exited maintenance mode, but the script does not receive the reply to confirm that the request completed.
The vCenter is unable relay the reply to the script running on the NSX manager due to the unstable connection between the host and the VC.
As a result the ens-enable script reports that the host fails to exit maintenance mode and the host remains in a 'Partial Success' state.

Resolution

If enabling EDP fails on any of the hosts then use one of the the following steps depending on the scenario:
- Scenario 1: The cluster of the host still has the old 'Transport Node Profile' that is configured for standard mode, to rollback the host, select the cluster in NSX UI System -> Fabric -> Hosts, click "ACTIONS" on the top to select "Sync Transport Node", and it will start changing the TN back to standard mode, wait till the host state becomes success. If the state becomes failed/partial-success, reboot the host and wait for the auto retry of TN config, it should become successful.
- Scenario 2: if the cluster of the host has been configured by the script with the new 'Transport Node Profile' with the ENS mode, then select the cluster in NSX UI System -> Fabric -> Hosts, click "Configure NSX" to change the TNP to the old one with standard mode. The hosts in the cluster will start changing back to standard mode, wait till the hosts state becomes success. If the state becomes failed/partial-success, reboot the host and wait for the auto retry of TN config, it should become successful.
After completing either scenario 1 or scenario 2, implement the following steps to successfully enable EDP-Standard on the cluster:
1. Upgrade all hosts in the cluster with i40en drivers to the 2.11.1.0 version. This will fix the RX missed dropped packets issue.
2. Download the uens-enable-423-patch-v3.tgz attached here and copy to the /tmp dir of any NSX Manager in the cluster.
3. Ssh into the Manager as root.
4. cd /opt/vmware/migration-coordinator-tomcat/bin
5. tar -xvzf /tmp/uens-enable-423-patch-v3.tgz
6. chown -R umc:umc uens-adoption
7. cd uens-adoption/config/
8. Follow instruction in the README file on executing python script enable_uens.py

Additional Information

Please refer to the following Technical Documentation on enabling EDP Standard in an Active/Brownfield cluster:

Enabling EDP Standard in Active Environments

Attachments

uens-enable-423-patch-v3.tgz get_app

Transport nodes fail to exit maintenance mode after running the enable EDP standard “enable_uens” script on a brownfield cluster

Article ID: 411486

Updated On:

Products

Issue/Introduction

Environment

Cause

Resolution

Additional Information

Attachments

Feedback