Random Transport node intermittently going into an "Install Failed" state in the NSX Manager UI.
search cancel

Random Transport node intermittently going into an "Install Failed" state in the NSX Manager UI.

book

Article ID: 423896

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • ESXi hosts intermittently going into an "Install Failed" state in the NSX Manager UI. The issue temporarily resolves when the affected host is rebooted or placed into Maintenance Mode.
  • When clicked on "Install Failed", it shows the following error:

 Software nsx-monitoring not present on host.Software nsx-vdpi not present on host.Software nsx-shared-libs not present on host.Software nsx-python-protobuf not present on host.Software nsx-proxy not present on host.Software nsx-netopa not present on host.Software nsx-snproxy not present on host.Software nsx-python-utils not present on host.Software nsx-nestdb not present on host.Software nsx-esx-datapath not present on host.Software nsx-python-logging not present on host.Software nsx-context-mux not present on host.Software nsx-exporter not present on host.Software vsipfwlib not present on host.Software nsx-ids not present on host.Software nsx-opsagent not present on host.Software nsx-sfhc not present on host.Software nsxcli not present on host.Software nsx-cpp-libs not present on host.Software nsx-proto2-libs not present on host.Software nsx-adf not present on host.Software nsx-platform-client not present on host.Software nsx-cfgagent not present on host.Software nsx-host not present on host.Software nsx-mpa not present on host

  • nsx-syslog indicate an error occurred when NSX try to retrieve the NSX VIB list present in the ESXI as per below log snippet:

YYYY-HH-DDTXX:XX:XX.281Z  INFO PolicyTransportNodeLcmFacadeImpl-2-2 SFHCServiceImpl 5366 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] SFHC request completed. MessageType MT_SOFTWARE_STATUS, operationId e7f88ede-52bd-4503-aa9b-7da7a2752e82, clientId XXXXXXXX-XXXXX-XXXX-XXXX-XXXXXXXXXXXX

YYYY-HH-DDTXX:XX:XX.282Z  INFO PolicyTransportNodeLcmFacadeImpl-2-2 HostPrepServiceFabricDeploymentServiceImpl 5366 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Got 25 software issues for host XXXXXXXX-XXXXX-XXXX-XXXX-XXXXXXXXXXXX

YYYY-HH-DDTXX:XX:XX.282Z  INFO PolicyTransportNodeLcmFacadeImpl-2-2 HostPrepServiceFabricDeploymentServiceImpl 5366 FABRIC [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Got software status ABSENT for nsx-monitoring

  • nsx-syslog shows, get software version is failing for all NSX VIBs as per below log snippets:

 

YYYY-HH-DDTXX:XX:XX.276Z Wa(180) nsx-sfhc[2100955]: NSX 2100955 - [nsx@6876 comp="nsx-esx" subcomp="nsxsfhc" tid="2102548" level="WARNING"] Get software version failed for nsx-monitoring.

YYYY-HH-DDTXX:XX:XX.276Z In(182) nsx-sfhc[2100955]: NSX 2100955 - [nsx@6876 comp="nsx-esx" subcomp="nsxsfhc" tid="2102548" level="INFO"] Software nsx-monitoring version: expect=4.2.X.X actual=

YYYY-HH-DDTXX:XX:XX.276Z Wa(180) nsx-sfhc[2100955]: NSX 2100955 - [nsx@6876 comp="nsx-esx" subcomp="nsxsfhc" tid="2102548" level="WARNING"] Get software version failed for nsx-vdpi.

YYYY-HH-DDTXX:XX:XX.276Z In(182) nsx-sfhc[2100955]: NSX 2100955 - [nsx@6876 comp="nsx-esx" subcomp="nsxsfhc" tid="2102548" level="INFO"] Software nsx-vdpi version: expect=4.2.X.X actual=

  • vmkwarning logs has events related to admission check failed for memory resource:

YYYY-HH-DDTXX:XX:XX.354Z Wa(180) vmkwarning: cpu69:57779266)WARNING: UserParam: 1571: sh: could not SetAlloc on container(470114774) -- 76800 pages: Admission check failed for memory resource

YYYY-HH-DDTXX:XX:XX.354Z Wa(180) vmkwarning: cpu69:57779266)WARNING: LinuxFileDesc: 4671: sh: Unrecoverable exec failure: Failure during exec while original state already lost

 

YYYY-HH-DDTXX:XX:XX.065Z Wa(180) vmkwarning: cpu8:12481243 opID=f9c45108)WARNING: Migrate: 4062: Can't create migrate info struct for vmmLeaderID = 12481246 : Migration failed to start due to lack of CPU or memory resources

YYYY-HH-DDTXX:XX:XX.192Z Wa(180) vmkwarning: cpu69:51241821 opID=a3250643)WARNING: Migrate: 1071: Failed to create a migrate heap of size 71982107: Admission check failed for memory resource.

YYYY-HH-DDTXX:XX:XX.192Z Wa(180) vmkwarning: cpu69:51241821 opID=a3250643)WARNING: Migrate: 371: vmmLeaderID = 51241827: Failed to allocate migration heap

YYYY-HH-DDTXX:XX:XX.192Z Wa(180) vmkwarning: cpu69:51241821 opID=a3250643)WARNING: Migrate: 4062: Can't create migrate info struct for vmmLeaderID = 51241827 : Migration failed to start due to lack of CPU or memory resources

YYYY-HH-DDTXX:XX:XX.670Z Wa(180) vmkwarning: cpu112:2944955 opID=37ee4418)WARNING: Migrate: 1071: Failed to create a migrate heap of size 80165898: Admission check failed for memory resource.

YYYY-HH-DDTXX:XX:XX.670Z Wa(180) vmkwarning: cpu112:2944955 opID=37ee4418)WARNING: Migrate: 371: vmmLeaderID = 2944962: Failed to allocate migration heap

 

 

Environment

  • VMware NSX 4.2.x
  • ESXI Version 8.x

Cause

  • Transport Node/ESXi host's memory exhaustion is causing "esxcli software vib list" command to crash intermittently when NSX tries to retrieve the respective VIBs. 
  • Most of the memory is consumed by user group or VMs with reservation set and memory consumption has spikes during vMotion Operations.

  • Memory admission failure is happening as esxcli command requires 300+ MB but that much memory is not available.

  • As soon as the memory is available, NSX is able to run the vib list command and finds the correct NSX VIBs and toggles back the transport node to INSTALL_SUCCESSFUL

Resolution

In vSphere ESXi 8.x, the localcli failure due to memory contention is considered expected behavior because NSX VIBs are not bundled in this version. To resolve this, upgrade the environment to vSphere ESXi 9.0 (VCF 9.x). The vSphere ESXI 9.0 architecture integrates NSX VIBs directly into the ESXi image, removing the dependency on the localcli command and eliminating "Install Failed" alerts in NSX UI

Workaround:

  • Free up ESXI host's memory by reducing manual reservations or un-checking "Reserve all guest memory" on high-priority Virtual Machines. This ensures the ESXi host has sufficient available RAM to successfully execute the NSX software discovery commands without hitting admission control failures.
  • Reduce the memory consumption of "user" group (VMs with reservation set) by powering off VM

Additional Information

For more information about ESXI memory management see the following resources:

 

Related memory troubleshooting scenarios