Intermittent VM connectivity lost observed after newly NSX-prepared ESXi host added to the cluster
search cancel

Intermittent VM connectivity lost observed after newly NSX-prepared ESXi host added to the cluster

book

Article ID: 376729

calendar_today

Updated On:

Products

VMware NSX VMware NSX-T Data Center

Issue/Introduction

VMs are experiencing intermittent connectivity loss after adding a new NSX prepared ESXi host to the cluster.

From the nsx-syslog.log in the added ESXi host under directory /var/run/log/nsx-syslog.log, the nsx-controller component is reporting a problem with vsipfw module and unable to get the kernel addrset due to switch port not found.

YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="E8228700" level="error"] vsipfw: VsipFWCmd.cpp:execute():220 ioctl failed because switch port is not found
YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="E8228700" level="error" errorCode="LCP01107"] dfw: Failed to get kernel addrset count: switch port not found
YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="E8228700" level="error"] vsipfw: VsipFWCmd.cpp:execute():220 ioctl failed because switch port is not found
YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="E8228700" level="error" errorCode="LCP01161"] dfw: Failed to get kernel ruleset count: switch port not found

Further in the nsx-syslog.log, it shows the virtual interface (VIF) attached to the VM was removed due to the absence of the switch port.

YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" s2comp="nsx-hyperbus" tid="E87B3700" level="info"] IP manager: Delete VIF [########-####-####-####-##########0d(host switch ## ## ## ## ## ## ## ##-## ## ## ## ## ## ## 43) LIP after receiving VIF disconnect message

YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="E87B3700" level="info"] ConfigApp: write to ConfigCache on VIF update (delta update)
YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="E7FA3700" level="error"] vsipfw: VsipFWCmd.cpp:execute():220 ioctl failed because switch port is not found
YYYY-MM-DDTHH:MM:SS.SSSZ cfgAgent[2134266]: NSX 2134266 - [nsx@6876 comp="nsx-controller" subcomp="cfgAgent" tid="E7FA3700" level="warn"] dfw: Failed to set mac address of vif [########-####-####-####-##########0d: switch port not found

The VIF port is detached for the impacted VMs during the incident.

YYYY-MM-DDTHH:MM:SS.SSSZ nsx-opsagent[2134410]: NSX 2134410 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="2135012" level="INFO"] [DoVifPortOperation] request=[opId:[6] op:[HOSTD_DETACH_PORT(2)] vif:[########-####-####-####-##########0d] ls:[########-####-####-####-##########b1] vmx:[/vmfs/volumes/vsan:################-##############ab/########-####-####-####-############36/<VM Name>.vmx] lp:[]]

Environment

VMware NSX-T Data Center

VMware NSX

Cause

The issue may occurred because of the NSX controller component exhausted its memory, preventing it from effectively communicating with the newly added ESXi host to synchronise the configuration.

Resolution

To verify the NSX Manager cluster status:

  1. Check Cluster Status:

    • Navigate to UI > System > Appliances.
    • Confirm that the cluster status is STABLE.
    • Ensure there are no high memory usage alarms reported.
  2. Address Issues:

    • If high memory usage is detected or the controller service is down on any node, reboot the affected node.
    • This action should restore the cluster health status to STABLE and normalise memory usage.