Random VMs in several Host TN clusters lose networking after enabling security-only in NSX-T UI
This is specific to the security-only use case. Stale data(LSP id) was left on the port so instead of adding new opaque data/extra configuration on the port, it was updated. KCP does not pick up an update so the new logical port id was never picked up by KCP.
This issue is fixed in 3.2.2 and added in the release notes.
Anything that would trigger a delete port and create port workflow would fix this issue. Example :
Create a temp dvpg, reconfigure the vm to this temp dvpg, and then move it back to the original. The operation of setting the vm on the temp dvpg should fix the issue itself as that would delete and create a port.
Due to loss of connectivity complete DP outage for the workload VMs facing the issue.
LOGS TO LOOK FOR AND THEIR LOCATION:
# Identify the switchport of Workload VM that is facing issue.
less net-stats_-l.txt
100663336 5 9 DvsPortset-1 00:50:##:##:##:e1 PFILEP06.eth1
# identify the VIF :
less net-dvs_-l.txt
com.vmware.port.extraConfig.vnic.external.id = 1251497834 , propType = CONFIG
com.vmware.common.port.volatile.status = inUse linkUp portID=100663336 propType = RUNTIME
# Search nsx-opsagent log containing details of LSP id and VIF id as below :
2022-09-20T05:07:11.850Z nsx-opsagent[12416680]: NSX 12416680 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="12417026" level="INFO"] [PortOp] Adding [com.vmware.port.extraConfig.vnic.external.id] value [1251497834]
2022-09-20T05:07:11.850Z nsx-opsagent[12416680]: NSX 12416680 - [nsx@6876 comp="nsx-esx" subcomp="opsagent" s2comp="nsxa" tid="12417026" level="INFO"] [PortOp] Adding [com.vmware.port.extraConfig.logicalPort.id] value [########-####-####-####-##########37]
# Now check vmkernel logs and check if the above LSP id is correctly pushed and printed by kernel . Below is an example of mismatch of the LSP id found in above opsagent log and what is there latest at vmkernel level :
2022-09-18T18:19:18.995Z cpu57:12103851)lsp id for switch port 0x06000022 is ########-####-####-####-##########14
2022-09-18T18:19:18.995Z cpu57:12103851)vif id for switch port 0x06000022 is 1251497834
Can see that there is a mismatch in LSP in net-dvs output and dump-cfgAgent-statecache:
commands/net-dvs_-l.txt:
port 2262:
com.vmware.port.extraConfig.vnic.external.id = 2131138743 , propType = CONFIG
com.vmware.port.extraConfig.opaqueNetwork.id = ########-####-####-####-##########8a , propType = CONFIG
com.vmware.port.extraConfig.logicalPort.id = ########-####-####-####-##########1b , propType = CONFIG
commands/dump-cfgAgent-statecache.sh.txt
PortID: 2262
LogicalSwitchID: ########-####-####-####-##########8a
LogicalSwitchPortID: ########-####-####-####-##########f7
HostSwitchID: 50 11 54 ## ## ## ## ##-## ## ## ## ## 0f 66 d7
OutOfSync: false