Logical Switch State has failed alarm in vROps with no stale vDS entries identified

Products

VMware NSX

Issue/Introduction

vROps reports a critical NSX alarm that a "Logical Switch State has failed"
On Manager view, Networking -> Logical Switches, the logical switch is in a Failed state
- Enable Manager view if not enabled, System -> General Settings -> User Interface Mode Toggle -> Edit and change "Toggle Visibility" to allow either Admin or all users.
When following the steps in related Knowledge Base article vROps alarm: Logical Switch State has failed, it has been confirmed there is not a stale vDS in NSX, confirmed by validating all vDS referenced in the vCenter MOB correspond to an existing vDS in the NSX database and vice-versa (detailed in the Resolution section of the linked KB).

Environment

NSX 4.x

Cause

In order for automated recovery tools to process any NSX logical switches changes, such as removing an old segment, a backend mapping for the vDS-to-Transport-Zone association defined is required.

In some corner cases, the required association between the vDS and the Transport Zone is missing from the transport nodes, and therefore stale entries cannot be automatically cleared.

In the event there are any stale entries detected, the logical switch as a whole object is considered to be in a FAILED state until these entries can be removed.

Resolution

Workaround:

Note: The following steps require care and precision to execute correctly. If in doubt please open a support case and a technical support engineer can implement them.

Confirm if there are any stale entries that have failed to delete listed under the vdsIdPortGroupStateMap value:
1. ssh as root to one of the NSX Managers and run the following:
  - # corfu_tool_runner.py -n nsx -t LogicalSwitchState -o showTable | grep "PORT_GROUP_STATE_ENUM_FAILEDFORDELETE" -B 30
  - Payload:
    {
    "managedResource": {
    "displayName": "########-####-####-####-##########f7"
    },
    ...
    ...
    "logicalSwitchDisplayName": "NSX-SWITCH-VLAN100",
    "transportZoneId": {
    "uuid": {
    "left": "####################",
    "right": "####################"
    }
    },
    "stateStatus": "CONFIG_STATUS_FAILED",
    "lastUpdated": "1742816816555",
    "switchType": "LOGICAL_SWITCH_TYPE_DEFAULT",
    "vdsIdPortGroupStateMap": {
    "50 04 c2 a6 ## ## ## ea-c8 ## ## ## be 1f 5b 07": {
    "portGroup": {
    "cmId": ########-####-####-####-##########5d",
    "portGroupKey": "dvportgroup-######"
    },
    "state": "PORT_GROUP_STATE_ENUM_SUCCESS"
    },
    "50 04 5a 18 ## ## ## 5c-8e ## ## ## 7b 35 32 56": { <==== HERE
    "state": "PORT_GROUP_STATE_ENUM_FAILEDFORDELETE"
2. In the above output, confirm that the ID for the affected vDS is listed as a second (stale) entry for vdsIdPortGroupStateMap value where it had failed to delete previously.
After confirming the existence of the above, the attached cleanup script may be run to clear the stale entry manually from the NSX database
1. Take an NSX FTP backup and ensure the passphrase is known
2. Copy the attached script, delete_lsstate_stale_vds.py, to one NSX Manager
3. As root on the NSX Manager cli, run the script to remove reference to the stale vdsIdPortGroupStateMap.
  - python3 ./delete_lsstate_stale_vds.py --vds_id "50 04 5a 18 ## ## ## 5c-8e ## ## ## 7b 35 32 56" --read_only
  - python3 ./delete_lsstate_stale_vds.py --vds_id "50 04 5a 18 ## ## ## 5c-8e ## ## ## 7b 35 32 56" --cleanup
4. The first command will provide some output showing what will happen from the command; we should see some details describing the 50 04 5a 18 ## ## ## 5c-8e ## ## ## 7b 35 32 56 vDS identified in the corfu_tool_runner.py command run in Step 1 being repaired/removed, or the Transport Nodes related to it. Example:
  1. YYYY-MM-DD HH:MM:SS,sss [INFO] LogicalSwitch ########-####-####-####-##########f7 needs stale dvs cleanup, LSState {'managedResource': {'displayName': '########-####-####-####-##########f7'}, 'logicalSwitchRevison': 3, 'ccpRealizedRevison': -1, 'portGroupStateRevision': 3, 'opaqueNetworkOnComputeManagerStateRevision': -1, 'logicalSwitchDisplayName': 'NSX-SWITCH-VLAN100', 'transportZoneId': {'uuid': {'left': '####################', 'right': '####################'}}, 'stateStatus': 'CONFIG_STATUS_FAILED', 'lastUpdated': '1742816816555', 'switchType': 'LOGICAL_SWITCH_TYPE_DEFAULT', 'vdsIdPortGroupStateMap': {'50 04 c2 a6 ## ## ## ea-c8 ## ## ## be 1f 5b 07': {'portGroup': {'cmId': '########-####-####-####-##########5d', 'portGroupKey': 'dvportgroup-######'}, 'state': 'PORT_GROUP_STATE_ENUM_SUCCESS'}, '50 04 5a 18 ## ## ## 5c-8e ## ## ## 7b 35 32 56': {'state': 'PORT_GROUP_STATE_ENUM_FAILEDFORDELETE'}}}
5. If the only identified entries in the --read_only output matches the logical switches/stale entries identified in Step 1 above, proceed with running the --cleanup script
After doing the cleanup script, the environment should correct itself via the next vcFullSync cycle, which takes place automatically and frequently, and should occur within about 5 minutes at most.

Attachments

delete_lsstate_stale_vds.py get_app