Extra read rate on VMFS datastores after upgrading to NSX-T 3.2.0.1

Products

VMware NSX VMware vSphere ESXi

Issue/Introduction

After upgrading to NSX-T 3.2.0.1, you experience these symptoms:

All ESXi hosts see a read rate increase of approximately 2 MB/s to VMFS datastores.
The problem will happen even when the ESXi hosts are in Maintenance Mode (MM) or when there are no VMs running on the hosts.

Cause

All functions related to nestdb may be impacted if nestdb crashes because of a disk full event. The"nestdb_remedy plug-in was introduced which monitors disk usage and will try to restart nestdb service when disks become available.

Note: By default, nestdb_remedy will check every 20 seconds which increases the disk read rate.

Resolution

To confirm this issue:

Run these 2 commands at the same time in separate host SSH sessions for at least 30 seconds:

watch -n2 ps -P -t -s -c > /tmp/ps-output.txt
watch -n2 esxcli storage core device world list -d <naa.####> > /tmp/worldlist.txt
Find the world ID for python:

grep python /tmp/worldlist.txt
Find that world ID in the ps-output.txt and see if nestdb is listed in the command column:

grep <world ID> /tmp/ps-output.txt

This issue is resolved in VMware NSX-T 3.2.2.

Note: This issue will also not occur in NSX-T 4.0 as the nestdb_remedy plugin is not used on 4.0.

Workaround:

Below are example steps to disable the nestdb_remedy plugin.

Step 1. Create NSGroup via API OR UI with desired host Transport Nodes

A) For UI option, select Manager view in the NSX UI and navigate to Inventory > Groups. Create NSGroup and add desired Transport Nodes.
Note: NSGroup UUID needed in Step 3 is listed on the Overview tab.

B) For API option, Host Transport Node UUIDs can be found in the UI (System > Fabric > Nodes > Host Transport Nodes), or from get nodes output in the Manager CLI

API to create NSGroup:
POST https://{{manager_ip}}/api/v1/ns-groups

{
    "display_name":"219NodeGroup",
    "members" : [ {
      "resource_type" : "NSGroupSimpleExpression",
      "target_type" : "TransportNode",
      "target_property" : "id",
      "op" : "EQUALS",
      "value" : "####eaf-xfbe-xbxc-bf##-##bxbxc####e"      <----- target host 1

    }, {
      "resource_type" : "NSGroupSimpleExpression",
      "target_type" : "TransportNode",
      "target_property" : "id",
      "op" : "EQUALS",
      "value" : "xdxxxexa-####-####-####-xxxxabxfxxxb",      <----- target host 2
    } ]
}

Response:

{
    "members": [
...
    ],
    "member_count": 2,
    "resource_type": "NSGroup",
    "id": "b#####f-xbxx-xbxd-bcxa-dax3####fxfd", <----- NSGroup ID
    "display_name": "219NodeGroup",
    "_create_time": 1648532005403,
    "_create_user": "admin",
    "_last_modified_time": 1648532005403,
    "_last_modified_user": "admin",
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}

Step 2. Create plugin profile with "enabled": false

POST https://{{manager_ip}}/api/v1/systemhealth/profiles/

{
    "display_name": "nestdb-remedy-control-profile-1",
    "enabled": false, <----- set to false
    "config": "{\"CHECK_INTERVAL\": 20, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}", <----- set config like this
    "plugin_id": "08878948-f2ae-42b6-8c63-c03091cac158" <----- use this UUID
}

Response:
{
    "type": "NETWORK",
    "enabled": false,
    "config": "{\"CHECK_INTERVAL\": 20, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}",
    "plugin_id": "08878948-f2ae-42b6-8c63-c03091cac158",
    "resource_type": "SystemHealthAgentProfile",
    "id": "###fe###-fxex-###d-bxea-fxxf######", <----- profile ID
    "display_name": "nestdb-remedy-control-profile-1",
    "_create_time": 1648540461728,
    "_create_user": "admin",
    "_last_modified_time": 1648540461728,
    "_last_modified_user": "admin",
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}

Step 3. Apply the created profile to the nsgroup

POST https://{{manager_ip}}/api/v1/service-configs

{
"display_name":"nestdb-control-service-config-1",
"profiles":[{
"profile_type":"SHAProfile",
"target_id":"###fe###-fxex-###d-bxea-fxxf######"}], <----- profile ID
"applied_to":[{
   "target_id":"b#####f-xbxx-xbxd-bcxa-dax3####fxfd", <----- NSGroup ID
   "target_type":"NSGroup"
}]
}

Step 4. Confirm the "nestdb_remedy" plugin is disabled on an ESXi host.

GET https://{{manager_ip}}/api/v1/systemhealth/plugins/status/{{esx_host_id}}

Response:
[
...
{
            "id": "08878948-f2ae-42b6-8c63-c03091cac158",
            "name": "nestdb_remedy",
            "status": "NORMAL",
            "profile": "NAME: nestdb-remedy-control-profile-1, ENABLE: False, CHECK_INTERVAL: 20, MAX_TRY_COUNT_FOR_A_CRASH: 2, MIN_INTERVAL_BETWEEN_TWO_REMEDIATION: 300",
            "detail": "Plugin is disabled."
        },
...
]

Notes:

There is no negative impact due to this workaround.
In rare cases the issue may still persist after following the above steps, to resolve, restart proton on all 3 Managers from the admin shell:

restart service manager

Additional Information

Impact/Risks:

High read data volume on all ESXi servers configured with NSX-T from every datastore.
With many ESXi hosts, the combined effect on storage will be high.