Extra read rate on VMFS datastores after upgrading to NSX-T 3.2.0.1
search cancel

Extra read rate on VMFS datastores after upgrading to NSX-T 3.2.0.1

book

Article ID: 317742

calendar_today

Updated On:

Products

VMware NSX VMware vSphere ESXi

Issue/Introduction

After upgrading to NSX-T 3.2.0.1, you experience these symptoms:

  • All ESXi hosts see a read rate increase of approximately 2 MB/s to VMFS datastores.
  • The problem will happen even when the ESXi hosts are in Maintenance Mode (MM) or when there are no VMs running on the hosts.

Cause

All functions related to nestdb may be impacted if nestdb crashes because of a disk full event. The"nestdb_remedy plug-in was introduced which monitors disk usage and will try to restart nestdb service when disks become available.

Note: By default, nestdb_remedy will check every 20 seconds which increases the disk read rate.

Resolution

To confirm you are hitting this issue:

  1. Run these 2 commands at the same time in separate host SSH sessions for at least 30 seconds:

    watch -n2 ps -P -t -s -c > /tmp/ps-output.txt
    watch -n2 esxcli storage core device world list -d <naa.####> > /tmp/worldlist.txt
  2. Find the world ID for python: 

    grep python /tmp/worldlist.txt
  3. Find that world ID in the ps-output.txt and see if nestdb is listed in the command column: 

    grep <world ID> /tmp/ps-output.txt

This issue is resolved in VMware NSX-T 3.2.2.

Note: This issue will also not occur in NSX-T 4.0 as the nestdb_remedy plugin is not used on 4.0.

Workaround:

Below are example steps to change the check interval to 120 seconds.
 
Step 1. Create NSGroup via API OR UI with desired host Transport Nodes

A) For UI option, select Manager view in the NSX UI and navigate to Inventory > Groups. Create NSGroup and add desired Transport Nodes.
Note: NSGroup UUID needed in Step 3 is listed on the Overview tab. 

B) For API option, Host Transport Node UUIDs can be found in the UI (System > Fabric > Nodes > Host Transport Nodes), or from get nodes output in the Manager CLI

 
API to c
reate NSGroup:
POST https://{{manager_ip}}/api/v1/ns-groups

 
{
    "display_name":"219NodeGroup",
    "members" : [ {
      "resource_type" : "NSGroupSimpleExpression",
      "target_type" : "TransportNode",
      "target_property" : "id",
      "op" : "EQUALS",
      "value" : "####eaf-xfbe-xbxc-bf##-##bxbxc####e"         <----- target host 1
     
    }, {
      "resource_type" : "NSGroupSimpleExpression",
      "target_type" : "TransportNode",
      "target_property" : "id",
      "op" : "EQUALS",
      "value" : "xdxxxexa-####-####-####-xxxxabxfxxxb",      <----- target host 2
    } ]
}
 
Response:
 
{
    "members": [
...
    ],
    "member_count": 2,
    "resource_type": "NSGroup",
    "id": "b#####f-xbxx-xbxd-bcxa-dax3####fxfd",     <----- NSGroup ID
    "display_name": "219NodeGroup",
    "_create_time": 1648532005403,
    "_create_user": "admin",
    "_last_modified_time": 1648532005403,
    "_last_modified_user": "admin",
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}
 
Step 2. Create plugin profile with CHECK_INTERVAL:120
 
POST https://{{manager_ip}}/api/v1/systemhealth/profiles/
 
{
    "display_name": "nestdb-remedy-control-profile-1",
    "enabled": true,
    "config": "{\"CHECK_INTERVAL\": 120, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}",      <----- set config like this
    "plugin_id":"########-fxae-xxbx-xcxx-c###cac###"       <----- use this UUID
}
 
Response:
{
    "type": "NETWORK",
    "enabled": true,
    "config": "{\"CHECK_INTERVAL\": 120, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}",
    "plugin_id": "########-fxae-xxbx-xcxx-c####xcac###",
    "resource_type": "SystemHealthAgentProfile",
    "id": "###fe###-fxex-###d-bxea-fxxf######",       <----- profile ID
    "display_name": "nestdb-remedy-control-profile-1",
    "_create_time": 1648540461728,
    "_create_user": "admin",
    "_last_modified_time": 1648540461728,
    "_last_modified_user": "admin",
    "_system_owned": false,
    "_protection": "NOT_PROTECTED",
    "_revision": 0
}
 
Step 3. Apply the created profile to the nsgroup
 
POST https://{{manager_ip}}/api/v1/service-configs
 
{
 "display_name":"nestdb-control-service-config-1",
 "profiles":[{
  "profile_type":"SHAProfile",
  "target_id":"###fe###-fxex-###d-bxea-fxxf######"}],       <----- profile ID
 "applied_to":[{
   "target_id":"b#####f-xbxx-xbxd-bcxa-dax3####fxfd",        <----- NSGroup ID
   "target_type":"NSGroup"
 }]
}
 
Step 4. Confirm the new check_interval is effective on plugin "nestdb_remedy" on an ESXi host, ex UUID: xxxxxeaf-xfbe-xbxc-bfxx-xxbxbxcxxxxe
 
GET https://{{manager_ip}}/api/v1/systemhealth/plugins/status/42492eaf-0fbe-4b2c-bf84-19b0b1c9913e
 
Response:

[
...
{
            "id": "########-fxae-xxbx-xcxx-c####cac###",
            "name": "nestdb_remedy",
            "status": "NORMAL",
            "profile": "NAME: nestdb-remedy-control-profile-1, ENABLE: True, CHECK_INTERVAL: 120, MAX_TRY_COUNT_FOR_A_CRASH: 2, MIN_INTERVAL_BETWEEN_TWO_REMEDIATION: 300",
            "detail": ""
        },
...
]
 
Notes:

  • There is no negative impact due to this workaround. The only consequence is in case of a disk usage issue, nestdb_remedy will take more time to restart the nestdb service.
  • In rare cases the issue may still persist after following the above steps, to resolve, restart proton on all 3 Managers from the admin shell:

restart service manager

Additional Information

Impact/Risks:

High read data volume on all ESXi servers configured with NSX-T from every datastore.
With many ESXi hosts, the combined effect on storage will be high.