After upgrading to NSX-T 3.2.0.1, you experience these symptoms:
All functions related to nestdb may be impacted if nestdb crashes because of a disk full event. The"nestdb_remedy plug-in was introduced which monitors disk usage and will try to restart nestdb service when disks become available.
Note: By default, nestdb_remedy will check every 20 seconds which increases the disk read rate.
To confirm you are hitting this issue:
Run these 2 commands at the same time in separate host SSH sessions for at least 30 seconds:
Find the world ID for python:
Find that world ID in the ps-output.txt and see if nestdb is listed in the command column:
This issue is resolved in VMware NSX-T 3.2.2.
Note: This issue will also not occur in NSX-T 4.0 as the nestdb_remedy plugin is not used on 4.0.
Workaround:
Below are example steps to change the check interval to 120 seconds.
Step 1. Create NSGroup via API OR UI with desired host Transport Nodes
A) For UI option, select Manager view in the NSX UI and navigate to Inventory > Groups. Create NSGroup and add desired Transport Nodes.
Note: NSGroup UUID needed in Step 3 is listed on the Overview tab.
B) For API option, Host Transport Node UUIDs can be found in the UI (System > Fabric > Nodes > Host Transport Nodes), or from get nodes output in the Manager CLI
API to create NSGroup:
POST https://{{manager_ip}}/api/v1/ns-groups
{
"display_name":"219NodeGroup",
"members" : [ {
"resource_type" : "NSGroupSimpleExpression",
"target_type" : "TransportNode",
"target_property" : "id",
"op" : "EQUALS",
"value" : "####eaf-xfbe-xbxc-bf##-##bxbxc####e" <----- target host 1
}, {
"resource_type" : "NSGroupSimpleExpression",
"target_type" : "TransportNode",
"target_property" : "id",
"op" : "EQUALS",
"value" : "xdxxxexa-####-####-####-xxxxabxfxxxb", <----- target host 2
} ]
}
Response:
{
"members": [
...
],
"member_count": 2,
"resource_type": "NSGroup",
"id": "b#####f-xbxx-xbxd-bcxa-dax3####fxfd", <----- NSGroup ID
"display_name": "219NodeGroup",
"_create_time": 1648532005403,
"_create_user": "admin",
"_last_modified_time": 1648532005403,
"_last_modified_user": "admin",
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 0
}
Step 2. Create plugin profile with CHECK_INTERVAL:120
POST https://{{manager_ip}}/api/v1/systemhealth/profiles/
{
"display_name": "nestdb-remedy-control-profile-1",
"enabled": true,
"config": "{\"CHECK_INTERVAL\": 120, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}", <----- set config like this
"plugin_id":"########-fxae-xxbx-xcxx-c###cac###" <----- use this UUID
}
Response:
{
"type": "NETWORK",
"enabled": true,
"config": "{\"CHECK_INTERVAL\": 120, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}",
"plugin_id": "########-fxae-xxbx-xcxx-c####xcac###",
"resource_type": "SystemHealthAgentProfile",
"id": "###fe###-fxex-###d-bxea-fxxf######", <----- profile ID
"display_name": "nestdb-remedy-control-profile-1",
"_create_time": 1648540461728,
"_create_user": "admin",
"_last_modified_time": 1648540461728,
"_last_modified_user": "admin",
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 0
}
Step 3. Apply the created profile to the nsgroup
POST https://{{manager_ip}}/api/v1/service-configs
{
"display_name":"nestdb-control-service-config-1",
"profiles":[{
"profile_type":"SHAProfile",
"target_id":"###fe###-fxex-###d-bxea-fxxf######"}], <----- profile ID
"applied_to":[{
"target_id":"b#####f-xbxx-xbxd-bcxa-dax3####fxfd", <----- NSGroup ID
"target_type":"NSGroup"
}]
}
Step 4. Confirm the new check_interval is effective on plugin "nestdb_remedy" on an ESXi host, ex UUID: xxxxxeaf-xfbe-xbxc-bfxx-xxbxbxcxxxxe
GET https://{{manager_ip}}/api/v1/systemhealth/plugins/status/42492eaf-0fbe-4b2c-bf84-19b0b1c9913e
Response:
[
...
{
"id": "########-fxae-xxbx-xcxx-c####cac###",
"name": "nestdb_remedy",
"status": "NORMAL",
"profile": "NAME: nestdb-remedy-control-profile-1, ENABLE: True, CHECK_INTERVAL: 120, MAX_TRY_COUNT_FOR_A_CRASH: 2, MIN_INTERVAL_BETWEEN_TWO_REMEDIATION: 300",
"detail": ""
},
...
]
Notes:
restart service manager
Impact/Risks:
High read data volume on all ESXi servers configured with NSX-T from every datastore.
With many ESXi hosts, the combined effect on storage will be high.