After upgrading to NSX-T 3.2.0.1, you experience these symptoms:
All functions related to nestdb may be impacted if nestdb crashes because of a disk full event. The"nestdb_remedy plug-in was introduced which monitors disk usage and will try to restart nestdb service when disks become available.
Note: By default, nestdb_remedy will check every 20 seconds which increases the disk read rate.
To confirm this issue:
Run these 2 commands at the same time in separate host SSH sessions for at least 30 seconds:
Find the world ID for python:
Find that world ID in the ps-output.txt and see if nestdb is listed in the command column:
This issue is resolved in VMware NSX-T 3.2.2.
Note: This issue will also not occur in NSX-T 4.0 as the nestdb_remedy plugin is not used on 4.0.
Workaround:
Below are example steps to disable the nestdb_remedy plugin.
Step 1. Create NSGroup via API OR UI with desired host Transport Nodes
A) For UI option, select Manager view in the NSX UI and navigate to Inventory > Groups. Create NSGroup and add desired Transport Nodes.
Note: NSGroup UUID needed in Step 3 is listed on the Overview tab.
B) For API option, Host Transport Node UUIDs can be found in the UI (System > Fabric > Nodes > Host Transport Nodes), or from get nodes output in the Manager CLI
API to create NSGroup:
POST https://{{manager_ip}}/api/v1/ns-groups
{
"display_name":"219NodeGroup",
"members" : [ {
"resource_type" : "NSGroupSimpleExpression",
"target_type" : "TransportNode",
"target_property" : "id",
"op" : "EQUALS",
"value" : "####eaf-xfbe-xbxc-bf##-##bxbxc####e" <----- target host 1
}, {
"resource_type" : "NSGroupSimpleExpression",
"target_type" : "TransportNode",
"target_property" : "id",
"op" : "EQUALS",
"value" : "xdxxxexa-####-####-####-xxxxabxfxxxb", <----- target host 2
} ]
}
Response:
{
"members": [
...
],
"member_count": 2,
"resource_type": "NSGroup",
"id": "b#####f-xbxx-xbxd-bcxa-dax3####fxfd", <----- NSGroup ID
"display_name": "219NodeGroup",
"_create_time": 1648532005403,
"_create_user": "admin",
"_last_modified_time": 1648532005403,
"_last_modified_user": "admin",
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 0
}
Step 2. Create plugin profile with "enabled": false
POST https://{{manager_ip}}/api/v1/systemhealth/profiles/
{
"display_name": "nestdb-remedy-control-profile-1",
"enabled": false, <----- set to false
"config": "{\"CHECK_INTERVAL\": 20, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}", <----- set config like this
"plugin_id": "08878948-f2ae-42b6-8c63-c03091cac158" <----- use this UUID
}
Response:
{
"type": "NETWORK",
"enabled": false,
"config": "{\"CHECK_INTERVAL\": 20, \"MAX_TRY_COUNT_FOR_A_CRASH\": 2, \"MIN_INTERVAL_BETWEEN_TWO_REMEDIATION\": 300}",
"plugin_id": "08878948-f2ae-42b6-8c63-c03091cac158",
"resource_type": "SystemHealthAgentProfile",
"id": "###fe###-fxex-###d-bxea-fxxf######", <----- profile ID
"display_name": "nestdb-remedy-control-profile-1",
"_create_time": 1648540461728,
"_create_user": "admin",
"_last_modified_time": 1648540461728,
"_last_modified_user": "admin",
"_system_owned": false,
"_protection": "NOT_PROTECTED",
"_revision": 0
}
Step 3. Apply the created profile to the nsgroup
POST https://{{manager_ip}}/api/v1/service-configs
{
"display_name":"nestdb-control-service-config-1",
"profiles":[{
"profile_type":"SHAProfile",
"target_id":"###fe###-fxex-###d-bxea-fxxf######"}], <----- profile ID
"applied_to":[{
"target_id":"b#####f-xbxx-xbxd-bcxa-dax3####fxfd", <----- NSGroup ID
"target_type":"NSGroup"
}]
}
Step 4. Confirm the "nestdb_remedy" plugin is disabled on an ESXi host.
GET https://{{manager_ip}}/api/v1/systemhealth/plugins/status/{{esx_host_id}}
Response:
[
...
{
"id": "08878948-f2ae-42b6-8c63-c03091cac158",
"name": "nestdb_remedy",
"status": "NORMAL",
"profile": "NAME: nestdb-remedy-control-profile-1, ENABLE: False, CHECK_INTERVAL: 20, MAX_TRY_COUNT_FOR_A_CRASH: 2, MIN_INTERVAL_BETWEEN_TWO_REMEDIATION: 300",
"detail": "Plugin is disabled."
},
...
]
Notes:
restart service manager
Impact/Risks:
High read data volume on all ESXi servers configured with NSX-T from every datastore.
With many ESXi hosts, the combined effect on storage will be high.