During an ESXi host software patch install there was a stop and start of all services where the NestDB service took too long to shutdown (about 1 minute). This resulted in the WatchDog service not being initialized to start.
The impact can be that multiple VMs were unreachable on the host cluster since they are unable to obtain IP addresses from DHCP.
Log lines of interest:
vmksummary.log
2024-09-11T03:56:36.871Z bootstop[113068089]: Host is rebooting
2024-09-11T04:01:03.415Z bootstop[2107041]: Host has booted
Initial NestDB shutdown for the reboot looks normal, and it starts after the boot up.
syslog
2024-09-11T03:56:57.165Z NSX[113069267]: Shutting down NSX-NESTDB watchdog
2024-09-11T03:56:57.201Z watchdog-NSX-NESTDB[112020713]: Terminating watchdog process with PID 14442553
2024-09-11T03:56:57.206Z watchdog-NSX-NESTDB[113069293]: [14442553] Signal received: exiting the watchdog
2024-09-11T03:56:57.226Z NSX[113069299]: Shutting down NSX-NESTDB service
2024-09-11T03:56:58.277Z NSX[113069389]: NSX-NESTDB service is stopped
2024-09-11T04:00:20.607Z secpolicytools[2101948]: Getting realpath failed: /var/run/vmware/watchdog-NSX-NESTDB.PID
2024-09-11T04:00:36.491Z watchdog-NSX-NESTDB[2104985]: [2104970] Begin '/opt/vmware/nsx-nestdb/bin/nestdb-server ++securitydom=25 --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /var/lib/vmware/nsx/nestdb/db --txn_log_size 10485760 --prof_prefix /var/log/vmware/nsx-nestdb/nsx-nestdb --mem_stats_interval 300 --mem_release_interval 1800 --metrics_rpc_publisher --metrics_text_publisher --listen tcp://127.0.0.1:2480', min-uptime = 60, max-quick-failures = 20, max-total-failures = 1000000, bg_pid_file = '', reboot-flag = '0'
2024-09-11T04:00:36.495Z watchdog-NSX-NESTDB[2104988]: Executing '/opt/vmware/nsx-nestdb/bin/nestdb-server ++securitydom=25 --schema /opt/vmware/nsx-nestdb/schema/nestdb.schema --database /var/lib/vmware/nsx/nestdb/db --txn_log_size 10485760 --prof_prefix /var/log/vmware/nsx-nestdb/nsx-nestdb --mem_stats_interval 300 --mem_release_interval 1800 --metrics_rpc_publisher --metrics_text_publisher --listen tcp://127.0.0.1:2480'
2024-09-11T04:00:36.515Z NSX[2104997]: NSX-NESTDB started
Then NestDB is shut down again, but this time it takes 57 seconds.
2024-09-11T04:01:32.689Z NSX[2109784]: Shutting down NSX-NESTDB watchdog
2024-09-11T04:01:32.723Z watchdog-NSX-NESTDB[2109815]: Terminating watchdog process with PID 2104970
2024-09-11T04:01:32.728Z watchdog-NSX-NESTDB[2109826]: [2104970] Signal received: exiting the watchdog
2024-09-11T04:01:32.748Z NSX[2109833]: Shutting down NSX-NESTDB service
2024-09-11T04:02:04.422Z secpolicytools[2113778]: Getting realpath failed: /var/run/vmware/watchdog-NSX-NESTDB.PID
2024-09-11T04:02:10.065Z secpolicytools[2113851]: Getting realpath failed: /var/run/vmware/watchdog-NSX-NESTDB.PID
2024-09-11T04:02:25.713Z NSX[2114450]: NSX-NESTDB is already running
syslog
2024-09-11T04:02:29.747Z NSX[2114646]: NSX-NESTDB service is stopped
VMware NSX-T 3.x
In NSX 3.x releases, if the ESXi host services are restarted (stop & start of services) again after an ESXi host reboot, then the NestDB service stop may take longer to shutdown (about 1 minute). This slow stop could interfere with the NestDB start sequence, and thus result in the service not launching the watchdog and NestDB service.
The extra ESXi services restart could be triggered after a host reboot by custom entries added to the /etc/rc.local.d/local.sh file in ESXi.
Resolution:
A code change has been added to NSX versions 4.x to address this issue.
Workaround:
Check the NestDB service status with "/etc/init.d/nsx-nestdb status" and if the output shows the service in a stopped state, start the NestDB service with "/etc/init.d/nsx-nestdb start"
# /etc/init.d/nsx-nestdb status
NSX-NestDB is not running.
# /etc/init.d/nsx-nestdb start