nestdb agent is down on an NSX prepared ESXi Host
search cancel

nestdb agent is down on an NSX prepared ESXi Host

book

Article ID: 322542

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

  • ESX host prepared for NSX
  • VMs on an host cannot send and receive traffic.
  • NSX configurations changes cannot be made on the host.
  • Host to Controller connection is down.
  • vMotion to the host is not possible.
  • The below log messages is being observed on /var/run/log/nsx-syslog:

<date-time> nestdb-server[390039091]: NSX 390039091 - [nsx@6876 comp="nsx-esx" subcomp="nsx-nestdb" tid="390039091" level="ERROR" errorCode="NST0103"] leveldb::DB::Write() failed: IO error: /var/lib/vmware/nsx/nestdb/db/8437570.ldb: No space left on device

<date-time-1> nestdb-server[390040348]: NSX 390040348 - [nsx@6876 comp="nsx-esx" subcomp="nsx-nestdb" tid="390040348" level="ERROR" errorCode="NST0103"] leveldb::DB::Write() failed: IO error: /var/lib/vmware/nsx/nestdb/db/8437575.ldb: No space left on device

<date-time-2> nestdb-server[390040382]: NSX 390040382 - [nsx@6876 comp="nsx-esx" subcomp="nsx-nestdb" tid="390040382" level="ERROR" errorCode="NST0103"] leveldb::DB::Write() failed: IO error: /var/lib/vmware/nsx/nestdb/db/8437578.ldb: No space left on device

  • The directory  /var/lib/vmware/nsx/nestdb/db/lost contains many files which are consuming all of the ramdisk space.

 

Environment

VMware NSX-T Data Center

Cause

If the nestdb agent experiences an unrecoverable error it saves a copy of current nestdb in /var/lib/vmware/nsx/nestdb/db/lost before restarting.
Over time, if nestdb continues to have errors it will create many files in /var/lib/vmware/nsx/nestdb/db/lost causing the ramdisk to run out of space.
When the ramdisk is full, nestdb can no longer be restarted and stays down causing the problematic symptoms.

Resolution

This issue is resolved in VMware NSX 3.2.3.1 and 4.1.1, available at VMware downloads.

 


Workaround:

  • Use the API /api/v1/transport-nodes/<uuid>/status?source=realtime to monitor the RAM disk utilization on a transport node, below is a sample result:

...

                {
                    "file_system": "nestdb",
                    "mount": "/var/lib/vmware/nsx/nestdb/db",
                    "total": 524288,
                    "type": "ramdisk", 
                    "used": 10548
                },

...

  • If the value of used is more than 400000, delete all files under /var/lib/vmware/nsx/nestdb/db/lost, to prevent this issue occurring.
  • If the issue has already occurred and nestdb is down, delete all files under /var/lib/vmware/nsx/nestdb/db/lost and restart the nsx-nestdb service '/etc/init.d/nsx-nestdb restart' on the ESXi host.

Note: Restarting nsx-nestdb does not have any impact to the ESXi host, as nestdb is not used as a persistent store. When nestdb restarts, nestdb performs a full sync with CCP (Central Control Plane).