nestdb agent is down on an NSX prepared ESXi Host
search cancel

nestdb agent is down on an NSX prepared ESXi Host

book

Article ID: 322542

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • ESX host prepared for NSX.
  • VMs on an host cannot send and receive traffic.
  • NSX configurations changes cannot be made on the host.
  • The status of nsx-nestdb service shows as not running on the ESX host.
     
    Example :

    [root@esxi-host:~] /etc/init.d/nsx-nestdb status
    NSX-NESTDB is not running

  • Host to Controller connection is down as verified from 'get controllers' command on the ESX host.

    Example :

    [root@esxi-host:~] nsxcli -c get controllers

     Controller IP    Port     SSL         Status       Is Physical Master   Session State  Controller FQDN           Failure Reason

      #.#.#.18     1235   enabled      not used            false              null              NA                       NA
      #.#.#.17     1235   enabled      not used            false              null              NA                       NA
      #.#.#.19     1235   enabled    disconnected           true              down              NA              CONNECTION_TIMED_OUT

  • vMotion to the host is not possible.
  • The below log messages is being observed on /var/run/log/nsx-syslog:

<date-time> nestdb-server[390039091]: NSX 390039091 - [nsx@6876 comp="nsx-esx" subcomp="nsx-nestdb" tid="390039091" level="ERROR" errorCode="NST0103"] leveldb::DB::Write() failed: IO error: /var/lib/vmware/nsx/nestdb/db/8437570.ldb: No space left on device

<date-time-1> nestdb-server[390040348]: NSX 390040348 - [nsx@6876 comp="nsx-esx" subcomp="nsx-nestdb" tid="390040348" level="ERROR" errorCode="NST0103"] leveldb::DB::Write() failed: IO error: /var/lib/vmware/nsx/nestdb/db/8437575.ldb: No space left on device

<date-time-2> nestdb-server[390040382]: NSX 390040382 - [nsx@6876 comp="nsx-esx" subcomp="nsx-nestdb" tid="390040382" level="ERROR" errorCode="NST0103"] leveldb::DB::Write() failed: IO error: /var/lib/vmware/nsx/nestdb/db/8437578.ldb: No space left on device

  • The directory /var/lib/vmware/nsx/nestdb/db/lost contains many files which are consuming all of the ramdisk space.

Environment

VMware NSX-T Data Center

Cause

If the nestdb agent experiences an unrecoverable error it saves a copy of current nestdb in /var/lib/vmware/nsx/nestdb/db/lost before restarting.
Over time, if nestdb continues to have errors it will create many files in /var/lib/vmware/nsx/nestdb/db/lost causing the ramdisk to run out of space.
When the ramdisk is full, nestdb can no longer be restarted and stays down causing the problematic symptoms.

Resolution

This issue is resolved in VMware NSX 3.2.3.1 and 4.1.1 available at Broadcom Downloads.
If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.


Workaround:

  • Use the API /api/v1/transport-nodes/<uuid>/status?source=realtime to monitor the RAM disk utilization on a transport node, below is a sample result:

...

                {
                    "file_system": "nestdb",
                    "mount": "/var/lib/vmware/nsx/nestdb/db",
                    "total": 524288,
                    "type": "ramdisk", 
                    "used": 10548
                },

...

  • If the value of used is more than 400000, delete all files under /var/lib/vmware/nsx/nestdb/db/lost, to prevent this issue occurring.
  • If the issue has already occurred and nestdb is down, delete all files under /var/lib/vmware/nsx/nestdb/db/lost and restart the nsx-nestdb service '/etc/init.d/nsx-nestdb restart' on the ESXi host.

Note: Restarting nsx-nestdb does not have any impact to the ESXi host, as nestdb is not used as a persistent store. When nestdb restarts, nestdb performs a full sync with CCP (Central Control Plane).