Edge Node is unable to establish BGP peering, TEP tunnels to ESXi hosts are down, and Controller Connectivity shows down
search cancel

Edge Node is unable to establish BGP peering, TEP tunnels to ESXi hosts are down, and Controller Connectivity shows down

book

Article ID: 306212

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

Purpose: To provide a procedure to identify and resolve an issue that will cause a datapath outage in an NSX-T environment.

Symptoms:
  • NSX-T 3.0.2 or earlier
  • Edge shows TEP Tunnels down
  • Edge is unable to peer with upstream BGP
  • Edge nestdb service is stopped
  • The Edge shows controller connectivity is down

  • A very large number of reload-#### files can be seen in the /config/vmware/edge/frr folder of the NSX-T Edge.  This large number of files can lead to the Edge running out of inodes.  The lack of inodes is what causes the symptoms above.


With root access, if inodes are checked, it will show none are available:
On a healthy system:

# df -i /config
Filesystem              Inodes IUsed   IFree IUse% Mounted on
/dev/mapper/nsx-config 1250928   109 1250819    1% /config

 
On an affected Edge:

Filesystem              Inodes IUsed   IFree IUse% Mounted on
/dev/mapper/nsx-config 1250928 1250928 0       100% /config



 

Log entries will indicate nestdb is not functioning:
 
In /var/log/syslog* you see lines similar to:
 

2022-04-22T12:18:02.166Z S3BPKS2BME03 NSX 4752 - [nsx@6876 comp="nsx-edge s2comp="nsx-net" tid="4988" level="WARNING"] StreamConnection[56065 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:56065] Couldn't connect to 'unix:///var/run/vmware/nestdb/nestdb-server.sock' (error: 111-Connection refused)
 
2022-04-22T12:18:02.166Z S3BPKS2BME03 NSX 4752 - [nsx@6876 comp="nsx-edge s2comp="nsx-net" tid="4988" level="WARNING"] StreamConnection[56065 Error to unix:///var/run/vmware/nestdb/nestdb-server.sock sid:-1] Error 111-Connection refused
 
2022-04-22T12:18:02.166Z S3BPKS2BME03 NSX 4752 - [nsx@6876 comp="nsx-edge s2comp="nsx-rpc" tid="4988" level="WARNING"] RpcConnection[56065 Connecting to unix:///var/run/vmware/nestdb/nestdb-server.sock 0] Couldn't connect to unix:///var/run/vmware/nestdb/nestdb-server.sock (error: 111-Connection refused)
 
2022-04-22T12:18:02.166Z S3BPKS2BME03 NSX 4752 - [nsx@6876 comp="nsx-edge s2comp="nsx-rpc" tid="4988" level="WARNING"] RpcTransport[2] Unable to connect to unix:///var/run/vmware/nestdb/nestdb-server.sock: 111-Connection refused
 
2022-04-22T12:18:02.166Z S3BPKS2BME03 NSX 4752 - [nsx@6876 comp="nsx-edge s2comp="nestdb-client" tid="4989" level="WARNING"] NestDbClient: failed to get stub to unix:///var/run/vmware/nestdb/nestdb-server.sock, retrying in 5000 ms…

 
 
In /var/log/syslog* you see lines similar to:

2022-04-26T10:28:19.204325-04:00 PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Failed to execute: rc=1, out=Traceback (most recent call last):#012  File "/usr/lib/frr/frr-reload.py", line 1524, in <module>#012    with open(filename, 'w') as fh:#012OSError: [Errno 28] No space left on device: '/config/vmware/edge/frr/reload-ELSSSB.txt'#012, err=Command '['/usr/lib/frr/frr-reload.py', '--debug', '--reload', '/config/vmware/edge/frr/frrbasecfg.txt']' returned non-zero exit status 1"

 
In /var/log/rcpm/frr-reload.log you see lines similar:
 

2022-04-22 05:24:38,959 WARNING: frr-reload.py failed due to
b'% Nexthop interface cannot be Null0, reject or blackhole\nline 36: Failure to communicate[13] to staticd, line:  ip route #.#.#.#/22 blackhole tag 4001 nexthop-vrf default\n\n% Nexthop interface cannot be Null0, reject or blackhole\nline 40: Failure to communicate[13] to staticd, line:  ip route #.#.#.#/22 blackhole tag 4001 nexthop-vrf default\n\n' cmds on file /config/vmware/edge/frr/reload-XEZNZH.txt

 
Note: These log lines in /var/log/rcpm/frr-reload.log will appear in a healthy and working NSX-T Edge. Only if these lines appear in combination with a large number of files in /config/vmware/edge/frr, the Edge is out of inodes, and the nestdb service is not running should you run through the Resolution below.
 
In /var/log/rcpm/frr-config.log*

2022-04-26T14:35:50Z PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="INFO"] "Reading the routing proto from file /var/run/vmware/edge/routing-pb.cfg"
 
2022-04-26T14:35:51Z PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="INFO"] "Inter-SR routing is enabled"
 
2022-04-26T14:35:51Z PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Unable to open FRR Config File. error(28): No space left on device"
 
2022-04-26T14:35:51Z PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Failed to save the base FRR config, "
 
2022-04-26T14:35:51Z PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Failed to copy/tar log files <type 'exceptions.IOError'> [Errno 28] No space left on device: '/config/vmware/edge/frr/frrproto.2022-04-26T10.35.51.624749.cfg'"
 
2022-04-26T14:35:51Z PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Failed to remove file /config/vmware/edge/frr/frrproto.2022-04-26T10.35.51.624749.cfg error <type 'exceptions.OSError'>"
 
2022-04-26T14:35:51Z PC1PKS2BME01 NSX 3481 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Error in applying the config to FRR"


Environment

VMware NSX-T Data Center 3.x
VMware NSX-T Data Center

Resolution

This issue is fully resolved in NSX-T 3.0.3.1, 3.1.2.3, 3.1.3.1, 3.1.3.3 and newer, and 3.2.0, and newer.
 
To delete the excess reload-*, run the following commands within the NSX-T Edge CLI as root:

With root access, enter the command:
# time perl -e 'for(</config/vmware/edge/frr/reload-*>))[9]<(unlink))}'
 
NOTE: Do not copy and paste the above line into the CLI of the Edge. HTML to text translation may alter the text and make the command non-functional.  Type the command manually.
 
Command output:
# time perl -e 'for(<reload-*>){((stat)[9]<(unlink))}'
 
real    0m0.008s
user    0m0.007s
sys     0m0.000s
 
NOTE: A system with a large number of files to delete will take longer than the times indicated above. At least one instance took over 40 minutes to run to completion. The process may be time consuming. Be patient, don't reboot the Edge or cancel the command.

NOTE: These files must be deleted before upgrading to a newer version of NSX-T. 


Workaround:
A reboot of the edge will temporarily resolve the issue, however to fully resolve the excess reload-#### files will need to be deleted manually.

Additional Information

To verify there are a large number of files:

  • Log into the NSX-T Edge as admin and gain root access
  • At the root cli: cd /config/vmware/edge
  • du -hs *
4.0K config.json
4.0K dns
4.9G frr  <--This is usually much smaller, measured in Kb.
32K ike
1.4M lb
4.0K mdproxy
8.0K rcpm
12K reverse-proxy
180K waf

 
To verify there is a large number of files in /config/vmware/edge/frr:

  • # ls -ltr /config/vmware/edge/frr/reload-* | wc -l
1250532 <----This is usually much smaller, in the range of a few thousand.

 
Note: This will always return the number of files inside this folder. A few thousand files is fine.
 
To verify if nestdb is running/not running

  • # get services nestdb
    # get service nestdb
Wed Apr 27 2022 UTC 20:25:21.328
Service name:      nestdb
Service state:     stopped
 
 
Impact/Risks:
While this event is occurring, the datapath through the Edge will be affected.  ESXi hosts will not be able to communicate with the Edge. North/South traffic through the Edge will be down.