New namespaces or load balance services fail to route properly in PKS environment

Products

VMware Cloud PKS

Issue/Introduction

Symptoms:

·      When you create a new namespace or a load balance service in PKS, the networks are unreachable. If doing a trace route externally to the load balancer's IP a routing loop is seen like the following.
H:\>tracert 10.49.7.7 <===== Load balancer IP
Tracing route to 10.49.7.7 over a maximum of 30 hops
1    <1 ms    <1 ms    <1 ms 10.2.13.1
2    <1 ms    <1 ms    <1 ms 10.2.154.6
3    <1 ms    <1 ms    <1 ms 10.100.254.18
4    <1 ms    <1 ms    <1 ms 10.5.2.58
5    <1 ms    <1 ms    <1 ms 10.5.2.81 <=====Physical router
6    <1 ms    <1 ms    <1 ms 172.20.108.5 <===== T0 Uplink IP
7    <1 ms    <1 ms    <1 ms 172.20.108.1 <===== Back to the Physical Gateway for the T0
8    <1 ms    <1 ms    <1 ms 172.20.108.5 <===== Back to the T0 Uplink
9     1 ms    <1 ms    <1 ms 172.20.108.1
10    <1 ms    <1 ms    <1 ms 172.20.108.5
11     1 ms    <1 ms     1 ms 172.20.108.1
12    <1 ms    <1 ms    <1 ms 172.20.108.5

·      One or both edges report memory allocation errors for the frr-config sub component having an exception while generating FRR Config in either the journal or syslogs. For example:
journalctl | grep "Cannot allocate memory"
Apr 14 18:42:44 HQ2-PCF-EDGEN01 frr.py[2549]: 1 2020-04-14T18:42:44Z HQ2-PCF-EDGEN01 NSX 2549 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Exception while generating FRR Config: ['Traceback (most recent call last):\n', ' File "/opt/vmware/nsx-edge/bin/frr.py", line 356, in apply_config_to_frr\n    ret = frr_cfg.push_to_frr()\n', ' File "/opt/vmware/nsx-edge/bin/frr.py", line 288, in push_to_frr\n    if self._push_to_frr():\n', ' File "/opt/vmware/nsx-edge/bin/frr.py", line 284, in _push_to_frr\n    return self.run_shell_command(cmdline)\n', ' File "/opt/vmware/nsx-edge/bin/frr.py",
line 205, in run_shell_command\n    close_fds=True)\n', ' File "/usr/lib/python2.7/subprocess.py", line 567, in check_output\n    process = Popen(stdout=PIPE, *popenargs, **kwargs)\n', ' File "/usr/lib/python2.7/subprocess.py", line 711, in __init__\n    errread, errwrite)\n', ' File "/usr/lib/python2.7/subprocess.py", line 1235, in _execute_child\n    self.pid = os.fork()\n', 'OSError: [Errno 12] Cannot allocate memory\n']"

tail syslog -f | grep frr-con

<179>1 2020-04-23T18:50:01.044136+00:00 HQ2-PCF-EDGEN02 NSX 2550 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Exception while generating FRR Config: ['Traceback (most recent call last):\n', ' File "/opt/vmware/nsx-edge/bin/frr.py", line 356, in apply_config_to_frr\n    ret = frr_cfg.push_to_frr()\n', ' File "/opt/vmware/nsx-edge/bin/frr.py", line 288, in push_to_frr\n    if self._push_to_frr():\n', ' File "/opt/vmware/nsx-edge/bin/frr.py", line 284, in _push_to_frr\n    return self.run_shell_command(cmdline)\n', ' File "/opt/vmware/nsx-edge/bin/frr.py",
line 205, in run_shell_command\n    close_fds=True)\n', ' File "/usr/lib/python2.7/subprocess.py", line 567, in check_output\n    process = Popen(stdout=PIPE, *popenargs, **kwargs)\n', ' File "/usr/lib/python2.7/subprocess.py", line 711, in __init__\n    errread, errwrite)\n', ' File "/usr/lib/python2.7/subprocess.py", line 1235, in _execute_child\n    self.pid = os.fork()\n', 'OSError: [Errno 12] Cannot allocate memory\n']
<179>1 2020-04-23T18:50:01.044362+00:00 HQ2-PCF-EDGEN02 NSX 2550 SYSTEM [nsx@6876 comp="nsx-edge" subcomp="frr-config" username="frr" level="ERROR"] "Error in applying the config to FRR"

· The disk for /var/log/ on one or more edge is currently full or was once full and the not rebooted since freeing up space. For example:
root@edge01:~# df -h

Filesystem Size Used Avail Use% Mounted on

udev 4.4G 0 4.4G 0% /dev

tmpfs 1.6G 8.8M 1.6G 1% /run

/dev/sda3 19G 3.5G 15G 20% /

tmpfs 7.9G 137M 7.7G 2% /dev/shm

tmpfs 5.0M 0 5.0M 0% /run/lock

tmpfs 7.9G 0 7.9G 0% /sys/fs/cgroup

/dev/sda1 945M 8.3M 872M 1% /boot

/dev/mapper/nsx-tmp 3.7G 8.4M 3.5G 1% /tmp

/dev/mapper/nsx-config__bak 19G 103M 18G 1% /config

/dev/mapper/nsx-image 19G 44M 18G 1% /image

/dev/mapper/nsx-var+log 16G 15G 297M 99% /var/log

Note: if the /var/log disk is currently full. The syslog will not be recording new logs and so you may not see the errors. Watching the journal log while creating a new namespace will generate the errors again. Or free up space.

· Running get route from the VRF of the stateful T1 router does not show the route for the newly created subnet for the namespace or loadbalancer service.

Environment

VMware PKS 1.x

Cause

Once the /var/log directory is full. All the logs of the nsx-edge-rcpm-python and child processes remains in the memory. For every config push, nsx-edge-rcpm-python(frr.py) spawn a child subprocess (frr-reload.py) which also spawn few child subprocesses. And a subprocess request as much of memory as the parent process is consuming which is a lot due to all the logs which were not able to dump in var/log. Resulting in request being denied by the OS and we see "cannot allocate memory" error message.

Resolution

This issue is resolved in VMware NSX-T Data Center 2.5.1, available at VMware Downloads.

Workaround:

To work around this issue, free up disk space in the /var/log directory. This space has so far only been seen to be filled up due to a known issue when the root password expires which prevents log rotation (since fixed in NSX-T 2.5.1). For more information, see VMware Kb NSX-T Manager and Edge node log rotate has stopped (76114).

To free up space do the following.

1. SSH to the NSX-T Edge vm.

2. Activate the bash shell by running st en and using the root user password when prompted.

3. List out the directory by running ls -larSh /var/log.

4. Remove the largest syslog. Typically would be syslog.1. The rest of the syslogs can stay. For example "rm /var/log/syslog.1"

5. Follow the rest of the steps in NSX-T Manager and Edge node log rotate has stopped (76114) to fix the root password expiration.

6. Reboot the appliance to reclaim the memory from the sub processes. If rebooting is not currently available restart the rcpm process, then reboot as soon as an available window to do so can be made. To restart the rcpm process run service nsx-edge-rcpm-python restart from the bash shell on the edge.