NSX-T MANAGER and HTTPS services goes down frequently due to incorrect NAT configuration
search cancel

NSX-T MANAGER and HTTPS services goes down frequently due to incorrect NAT configuration

book

Article ID: 317800

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • Access to the NSX-T manager GUI fails or is intermittent.
  • REST API calls to the NSX-T manager may fail.
  • A new NAT (SNAT or DNAT) rule was added with address: 0.0.0.0/0.
  • The CPU of the NSX-T manager is high.
  • In the NSX-T manager CLI when you run the command 'get cluster status'

You will notice the HTTPS (reverse-proxy) and MANAGER (proton) service is down randomly on different managers, see example below:

Group Type: MANAGER
Group Status: DEGRADED
Members:
UUID FQDN IP STATUS
3a6c0842-####-####-####-##########6f manager01 192.168.110.51 UP
6dd22d42-####-####-####-##########47 manager02 192.168.110.52 UP
0b172d42-####-####-####-##########bb manager03 192.168.110.53 DOWN

Group Type: HTTPS
Group Status: DEGRADED
Members:
UUID FQDN IP STATUS
3a6c0842-####-####-####-##########6f manager01 192.168.110.51 UP
6dd22d42-####-####-####-##########47 manager02 192.168.110.52 UP
0b172d42-####-####-####-##########bb manager03 192.168.110.53 DOWN

Then a few minutes later it may flip to other managers:

Group Type: MANAGER
Group Status: DEGRADED
Members:
UUID FQDN IP STATUS
3a6c0842-####-####-####-##########6f manager01 192.168.110.51 UP
6dd22d42-####-####-####-##########47 manager02 192.168.110.52 DOWN
0b172d42-####-####-####-##########bb manager03 192.168.110.53 UP

Group Type: HTTPS
Group Status: DEGRADED
Members:
UUID FQDN IP STATUS
3a6c0842-####-####-####-##########6f manager01 192.168.110.51 UP
6dd22d42-####-####-####-##########47 manager02 192.168.110.52 DOWN
0b172d42-####-####-####-##########bb manager03 192.168.110.53 UP


You may see log entries in the log '/var/log/proton/proton-tomcat-wrapper.log' which indicate a Java Out of Memory issue:

STATUS | wrapper | 2020/02/03 13:31:31 | The JVM has run out of memory. Requesting thread dump.
STATUS | wrapper | 2020/02/03 13:31:31 | Dumping JVM state.
STATUS | wrapper | 2020/02/03 13:31:31 | The JVM has run out of memory. Restarting JVM.
INFO | jvm 1 | 2020/02/03 13:31:31 | Dumping heap to /image/core/proton_oom.hprof ...
ERROR | wrapper | 2020/02/03 13:32:06 | Shutdown failed: Timed out waiting for signal from JVM.
STATUS | wrapper | 2020/02/03 13:32:06 | Dumping JVM state..
       
STATUS | wrapper | 2020/02/03 13:44:34 | The JVM has run out of memory. Restarting JVM.
INFO | jvm 2 | 2020/02/03 13:44:34 | Dumping heap to /image/core/proton_oom.hprof ...
INFO | jvm 2 | 2020/02/03 13:44:34 | Unable to create /image/core/proton_oom.hprof: File exists
ERROR | wrapper | 2020/02/03 13:45:08 | Shutdown failed: Timed out waiting for signal from JVM.
STATUS | wrapper | 2020/02/03 13:45:08 | Dumping JVM state.
ERROR | wrapper | 2020/02/03 13:45:13 | JVM did not exit on request, termination requested.
STATUS | wrapper | 2020/02/03 13:45:13 | JVM received a signal SIGKILL (9).
STATUS | wrapper | 2020/02/03 13:45:13 | JVM process is gone.



Environment

VMware NSX-T Data Center

Cause

This issue occur due to the process used to evaluate a NAT (SNAT or DNAT) rule which contains the 0.0.0.0/0 address range.
Note: this is the full IPv4 address range.
Currently the manager tries to implement each of these addresses on the loopback of the system.
This amount of processing causes java to run out of memory and the services HTTPS and MANAGER to continuously crash.

Resolution

This issue is resolved in VMware NSX-T Data Center 3.0, available at Broadcom Downloads.


Workaround:
Do not use the range '0.0.0.0/0' when creating NAT rules, please use the word 'ANY' instead.
Once the '0.0.0.0/0' address is changed or removed, all 3 NSX-T managers need to be restarted at the same time.