NSX-T Upgrade to 3.x gets stuck due to "NonConfig" and the cluster fails to come up
search cancel

NSX-T Upgrade to 3.x gets stuck due to "NonConfig" and the cluster fails to come up

book

Article ID: 312627

calendar_today

Updated On:

Products

VMware NSX Networking

Issue/Introduction

Symptoms:
  • Upgrading to 3.x gets stuck at restore_datastore_cluster step
  • Below is the output of get upgrade progress-status, it shows that the upgrade fails at restore_datastore_cluster
Upgrade steps:
download_os [2022-01-14 01:51:23 - 2022-01-14 01:51:48] SUCCESS
shutdown_manager [2022-01-14 01:51:57 - 2022-01-14 01:53:42] SUCCESS
install_os [2022-01-14 01:53:42 - 2022-01-14 01:54:50] SUCCESS
migrate_manager_config [2022-01-14 01:54:50 - 2022-01-14 01:54:55] SUCCESS
switch_os [2022-01-14 01:54:55 - 2022-01-14 01:55:01] SUCCESS
reboot [2022-01-14 01:55:01 - 2022-01-14 01:55:49] SUCCESS
run_migration_tool [2022-01-14 01:57:26 - 2022-01-14 02:28:33] SUCCESS
start_manager [2022-01-14 02:28:34 - 2022-01-14 02:32:49] SUCCESS
resume_other_nodes [2022-01-14 02:32:49 - 2022-01-14 02:41:20] SUCCESS
restore_datastore_cluster [2022-01-14 02:41:20 - 2022-01-14 03:07:57] FAILED
restore_datastore_cluster [2022-01-14 03:23:07 - 2022-01-14 03:34:50] FAILED
Status: Failed to restore datastore cluster. None, 500

 
  • In var/log/corfu-nonconfig/corfu.9040.log, we see that the NSX-T Manager fails to connect to another NSX-T Manager via port 9040.
2022-01-14T06:59:48.539Z | DEBUG |                      client-16 |      o.c.r.c.NettyClientRouter | connectAsync[10.104.21.130:9040]: Channel connection failed, reconnecting...
2022-01-14T06:59:48.539Z | INFO  |                      client-16 |      o.c.r.c.NettyClientRouter | Connect Async 10.104.21.130:9040
2022-01-14T06:59:48.852Z | WARN  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 10.104.21.130:9040 but failed by timeout
2022-01-14T06:59:48.852Z | WARN  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 8664 times, systemDownHandlerTriggerLimit = 60
2022-01-14T06:59:48.852Z | INFO  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | fetchLayout: Invoking the systemDownHandler.
2022-01-14T06:59:48.852Z | WARN  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 10.104.21.130:9040 but failed by timeout
2022-01-14T06:59:48.852Z | WARN  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 8663 times, systemDownHandlerTriggerLimit = 602022-01-14T06:59:48.852Z | INFO  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | fetchLayout: Invoking the systemDownHandler.
2022-01-14T06:59:49.355Z | WARN  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 10.104.21.130:9040 but failed by timeout
2022-01-14T06:59:49.355Z | WARN  |                 CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 10307 times, systemDownHandlerTriggerLimit = 60
 
  • In system/ss_-ip, it can be see that the TCP connection to port 9040 is stuck in SYN-SENT
tcp     SYN-SENT         0          1        10.126.195.8:51582              10.124.25.30:9040           users:(("java",pid=28844,fd=49))
tcp     ESTAB            0          0        10.126.195.8:38030              10.126.195.8:9040           users:(("java",pid=31025,fd=136))
tcp     SYN-SENT         0          1        10.126.195.8:51576              10.124.25.30:9040           users:(("java",pid=15546,fd=230))
tcp     ESTAB            0          0        10.126.195.8:9040               10.126.195.8:38030          users:(("java",pid=28844,fd=23))


 


Environment

VMware NSX-T Data Center
VMware NSX-T Data Center 3.x

Cause

This issue can occur if NSX-T Managers are assigned to different IP subnets where there is a firewall between the NSX-T Managers. The firewall blocks TCP port 9040 between the NSX-T Managers, consequently, corfu-nonconfig cluster fails to come up.

Resolution

Make sure that NSX-T Managers can communicate with each others via port 9040, as per the following URL:
https://ports.esp.vmware.com/home/NSX-T-Data-Center