- Upgrading to 3.x gets stuck at restore_datastore_cluster step
- Below is the output of get upgrade progress-status, it shows that the upgrade fails at restore_datastore_cluster
Upgrade steps:
download_os [2022-01-14 01:51:23 - 2022-01-14 01:51:48] SUCCESS
shutdown_manager [2022-01-14 01:51:57 - 2022-01-14 01:53:42] SUCCESS
install_os [2022-01-14 01:53:42 - 2022-01-14 01:54:50] SUCCESS
migrate_manager_config [2022-01-14 01:54:50 - 2022-01-14 01:54:55] SUCCESS
switch_os [2022-01-14 01:54:55 - 2022-01-14 01:55:01] SUCCESS
reboot [2022-01-14 01:55:01 - 2022-01-14 01:55:49] SUCCESS
run_migration_tool [2022-01-14 01:57:26 - 2022-01-14 02:28:33] SUCCESS
start_manager [2022-01-14 02:28:34 - 2022-01-14 02:32:49] SUCCESS
resume_other_nodes [2022-01-14 02:32:49 - 2022-01-14 02:41:20] SUCCESS
restore_datastore_cluster [2022-01-14 02:41:20 - 2022-01-14 03:07:57] FAILED
restore_datastore_cluster [2022-01-14 03:23:07 - 2022-01-14 03:34:50] FAILED
Status: Failed to restore datastore cluster. None, 500
- In var/log/corfu-nonconfig/corfu.9040.log, we see that the NSX-T Manager fails to connect to another NSX-T Manager via port 9040.
2022-01-14T06:59:48.539Z | DEBUG | client-16 | o.c.r.c.NettyClientRouter | connectAsync[10.##.##.##:9040]: Channel connection failed, reconnecting...
2022-01-14T06:59:48.539Z | INFO | client-16 | o.c.r.c.NettyClientRouter | Connect Async 10.##.##.##:9040
2022-01-14T06:59:48.852Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 10.##.##.##:9040 but failed by timeout
2022-01-14T06:59:48.852Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 8664 times, systemDownHandlerTriggerLimit = 60
2022-01-14T06:59:48.852Z | INFO | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | fetchLayout: Invoking the systemDownHandler.
2022-01-14T06:59:48.852Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 10.##.##.##:9040 but failed by timeout
2022-01-14T06:59:48.852Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 8663 times, systemDownHandlerTriggerLimit = 602022-01-14T06:59:48.852Z | INFO | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | fetchLayout: Invoking the systemDownHandler.
2022-01-14T06:59:49.355Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Tried to get layout from 10.##.##.##:9040 but failed by timeout
2022-01-14T06:59:49.355Z | WARN | CorfuRuntime-0 | o.corfudb.runtime.CorfuRuntime | Couldn't connect to any up-to-date layout servers, retrying in PT1S, Retried 10307 times, systemDownHandlerTriggerLimit = 60
- In system/ss_-ip, it can be see that the TCP connection to port 9040 is stuck in SYN-SENT
tcp SYN-SENT 0 1 10.##.##.##:51582 10.##.##.##:9040 users:(("java",pid=28844,fd=49))
tcp ESTAB 0 0 10.##.##.##:38030 10.##.##.##:9040 users:(("java",pid=31025,fd=136))
tcp SYN-SENT 0 1 10.##.##.##:51576 10.##.##.##:9040 users:(("java",pid=15546,fd=230))
tcp ESTAB 0 0 10.##.##.##:9040 10.##.##.##:38030 users:(("java",pid=28844,fd=23))