ESXi host status reported as Unknown after upgrade from NSX 3.2 to 4.2.1.3.

Products

VMware NSX

Issue/Introduction

The upgrade from NSX version 3.2.3 to 4.2.1.3 completed without any errors reported during the process.
Upgrade pre-checks for the ESXi transport nodes did not flag any issues.
However, post-upgrade, the transport node status for all hosts within the cluster is now being displayed as "Unknown."
As per the setup, there was only one NSX Manager appliance configured in small form factor.
The "View Details" section for the ESXi hosts shows that the "Controller Connectivity" status listed as "Unknown." In the Monitor section, the Agent Status is not reported.

On the ESXi hosts, from nsxcli, running the command, "get managers" and "get controllers" reported all the 3 managers nodes.
The manager node and the controller connection is reported as active and connected.
The NSX managers are reachable from the ESXi hosts and jumbo packets connectivity is also working as expected.
esxcli network ip connection list | grep -iE "1234|1234" report port 1234 and 1235 connection as ESTABLISHED.

[ESXi] esxcli network ip connection list | grep -iE "1234|1235"
tcp         0       0  192.###.###.11:50475              192.###.###.1:1235    ESTABLISHED    264890  newreno  nsx-proxy
tcp         0       0  192.###.###.12:19099              192.###.###.1:1234    ESTABLISHED    264890  newreno  nsx-proxy

Rebooting the NSX Manager and the ESXi node does not update the status of the nodes.
Running the command, "get transport-nodes status" from NSX manager report the Connection-State as OPEN.

MANAGER> get transport-nodes status
TransportNode-ID                       Remote-Address                                   Controller                             Manager         SSL-Enabled  Connection-State  Supported-Versions   Node-Type    Name
212############################f6dab   192.###.###.11:57658                             aa##########################ca3   192.###.###.1           true         OPENED            [4.1, 4.0, 3.2]      ESXi      ESXi1
123############################faa82   192.###.###.12:50475                             aa##########################ca3   192.###.###.1           true         OPENED            [4.1, 4.0, 3.2]      ESXi      ESXi2
fec############################259f0   192.###.###.13:59819                             aa##########################ca3   192.###.###.1           true         OPENED            [4.1, 4.0, 3.2]      ESXi      ESXi3
495############################e5e97   192.###.###.14:50166                             aa##########################ca3   192.###.###.1           true         OPENED            [4.1, 4.0, 3.2]      ESXi      ESXi4

In the NSX manager appliance running get cluster shows Group Type: MONITORING as STATUS DOWN.

Group Type: MONITORING
Group Status: UNAVAILABLE

Members:
    UUID                                       FQDN                                       IP               IPv6                                             STATUS
    ade#############################db98       Manager_FQDN                               192.###.###.1    -                                                DOWN

Post upgrade, checking the phonehome-coordinator service status (responsible for Monitoring group) on the NSX manager appliance reports its status as stopped.
Restarting the service, /etc/init.d/phonehome-coordinator restart brings the service up but Monitoring Group is still reported down.
The Phonehome-coordinator (Monitoring) service fails to start due to an out-of-memory (OOM) condition which can cause the service to go down.
Log entries similar to below example can be found in the following log file: /var/log/phonehome-coordinator/phonehome-coordinator-tomcat-wrapper.log

| java.lang.OutOfMemoryError: Java heap space
| The JVM has run out of memory.  Requesting thread dump.
| Dumping JVM state.
| Dumping heap to /image/core/phc_oom.hprof ...
| Unable to create /image/core/phc_oom.hprof: File exists
| Terminating due to java.lang.OutOfMemoryError: Java heap space
| The JVM has run out of memory.  Requesting thread dump.
| Dumping JVM state.
| JVM exited unexpectedly.

After increasing the form factor of the NSX manager from small to medium, the transport node status is reflected correctly as Up.

Environment

VMware NSX

Cause

This issue is commonly observed when NSX Manager is deployed using the Small-sized form factor, where limited memory resources may contribute to the failure.
It can cause an out-of-memory (OOM)/race condition, which leads to the Phonehome-coordinator (Monitoring) service failing to start during initialization.
As a result, the Phonehome-coordinator service crashes during startup, and the Monitoring group won't come up until the issue is resolved.

Resolution

To fix the underlying issue:

If the NSX Manager VM was deployed as small form factor (Not Supported for Production Environments)

Any one of the three options can be used to resize the NSX manager. Reference: Resize an NSX Manager Node
For reference on NSX Manager VM and Host requirements, please refer: NSX Manager VM and Host Transport Node System Requirements

And if the NSX Manager VM was deployed as medium or higher form factor:

Reboot the NSX Manager to resolve the underlying "out of memory" issue.