NSX-T Edge node status shows Unknown due to phonehome-coordinator service taking too long to start
search cancel

NSX-T Edge node status shows Unknown due to phonehome-coordinator service taking too long to start

book

Article ID: 381442

calendar_today

Updated On:

Products

VMware NSX-T Data Center

Issue/Introduction

  • NSX-T Edge node status shows Unknown from the NSX manager UI and controllers shows as Not Available
  • AlarmsProvider was not able to create Stub with the Master APH :

    2024-10-18T02:16:02.780Z edge.####.### NSX 3209 - [nsx@6876 comp="nsx-edge" subcomp="mpa-client" tid="3209" level="WARNING"] [AlarmsProvider] getMPStubs No stub present for APH (4######3-####-####-####-b###########e)
    2024-10-18T02:16:02.781Z edge.####.### NSX 3209 - [nsx@6876 comp="nsx-edge" subcomp="mpa-client" tid="3209" level="INFO"] [AlarmsProvider] MsgHandler : Invalid stub for Master APH
    2024-10-18T02:16:02.781Z edge.####.### NSX 3209 - [nsx@6876 comp="nsx-edge" subcomp="mpa-client" tid="3209" level="INFO"] [AlarmsProvider] SendRequest: Failed to send msg Master APH, Publish, type (com.vmware.nsx.monitoring.CollectorMpMsg), correlationId (), trackingIdStr (b######7-####-#####-####-5########c), ret (-1)
    2024-10-18T02:16:02.782Z edge.####.### NSX 3209 - [nsx@6876 comp="nsx-edge" subcomp="opsagent" s2comp="alarmsprovider" tid="3209" level="ERROR" errorCode="OPS60004"] failed to send message (type:com.vmware.nsx.monitoring.CollectorMpMsg) to mpa
  • phonehome-coordinator took long time to come up after last restart:
    STATUS | wrapper  | 2024/09/23 06:45:30 | Launching a JVM...
    INFO   | jvm 1    | 2024/09/23 06:45:31 | WrapperManager: Initializing...
    INFO   | jvm 1    | 2024/09/23 06:45:40 | 2024-09-23T06:45:40.182Z INFO org.apache.catalina.startup.Catalina load Initialization processed in 8192 ms
    INFO   | jvm 1    | 2024/09/23 07:50:58 | 2024-09-23T07:50:58.229Z INFO org.apache.catalina.startup.Catalina start Server startup in 3918046 ms
  • CollectorMpService couldn't get registered because the phonehome-coordinator took long time to come up
    var/log/vmware/appl-proxy-rpc.log.10.gz:2024-10-17T21:13:41.174Z a#####b.####.#####.com NSX 1709 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-rpc" tid="1735" level="ERROR" errorCode="RPC503"] RpcTransport[1]::RemoteService[vmware.nsx.monitoring.CollectorMpService] Failed to resolve service: 6-No such device or address
    var/log/vmware/appl-proxy-rpc.log.10.gz:2024-10-17T21:13:41.456Z a####b.####.####.com NSX 1709 - [nsx@6876 comp="nsx-manager" subcomp="appl-proxy" s2comp="nsx-rpc" tid="1735" level="ERROR" errorCode="RPC503"] RpcTransport[1]::RemoteService[vmware.nsx.monitoring.CollectorMpService] Failed to resolve service: 6-No such device or address

Environment

VMware NSX-T Data Center

Cause

The edge node UNKNOWN was caused by aggService heartbeat timeout because the phonehome-coordinator service took long time to start.

Resolution

This is a known issue affecting the current version of NSX 3.2.1.2 and is fixed in 3.2.2

https://docs.vmware.com/en/VMware-NSX/3.2.2/rn/vmware-nsxt-data-center-322-release-notes/index.html#Release-Note-Section-8278

Workaround:
1) Restart the phonehome-coordinator service on the affected NSX Manager
/etc/init.d/phonehome-coordinator status
/etc/init.d/phonehome-coordinator restart
/etc/init.d/phonehome-coordinator status

2) If the issue still does not resolve restart opsagent on the Transport Node Host/Edge:
/etc/init.d/nsx-opsagent status
/etc/init.d/nsx-opsagent restart
/etc/init.d/nsx-opsagent status