Unknown Services: When running "get cluster status" a number of services are reported as UNKNOWN (SEARCH, APPLIANCE proxy, and SHA).
search cancel

Unknown Services: When running "get cluster status" a number of services are reported as UNKNOWN (SEARCH, APPLIANCE proxy, and SHA).

book

Article ID: 429705

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

When running the get cluster status command from the NSX CLI, the cluster status is reported as DEGRADED. Multiple management services, including SEARCH, APPLIANCE_PROXY, and SHA, show an UNKNOWN status on one or more nodes

Investigation of /var/log/syslog reveals the following error signatures:

  • 404 Not Found errors for SHA metrics: ShaMetricStatsServiceImpl ... Got exception when querying metric data, detail 404 Not Found 
  • Onboard failure for RPC stubs: LmMetricRpcStub Onboard fails for APH [UUID] 
  • Wait thread timeouts occurring approximately 4 seconds after an onboarding response is received

 

Environment

NSX 9.0

Cause

The root cause is a race condition in the Service Health Agent (SHA) onboarding process. This occurs when the Management Plane (MP) server responds to an onboarding request faster than the SHA sending thread can enter its "wait" state. Because the response is received while the thread is still active, the thread later enters the wait state and remains there until it hits a timeout. This failure prevents health metrics from being stored, causing subsequent status queries to fail with a 404 error and the service to report as UNKNOWN

Resolution

This is a known issue impacting VMware NSX.

Workaround 

  1. Access the affected NSX Manager via SSH as admin.
  2. Temporarily disable the remote syslog server configuration to resolve the timing issue.
  3. Restart the SHA service: This command will make changes to your system. Review it carefully before running. restart service sha Unknown Services: When running "get cluster status
  4. Confirm services return to UP and cluster returns to STABLE using get cluster status.

Further Assistance 

If you believe you have encountered this issue and are unable to complete the workaround , open a support case with Broadcom Support and refer to this KB article.
For more information, see Creating and managing Broadcom support cases.