Unknown Services: When running "get cluster status" a number of services are reported as UNKNOWN (SEARCH, APPLIANCE proxy, and SHA).
search cancel

Unknown Services: When running "get cluster status" a number of services are reported as UNKNOWN (SEARCH, APPLIANCE proxy, and SHA).

book

Article ID: 429705

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

When running the get cluster status command from the NSX CLI, the cluster status is reported as DEGRADED. Multiple management services, including SEARCH, APPLIANCE_PROXY, and SHA, show an UNKNOWN status on one or more nodes.

Investigation of /var/log/syslog reveals the following error signatures:

  • 404 Not Found errors for SHA metrics: ShaMetricStatsServiceImpl ... Got exception when querying metric data, detail 404 Not Found .
  • Onboard failure for RPC stubs: LmMetricRpcStub Onboard fails for APH [UUID].
  • Wait thread timeouts occurring approximately 4 seconds after an onboarding response is received.

 

Environment

NSX 9.0

Cause

The root cause is a race condition in the Service Health Agent (SHA) onboarding process.

  • This occurs when the Management Plane (MP) server responds to an onboarding request faster than the SHA sending thread can enter its "wait" state. Because the response is received while the thread is still active, the thread later enters the wait state and remains there until it hits a timeout.
  • This failure prevents health metrics from being stored, causing subsequent status queries to fail with a 404 error and the service to report as UNKNOWN.

Resolution

This issue is resolved in VCF 9.1, available at Broadcom downloads.

If you are having difficulty finding and downloading software, please review the Download Broadcom products and software KB.

Workaround:

  1. Access the affected NSX Manager via SSH as admin.
  2. Temporarily disable the remote syslog server configuration to resolve the timing issue.
  3. Restart the SHA service: This command will make changes to your system. Review it carefully before running. restart service sha Unknown Services: When running "get cluster status" 
  4. Confirm services return to UP and cluster returns to STABLE using get cluster status.