Embedded orchestrator is unresponsive and shows endpoint 'not found' error message
search cancel

Embedded orchestrator is unresponsive and shows endpoint 'not found' error message

book

Article ID: 393612

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • When reviewing embedded orchestrator instances, behavior may occur where embedded trust is broken and the endpoint is 'not found'.
  • You find an error message similar to :
           930003: Automation Orchestrator Endpoint with id 'xxxx-example-UUID' not found  
  • When reviewing logs located on Automation appliance: /services-logs/prelude/tango-vro-gateway-app/file-logs/tango-vro-gateway-app.log  you see behavior snippets like 
    [EMBEDDED_ENDPOINT] Error while trying to retrieve uuid for endpoint embedded-VRO
  • When reviewing the integrations tab for embedded orchestrator you see:
    PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
  • When reviewing logs located at /services-logs/prelude/postgres-<exampleUUID>/file-logs/postgres.log
    FATAL:  the database system is shutting down 

Environment

Aria Automation 8.x

Aria Orchestrator 8.x

Cause

In the context of VMware Aria Automation (formerly vRealize Automation), X509TrustManagerImpl is a Java class used for managing SSL/TLS trust and verifying the authenticity of certificates, which is crucial for secure communication with the Aria Automation server.

These certificates are stored in PostgreSQL database.

If vPostgres clustering is broken, this will lead to transactions that are not replicated. 

The breakdown is due to network isolation which correlates to split brain activity, and  VCO-APP pod transactions to the database are not synchronized correctly because one node thinks it is Master Role, when actually another node holds Master Role.

Resolution

Orchestrator instances with the above error's indicates there is Postgres Split Brain activity and need to resolve this by resetting the internal postgres service's master as outlined in Broadcom documentation:

https://knowledge.broadcom.com/external/article/317721/network-isolation-causes-splitbrain-scen.html