PSC Health Status changes from Green to Red in the vCenter due to vmware-postgres-archiver service

Products

VMware vCenter Server

Issue/Introduction

pschealth reporting health status as red due to vmware-postgres-archiver service

In vCenter server, the journalctl.log shows below events:

MM DD HH:MM:SS <vCenter server fqdn> vpxd[<process ID>]: Event [<Event ID>] [1-1] [YYYY-MM-DDTHH:MM:SS.SSSSSZ] [vim.event.HealthStatusChangedEvent] [info] [Vmonuser] [] [<Event ID>] [pschealth status changed from green to red]
MM DD HH:MM:SS <vCenter server fqdn> vpxd[<process ID>]: Event [<Event ID>] [1-1] [YYYY-MM-DDTHH:MM:SS.SSSSSZ] [vim.event.EventEx] [info] [] [] [<Event ID>] [Alarm 'Health status changed alarm' on Datacenters triggered by event <Event ID> 'pschealth status changed from green to red']
MM DD HH:MM:SS <vCenter server fqdn> vpxd[<process ID>]: Event [<Event ID>] [1-1] [YYYY-MM-DDTHH:MM:SS.SSSSSZ] [vim.event.HealthStatusChangedEvent] [info] [Vmonuser] [] [<Event ID>] [pschealth status changed from red to green]
MM DD HH:MM:SS <vCenter server fqdn> vpxd[<process ID>]: Event [<Event ID>] [1-1] [YYYY-MM-DDTHH:MM:SS.SSSSSZ] [vim.event.EventEx] [info] [] [] [<Event ID>] [Alarm 'Health status changed alarm' on Datacenters triggered by event <Event ID> 'pschealth status changed from red to green']
In vCenter server /var/log/vmware/vmon/vmon.log:

YYYY-MM-DDTHH:MM:SS.SSSSSZ Wa(03) host-####<vmware-postgres-archiver> Service api-health command's stderr: Service health xml file is stale. Current time: 5978037, expiration time: 5978030. Treating service health state RED.
YYYY-MM-DDTHH:MM:SS.SSSSSZ Wa(03)+ host-####
YYYY-MM-DDTHH:MM:SS.SSSSSZ Wa(03) host-#### <vmware-postgres-archiver> Service api-health command's stderr: <?xml version="1.0" encoding="UTF-8" standalone="yes"?><healthStatus schemaVersion="1.0" xmlns="http:/
/www.vmware.com/cis/cm/common/jaxb/healthstatus"><status>GREEN</status><message messageKey="cis.vmware-postgres-archiver.health.healthy" defaultMessage="VMware Archiver service is healthy."></message><expirationMonoSec>####</expirationMonoSec></healthStatus>
YYYY-MM-DDTHH:MM:SS.SSSSSZ Wa(03) host-#### <vmware-postgres-archiver> Health of service failed. Health data: {"localizable_msgs": [{"id": "com.vmware.vmon.svc_health_timeout", "default_message": "Service is in an unhealthy state.", "args": []}], "_service_name": "vmware-postgres-archiver", "_trigger_threaddump_on_failure": 0}
YYYY-MM-DDTHH:MM:SS.SSSSSZ In(05) host-#### <vmware-postgres-archiver> Recover from service api health check failure. Fail count 0
YYYY-MM-DDTHH:MM:SS.SSSSSZ In(05) host-#### <vmware-postgres-archiver> Restarting service.
YYYY-MM-DDTHH:MM:SS.SSSSSZ In(05) host-#### <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata vmware-postgres-archiver,UNHEALTHY,HEALTHY,1
YYYY-MM-DDTHH:MM:SS.SSSSSZ Wa(03) host-#### <pschealth> Health of service failed. Health data:
YYYY-MM-DDTHH:MM:SS.SSSSSZ In(05) host-#### <pschealth> Recover from service api health check failure. Fail count 0
YYYY-MM-DDTHH:MM:SS.SSSSSZ In(05) host-#### <event-pub> Constructed command: /usr/bin/python /usr/lib/vmware-vmon/vmonEventPublisher.py --eventdata pschealth,UNHEALTHY,HEALTHY,1

In vCenter server /var/log/vmware/envoy/envoy.log:

YYYY-MM-DDTHH:MM:SS.SSSSSZ info envoy[####] [Originator@6876 sub=Default] YYYY-MM-DDTHH:MM:SS.SSSSSZ POST /sdk 500 via_upstream - 540 585 - 5034 5033 0 <Monitoring Solution IP>:33868 HTTP/1.1 TLSv1.2 <vCenter server IP>:443 127.0.0.1:38422 HTTP/2 - 127.0.0.1:8085 - "ns1:Login>
YYYY-MM-DDTHH:MM:SS.SSSSSZ info envoy[####] [Originator@6876 sub=Default] YYYY-MM-DDTHH:MM:SS.SSSSSZ POST /sdk 500 via_upstream - 540 585 - 4035 4034 0 <Monitoring Solution IP>:48724 HTTP/1.1 TLSv1.2 <vCenter server IP>:443 127.0.0.1:38422 HTTP/2 - 127.0.0.1:8085 - "ns1:Login>
YYYY-MM-DDTHH:MM:SS.SSSSSZ info envoy[####] [Originator@6876 sub=Default] YYYY-MM-DDTHH:MM:SS.SSSSSZ POST /sdk 500 via_upstream - 540 585 - 5026 5026 0 <Monitoring Solution IP>:35214 HTTP/1.1 TLSv1.2 <vCenter server IP>:443 127.0.0.1:38436 HTTP/2 - 127.0.0.1:8085 - "ns1:Login>
YYYY-MM-DDTHH:MM:SS.SSSSSZ info envoy[####] [Originator@6876 sub=Default] YYYY-MM-DDTHH:MM:SS.SSSSSZ POST /sdk 500 via_upstream - 540 585 - 5034 5034 0 <Monitoring Solution IP>:50060 HTTP/1.1 TLSv1.2 <vCenter server IP>:443 127.0.0.1:38422 HTTP/2 - 127.0.0.1:8085 - "ns1:Login>
YYYY-MM-DDTHH:MM:SS.SSSSSZ info envoy[####] [Originator@6876 sub=Default] YYYY-MM-DDTHH:MM:SS.SSSSSZ POST /sdk 500 via_upstream - 540 585 - 4035 4034 0 <Monitoring Solution IP>:36568 HTTP/1.1 TLSv1.2 <vCenter server IP>:443 127.0.0.1:38436 HTTP/2 - 127.0.0.1:8085 - "ns1:Login>

Cause

This issue occurs when a third-party monitoring tool or backup solution performs high-frequency API calls to the vCenter Server.

Resolution

To resolve the issue, identify the Monitoring Solution IP and the solution related to it and work with the Vendor to manage the sessions from their end / investigate further.

After the issue is resolved from their end, restart the vCenter services to clear the degraded state:

service-control --stop vmware-postgres-archiver
service-control --start vmware-postgres-archiver

Workaround

If the third-party tool cannot be adjusted immediately, temporarily block the source IP at the network firewall or via the vCenter Appliance Firewall (VAMI > Firewall) to allow the internal vmware-postgres-archiver service to recover.