PG_AUTO_FAILOVER: all nodes down after enable SSL due to ceritificate issue
search cancel

PG_AUTO_FAILOVER: all nodes down after enable SSL due to ceritificate issue

book

Article ID: 296415

calendar_today

Updated On:

Products

VMware Tanzu Greenplum

Issue/Introduction

In pg_auto_failover, the user can enhance the network security by enabling the SSL (via command: # pg_autoctl enable ssl xxxx), refer to this document
In some cases, After enabling the SSL, the user might notice all data nodes been marked as unhealthy and in down state. 

In this article, we will discuss one of the reasons that may cause such an issue - caused by invalid permission of client certificate/key on monitor. 

 

Environment

Product Version: 14.5

Resolution

Troubleshooting steps:

When the data node has been marked as unhealthy, please check the Postgres logs of the data node, below is an example:

From the logs, we can see:

  1. The monitor is trying to connect to the data node with SSL, but failed: "could not accept SSL connection: EOF detected" 
  2. Then the monitor try to fall back to non-SSL connection, but since the customer requires high-level security so non-SSL is not allowed in pg_hba.conf, so the connection failed
  3. As a result, the monitor can not get info from the data node, so marked the data node as unhealthy

To check why the SSL connection failed, we can use psql client to connect to the Data Node from the Monitor Node. 
- Noted that the client certificate and key by default is under ~/.postgresql/
- Run the below command from the Monitor Node:

psql -h <DataNode> -U pgautofailover_monitor "dbname=postgres sslmode=verify-full sslcert=<CLIENT CERT FILE> sslkey=<CLIENT KEY FILE> sslrootcert=<ROOT CERT>"

In this example, we get the below error:

- So it is clear that the reason why the monitor can not connect to a data node is due to invalid permission of the client's certificate files under  ~/.postgresql/, once we fix the permission issue, the cluster is back to normal