Too many authentication failures blocks CSP patching on vIDM

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Attempts to apply the vIDM cumulative patches (e.g., CSP, Patch 5, Patch 7) due to cluster health validation failures.
- Pre-patch cluster validation indicates Pgpool Master and PostgreSQL Primary roles are not aligned.
- Execution of poolnodes reports a standby node in a DOWN or Detached state (Status 1).
The /db/data directory on the failing standby node is completely empty, indicating a failed replication initialization.
The native auto-recovery.sh polling script or manual execution of pcp_recovery_node fails to complete the database restoration.
Manual SSH validation from the active Master node to the failing standby node prompts for a password or returns connection exhaustion errors: Received disconnect from <Node_IP> port 22:2: Too many authentication failures
Verbose SSH validation (ssh -v root@<target-node>) reveals the destination node actively rejecting the offered public key, forcing a fallback to interactive methods:

debug1: Offering public key: RSA SHA256:<Sanitized> /root/.ssh/id_rsa
debug1: Authentications that can continue: publickey,password,keyboard-interactive

Environment

VMware Identity Manager (vIDM) 3.3.x

VMware Aria Suite Lifecycle 8.18.x

Cause

The internal Pgpool-II recovery mechanism (specifically the recovery_1st_stage script) requires valid SSH public key authentication to orchestrate the database rebuild. It utilizes this secure tunnel to purge the target directory and stream the Write-Ahead Log (WAL) and PostgreSQL binaries via pg_basebackup. If the target node's /root/.ssh/authorized_keys file is missing the Master node's public key string, or if the directory permissions violate OpenSSH StrictModes parameters, the automated SSH connection is rejected. This permanently halts the replication stream, leaving the node detached, the data directory empty, and blocking LCM lifecycle operations.

Resolution

To restore the architectural prerequisite and unblock the native Pgpool recovery sequence, you must establish explicit cryptographic trust.

Extract Primary Key: Execute the following on the active Master node to retrieve its public key string.
```
cat /root/.ssh/id_rsa.pub
```
Inject Key on Target: Establish an independent terminal session (e.g., direct PuTTY) to the failing standby node. Append the exact string copied from Step 1 into the authorized keys file. Ensure the key remains on a single, unbroken line.
```
vi /root/.ssh/authorized_keys
```
Enforce StrictModes Permissions: Execute the following on the failing standby node to satisfy the OpenSSH daemon's strict permission requirements.
```
chmod 700 /root/.ssh && chmod 600 /root/.ssh/authorized_keys && chown -R root:root /root/.ssh
```
Validate Cryptographic Trust: Execute the explicit validation string on the Master node to confirm the connection bypasses local agent caches and successfully negotiates the public key without allocating a TTY.
```
ssh -o IdentitiesOnly=yes -i /root/.ssh/id_rsa root@<target-node-ip> 'echo Trust Established'
```
Re-initiate Recovery: Once you receive Trust Established, the cryptographic block resolves. If networkService is active, the auto-recovery.sh loop automatically detects the failure and re-initiates the replication stream on its next polling cycle. If you manually stopped networkService, orchestrate the PCP recovery from the Master node:
```
pcp_recovery_node -w -h localhost -U pgpool -p 9898 -n <Target_Node_ID>
```

Execute poolnodes upon completion to verify all nodes report an UP status and the quorum restores. Proceed with the vRSLCM patching operation.