VCF Identity Broker encountered an issue during authentication

Products

VCF Operations

Issue/Introduction

1. When trying to log in to VCF Operations using SSO, the following error is shown:

ERROR: VCF Identity Broker encountered an issue during authentication. Please contact your VCF Admin with the below details for resolution."

2. After rebooting all vIDB nodes and retrying to log in again, the error "no healthy upstream" is shown.

Environment

VCF Operations 9.x

VCF Identity Broker 9.x

Cause

The majority of the data exists in /var/lib/kubelet/pods directory that ends in /mount/pgroot/data/pg_wal. The pg_wal files created from the internally connected sessions saturated the storage usage on its partition, causing the partition to run out of space, at 100% utilization:

Example:

/dev/sdd 10G 10G 32K 100% /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~csi/<pvc-uid>/mount

Note: It could also be a different partition, for example, /dev/sdc, or /dev/sdb.

Resolution

Follow the provided workaround steps to clean up old PG_Wal session files for the vIDB cluster to be re-established and working again:

1. Log in to the vIDB nodes as vmware-system-user. Then execute sudo su to change to the root user.

2. Execute df -h on all nodes and locate the nodes that present 100% partition.

3. On the node that has the 100% partition, execute the following command to enable the kubectl commands.

export KUBECONFIG=/etc/kubernetes/admin.conf

4. Execute the following command to check the pods' status.

kubectl -n vidb-external get pods

Note: The above output shows vidb-postgres-instance-0 is NOT 2/2 fully ready yet.

5. Execute the following command to confirm nodes are in the proper state. In the State column, the Leader instance should show "running". The other 2 replica instances should show "in archive recovery", "starting", or "streaming" instead of "start failed".

kubectl exec -n vidb-external vidb-postgres-instance-0 -- patronictl list

6. In the above output, note the member/instance name in the first Member column for the Leader and Replica nodes. In the above screenshot, the Leader node is vidb-postgres-instance-1, and the other two instances are for the replica nodes.

7. Execute the following commands to reinitialize both replica nodes' postgres services. Wait a few minutes for the status to change.

kubectl exec vidb-postgres-instance-1 -n vidb-external -- patronictl reinit vidb-postgres-instance vidb-postgres-instance-0 --force

kubectl exec vidb-postgres-instance-1 -n vidb-external -- patronictl reinit vidb-postgres-instance vidb-postgres-instance-2 --force

Note: The above step should clean up the 100% partition.

8. Execute df -h on the node identified in step 2, to verify that the partition now has space.

9. Execute the following command to review and ensure all vidb pod service statuses are running.

kubectl get pods -A | grep vidb

10. Confirm SSO login is working again on VCF Operations.