All Cloud Controllers, Cloud Controller Workers, Clock Global, 1/3 Diego DB VMs are down

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

The MySQL Galera cluster is only showing 1.33 VMs that can accept queries in Grafana.
Applications are unable to be pushed and the CF CLI is unusable.
You are getting 404 errors when attempting to CF login, as it is redirecting to the system URL.
Despite these failures, the MySQL VM's in the CF deployment show running status when running 'bosh -d cf-<DEPLOYMENT_ID> vms' command
Example logging from /var/vcap/sys/log/pxc-mysql/mysql.err.log

2025-11-08T01:27:41.989239Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection to peer ########-9f67 with addr ssl://###.###.###.###:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1223 rttvar: 83 rto: 1632000 lost: 50 last_data_recv: 3112 cwnd: 1 last_queued_since: 98793192 last_delivered_since: 3100388083 send_queue_length: 15 send_queue_bytes: 34648 segment: 0 messages: 15 (gmcast.peer_timeout)

2025-11-08T01:27:43.017312Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection established to ########-9f67 ssl://###.###.###.###:4567

2025-11-08T01:27:43.908117Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)

2025-11-08T01:27:43.908182Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 inactive (evs.inactive_timeout)
....

2025-11-08T01:27:46.388996Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node

view (view_id(NON_PRIM,########-9976,695)
Multiple nodes showing NON_PRIM status indicates the cluster does not have quorum. The mysql-diag tool will also show nodes in Non Primary status.

Cause

This problem is caused by intermittent network connectivity failures. In the scenario observed, the network dropped packets initially between all nodes, when the nodes attempted to resync, in the middle of the resync operation, the network dropped again. This led to a timeout on the node attempting synchronization which exceeded the 3 attempt maximum. This left the cluster unsynced and unable to reestablish quorum even when the network connectivity restabilized.

Resolution

Clustered MySQL is dependent on consistent and low latency network connectivity. The first corrective action should be to ensure ping connectivity between all 3 MySQL cluster nodes. If ping connectivity succeeds, ensure all 3 MySQL nodes can reach each other on the internal 4567 port with netcat command: 'nc -vz <PEER_NODE_IP_ADDRESS> 4567'.

Once the network connectivity has been restored between MySQL VMs. Use the mysql-diag tool from the mysql-monitor VM to assess cluster health. Depending on the cluster's current status and condition, use the Recovering from MySQL Cluster Downtime documentation for cluster recovery. This will detail the process to identify whether bootstrap errands can be run, or if a manual cluster bootstrap is required.

A code modification will be released on version 11 of Elastic Application Runtime (TAS) which will increase the number of retry attempts 'max_install_timeouts' (currently 3) in the event of network interruptions between nodes. This will allow the MySQL cluster to try recovery for a longer period of time after network interruptions, currently the limited retries allow for approximately 30 seconds of failure before the retries fail.

Additional Information

When reviewing the mysql.err.log, the following message can be broken down to identify network failures:

2025-11-08T01:27:41.989239Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection to peer ########-9f67 with addr ssl://###.###.###.###:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1223 rttvar: 83 rto: 1632000 lost: 50 last_data_recv: 3112 cwnd: 1 last_queued_since: 98793192 last_delivered_since: 3100388083 send_queue_length: 15 send_queue_bytes: 34648 segment: 0 messages: 15 (gmcast.peer_timeout)

Break down of 'socket stats' for network details:

2025-11-08T01:27:41.989239Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection to peer ########-9f67 with addr ssl://###.###.###.###:4567 timed out, no messages seen in PT3S,

socket stats:

rtt: 1223 -------------> Round Trip Time in microseconds. In this case 1.2ms

rttvar: 83

rto: 1632000

lost: 50 -------------> Lost Packets as indicated by Linux Kernel. In this case 50.

last_data_recv: 3112 -------------> Time since last data received in milliseconds. 3.1s in this case.

cwnd: 1 -------------> TCP_INFO congestion window reported from Linux. 1 Max Segment Size in this case, indicating Slow-start mode.

last_queued_since: 98793192

last_delivered_since: 3100388083

send_queue_length: 15

send_queue_bytes: 34648 -------------> Number of bytes currently waiting to be sent to Peer node.

segment: 0

messages: 15 -------------> Number of messages Galera has queued.

(gmcast.peer_timeout)

The RTT and lost packet messages above indicate network connectivity failures. This is confirmed by the 'last_data_received' and 'cwnd' values indicating the TCP session is being reset. The queued messages values are large due to the failing network connectivity. These all point to an underlying network failure.