bosh -d cf-<DEPLOYMENT_ID> vms' command/var/vcap/sys/log/pxc-mysql/mysql.err.log2025-11-08T01:27:41.989239Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection to peer ########-9f67 with addr ssl://###.###.###.###:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1223 rttvar: 83 rto: 1632000 lost: 50 last_data_recv: 3112 cwnd: 1 last_queued_since: 98793192 last_delivered_since: 3100388083 send_queue_length: 15 send_queue_bytes: 34648 segment: 0 messages: 15 (gmcast.peer_timeout)
2025-11-08T01:27:43.017312Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection established to ########-9f67 ssl://###.###.###.###:4567
2025-11-08T01:27:43.908117Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 suspected, timeout PT5S (evs.suspect_timeout)
2025-11-08T01:27:43.908182Z 0 [Note] [MY-000000] [Galera] declaring node with index 1 inactive (evs.inactive_timeout)
....
2025-11-08T01:27:46.388996Z 0 [Note] [MY-000000] [Galera] Current view of cluster as seen by this node
view (view_id(NON_PRIM,########-9976,695)
This problem is caused by intermittent network connectivity failures. In the scenario observed, the network dropped packets initially between all nodes, when the nodes attempted to resync, in the middle of the resync operation, the network dropped again. This led to a timeout on the node attempting synchronization which exceeded the 3 attempt maximum. This left the cluster unsynced and unable to reestablish quorum even when the network connectivity restabilized.
Clustered MySQL is dependent on consistent and low latency network connectivity. The first corrective action should be to ensure ping connectivity between all 3 MySQL cluster nodes. If ping connectivity succeeds, ensure all 3 MySQL nodes can reach each other on the internal 4567 port with netcat command: 'nc -vz <PEER_NODE_IP_ADDRESS> 4567'.
Once the network connectivity has been restored between MySQL VMs. Use the mysql-diag tool from the mysql-monitor VM to assess cluster health. Depending on the cluster's current status and condition, use the Recovering from MySQL Cluster Downtime documentation for cluster recovery. This will detail the process to identify whether bootstrap errands can be run, or if a manual cluster bootstrap is required.
A code modification will be released on version 11 of Elastic Application Runtime (TAS) which will increase the number of retry attempts 'max_install_timeouts' (currently 3) in the event of network interruptions between nodes. This will allow the MySQL cluster to try recovery for a longer period of time after network interruptions, currently the limited retries allow for approximately 30 seconds of failure before the retries fail.
When reviewing the mysql.err.log, the following message can be broken down to identify network failures:
2025-11-08T01:27:41.989239Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection to peer ########-9f67 with addr ssl://###.###.###.###:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1223 rttvar: 83 rto: 1632000 lost: 50 last_data_recv: 3112 cwnd: 1 last_queued_since: 98793192 last_delivered_since: 3100388083 send_queue_length: 15 send_queue_bytes: 34648 segment: 0 messages: 15 (gmcast.peer_timeout)
Break down of 'socket stats' for network details:
2025-11-08T01:27:41.989239Z 0 [Note] [MY-000000] [Galera] (########-9976, 'ssl://0.0.0.0:4567') connection to peer ########-9f67 with addr ssl://###.###.###.###:4567 timed out, no messages seen in PT3S,
socket stats:
rtt: 1223 -------------> Round Trip Time in microseconds. In this case 1.2ms
rttvar: 83
rto: 1632000
lost: 50 -------------> Lost Packets as indicated by Linux Kernel. In this case 50.
last_data_recv: 3112 -------------> Time since last data received in milliseconds. 3.1s in this case.
cwnd: 1 -------------> TCP_INFO congestion window reported from Linux. 1 Max Segment Size in this case, indicating Slow-start mode.
last_queued_since: 98793192
last_delivered_since: 3100388083
send_queue_length: 15
send_queue_bytes: 34648 -------------> Number of bytes currently waiting to be sent to Peer node.
segment: 0
messages: 15 -------------> Number of messages Galera has queued.
(gmcast.peer_timeout)
The RTT and lost packet messages above indicate network connectivity failures. This is confirmed by the 'last_data_received' and 'cwnd' values indicating the TCP session is being reset. The queued messages values are large due to the failing network connectivity. These all point to an underlying network failure.