This issue happens when one of the nodes in the TAS internal HA cluster becomes unhealthy during upgrade. The cause is related to a bug in the Percona XtraDB Cluster (pxc) component in these VMs and is directly related to specific pxc versions.
TAS versions earlier than TAS 4.0.31, TAS 6.0.11 or TAS 10.0.1 shipping with pxc versions earlier than 1.0.33+
One of the nodes of the MySQL cluster gets an idle connection to Galera causing probing of the Galera gcomm port. The logs will show errors similar to the following:
202x-01-10T21:34:36.398945Z 0 [Note] [MY-000000] [Galera] (xxxxxxxx-xxxx, 'ssl://0.0.0.0:4567') connection to peer 00000000-0000 with addr ssl://xxx.xx.xx.xx:4567 timed out, no messages seen in PT3S, socket stats: rtt: 1187 rttvar: 654 rto: 204000 lost: 0 last_data_recv: 3000 cwnd: 10 last_queued_since: 12553177320334 last_delivered_since: 12553177320334 send_queue_length: 0 send_queue_bytes: 0 (gmcast.peer_timeout)
It can be reproduced by establishing a single idle connection to Galera.
## Given a deployed cluster connect to one of the nodes
$ bosh ssh mysql/0
$ nc localhost 4567 &
## Observe Galera no longer accepts connections until this connection disconnects (or is eventually accepted)
$ openssl s_client -connect localhost:4567 ...
...connection hangs forever...
...tcpdump / wireshark shows the "TLS Client Hello", but the server never responds...
...Similarly, a node rejoining the cluster will fail connecting to this node with identical symptoms to this issue...
For a short term workaround the customer can:
For a long term solution, you will need to upgrade TAS to a version that ships with the pxc version that contains the fix for this bug. The pxc versions are 1.0.33+ or later.
The TAS versions containing the patch pxc versions are:
Details to the Percona bug and fix can be found here:
Asio acceptor stops accepting connections: https://github.com/percona/galera/commit/4453168d78a1d0bbe2f2cfdfcb059f2366ae2a6f