In some versions of TAS, the mysql cluster runs parallel applier threads. The applier threads can get into a race condition when the server node joins the cluster. The node will fail to join, and will get into a bad state.
The condition was introduced in pxc/1.0.12, which was bundled with TAS 4.0.5. The workaround that sets wsrep_applier_threads to 1 is included with TAS 4.0.21+ and 6.0.1+. The underlying bug that caused the deadlock with values > 1 was fixed in pxc/1.0.29 which shipped with TAS v4.0.25
Preferred workaround
Edit the appropriate config file on the Ops Manager VM, then applying changes to the TAS tile.
Find the config file by running this command with root privilege on the Ops Man VM:
sudo find /var/tempest/workspaces/default/metadata -exec grep -l "^name: cf" {} \;
FInd these lines
engine_config:
galera:
enabled: true
Add this line directly under "enabled: true", aligned with the first "e" in enabled. Correct alignment is essential to YAML syntax; otherwise you will see a 500 server error in the Ops Man UI.
wsrep_applier_threads: 1
Here are automated steps
$ ssh $opsmgr_vm # <- e.g. "smith ssh" in a Shepherd environment
$ sudo -i
# cf_metadata="$(grep -lr '^name: cf' /var/tempest/workspaces/default/metadata/)"
# cp "$cf_metadata" $HOME/original_cf_metadata.yml # <- just a backup to be safe
# cp "$cf_metadata" /tmp/cf.yml
# sed -i'' -r -e '/innodb_buffer_pool_size_percent:/i\ wsrep_applier_threads: 1' /tmp/cf.yml
# diff -u "$cf_metadata" /tmp/cf.yml
--- /var/tempest/workspaces/default/metadata/cbd28b16e356.yml 2024-04-08 11:55:30.312043459 +0000
+++ /tmp/cf.yml 2024-04-08 15:12:51.448307401 +0000
@@ -8799,6 +8799,7 @@
engine_config:
galera:
enabled: true
+ wsrep_applier_threads: 1
innodb_buffer_pool_size_percent: 50
innodb_flush_log_at_trx_commit: 2
innodb_strict_mode: true
### If the change look correct and similar to the above output, replace the metadata
# mv /tmp/cf.yml "$cf_metadata"
Inspect formatting of file. Make sure the wsrep_applier_threads line starts directly under "enabled: true", aligned with the first "e" in enabled. Correct alignment is essential to YAML syntax; otherwise you will see a 500 server error in the Ops Man UI.
### Make sure the file ownership is correct:
chown tempest-web:tempest-web $cf_metadata
Once this change is saved and you run Apply Changes to the TAS tile, the fix is in place. It’s also possible to edit the my.cnf file (mysql configuration) on each mysql server node; this would avoid the need to apply changes to the whole TAS deployment.
Faster but less persistent workaround
ssh $opsmgrvm
echo -e "[mysqld]\nwsrep_applier_threads = 1" >> /var/vcap/jobs/pxc-mysql/config/my.cnf
Executing monit restart galera-init afterwards will ensure the change is applied successfully, otherwise it will just be picked up on the next restart.
This method is less persistent than the preferred workaround; it will not be retained after a stemcell upgrade, or after any bosh recreate operations.