MySQL Cluster has two failing nodes after attempting to upgrade

Products

VMware Tanzu SQL

Issue/Introduction

Customer was upgrading and patching, and one of the MySQL service instances came up with only one node running after the upgrade-all-service-instances errand. They have tried running the errand again as well as manually deploying the service instance using the manifest. There were two VMs that did not recover after multiple attempts.

Using environment '10.200.216.139' as client 'ops_manager' 
Task 8331165 
Task 8331165 | 20:14:23 | Preparing deployment: Preparing deployment 
Task 8331165 | 20:14:25 | Warning: DNS address not available for the link provider instance: nats/71661674-57f0-4b74-a188-800bc6de5db0 
Task 8331165 | 20:14:25 | Warning: DNS address not available for the link provider instance: nats/e6e1f161-f19e-418b-959a-ac3b534668d6 
Task 8331165 | 20:14:25 | Warning: DNS address not available for the link provider instance: clock_global/1b66650f-d3d2-4c69-a93b-4d185232b639 
Task 8331165 | 20:14:25 | Warning: DNS address not available for the link provider instance: clock_global/d30d3f9b-a344-4528-b0fa-f67310c113db 
Task 8331165 | 20:14:25 | Warning: DNS address not available for the link provider instance: doppler/3d6b1ace-0a50-4667-8a4b-e2451afb658c 
Task 8331165 | 20:14:25 | Warning: DNS address not available for the link provider instance: doppler/7eda361a-4971-40ef-ad99-d31cd728ca66 
Task 8331165 | 20:14:25 | Warning: DNS address not available for the link provider instance: doppler/8cbd1279-b704-4db1-a7ee-c7a290bc1bde 
Task 8331165 | 20:14:34 | Preparing deployment: Preparing deployment (00:00:11) 
Task 8331165 | 20:14:34 | Preparing deployment: Rendering templates (00:00:10) 
Task 8331165 | 20:14:44 | Preparing package compilation: Finding packages to compile (00:00:00) 
Task 8331165 | 20:14:47 | Updating instance mysql: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) 
Task 8331165 | 20:14:49 | L executing pre-stop: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) 
Task 8331165 | 20:14:49 | L executing drain: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) 
Task 8331165 | 20:15:35 | L stopping jobs: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) 
Task 8331165 | 20:15:36 | L executing post-stop: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) 
Task 8331165 | 20:15:41 | L installing packages: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) 
Task 8331165 | 20:15:46 | L configuring jobs: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) 
Task 8331165 | 20:15:46 | L executing pre-start: mysql/6835f206-daea-4466-a928-730def6c30a6 (0) (canary) (00:06:01) L Error: Action Failed get_task: Task f5c6bd53-6e0b-42c5-5f41-c1d72e1233b5 result: 1 of 8 pre-start scripts failed. Failed Jobs: pxc-mysql. Successful Jobs: loggregator_agent, mysql-restore, pre-start-script, bpm, bosh-dns, user_add, antivirus. 
Task 8331165 | 20:20:48 | Error: Action Failed get_task: Task f5c6bd53-6e0b-42c5-5f41-c1d72e1233b5 result: 1 of 8 pre-start scripts failed. Failed Jobs: pxc-mysql. Successful Jobs: loggregator_agent, mysql-restore, pre-start-script, bpm, bosh-dns, user_add, antivirus. 
Task 8331165 Started Sun Jan 29 20:14:23 UTC 2023 Task 8331165 Finished Sun Jan 29 20:20:48 UTC 2023 Task 8331165 Duration 00:06:25 
Task 8331165 error Capturing task '8331165' output: Expected task '8331165' to succeed but state is 'error' Exit code 1

Environment

Product Version: 2.10

Resolution

Repaired failing VMs by scaling down (by modifying bosh manifest), scaling back up in same way.

bosh -d <service-instance-deployment> manifest > si.yaml

cp si.yaml si.yaml.orig

Edit si.yaml to change the number of instances for the pxc-mysql job from 3 to 1.

bosh -d <service-instance-deployment> ignore <running instance ID>

bosh update-resurrection off

bosh -d <service-instance-deployment> deploy si.yaml --fix

This should have the effect of deleting the unhealthy server instances.

Revert back to 3 nodes:

bosh -d <service-instance-deployment> deploy si.yaml.orig --fix

This should create 2 new, healthy nodes.

Revert settings:

bosh -d <service-instance-deployment> unignore <instance-id-ignored-above>

bosh update-resurrection on