Foundations may become unstable when a mysql node state transfer stalls. Symptoms that you may observe include the NCP plugin down, API and CLI down, and inability to elect master among Diego Database nodes. Additional behaviors can be seen below:
When attempting to try to recreate the Diego Database, the pre-start will fail due to the policy-server job.
Error: Action Failed get_task: Task ed55499a-fc50-45b4-7bc2-aec1e805032a result: 1 of 8 pre-start scripts failed. Failed Jobs: policy-server. Successful Jobs: loggregator_agent, cfdot, bpm, syslog_forwarder, bosh-dns, qualys-cloud-agent-linux, ncp.
Task 160204 | 14:48:16 | Error: Action Failed get_task: Task ed55499a-fc50-45b4-7bc2-aec1e805032a result: 1 of 8 pre-start scripts failed. Failed Jobs: policy-server. Successful Jobs: loggregator_agent, cfdot, bpm, syslog_forwarder, bosh-dns, qualys-cloud-agent-linux, ncp.
Pre-start Error Log
+ '[' -f /var/vcap/jobs/bosh-dns/bin/wait ']'
+ /var/vcap/jobs/bosh-dns/bin/wait
Check the communication status between NSX API or BBS API and NCP
$ bosh -d $ (bosh ds --column=name | grep ^cf-) ssh -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get nop-master status" -r diego database | grep instance
This instance is not the NCP master
This instance is not the NCP master
This instance is not the NCP master
Policy Server Logs
[mysql] 2025/06/10 14:40:15 connection.go:49: read tcp 10.###.###.22:37384->10.###.###.16:3306: i/o timeout
[mysql] 2025/06/10 14:42:16 connection.go:49: read tcp 10.###.###.22:53672->10.###.###.17:3306: i/o timeout
[mysql] 2025/06/10 14:44:17 connection.go:49: read tcp 10.###.###.22:45822->10.###.###.16:3306: i/o timeout
[mysql] 2025/06/10 14:46:19 connection.go:49: read tcp 10.###.###.22:33106->10.###.###.16:3306: i/o timeout
The "NCP down" alarm in the NSX alarm dashboard indicates that the NSX Manager is unable to communicate with the Network Container Plugin.
MySQL proxy showing backend unhealthy errors
{"timestamp":"2025-06-10T17:12:42.644321002Z","level":"error","source":"/var/vcap/packages/proxy/bin/proxy","message":"/var/vcap/packages/proxy/bin/proxy.active-monitor.Healthcheck failed on backend","data":{"backend":{"host":"10.###.###.19","port":3306,"status_port":9201,"healthy":false,"name":"mysql/533dd8d6-####-####-####-####","currentSessionCount":0},"endpoint":"https://10.###.###.19:9201/api/v1/status","error":"Backend reported as unhealthy","resp":"HTTP 200 OK: {\"wsrep_local_state\":1,\"wsrep_local_state_comment\":\"Joining\",\"wsrep_local_index\":18446744073709551615,\"healthy\":false}","session":"1"}}
Examined MySQL cluster health with Mysql-diag:
mysql_monitor/0dd523f9-####-####-####-####:~# mysql-diag
Tue Jun 10 17:10:57 UTC 2025
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+
| INSTANCE | STATE | CLUSTER STATUS | SEQNO | PERSISTENT DISK USED | EPHEMERAL DISK USED |
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+
| [0] mysql/9a6be4d7-####-####-####-#### | Synced | Primary | 827608924 | 70.2G / 195.8G (35.8%) | 1.5G / 283.1G (0.5%) |
| [1] mysql/ac0869db-####-####-####-#### | Synced | Primary | 827608924 | 66.7G / 195.8G (34.1%) | 1.4G / 283.1G (0.5%) |
| [2] mysql/533dd8d6-####-####-####-#### | Joining: receiving State Transfer | Primary | 801841521 | 92.3G / 195.8G (47.1%) | 1.4G / 283.1G (0.5%) |
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+
NOTE: Proxies will currently attempt to direct traffic to "mysql/9a6be4d7-1f6e-4249-8a5a-4a7136464609"
SSH into the mysql node and run SHOW PROCESSLIST;
UAA connections runningupdatestatements for over 12,000 seconds (~3.5 hours):
id: 141511164, Time: 12742
id: 141511417, Time: 12048
id: 141511392, Time: 10626Diego and Locket processes executing are stuck in wsrep: preparing for TO isolation:
Last_errno: 1205 → Lock wait timeout (classic sign of a deadlock or blocking)
Query_time: ~50s,Lock_time: ~50s→ All time spent waiting for a lock
InnoDB_rec_lock_wait: ~50s → Confirms waiting on a record-level lock
monit restart all
bosh -d <deployment> ssh mysql-monitor
sudo -i
mysql-diag
bosh -d <cf> recreate diego-database/0 --no-converge
bosh vms
cf login
$ bosh -d $ (bosh ds --column=name | grep ^cf-) ssh -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get nop-master status" -r diego database | grep instance