MySQL Node Stuck in Joining: Receiving State Transfer: i/o timeout

Products

VMware Tanzu Application Service

Issue/Introduction

Foundations may become unstable when a mysql node state transfer stalls. Symptoms that you may observe include the NCP plugin down, API and CLI down, and inability to elect master among Diego Database nodes. Additional behaviors can be seen below:

When attempting to try to recreate the Diego Database, the pre-start will fail due to the policy-server job.

Error: Action Failed get_task: Task ed55499a-fc50-45b4-7bc2-aec1e805032a result: 1 of 8 pre-start scripts failed. Failed Jobs: policy-server. Successful Jobs: loggregator_agent, cfdot, bpm, syslog_forwarder, bosh-dns, qualys-cloud-agent-linux, ncp.
Task 160204 | 14:48:16 | Error: Action Failed get_task: Task ed55499a-fc50-45b4-7bc2-aec1e805032a result: 1 of 8 pre-start scripts failed. Failed Jobs: policy-server. Successful Jobs: loggregator_agent, cfdot, bpm, syslog_forwarder, bosh-dns, qualys-cloud-agent-linux, ncp.

Pre-start Error Log

diego_database/###:/var/vcap/sys/log/policy-server# tail -10f pre-start.stderr.log

+ '[' -f /var/vcap/jobs/bosh-dns/bin/wait ']'
+ /var/vcap/jobs/bosh-dns/bin/wait

Check the communication status between NSX API or BBS API and NCP

$ bosh -d $ (bosh ds --column=name | grep ^cf-) ssh -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get nop-master status" -r diego database | grep instance

This instance is not the NCP master 
This instance is not the NCP master
This instance is not the NCP master

Policy Server Logs

/var/vcap/packages/policy-server/bin/migrate-db -config-file=/var/vcap/jobs/policy-server/config/policy-server.json

[mysql] 2025/06/10 14:40:15 connection.go:49: read tcp 10.###.###.22:37384->10.###.###.16:3306: i/o timeout
[mysql] 2025/06/10 14:42:16 connection.go:49: read tcp 10.###.###.22:53672->10.###.###.17:3306: i/o timeout
[mysql] 2025/06/10 14:44:17 connection.go:49: read tcp 10.###.###.22:45822->10.###.###.16:3306: i/o timeout
[mysql] 2025/06/10 14:46:19 connection.go:49: read tcp 10.###.###.22:33106->10.###.###.16:3306: i/o timeout

The "NCP down" alarm in the NSX alarm dashboard indicates that the NSX Manager is unable to communicate with the Network Container Plugin.

MySQL proxy showing backend unhealthy errors

{"timestamp":"2025-06-10T17:12:42.644321002Z","level":"error","source":"/var/vcap/packages/proxy/bin/proxy","message":"/var/vcap/packages/proxy/bin/proxy.active-monitor.Healthcheck failed on backend","data":{"backend":{"host":"10.###.###.19","port":3306,"status_port":9201,"healthy":false,"name":"mysql/533dd8d6-####-####-####-####","currentSessionCount":0},"endpoint":"https://10.###.###.19:9201/api/v1/status","error":"Backend reported as unhealthy","resp":"HTTP 200 OK: {\"wsrep_local_state\":1,\"wsrep_local_state_comment\":\"Joining\",\"wsrep_local_index\":18446744073709551615,\"healthy\":false}","session":"1"}}

Examined MySQL cluster health with Mysql-diag:

mysql_monitor/0dd523f9-####-####-####-####:~# mysql-diag
Tue Jun 10 17:10:57 UTC 2025
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+
|                    INSTANCE                     |               STATE               | CLUSTER STATUS |   SEQNO   |  PERSISTENT DISK USED  | EPHEMERAL DISK USED  |
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+
|  [0] mysql/9a6be4d7-####-####-####-#### | Synced                            | Primary        | 827608924 | 70.2G / 195.8G (35.8%) | 1.5G / 283.1G (0.5%) |
|  [1] mysql/ac0869db-####-####-####-#### | Synced                            | Primary        | 827608924 | 66.7G / 195.8G (34.1%) | 1.4G / 283.1G (0.5%) |
|  [2] mysql/533dd8d6-####-####-####-#### | Joining: receiving State Transfer | Primary        | 801841521 | 92.3G / 195.8G (47.1%) | 1.4G / 283.1G (0.5%) |
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+

NOTE: Proxies will currently attempt to direct traffic to "mysql/9a6be4d7-1f6e-4249-8a5a-4a7136464609"

SSH into the mysql node and run SHOW PROCESSLIST;

UAA connections running update statements for over 12,000 seconds (~3.5 hours):

id: 141511164, Time: 12742
id: 141511417, Time: 12048
id: 141511392, Time: 10626

Diego and Locket processes executing are stuck in wsrep: preparing for TO isolation:

Slow_query Log

Last_errno: 1205 → Lock wait timeout (classic sign of a deadlock or blocking)

Query_time: ~50s, Lock_time: ~50s → All time spent waiting for a lock

InnoDB_rec_lock_wait: ~50s → Confirms waiting on a record-level lock

Cause

MySQL Service Instance was stuck undergoing a State Transfer which is a heavy operation that copies data from another node where traffic was being directed to by the mysql-proxy
The heavy operations and long queries were causing communication issues between the policy server job and mysql-proxy
This caused a cascading effect that resulted in timeouts during Diego database recreation because it performs a migration during the pre-start and prevented proper NCP master election

Resolution

Perform a monit restart on the node that mysql-proxy is forwarding traffic to and the node that is stuck in the transfer state.
- ```
monit restart all
```

Validate MySQL cluster synchronization

bosh -d <deployment> ssh mysql-monitor
sudo -i
mysql-diag

Recreate the Diego database VM

bosh -d <cf> recreate diego-database/0 --no-converge

Confirm Cloud Controller and Clock Global VMs are up and CLI is operational
- ```
bosh vms
cf login
```

Verifiy NCP master election completed

$ bosh -d $ (bosh ds --column=name | grep ^cf-) ssh -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get nop-master status" -r diego database | grep instance