MySQL Node Stuck in Joining: Receiving State Transfer: i/o timeout
search cancel

MySQL Node Stuck in Joining: Receiving State Transfer: i/o timeout

book

Article ID: 400591

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

Foundations may become unstable when a mysql node state transfer stalls. Symptoms that you may observe include the NCP plugin down, API and CLI down, and inability to elect master among Diego Database nodes. Additional behaviors can be seen below:

 

When attempting to try to recreate the Diego Database, the pre-start will fail due to the policy-server job.

Error: Action Failed get_task: Task ed55499a-fc50-45b4-7bc2-aec1e805032a result: 1 of 8 pre-start scripts failed. Failed Jobs: policy-server. Successful Jobs: loggregator_agent, cfdot, bpm, syslog_forwarder, bosh-dns, qualys-cloud-agent-linux, ncp.
Task 160204 | 14:48:16 | Error: Action Failed get_task: Task ed55499a-fc50-45b4-7bc2-aec1e805032a result: 1 of 8 pre-start scripts failed. Failed Jobs: policy-server. Successful Jobs: loggregator_agent, cfdot, bpm, syslog_forwarder, bosh-dns, qualys-cloud-agent-linux, ncp.

Pre-start Error Log

  • diego_database/###:/var/vcap/sys/log/policy-server# tail -10f pre-start.stderr.log
+ '[' -f /var/vcap/jobs/bosh-dns/bin/wait ']'
+ /var/vcap/jobs/bosh-dns/bin/wait

Check the communication status between NSX API or BBS API and NCP

$ bosh -d $ (bosh ds --column=name | grep ^cf-) ssh -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get nop-master status" -r diego database | grep instance

This instance is not the NCP master
This instance is not the NCP master
This instance is not the NCP master

Policy Server Logs

  •  /var/vcap/packages/policy-server/bin/migrate-db -config-file=/var/vcap/jobs/policy-server/config/policy-server.json
[mysql] 2025/06/10 14:40:15 connection.go:49: read tcp 10.###.###.22:37384->10.###.###.16:3306: i/o timeout
[mysql] 2025/06/10 14:42:16 connection.go:49: read tcp 10.###.###.22:53672->10.###.###.17:3306: i/o timeout
[mysql] 2025/06/10 14:44:17 connection.go:49: read tcp 10.###.###.22:45822->10.###.###.16:3306: i/o timeout
[mysql] 2025/06/10 14:46:19 connection.go:49: read tcp 10.###.###.22:33106->10.###.###.16:3306: i/o timeout

The "NCP down" alarm in the NSX alarm dashboard indicates that the NSX Manager is unable to communicate with the Network Container Plugin.

 MySQL proxy showing backend unhealthy errors

{"timestamp":"2025-06-10T17:12:42.644321002Z","level":"error","source":"/var/vcap/packages/proxy/bin/proxy","message":"/var/vcap/packages/proxy/bin/proxy.active-monitor.Healthcheck failed on backend","data":{"backend":{"host":"10.###.###.19","port":3306,"status_port":9201,"healthy":false,"name":"mysql/533dd8d6-####-####-####-####","currentSessionCount":0},"endpoint":"https://10.###.###.19:9201/api/v1/status","error":"Backend reported as unhealthy","resp":"HTTP 200 OK: {\"wsrep_local_state\":1,\"wsrep_local_state_comment\":\"Joining\",\"wsrep_local_index\":18446744073709551615,\"healthy\":false}","session":"1"}}

Examined MySQL cluster health with Mysql-diag:

mysql_monitor/0dd523f9-####-####-####-####:~# mysql-diag
Tue Jun 10 17:10:57 UTC 2025
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+
|                    INSTANCE                     |               STATE               | CLUSTER STATUS |   SEQNO   |  PERSISTENT DISK USED  | EPHEMERAL DISK USED  |
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+
|  [0] mysql/9a6be4d7-####-####-####-#### | Synced                            | Primary        | 827608924 | 70.2G / 195.8G (35.8%) | 1.5G / 283.1G (0.5%) |
|  [1] mysql/ac0869db-####-####-####-#### | Synced                            | Primary        | 827608924 | 66.7G / 195.8G (34.1%) | 1.4G / 283.1G (0.5%) |
|  [2] mysql/533dd8d6-####-####-####-#### | Joining: receiving State Transfer | Primary        | 801841521 | 92.3G / 195.8G (47.1%) | 1.4G / 283.1G (0.5%) |
+-------------------------------------------------+-----------------------------------+----------------+-----------+------------------------+----------------------+

NOTE: Proxies will currently attempt to direct traffic to "mysql/9a6be4d7-1f6e-4249-8a5a-4a7136464609"

SSH into the mysql node and run SHOW PROCESSLIST;

UAA connections running update statements for over 12,000 seconds (~3.5 hours):

id: 141511164, Time: 12742 id: 141511417, Time: 12048 id: 141511392, Time: 10626
 

Diego and Locket processes executing are stuck in wsrep: preparing for TO isolation:

diego id: 141511467, Time: 11950 locket id: 141511469, Time: 11948
 
Slow_query Log
  • Last_errno: 1205Lock wait timeout (classic sign of a deadlock or blocking)
  • Query_time: ~50s, Lock_time: ~50sAll time spent waiting for a lock
  • InnoDB_rec_lock_wait: ~50s → Confirms waiting on a record-level lock

Cause

  • MySQL Service Instance was stuck undergoing a State Transfer which is a heavy operation that copies data from another node where traffic was being directed to by the mysql-proxy
  • The heavy operations and long queries were causing communication issues between the policy server job and mysql-proxy
  • This caused a cascading effect that resulted in timeouts during Diego database recreation because it performs a migration during the pre-start and prevented proper NCP master election

Resolution

  • Perform a monit restart on the node that mysql-proxy is forwarding traffic to and the node that is stuck in the transfer state.
    • monit restart all
  • Validate MySQL cluster synchronization
    • bosh -d <deployment> ssh mysql-monitor
      sudo -i
      mysql-diag
  • Recreate the Diego database VM
    • bosh -d <cf> recreate diego-database/0 --no-converge
  • Confirm Cloud Controller and Clock Global VMs are up and CLI is operational
    • bosh vms
      cf login
  • Verifiy NCP master election completed
    • $ bosh -d $ (bosh ds --column=name | grep ^cf-) ssh -c "sudo /var/vcap/jobs/ncp/bin/nsxcli -c get nop-master status" -r diego database | grep instance