In this article, we will provide guidance for troubleshooting a Diego cell reported as failing due to a process in a hung state. In our example, it was D state (Uninterruptible sleep, usually IO), but it could be hung in a different state).
BOSH VMs command reports a diego_cell is failing, however, when we BOSH SSH to the Diego cell, all services are up and running. If we stop and start all services (Monit stop/start all) BOSH will still report the diego_cell as failing.
Note- The diego_cell can be running also with this symptom, but deployment changes can hang trying to update the instance.
diego_cell/5 (7a9e0058-33a8-409f-ad50-2604c92b2b22) | failing
All requests are queued up due to the process stuck in D state:
diego_cell/b051d335-3fab-4150-8726-6ef1f0461015:~# ps -efly | grep ^D D root 4030014 2 0 80 0 0 0 msleep Jul05 ? 00:02:18 [kworker/u16:1] D root 4043789 4043780 0 70 -10 188 2224 copy_n Jul05 ? 00:00:00 /proc/self/exe init D root 4043812 4043803 0 70 -10 188 2224 copy_n Jul05 ? 00:00:00 /proc/self/exe init diego_cell/b051d335-3fab-4150-8726-6ef1f0461015:~# ps -ef | grep 4043789 root 2112360 2110853 0 14:19 pts/1 00:00:00 grep --color=auto 4043789 root 4043789 4043780 0 Jul05 ? 00:00:00 /proc/self/exe init
Garden logs report the following:
{"timestamp":"1499693976.045541048","source":"guardian","message":"guardian.start.looking-for-properties.failed-restoring-container","log_level":2,"data":{"error":"loading 39f44873-db7a-49c9-42f7-be2f734870f8: property not found: kawasaki.host-interface","handle":"39f44873-db7a-49c9-42f7-be2f734870f8","session":"7.3"}} {"timestamp":"1499693976.045624018","source":"guardian","message":"guardian.start.looking-for-properties.failed-restoring-container","log_level":2,"data":{"error":"loading 63d9ae5d-1dca-4821-61f7-0ffa44797213: property not found: kawasaki.host-interface","handle":"63d9ae5d-1dca-4821-61f7-0ffa44797213","session":"7.4"}} {"timestamp":"1499693976.045707941","source":"guardian","message":"guardian.start.looking-for-properties.failed-restoring-container","log_level":2,"data":{"error":"loading 69043dd0-bc30-4b55-5463-b26ed81e5fba: property not found: kawasaki.host-interface","handle":"69043dd0-bc30-4b55-5463-b26ed81e5fba","session":"7.5"}}
The exact cause is currently unknown. State D = uninterruptible sleep (usually IO) could be a probable cause.
The only solution is to reboot the VM from the IaaS side. If the reboot fails to stop the VM, then the customer will need to delete the VM via the IaaS. If the resurrector is enabled, it will recreate the VM in a few mins.
Note - bosh stop <instance name> <instance no.> will not help resolve the issue.