Diego cell Unresponsive due to a Process Stuck in "D" State
search cancel

Diego cell Unresponsive due to a Process Stuck in "D" State

book

Article ID: 297849

calendar_today

Updated On:

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

In this article, we will provide guidance for troubleshooting a Diego cell reported as failing due to a process in a hung state. In our example, it was D state (Uninterruptible sleep, usually IO), but it could be hung in a different state).

BOSH VMs command reports a diego_cell is failing, however, when we BOSH SSH to the Diego cell, all services are up and running. If we stop and start all services (Monit stop/start all) BOSH will still report the diego_cell as failing.

Note- The diego_cell can be running also with this symptom, but deployment changes can hang trying to update the instance.

diego_cell/5 (7a9e0058-33a8-409f-ad50-2604c92b2b22) | failing

All requests are queued up due to the process stuck in D state:

diego_cell/b051d335-3fab-4150-8726-6ef1f0461015:~# ps -efly | grep ^D
D root 4030014 2 0 80 0 0 0 msleep Jul05 ? 00:02:18 [kworker/u16:1]
D root 4043789 4043780 0 70 -10 188 2224 copy_n Jul05 ? 00:00:00 /proc/self/exe init
D root 4043812 4043803 0 70 -10 188 2224 copy_n Jul05 ? 00:00:00 /proc/self/exe init

diego_cell/b051d335-3fab-4150-8726-6ef1f0461015:~# ps -ef | grep 4043789
root 2112360 2110853 0 14:19 pts/1 00:00:00 grep --color=auto 4043789
root 4043789 4043780 0 Jul05 ? 00:00:00 /proc/self/exe init

Garden logs report the following:

{"timestamp":"1499693976.045541048","source":"guardian","message":"guardian.start.looking-for-properties.failed-restoring-container","log_level":2,"data":{"error":"loading 39f44873-db7a-49c9-42f7-be2f734870f8: property not found: kawasaki.host-interface","handle":"39f44873-db7a-49c9-42f7-be2f734870f8","session":"7.3"}}
{"timestamp":"1499693976.045624018","source":"guardian","message":"guardian.start.looking-for-properties.failed-restoring-container","log_level":2,"data":{"error":"loading 63d9ae5d-1dca-4821-61f7-0ffa44797213: property not found: kawasaki.host-interface","handle":"63d9ae5d-1dca-4821-61f7-0ffa44797213","session":"7.4"}}
{"timestamp":"1499693976.045707941","source":"guardian","message":"guardian.start.looking-for-properties.failed-restoring-container","log_level":2,"data":{"error":"loading 69043dd0-bc30-4b55-5463-b26ed81e5fba: property not found: kawasaki.host-interface","handle":"69043dd0-bc30-4b55-5463-b26ed81e5fba","session":"7.5"}}

 

 

Environment


Cause

The exact cause is currently unknown. State D = uninterruptible sleep (usually IO) could be a probable cause.

 

Resolution

The only solution is to reboot the VM from the IaaS side. If the reboot fails to stop the VM, then the customer will need to delete the VM via the IaaS. If the resurrector is enabled, it will recreate the VM in a few mins.

Note - bosh stop <instance name> <instance no.> will not help resolve the issue.