Bosh vm recreation fails after stopping services errored

search cancel

Bosh vm recreation fails after stopping services errored

book

Article ID: 438310

calendar_today

Updated On:

Products

VMware Tanzu Application Platform

Issue/Introduction

Bosh VM is in stopped state.
When running a bosh recreate command, it fails with an error similar to:

06:54:10 | Updating instance diego_cell/xxx-yyy-zzz : Stopping instance (00:00:39)
                         L Error: Action Failed get_task: Task xxx-yyy-zzz result: Stopping Monitored Services: Stopping services '[garden]' errored

Upon inspection of VM's logs you see errors similar to:

Failed to stop garden.service: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
See system logs and 'systemctl status garden.service' for details.

level=warning msg="unable to get oom kill count" error="openat2 /sys/fs/cgroup/memory/system.slice/runc-bpm-rep.scope/memory.oom_control: no such file or directory"

level=error msg="runc run failed: unable to start container process: unable to apply cgroup configuration: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)"

Running the command "time systemctl is-system-running" confirms DBus timeout to PID 1

Failed to query system state: Failed to activate service 'org.freedesktop.systemd1':
timed out (service_start_timeout=25000ms)
real    0m1.910s

dmesg log shows a log similar to:

systemd-journaldFailed to send WATCHDOG=1 notification message:
Transport endpoint is not connected

Environment

Foundation core 3.x

Elastic application runtime 6.x, 10,x

Cause

The cause is a OS-level failure of the systemd control plane (PID 1). Systemd becomes internally blocked (deadlocked/stalled) and stops responding to DBus requests, even though it is still running.

Resolution

Identify the VM CID: Run 'command below' to find the Cloud ID (CID) for the bosh VM.

bosh -d deployment-name instances --details

Force Delete the VM: Instead of a standard recreate, manually delete the problematic VM using the CID:

bosh -d deployment-name delete-vm <VM-CID>

Once the old VM is deleted, run the recreate command again. BOSH will see the VM is missing and provision a brand new one:

bosh -d deployment-name recreate vm_name/id --fix --no-converge

Additional Information

If the `delete-vm` command also hangs, please manually power off and delete the VM from your IaaS console (e.g., vCenter) and then run the `recreate --fix` command, but make sure you don't delete the persistent disk if the VM has one.

Feedback

thumb_up Yes

thumb_down No