Bosh VM is in stopped state.
When running a bosh recreate command, it fails with an error similar to:
06:54:10 | Updating instance diego_cell/xxx-yyy-zzz : Stopping instance (00:00:39)
L Error: Action Failed get_task: Task xxx-yyy-zzz result: Stopping Monitored Services: Stopping services '[garden]' errored
Upon inspection of VM's logs you see errors similar to:
Failed to stop garden.service: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
See system logs and 'systemctl status garden.service' for details.
level=warning msg="unable to get oom kill count" error="openat2 /sys/fs/cgroup/memory/system.slice/runc-bpm-rep.scope/memory.oom_control: no such file or directory"
level=error msg="runc run failed: unable to start container process: unable to apply cgroup configuration: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)"
Running the command "time systemctl is-system-running" confirms DBus timeout to PID 1
Failed to query system state: Failed to activate service 'org.freedesktop.systemd1':
timed out (service_start_timeout=25000ms)
real 0m1.910sdmesg log shows a log similar to:
systemd-journaldFailed to send WATCHDOG=1 notification message:
Transport endpoint is not connected
Foundation core 3.x
Elastic application runtime 6.x, 10,x
The cause is a OS-level failure of the systemd control plane (PID 1). Systemd becomes internally blocked (deadlocked/stalled) and stops responding to DBus requests, even though it is still running.
Identify the VM CID: Run 'command below' to find the Cloud ID (CID) for the bosh VM.
bosh -d deployment-name instances --detailsForce Delete the VM: Instead of a standard recreate, manually delete the problematic VM using the CID:
bosh -d deployment-name delete-vm <VM-CID>Once the old VM is deleted, run the recreate command again. BOSH will see the VM is missing and provision a brand new one:
bosh -d deployment-name recreate vm_name/id --fix --no-converge
If the `delete-vm` command also hangs, please manually power off and delete the VM from your IaaS console (e.g., vCenter) and then run the `recreate --fix` command, but make sure you don't delete the persistent disk if the VM has one.