Failure to create containers when upgrading with shared Microsoft base image

search cancel

Failure to create containers when upgrading with shared Microsoft base image

book

Article ID: 297439

calendar_today

Updated On: 04-19-2024

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

When upgrading between versions of the windows rootfs (windows1803fs or windows2019fs) in VMware Tanzu Application Service for VMs (TAS) that have a shared Microsoft base layer, there may be failures to create containers, causing the upgrade to fail.

When upgrading between versions of the windows rootfs (windows1803fs or windows2019fs) in TASW that have a shared Microsoft base layer, there may be failures to create containers, causing the deployment to fail. If there are stemcell changes or if the Microsoft base layer changes, this error likely does not occur. When the stemcell changes, the Windows Diego Cells are recreated during the upgrade so any corruption of the Microsoft base layer is irrelevant. When the Microsoft base layer changes, if there is corruption to the old base layer, it will not impact the upgrade because the new layer will replace the old one instead of being shared.

This error has been seen with Windows Server 1803 and Windows Server 2019. This error has been on seen on GCP, Azure, vSphere, and AWS.

In order to investigate this issue, you will need to be able to SSH to the Ops Manager VM and run commands with the BOSH CLI.

Symptoms:

Error Type #1: The pre-start script for the `windowsfs` job fails which causes the TASW upgrade to fail.

This one occurs when there is a custom CA certificate injection that occurs when the rootfs job starts up because we try to create a container in order to add the new layer with the certificates in it.

Ops Manager Output:

Task 308031 | 13:47:04 | Preparing deployment: Preparing deployment (00:00:03)

Task 308031 | 13:47:11 | Preparing package compilation: Finding packages to compile (00:00:00)

Task 308031 | 13:47:21 | Updating instance windows_diego_cell: windows_diego_cell/44c5841f-7580-4e9c-9856-89fcbe08ab0d (2) (canary) (00:00:35)

L Error: Action Failed get_task: Task 59ba76d1-14c5-4d7b-681c-08b9ec4bd64d result: 1 of 10 pre-start scripts failed. Failed Jobs: windows1803fs. Successful Jobs: set_kms_host, groot, loggregator_agent_windows, bosh-dns-windows, rep_windows, winc-network-1803, set_password, enable_ssh, enable_rdp.

Task 308031 | 13:47:56 | Error: Action Failed get_task: Task 59ba76d1-14c5-4d7b-681c-08b9ec4bd64d result: 1 of 10 pre-start scripts failed. Failed Jobs: windows1803fs. Successful Jobs: set_kms_host, groot, loggregator_agent_windows, bosh-dns-windows, rep_windows, winc-network-1803, set_password, enable_ssh, enable_rdp.

BOSH Job Logs

Run bosh deployments and look for the pas-windows-GUID deployment name.
Run export BOSH_DEPLOYMENT=<pas-windows-GUID>
Run bosh vms and look for the failing Windows Diego cell name.
Run bosh logs windows_diego_cell/GUID with the GUID of the failing Windows Diego cell.
In the windows1803fs or windows2019fs pre-start.stderr.log, search for an error that will look like this:

container layer-1557177919 encountered an error during Start: failure in a Windows system call: The compute system exited unexpectedly. (0xc0370106)

Error Type #2: The post-start script for the rep_windows job fails which causes the TASW upgrade to fail.

This error will occur if there is a problem with the base image and there are no custom certificates to inject. If there are no custom certificates to inject, the rootfs job (i.e. windows1803fs) will not try to create a container and will successfully start up. After all of the jobs have started, the post-start scripts get run. In the rep_windows job, the post-start script tries to create a container first and it will be the one to fail.

Ops Manager Output:

Task 8192 | 21:12:30 | Updating instance windows2019-cell: windows2019-cell/bd6d70b9-ed1f-412f-9d49-8045627f4ab3 (0) (canary) (00:17:24)
                     L Error: Action Failed get_task: Task a9555020-1a3b-40c7-677c-d6fc392ce135 result: 1 of 3 post-start scripts failed. Failed Jobs: rep_windows. Successful Jobs: route_emitter_windows, bosh-dns-windows.
Task 8192 | 21:29:55 | Error: Action Failed get_task: Task a9555020-1a3b-40c7-677c-d6fc392ce135 result: 1 of 3 post-start scripts failed. Failed Jobs: rep_windows. Successful Jobs: route_emitter_windows, bosh-dns-windows.

BOSH Job Logs

Run bosh deployments and look for the pas-windows-GUID deployment name.
Run export BOSH_DEPLOYMENT=<pas-windows-GUID>
Run bosh vms and look for the failing Windows Diego cell name.
Run bosh logs windows_diego_cell/GUID with the GUID of the failing Windows Diego cell.
In the rep_windows directory, in the rep_windows subdirectory, search the job-service-wrapper.out.log for an error that will look like this:

{"timestamp":"2019-07-29T14:59:56.406860500Z","level":"error","source":"rep","message":"rep.exited-with-failure","data":{"error":"Exit trace for group:\ngarden_health_checker exited with error: runc run: exit status 1: container check-c403ce0e-7875-4c04-517a-b1a790e2d323 encountered an error during Start: failure in a Windows system call: The virtual machine or container exited unexpectedly. (0xc0370106)\ncontainer-metrics-reporter exited with nil\nhub-closer exited with nil\nmetrics-reporter exited with nil\nvolman-driver-syncer exited with nil\ndebug-server exited with nil\n"}}

Cause

The cause is unknown at this time. VMware has an open support ticket with Microsoft and there is an ongoing investigation into the cause.

Resolution

The following are the current known workarounds to this issue.

Workaround #1: `recreate` all of the windows Diego cells then restart the upgrade.

(Time: High; Difficulty Level: Low)

This workaround recreates the windows Diego cells on upgrade in order for the Microsoft base image to be usable.

Steps

Go to the BOSH Director tile.
In the Settings tab, go to Director Config, enable Recreate All VMs, and save.
Return to the dashboard, and go to Review Pending Changes.
(Optional) Uncheck pending changes for Pivotal Application Service. This will allow recreation of only TASW deployments, which will make the upgrade faster.
Apply Changes.

Workaround #2: Delete the `size` file in the shared base image on each windows Diego cell then restart the upgrade in the Ops Manager UI

(Time: Medium; Difficulty Level: High)

This workaround requires Enable BOSH-native SSH support on all VMs (beta) to be enabled in the PASW > VM Options configuration. By deleting the Size file, our BOSH jobs will replace this layer even if it is shared between the old rootfs and the rootfs we are upgrading to and that will fix the issue. It will increase the upgrade time by a couple of minutes as each VM will need to re-extract the Microsoft base layer which is rather large. After you have deleted the Size file in the base layer directory on each VM, you can proceed with the upgrade in the Ops Manager UI.

Steps

Run bosh deployments and look for the pas-windows-GUID deployment name.
Run export BOSH_DEPLOYMENT=<pas-windows-GUID>.
Run bosh vms.
Do the following for every windows Diego cell:
- Run bosh ssh windows_diego_cell/GUID
- Run powershell
- Run stop-service rep_windows
- Run cd C:\var\vcap\data\groot\layers
- Run $largest=[long]"0"; $largestSha=""; Get-ChildItem | ForEach-Object { If( [long](Get-Content $_\size) -Gt $largest){ $largest=[long](Get-Content $_\size); $largestSha=$_.Name } }; cd $largestSha; rm size
  - This will remove the size file from the base layer directory
- Exit powershell
- Exit the cell
Continue the upgrade in the Ops Manager UI

Workaround #3: Power-cycle the VM then restart the upgrade in the Ops Manager

(Time: High; Difficulty Level: High)

This workaround is dependent on your IaaS. It involves turning on RDP for each cell and creating an Administrator user whose password is known, calling bosh stop on each cell, logging in to your IaaS console, turning off each VM (each Windows Diego Cell), and then turning them back on. Once each VM has restarted, you will need to RDP onto the cell and manually start the bosh-agent service by running Start-Service bosh-agent in powershell. SSH cannot be used in this step because the bosh agent is stopped and the job that allows SSH is stopped, therefore, RDP is necessary. With this workaround, you do not require access to the Ops Manager VM or BOSH director. This workaround appears to work because it powers off all the Windows services, then turns them back on, and the shared Microsoft base layer is usable again. This workaround is different from bosh restart which simply stops and starts only our BOSH jobs. After you have power cycled each VM and turned the bosh-agent service back on, you can proceed with the upgrade in the Ops Manager UI.

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No