SDDC Manager update to 4.5 release fails, when volume group(s) have unused space in it (OR) volume group(s) have multiple physical volumes(PVs) in it

Products

VMware Cloud Foundation

Issue/Introduction

This KB provides instructions to work around update failures caused due to

Unused free space in Volume Group(s)
Multiple Physical Volume(disk) in Volume Group(s)

Symptoms:

SDDC Manager update to VCF 4.5.0.0 fails in "VMware Cloud Foundation Service and Platform Upgrades" step. Below error is reported in SDDC Manager UI:

or

Check /var/log/vmware/capengine/cap-update/workflow.log indicate "Task validate failed" due to unexpected free space in volume group.

(OR)

Check the below two log files

/var/log/vmware/capengine/cap-update-cleanup/workflow.log
/var/log/vmware/capengine/cap-update/workflow.log

for errors in reclaiming snapshot disks (example error messages below)

Failed to reclaim snapshot disk <device_name> from VG <volume_group>. Error : exit status 5
Failed to reclaim snapshot disk <device_name> from VG <volume_group>. Error : exit status 126
Failed to reclaim snapshot disk <device_name> from VG <volume_group>. Error : exit status 127

Task Failed Error

2022/10/31 09:19:49.463490 validate.go:99: Debug: vgname:[data_vg] actualVFreeSize: [24996] vFreeSize:[26214] toleranceAllowed:[3932]
2022/10/31 09:19:49.527247 validate.go:99: Debug: vgname:[lcmmount_vg] actualVFreeSize: [124568] vFreeSize:[104857] toleranceAllowed:[15728]
2022/10/31 09:19:49.527298 progress.go:11: Validate failed. VFree size of the volume group lcmmount_vg mismtaches the expectation. Actual: [124568] Expected: [104857].
2022/10/31 09:19:49.527490 task_progress.go:24: Validate failed. VFree size of the volume group lcmmount_vg mismtaches the expectation. Actual: [124568] Expected: [104857].
2022/10/31 09:19:49.556785 workflow_manager.go:198

: Task validate failed. Error: Validate failed. VFree size of the volume group lcmmount_vg mismtaches the expectation. Actual: [

124568] Expected: [104857].
2022/10/31 09:19:49.556950 workflow_manager.go:138: Stopping workflow execution as task validate failed

reclaim-vfree error 1

2022/11/03 21:12:26.914537 reclaimvfree.go:242: Executing command: vgreduce data_vg /dev/sdg1
2022/11/03 21:12:27.014444 reclaimvfree.go:253: Executing command: pvremove -y -ff /dev/sdg1
2022/11/03 21:12:27.126447 reclaimvfree.go:264: Executing command: parted -s -a opt /dev/sdg rm 1
2022/11/03 21:12:27.167333 progress.go:11: Reclaimed snapshot /dev/sdg1
2022/11/03 21:12:27.167401 reclaimvfree.go:242: Executing command: vgreduce lcmmount_vg /dev/sdg2
2022/11/03 21:12:27.167730 task_progress.go:24: Reclaimed snapshot /dev/sdg1
2022/11/03 21:12:27.286985 reclaimvfree.go:253: Executing command: pvremove -y -ff /dev/sdg2
2022/11/03 21:12:27.374610 reclaimvfree.go:264: Executing command: parted -s -a opt /dev/sdg rm 2
2022/11/03 21:12:27.400884 progress.go:11: Reclaimed snapshot /dev/sdg2
2022/11/03 21:12:27.401049 reclaimvfree.go:242: Executing command: vgreduce lcmmount_vg /dev/sdg2
2022/11/03 21:12:27.401154 task_progress.go:24: Reclaimed snapshot /dev/sdg2
2022/11/03 21:12:27.478621 progress.go:11: Failed to reclaim snapshot disk /dev/sdg2 from VG lcmmount_vg. Error : exit status 5

2022/11/03 21:12:27.478859 task_progress.go:24: Failed to reclaim snapshot disk /dev/sdg2 from VG lcmmount_vg. Error : exit status 5

2022/11/03 21:12:27.491478 workflow_manager.go:198

: Task reclaim-vfree failed. Error: Failed to reclaim snapshot disk /dev/sdg2 from VG lcmmount_vg. Error :  exit status

5

2022/11/03 21:12:27.491630 workflow_manager.go:138: Stopping workflow execution as task reclaim-vfree failed

reclaim-vfree error 2

2022/11/03 20:40:06.100186 reclaimvfree.go:242: Executing command: vgreduce data_vg /dev/sdg1
2022/11/03 20:40:06.292377 reclaimvfree.go:253: Executing command: pvremove -y -ff /dev/sdg1
2022/11/03 20:40:06.444020 reclaimvfree.go:264: Executing command: parted -s -a opt /dev/sdg rm 1
2022/11/03 20:40:06.538938 progress.go:11: Reclaimed snapshot /dev/sdg1
2022/11/03 20:40:06.539027 reclaimvfree.go:242: Executing command: vgreduce lcmmount_vg /dev/sde
  /dev/sdg2
2022/11/03 20:40:06.539239 task_progress.go:24: Reclaimed snapshot /dev/sdg1
2022/11/03 20:40:06.772812 progress.go:11: Failed to reclaim snapshot disk /dev/sde
  /dev/sdg2 from VG lcmmount_vg. Error : exit status 126

2022/11/03 20:40:06.773629 task_progress.go:24: Failed to reclaim snapshot disk /dev/sde
  /dev/sdg2 from VG lcmmount_vg. Error : exit status 126

2022/11/03 20:40:06.819900 workflow_manager.go:198: Task reclaim-vfree failed. Error: Failed to reclaim snapshot disk /dev/sde
  /dev/sdg2 from VG lcmmount_vg. Error : exit status 126

2022/11/03 20:40:06.819970 workflow_manager.go:138: Stopping workflow execution as task reclaim-vfree failed

reclaim-vfree error 3

2022/11/07 09:35:18.875054 reclaimvfree.go:242: Executing command: vgreduce lcmmount_vg /dev/sdc
  /dev/sdg2
2022/11/07 09:35:18.875229 task_progress.go:24: Reclaimed snapshot /dev/sdg2
2022/11/07 09:35:18.941316 progress.go:11: Failed to reclaim snapshot disk /dev/sdc
  /dev/sdg2 from VG lcmmount_vg. Error : exit status 127

2022/11/07 09:35:18.941490 task_progress.go:24: Failed to reclaim snapshot disk /dev/sdc
  /dev/sdg2 from VG lcmmount_vg. Error : exit status 127

2022/11/07 09:35:18.959857 workflow_manager.go:198: Task reclaim-vfree failed. Error: Failed to reclaim snapshot disk /dev/sdc
  /dev/sdg2 from VG lcmmount_vg. Error : exit status 127

2022/11/07 09:35:18.959911 workflow_manager.go:138: Stopping workflow execution as task reclaim-vfree failed

Environment

Vmware Cloud Foundation 4.5

Cause

The presence of multiple PVs in a volume group causes this failure. To assert this,

Login to SDDC Manager as root user
Run "vgs" command to check free space available in the volume group(s)

Run "vgs" command to check if there are multiple PVs in a Volume group (AND) run "lsblk" to check if /storage/lvm_snapshot mount point is mounted or not available.

Resolution

Currently there is no resolution. We are working on this

Workaround:

Pre-requisite:

Take a Snapshot of the SDDC-Manager VM before proceeding with the workaround.

Procedure:

Download and copy the attached script (update_failure_workaround.sh) to SDDC manager at /home/vcf location. (Script can be found in the KB attachments)
Login to SDDC Manager as vcf user and switch to root user
Assign execute permission to the script using the following command

cd /home/vcf
chmod +x update_failure_workaround.sh

Run the below command to identify the Snapshot Device Name

grep "Configured" /var/log/vmware/capengine/cap-required-hardware-addition/workflow.log | grep "/storage/lvm_snapshot"

example output:
Configured disk "/dev/sdg" in the appliance and mounted on /storage/lvm_snapshot

Perform the cleanup using the following command

./update_failure_workaround.sh <Snapshot Device>

example usage:

./update_failure_workaround.sh /dev/sdg

example output: please check for the "Success" at the end.

INFO Remove Snapshots if present
.
.
.
.
INFO Mount all filesystems mentioned in fstab
INFO lvm_snapshot is mounted successfully
INFO Cleanup Done
.
INFO altered cap update workflows
INFO Success

Retry the upgrade from the SDDC Manager UI
Once the update finishes, remove the workaround script by running the below command

rm /home/vcf/update_failure_workaround.sh

Attachments

update_failure_workaround get_app