"Exception occurred in postInstallHook for B2B-patching" error when updating vCenter Server 8.0U3 to 8.0U3e

Products

VMware vCenter Server 8.0

Issue/Introduction

vCenter Server update from 8.0U3 to 8.0U3e version fails with error:
"Exception occurred in postInstallHook for B2B-patching. Please check the logs for more details. Take corrective action and thenresume".
The vsan-health service will be in stopped state and unable to start on newly deployed VCSA appliance.
In /var/log/vmware/applmgmt/PatchRunner.log

YYYY-MM-DDTHH:MM:SSZ INFO service_manager The service vmware-vapi-endpoint has been already started
YYYY-MM-DDTHH:MM:SSZ INFO service_manager Getting starttype of service vmware-vsan-health
YYYY-MM-DDTHH:MM:SSZ INFO service_manager Executing command '[['/usr/sbin/vmon-cli', '--status', 'vsan-health']]'
YYYY-MM-DDTHH:MM:SSZ INFO service_manager Command '[['/usr/sbin/vmon-cli', '--status', 'vsan-health']]' has exit-code='0' and stdout: Name: vsan-health
Starttype: AUTOMATIC
RunState: STOPPED
RunAsUser: vsan-health
CurrentRunStateDuration(ms): 223510
HealthState: UNHEALTHY
FailStop: FALSE
MainProcessId: N/A

AND

YYYY-MM-DDTHH:MM:SSZ ERROR vmware_b2b.patching.phases.patcher Patch hook Patch got unhandled exception.
Traceback (most recent call last):
File "/storage/seat/software-updatefpci6cm7/stage/scripts/patches/py/vmware_b2b/patching/phases/patcher.py", line 208, in patch
_patchComponents(ctx, userData, statusAggregator.reportingQueue)
File "/storage/seat/software-updatefpci6cm7/stage/scripts/patches/py/vmware_b2b/patching/phases/patcher.py", line 89, in _patchComponents
_startDependentServices(c)
File "/storage/seat/software-updatefpci6cm7/stage/scripts/patches/py/vmware_b2b/patching/phases/patcher.py", line 56, in _startDependentServices
serviceManager.start(depService)
File "/storage/seat/software-updatefpci6cm7/stage/scripts/patches/libs/sdk/service_manager.py", line 909, in wrapper
return getattr(controller, attr)(*args, **kwargs)
File "/storage/seat/software-updatefpci6cm7/stage/scripts/patches/libs/sdk/service_manager.py", line 799, in start
super(VMwareServiceController, self).start(serviceName)
File "/storage/seat/software-updatefpci6cm7/stage/scripts/patches/libs/sdk/service_manager.py", line 665, in start
raise IllegalServiceOperation(errorText)
service_manager.IllegalServiceOperation: Service cannot be started. Error: Error executing start on service vsan-health. Details {
"detail": [
{
"id": "install.ciscommon.service.failstart",
"translatable": "An error occurred while starting service '%(0)s'",
"args": [
"vsan-health"
],
"localized": "An error occurred while starting service 'vsan-health'"
}
],
"componentKey": null,
"problemId": null,
"resolution": null
}
Service-control failed. Error: {
"detail": [
{
"id": "install.ciscommon.service.failstart",
"translatable": "An error occurred while starting service '%(0)s'",
"args": [
"vsan-health"
],
"localized": "An error occurred while starting service 'vsan-health'"
}
],
"componentKey": null,
"problemId": null,
"resolution": null
}

The failure can also generate a crash dump under /var/core/ directory with name: core.vsanvcmgmtd-wor.####
In /var/log/vmware/vmon/vmon.log

YYYY-MM-DDTHH:MM:SSZ In(05) host-#### Received start request for vsan-health
YYYY-MM-DDTHH:MM:SSZ In(05) host-#### <vsan-health-prestart> Constructed command: /usr/bin/python /usr/lib/vmware-vsan/scripts/vsanvcmgmtd-prestart.pyc
YYYY-MM-DDTHH:MM:SSZ In(05) host-#### <vsan-health> Service pre-start command completed successfully.
YYYY-MM-DDTHH:MM:SSZ In(05) host-#### <vsan-health> Constructed command: /usr/sbin/vsanvcmgmtd -s -c /usr/lib/vmware-vsan/VsanVcMgmtConfig.xml -u /etc/vmware-vsan-health/VsanMgmtCustomizedConfig.xml
YYYY-MM-DDTHH:MM:SSZ In(05) host-#### <vsan-health> Running the API Health command as user vsan-health
YYYY-MM-DDTHH:MM:SSZ In(05) host-#### <vsan-health-healthcmd> Constructed command: /usr/bin/python /usr/lib/vmware-vpx/vsan-health/vsanhealth-vmon-apihealth.py
YYYY-MM-DDTHH:MM:SSZ Wa(03) host-#### <vsan-health> Service api-health command's stderr: ERROR:root:Got URL error HTTP Error 503: Service Unavailable

YYYY-MM-DDTHH:MM:SSZ In(05) host-#### <vsan-health> Re-check service health since it is still initializing.
YYYY-MM-DDTHH:MM:SSZ In(05) host-#### Client info Uid=0,Gid=0,Pid=174544,Comm=(vmon-coredumper),PPid=2,Comm=(kthreadd),PPid=0
YYYY-MM-DDTHH:MM:SSZ In(05) host-#### <vsan-health> Service is dumping core. Coredump count 0. CurrReq: 1
YYYY-MM-DDTHH:MM:SSZ Wa(03) host-#### [ReadSvcSubStartupData] No startup information from vsan-health.
YYYY-MM-DDTHH:MM:SSZ Wa(03) host-#### <vsan-health> Service exited. Exit code 1

The cns.vpx_storage_volume_update table has volume_id that begins with'file:'
- Connect to VCDB:
  > /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres
- Check for volume_id that begins with 'file:':
  
  > SELECT * FROM cns.vpx_storage_volume_update WHERE volume_id LIKE 'file:%';
  
  volume_id | datastore | vclock | modified | deleted | corrupted
  -------------------------------------------+------------------------------------------------------------+--------+----------+---------+-----------
  file:######-####-####-#### | ds:///vmfs/volumes/vsan:######-####-####-####/ | 16##54 | f | t | f
  file:######-####-####-#### | ds:///vmfs/volumes/vsan:######-####-####-####/ | 16##54 | f | t | f
  file:######-####-####-#### | ds:///vmfs/volumes/vsan:######-####-####-####/ | 16##54 | f | t | f
  file:######-####-####-#### | ds:///vmfs/volumes/vsan:######-####-####-####/ | 16##54 | f | t | f
  (4 rows)

Environment

vCenter Server 8.0U3

Cause

At boot up time, cns tries to set in-memory cache based on VCDB contents.
It runs into a null pointer when trying to set some block volume specific data structure, but is getting file: volume instead. This table, "cns.vpx_storage_volume_update", is never meant to hold file volumes.
These entries will not cause problem until next vsan-health reboot, hence possibly during the upgrade. Causing upgrade failure.

Resolution

This issue is fixed in vCenter server 8.0U3e.

Workaround:

NOTE: Ensure to take a valid vCenter backup and a snapshot before proceeding with the changes. Ensure to take offline snapshots for all the vCenter Virtual Machines in case of ELM.

Dump the CNS tables:
> /opt/vmware/vpostgres/current/bin/pg_dump -U postgres -d VCDB -n cns -Fc > cnstables.dmp
Connect to VCDB:
> /opt/vmware/vpostgres/current/bin/psql -d VCDB -U postgres
Query the table to check if any vol ID with 'file:' exists:
> SELECT * FROM cns.vpx_storage_volume_update WHERE volume_id LIKE 'file:%';
Delete all rows with 'file:' in volume_id field of cns.vpx_storage_volume_update table:
> DELETE FROM cns.vpx_storage_volume_update WHERE volume_id LIKE 'file:%';
Validate whether rows with 'file:' in volume_id field are deleted from cns.vpx_storage_volume_update table using Step-3.
Upgrade the vCenter server.

Note: Upgrading to vCenter server 8.0U3e will not prevent this issue from occurring. The above workaround needs to be applied for upgrades from a lower version to vCenter server 8.0U3e. Future upgrades from vCenter server 8.0U3e will prevent this issue from occurring.