WCP service fails to start during vCenter upgrade
search cancel

WCP service fails to start during vCenter upgrade

book

Article ID: 340837

calendar_today

Updated On:

Products

VMware Cloud Foundation VMware vCenter Server

Issue/Introduction

Editing the certificates directly in VCDB to get wcpsvc to fall back to reissuing these certificates and get back to a clean slate to eliminate problematic SANs

  • vCenter upgrade will fail with error 'Exception occurred in postInstallHook"
YYYY-MM-DDTHH:MM:SS.978Z ERROR vmware_b2b.patching.phases.patcher Patch hook Patch got unhandled exception.
Traceback (most recent call last):
  File "/storage/core/software-packages/scripts/patches/py/vmware_b2b/patching/phases/patcher.py", line 203, in patch
    _patchComponents(ctx, userData, statusAggregator.reportingQueue)
  File "/storage/core/software-packages/scripts/patches/py/vmware_b2b/patching/phases/patcher.py", line 84, in _patchComponents
    _startDependentServices(c)
  File "/storage/core/software-packages/scripts/patches/py/vmware_b2b/patching/phases/patcher.py", line 53, in _startDependentServices
    serviceManager.start(depService)
  File "/storage/core/software-packages/scripts/patches/libs/sdk/service_manager.py", line 901, in wrapper
    return getattr(controller, attr)(*args, **kwargs)
  File "/storage/core/software-packages/scripts/patches/libs/sdk/service_manager.py", line 794, in start
    super(VMwareServiceController, self).start(serviceName)
  File "/storage/core/software-packages/scripts/patches/libs/sdk/service_manager.py", line 665, in start
    raise IllegalServiceOperation(errorText)
service_manager.IllegalServiceOperation: Service cannot be started. Error: Error executing start on service wcp. Details {
    "detail": [
        {
            "id": "install.ciscommon.service.failstart",
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "wcp"
            ],
            "localized": "An error occurred while starting service 'wcp'"
        }
    ],
    "componentKey": null,
    "problemId": null,
    "resolution": null
}
Service-control failed. Error: {
    "detail": [
        {
            "id": "install.ciscommon.service.failstart",
            "translatable": "An error occurred while starting service '%(0)s'",
            "args": [
                "wcp"
            ],
            "localized": "An error occurred while starting service 'wcp'"
        }
    ],
    "componentKey": null,
    "problemId": null,
    "resolution": null
}  
YYYY-MM-DDTHH:MM:SS.980Z WARNING root stopping status aggregation...
YYYY-MM-DDTHH:MM:SS.981Z ERROR __main__ Patch vCSA failed
  • Reverting snapshots WCP service will be in stopped state and starting service results in coredumps at /var/core
  • WCP logs indicate certificate issues
YYYY-MM-DDTHH:MM:SS.806Z error wcp [crypto/cryptography.go:195] [opID=domain-c10] Unable to decrypt input of length: 0 
YYYY-MM-DDTHH:MM:SS.806Z error wcp [crypto/cryptography.go:147] [opID=domain-c10] AES decryption GCM Mode failed due to: <nil>, Decryption failed. Retrying CBC
YYYY-MM-DDTHH:MM:SS.806Z error wcp [crypto/cryptography.go:150] [opID=domain-c10] AES decryption CBC Mode failed due to: encrypted text is incorrect
YYYY-MM-DDTHH:MM:SS.806Z error wcp [crypto/cryptography.go:229] [opID=domain-c10] Cannot decrypt input: encrypted text is incorrect
YYYY-MM-DDTHH:MM:SS.806Z error wcp [kubelifecycle/kube_instance_db.go:507] [opID=domain-c10] Failed to decrypt Netoperator service account password: encrypted text is incorrect
YYYY-MM-DDTHH:MM:SS.806Z error wcp [instanceconfig/configuration.go:285] [opID=63c5292e] Failed to marshal Cluster domain-c10 enableSpec to json. Err x509: SAN rfc822Name is malformed
 
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x230 pc=0x3db07f1]

goroutine 1 [running]:
panic(0x48a4c80, 0x6d29420)
        /build/mts/release/bora-20336139/compcache/cayman_go/ob-19366505/linux64/src/runtime/panic.go:1065 +0x565 fp=0xc00018f388 sp=0xc00018f2c0 pc=0x1739a05

From above logs, it indicates that we're failing to decrypt the existing config in the DB and hence failing to initialize the cluster map

Environment

VMware Cloud Foundation 4.0.x
VMware vCenter Server 7.0.x

Cause

wcpsvc appears to have 'bad' certificates in its VCDB for both the VirtualIPTLSCertificate and the Ncp ingress certificate. This certificate causes following issues;

  • wcpsvc fails to load supervisor configs from DB because it does cannot parse the certificate present in cluster_db_configs : desired_config.
    • On initial testing, it appears to be a problem with the VIP and L7 certs having a 'blank' email SAN. There appears to be extra validation in Golang's new version 1.16 x509 library. The certificate works fine with Go 1.15, but fails in Go 1.16 onwards.
    • The specific problem appears to be a unicode character in the email SAN while Go needs strictly ASCII characters
  • It silently fails to load the supervisor ID into its in-memory cache
  • When populating workloads (WCP namespaces), it attempts to reference the above cache and dereference the supervisor ID  
  • WCP crashes during startup and cannot get far enough to update the certificate

Resolution

Currently there is no resolution.

NOTE: Make sure all vCenters in ELM are shut down and take a snapshot of all nodes for backup. For standalone vCenters, a powered-on snapshot will be sufficient. Please check the below article for snapshot best practices:
 

Workaround:
  • SSH into vCenter
  • Stop wcpsvc

    vmon-cli --stop wcp
  • Backup table

    PGPASSFILE=/etc/vmware/wcp/.pgpass /opt/vmware/vpostgres/13/bin/pg_dump --host localhost --port 5432 --username wcpuser --format plain --verbose --file "desired_config_backup" --table wcp.cluster_db_configs VCDB
  • Connect to VCDB

    PGPASSFILE=/etc/vmware/wcp/.pgpass psql -U wcpuser -h localhost -d VCDB
  • Note VIP

    VCDB=> select master_cluster_ip from cluster_db_configs ;
  • Note cert/key for VIP and NCP

    VCDB=> select desired_config->'VirtualIPTLSCertificate' from cluster_db_configs;
    VCDB=> select desired_config->'VirtualIPTLSPrivateKey' from cluster_db_configs;
    VCDB=> select desired_config->'NcpDefaultLBCert' from cluster_db_configs ;
    VCDB=> select desired_config->'NcpDefaultLBPrivateKey' from cluster_db_configs ;
  • Unset VIP cert/key in VCDB

    VCDB=> update cluster_db_configs set desired_config=jsonb_set(desired_config, '{VirtualIPTLSPrivateKey}''""');
    UPDATE 1
    VCDB=> update cluster_db_configs set desired_config=jsonb_set(desired_config, '{VirtualIPTLSCertificate}''""');
    UPDATE 1
  • Unset NCP cert/key in VCDB

    update cluster_db_configs set desired_config=jsonb_set(desired_config, '{NcpDefaultLBCert}''""');
    update cluster_db_configs set desired_config=jsonb_set(desired_config, '{NcpDefaultLBPrivateKey}''""');
  • Unset VIP

    VCDB=> update cluster_db_configs set master_cluster_ip='';
  • restart wcpsvc, verify that it comes up and certs are replaced.

    vmon-cli --start wcp
  • If anything went wrong, write back the previous values to the corresponding columns (this step has not been tested by engineering)

    update cluster_db_configs set desired_config=jsonb_set(desired_config, '{VirtualIPTLSPrivateKey}''"<old value>"');
    update cluster_db_configs set desired_config=jsonb_set(desired_config, '{VirtualIPTLSCertificate}''"<old value>"');
     
    update cluster_db_configs set desired_config=jsonb_set(desired_config, '{NcpDefaultLBCert}''"<old value>"');
    update cluster_db_configs set desired_config=jsonb_set(desired_config, '{NcpDefaultLBPrivateKey}''"<old value>"'


Additional Information

Impact/Risks:
Moderate: This involves database changes hence a snapshot of vCenter as well as backup of the vCenter is a must. File-Based Backup and Restore of vCenter Server