VMware Cloud Foundation (VCF) Health Check Fails Due to Stale Certificates for Non-Existent Workload Domains
search cancel

VMware Cloud Foundation (VCF) Health Check Fails Due to Stale Certificates for Non-Existent Workload Domains

book

Article ID: 419499

calendar_today

Updated On:

Products

VMware SDDC Manager VMware Cloud Foundation

Issue/Introduction

  • A VMware Cloud Foundation (VCF) health check fails during the Pre-validation stage, specifically during the Verify Inventory sub-check.
  • The VCF UI shows the below error message
    Description Pre-validation [Validate Source, Validate Disk Space, Verify Inventory]
    Progress Messages Error occurred during verify inventory pre-validation check. Invalid Domain Name: deleted-WLD. Please ensure the domain name is valid and exists.
    Pre-validation check has failed. Please ensure the required pre-requisites are met for more details refer subtasks details and refer sos.log
    Error
    
    Message: Error occurred during verify inventory pre-validation check. Invalid Domain Name: deleted-WLD. Please ensure the domain name is valid and exists.

     

  • SDDC Manager /var/log/vmware/vcf/sddc-support/vcf-sos.log reports the below message
    INFO [vcf_sos] [healthsummary.py::post::134::MainThread] [Health-Summary-API] Post Request: {'healthChecks': {'certificateHealth': True, 'computeHealth': False, 'connectivityHealth': True, 'dnsHealth': True, 'generalHealth': False, 'hardwareCompatibilityHealth': False, 'ntpHealth': True, 'passwordHealth': True, 'servicesHealth': False, 'storageHealth': False}, 'options': {'config': {'skipKnownHostCheck': False, 'force': False}, 'include': {'summaryReport': False}}, 'scope': {'domains': [{'clusterNames': [], 'domainName': 'deleted-WLD'}], 'includeAllDomains': False, 'includeFreeHosts': False}}

     

    DEBUG [vcf_sos] [db_api.py::db_to_json::596::MainThread] Json got from db is {'id': '31af####-####-####-####-########2bc8', 'creationTimestamp': 'YYYY-MM-DDTHH:MM:SS.113Z', 'status': 'Completed_with_failure', 'completionTimestamp': 'YYYY-MM-DDTHH:MM:SS.748Z', 'description': 'Health-Check operation for SDDC', 'bundleAvailable': 'Yes', 'subTasks': [{'status': 'Failed', 'stages': [{'name': 'Validate Source', 'status': 'COMPLETED', 'description': 'Validate Source'}, {'name': 'Validate Disk Space', 'status': 'COMPLETED', 'description': 'Validate Disk Space'}, {'name': 'Verify Inventory', 'status': 'FAILED', 'description': 'Verify Inventory'}], 'creationTimestamp': 'YYYY-MM-DDTHH:MM:SS.150Z', 'task_id': '31af####-####-####-####-########2bc8', 'name': 'Pre-Validation', 'description': 'Pre-validation [Validate Source,Validate Disk Space,Verify Inventory]', 'errors': [{'message': 'Error occurred during verify inventory pre-validation check. Invalid Domain Name: deleted-WLD. Please ensure the domain name is valid and exists.', 'remediationMessage': None, 'errorCode': 'Errorcode'}, {'message': 'Pre-validation check has failed. Please ensure therequired pre-requisites are met for more details refer subtasks details and refer sos.log', 'remediationMessage': None, 'errorCode': 'Errorcode'}], 'completionTimestamp': 'YYYY-MM-DDTHH:MM:SS.703Z'}]}

     

  • The log findings indicate that the health check is attempting to validate a Workload Domain (WLD) that is non-existent or has been deleted (e.g., deleted-WLD).
  • The database records show stale certificate entries associated with the deleted WLDs in the operationsmanager database:
    psql -h localhost -U postgres -d operationsmanager -c "select id,resource_id,resource_fqdn, from domain_name from certificatemanagement.certificate_expiry"
    
    id, 	resource_id, 							resource_fqdn, 				domain_name,
    
    -35     8db0####-####-####-####-########be55    deletedvc.example.com   	deleted-WLD    
    -34     8c79####-####-####-####-########0fa8    deletednsx01.example.com  	deleted-WLD    
    -33     c7c1####-####-####-####-########6abb    deletednsx02.example.com  	deleted-WLD    
    -32     0180####-####-####-####-########d951    deletednsx03.example.com  	deleted-WLD    
    -31     e677####-####-####-####-########55ae    deletednsxvip.example.com    deleted-WLD  

Environment

VMware Cloud Foundation

Cause

The VCF health check is triggered to include one or more deleted or non-existent Workload Domains (WLDs), such as deleted-WLD. This typically happens when stale certificate entries for the deleted WLDs' components (vCenter, NSX components) remain in the SDDC Manager operationsmanager database, specifically in the certificatemanagement.certificate_expiry and certificatemanagement.certificate_chain_expiry tables.

When the health check runs, it attempts to verify the inventory for these domains, fails to find them, and reports the Invalid Domain Name error.

Resolution

The issue is resolved by manually deleting the stale certificate records associated with the non-existent WLDs from the SDDC Manager's operationsmanager database.

 

Steps to follow:

  1. Take snapshot of SDDC Manager VM
  2. SSH to SDDC Manager VM with vcf user and elevate to root using su
  3. Identify the stale certificate IDs for the non-existent WLD.
    psql -h localhost -U postgres -d operationsmanager -c "select id,resource_id,resource_fqdn,domain_name from certificatemanagement.certificate_expiry"


    Sample output

    id, 	resource_id, 							resource_fqdn, 				domain_name,
    
    -35     8db0####-####-####-####-########be55    deletedvc.example.com   	deleted-WLD    
    -34     8c79####-####-####-####-########0fa8    deletednsx01.example.com  	deleted-WLD    
    -33     c7c1####-####-####-####-########6abb    deletednsx02.example.com  	deleted-WLD    
    -32     0180####-####-####-####-########d951    deletednsx03.example.com  	deleted-WLD    
    -31     e677####-####-####-####-########55ae    deletednsxvip.example.com   deleted-WLD

     

  4. Delete the stale records from the certificatemanagement.certificate_expiry table
    psql -h localhost -U postgres -d operationsmanager -c "DELETE FROM  certificatemanagement.certificate_expiry where domain_name='deleted-WLD'"

     

    • Expected Error (if foreign key constraint exists):
      ERROR:  update or delete on table "certificate_expiry" violates foreign key constraint "certificate_chain_cache_fk" on table "certificate_chain_expiry"
      DETAIL:  Key (id)=(-32) is still referenced from table "certificate_chain_expiry".

      Note the <ID> that is causing the constraint violation. This is the server_cert_id you need for the next step.

       

    • Delete Dependent Entries from certificatemanagement.certificate_chain_expiry. If Step 4 failed due to a foreign key constraint, you must first delete the referencing entries in the certificate_chain_expiry table using the id noted from the error.
       
      1. Verify the entry (optional but recommended):
        psql -h localhost -U postgres -d operationsmanager -c "\x" -c "select * from certificatemanagement.certificate_chain_expiry where server_cert_id='<ID from foreign key error>'"

        Sample output

        id             | 27
        server_cert_id | -32
        issued_to      | deletednsx03.example.com
        issued_by      | OU=#########, O=###########, ST=##########, C=#####, DC=#####, DC=#####, CN=####
        expiry_date    | YYYY-MM-DD hh:mm:ss
        chain_order    | 0
        creation_date  | YYYY-MM-DD hh:mm:ss

         

        Confirm the output relates to the deleted WLD components (e.g., deleted vCenter, NSX)

      2. Delete the dependent entry:
        psql -h localhost -U postgres -d operationsmanager -c "DELETE FROM certificatemanagement.certificate_chain_expiry where server_cert_id='<ID from foreign key error>'"


        Note: If multiple IDs caused the constraint violation, repeat this verification and deletion process for all relevant IDs.



      3. Delete Entries from certificatemanagement.certificate_expiry
        psql -h localhost -U postgres -d operationsmanager -c "delete from certificatemanagement.certificate_expiry where domain_name='deleted-WLD'"

         

  5. Re-run the VCF Health Check. The health check should now complete successfully as the stale references to the deleted WLDs have been removed.