Unable to upgrade TKGI cluster: The pks-nsx-t-prepare-master-vm pre-start scripts failed to create the cluster Principal Identity
search cancel

Unable to upgrade TKGI cluster: The pks-nsx-t-prepare-master-vm pre-start scripts failed to create the cluster Principal Identity

book

Article ID: 392407

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated (TKGi) VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core) VMware Tanzu Kubernetes Grid Integrated EditionStarter Pack (Core)

Issue/Introduction

  • This issue may occur when rotating the NSX-T TLS Certificate "tls_nsx_t" (also detailed in the Rotate VMware NSX Certificates for Kubernetes Clusters documentation).

  • This may also be encountered if users unintentionally run the maestro regenerate leaf --all or the maestro regenerate --all commands explicitly discouraged in the TKGI certificate rotation documentation. 

  • You may see the same failure when creating a new TKGI cluster.
  • You see messages similar to the following when running the following command to show the bosh task output:

    # bosh task <task-iD>

    {"time":1742508521,"stage":"Updating instance","tags":["master"],"total":3,"task":"master/3ec5cd8b-xxxx-xxxx-xxxx-ed6119b4c8e1 (0) (canary)","index":1,"state":"failed","progress":100,"data":{"error":"Action Failed get_task: Task bd055f81-4253-493d-7fbd-9406aa30d45d result: 1 of 8 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. Successful Jobs: kube-apiserver, etcd, bpm,

  • Binding the new NSX-T superuser certificate to the Superuser Principal Identity (also references as PI) using "Step 8"  from KB How to renew the nsx-t-superuser-certificate used by Principal Identity user will fail with error similar to:

    # curl -X POST -u 'admin' -k https://<NSX_MGR_FQDN>/api/v1/trust-management/principal-identities?action=update_certificate -H "Content-Type: application/json" -H "X-Allow-Overwrite: true" -d @bind.json
    Enter host password for user 'admin':
    {
      "httpStatus" : "NOT_FOUND",
      "error_code" : 600,
      "module_name" : "common-services",
      "error_message" : "The requested object : Certificate/f9fd8b6d-####-####-####-989cecb532f5 could not be found. Object identifiers are case sensitive."

 

Note: The preceding log excerpts are only examples. Date, time, and environmental variables may vary depending on your environment.

Environment

This has been observed on TKGI 1.18 and 1.19 versions, but is not isolated to these versions.

Cause

  • The user deleted the Superuser Principal Identity cert "f9fd8b6d-####-####-####-989cecb532f5" by running an older version of the CARR script detailed in the Using Certificate Analyzer Results and Recovery script KB.
    • This older version of the script marked the NSX-T Superuser cert as stale >> released it >> then deleted it.
    • In older versions of the CARR script, it didn't completely clean up the NSX-T Corfu DB.
    • Users will still be able to see the cert when running curl commands from the cluster master node for authentication instead of using the NSX-T user name and password.

  • The pks-nsx-t-prepare-master-vm pre-start script can be edited to show verbose logging and re-run to gather more explicit messaging if needed:
    • Edit the script from the master node under /var/vcap/jobs/pks-nsx-t-prepare-master-vm/bin/pre-start
    • On the third line, change:

      set -e

      to

      set -ex

    • Save the script and run it to see what it reports:

      # /var/vcap/jobs/pks-nsx-t-prepare-master-vm/bin/pre-start 


  • If CARR script logging is still present, the following messaging helps identify the certificate deletion event in carr.log:

    2025-03-04 23:53:21,413 - carr.validations.ver32.stale_certs_validator - MainThread - INFO - stale_certs_validator.py:173 - Found stale Appliance Certificate with id : f9fd8b6d-####-####-####-989cecb532f5

    2025-03-05 00:05:22,599 - carr.interface.cli.cert_hidden_cmd_intf - MainThread - INFO - cert_hidden_cmd_intf.py:45 - Running curl command : curl -k -s -S -X POST -H "Content-Type:application/json" -H "X-NSX-Username:admin" -d '{ "node_id":"{name: '\''f9fd8b6d-####-####-####-989cecb532f5'\'',node_id: '\''a2988b1b-####-####-####-e92bc8dab67e'\'',certificate_id: '\''f9fd8b6d-####-####-####-989cecb532f5'\''}","service_type":"CLIENT_AUTH" }' http://127.0.0.1:7440/nsxapi/api/v1/trust-management/certificates/f9fd8b6d-####-####-####-989cecb532f5?action=release

    2025-03-05 00:05:22,676 - carr.interface.rest.base_api - MainThread - INFO - base_api.py:92 - path being executed is: DELETE https://<NSX_MGR_FQDN>:443/api/v1/trust-management/certificates/f9fd8b6d-####-####-####-989cecb532f5

Resolution

These steps can be run from an SSH session to the TKGI master node, or the NSX Manager.

 

  1. Identify how many Principal Identity users had their cert deleted by the NSX-T CARR script.

    a. Run the following command to get a list of all the PI users that were created by the Admin.

    # curl -X GET -u 'admin:<PASSWORD>' -k https://<NSX_MGR_FQDN>/api/v1/trust-management/principal-identities | jq -r '.results[]| select(._create_user =="admin")' |grep -E 'name|id|certificate_id'

    Example using fake NSX manager (nsx-manager.domain.com) and admin password (AdminPassword123): 

    # curl -X GET -u 'admin:AdminPassword123' -k https://nsx-manager.domain.com/api/v1/trust-management/principal-identities | jq -r '.results[]| select(._create_user =="admin")' ||grep -E '"name"|"id"|"certificate_id"'

      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100 53422    0 53422    0     0  1028k      0 --:--:-- --:--:-- --:--:-- 1043k
     
    "name": "new-lab-superuser",
      "certificate_id": "91abd838-####-####-####-66c2c49a548d",
      "id": "d5cf6f11-####-####-####-fa6cd69dbb50",
     
    "name": "da862e78-####-####-####-48ce93c12648",
      "certificate_id": "da862e78-###-###-###-48ce93c12648",
      "id": "bdca6884-####-####-####-a0b3004a43e0",
     
    "name": "9949f21d-####-####-####-5623f7fa3b46",
      "certificate_id": "9949f21d-####-####-####-5623f7fa3b46",
      "id": "36024c92-####-####-####-b7d0af7db033",,



    b. Run the following command to create a file named certificate.out that contains all of the certificates stored in NSX-T:
      
    # curl -X GET -u 'admin:<PASSWORD>' -k https://<NSX_MGR_FQDN>/api/v1/trust-management/certificates > certificate.out



    c.  Search in the certificate.out file using the "certificate_id" you got from the step "1a" for each of the PI users, to see if their certificate exist or not; make a note of the PI users whom their certificate is missing 

    Example: 

    "name": "da862e78-####-####-####-48ce93c12648",

      "certificate_id": "da862e78-###-###-###-48ce93c12648",
      "id": "bdca6884-####-####-####-a0b3004a43e0",



  2. Run the following command to get the identity of the Principal Identity user for the TKGI deployment  "The NSX-T Superuser PI"

    # curl -X GET -u 'admin:<PASSWORD>' -k https://<NSX_MGR_FQDN>/api/v1/logical-switches | jq -r '.results[]| select(.display_name == "pks-<CLUSTER_UUID>")' | grep -E 'display_name|_create_user'

    Example using fake NSX manager (nsx-manager.domain.com) and admin password (AdminPassword123) and cluster (service-instance_7c87b2d4-####-####-####-d8b1c9202801):

    curl -X GET -u 'admin:AdminPAssword123' -k https://nsx-manager.domain.com/api/v1/logical-switches | jq -r '.results[]| select(.display_name == "pks-7c87b2d4-####-####-####-d8b1c9202801")' | grep -E 'display_name|_create_user'

    "display_name": "pks-7c87b2d4-####-####-####-d8b1c9202801",
      "_create_user": "new-lab-superuser",


    Note: The out put will give us the "Name" of the NSX-T Superuser PI not the "ID"



  3. Corrective action on the NSX-T Managers requires database edits. To ensure safe updates, please open a case with the NSX-T team to assist with editing the NSX-T database using the corfu_tool_runner.py. See below a high level summary of the required steps:

    • Gather the Principal Identity "name", "node_id", and "cert-id" that was in use previously.
    • Use the corfu_tool_runner.py to create a file named pi.out file displaying all Principal Identities. This will provide a "right" and "left" value from the DB.
    • Use the corfu_tool_runner.py to delete the old PI using above "right" and "left" values (this may take up to 5 minutes).
    • Use the PI and certificate ID's to search the .client_truststore in NSX Manager.
    • Once found, export the certificate.
    • Using the exported certificate, add the PI back via the NSX web client.


Additional Information

If, after resolving the issue with Superuser PI certificate, the tkgi upgrade-cluster or the bosh deploy still fail on pks-nsx-t-prepare-master-vm pre-start script and the  pks-nsx-t-prepare-master-vm logs shows the following error:

 

WARN[2025-03-07T01:37:19Z] NSX-T communication config: client tls files not set
WARN[2025-03-07T01:37:19Z] NSX-T communication config: server tls authentication is disabled
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}

 

This 409 conflict will occur if we rotate the tls-nsx-t cluster cert and the tls-nsx-t cert was uploaded to NSX-T but the PI user for the cluster didn't get created due to the Superuser PI cert issue introduced by the CARR script. Please note: The cluster PI user is different than the Global Superuser PI user addressed in earlier steps. Each cluster has its own PI user for cluster specific operations. These cluster PI users are created by the Global Superuser PI user when the pks-nsxt-prepare-master-vm pre-start script is run.

 

To resolve this conflict, use this KB