Error "HttpCode 409" returned by pks-nsx-t-prepare-master-vm script during TKGI cluster deploy operation

Products

VMware Tanzu Kubernetes Grid Integrated Edition VMware Tanzu Kubernetes Grid Integrated Edition (Core)

Issue/Introduction

During TKGI deploy operations (such as upgrade-cluster, or master node recreation), you see the bosh deploy fail on pks-nsx-t-prepare-master-vm pre-start script
The pks-nsx-t-prepare-master-vm logs shows the following error:

WARN[2025-03-07T01:37:19Z] NSX-T communication config: client tls files not set
WARN[2025-03-07T01:37:19Z] NSX-T communication config: server tls authentication is disabled
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}
Submit error, HttpCode 409, retry &{0xc00003edc0 import 30000000000 <nil> <nil>}

Cause

This 409 conflict error will be returned from NSX Managers if you rotate the tls-nsx-t cluster cert and the tls-nsx-t cert was uploaded to NSX-T but the Principal Identity user for the cluster didn't get created due to the Superuser PI cert issue introduced by the CARR script.

Resolution

Please note: The cluster PI user is different than the Global Superuser PI. Each cluster has its own PI user for cluster specific operations. These cluster PI users are created by the Global Superuser PI user when the pks-nsxt-prepare-master-vm pre-start script is run.

Run the following command to confirm that the PI user for the cluster that uses the tls-nsx-t cert was deleted, you shouldn't get an output if the user was deleted as part of the tls-nsx-t cert rotation

# curl -X GET -u 'admin:<PASSWORD>' -k https://<NSX_MGR_FQDN>/api/v1/trust-management/principal-identities | jq -r '.results[] | select(.name == "pks-<CLUSTER_UUID>")'

Example using fake NSX manager (nsx-manager.domain.com) and admin password (AdminPassword123) and cluster (service-instance_7c87b2d4-####-####-####-d8b1c9202801):

curl -X GET -u 'admin:AdminPassword123' -k https://nsx-manager.domain.com/api/v1/trust-management/principal-identities | jq -r '.results[] | select(.name == "pks-7c87b2d4-####-####-####-d8b1c9202801")'

Confirm if the new tls-nsx-t cert was uploaded to the master node that is failing on the pre-start script:
- This cert resides on the master node here: /var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_client.crt
- If the new cert wasn't uploaded to the master node, you will need to rotate it using the How to rotate Tanzu Kubernetes Grid Integrated Edition tls-nsx-t cluster certificate KB, then apply change again using Upgrade cluster steps.
If the master node contains the new cert then:
- Confirm if the cluster tls-nsx-t cert exists in NSX-T or not. (it should as this what causing the 409 error "conflict")
  - The cert can be found on NSX web client by searching for the cluster ID, then showing Certificates. The tls-nsx-t certificate is the one labeled: pks-<CLUSTER_UUID>
- Delete the Principal Identity user for the cluster (this is the owner of the tls-nsx-t cert) if it still exists:
  - From SSH to the master node: Set the pksnsxcli alias by running:
    
    # alias pksnsxcli=/var/vcap/packages/pks-nsx-t-cli/bin/pksnsxcli
  - Delete the cluster PI user referencing steps in the "Register the certificate with NSX-T and push it to Kubernetes VMs to replace the old certificate" KB if required:
    
    For manager API:
    
    # pksnsxcli delete principal --instance-id <CLUSTER_INSTANCE_ID> --nsx-manager-host <NSX_MGR_HOSTNAME> --username <USERNAME> --password '<PASSWORD>' --insecure"

Delete the new tls-nsx-t cert using the curl command from NSX Manager or master node:

# curl -X DELETE -sku 'admin' "https://<NSX_MGR_FQDN>/api/v1/trust-management/certificates/2332836c-####-####-####-859213a8cc17" --header "X-Allow-Overwrite: true"

Then create the PI user using the following command, this command will create the user >> import the new tls-nsx-t cert to NSX-T >> then bind them together.

# pksnsxcli create principal --instance-id 7c87b2d4-####-####-####-d8b1c9202801 --nsx-manager-host <NSX_MGR_IP/FQDN> --username admin --password '<PASSWORD>' --insecure -C /var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_client.crt

Confirm that the cluster PI user was created and the nsx-t tls cert was uploaded to nsx-t using the following command:

# curl -X GET -u 'admin:<PASSWORD>' -k https://<NSX_MGR_FQDN>/api/v1/trust-management/principal-identities | jq -r '.results[] | select(.name == "pks-7c87b2d4-####-####-####-d8b1c9202807")'

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 55320 0 55320 0 0 1378k 0 --:--:-- --:--:-- --:--:-- 1385k
{
"name": "pks-7c87b2d4-####-####-####-d8b1c9202801",
"node_id": "7c87b2d4-####-####-####-d8b1c9202801",
"role": "enterprise_admin",
"certificate_id": "0d99911f-####-####-####-e86ce073d8a6",
"roles_for_paths": [
{
"path": "/",
"roles": [
{
"role": "enterprise_admin"
}
],
"delete_path": false
}
],
"is_protected": true,
"resource_type": "PrincipalIdentity",
"id": "ff6308e7-####-####-####-2cf752454f7a",
"display_name": "pks-7c87b2d4-####-####-####-d8b1c9202801",
"tags": [
{
"scope": "pks/cluster",
"tag": "7c87b2d4-####-####-####-d8b1c9202801"
}
],
"_create_time": 1742603019296,
"_create_user": "new-lab-superuser",
"_last_modified_time": 1742603019296,
"_last_modified_user": "new-lab-superuser",
"_system_owned": false,
"_protection": "REQUIRE_OVERRIDE",
"_revision": 0
}
Upgrade the cluster using tkgi upgrade-cluster <CLUSTER_NAME> or issue commands similar to the following to recreate the cluster from Bosh manifest. This will register the new certificate with NSX-T and push it to Kubernetes VMs:

# bosh manifest -d service-instance_<CLUSTER_UUID> > service-instance_<CLUSTER_UUID>.yml

# bosh deploy -d service-instance_<CLUSTER_UUID> service-instance_<CLUSTER_UUID>.yml