TKGi cluster upgrade fails with pks-nsx-t-prepare-master-vm failed job due to duplicate SNAT IP on NSX
search cancel

TKGi cluster upgrade fails with pks-nsx-t-prepare-master-vm failed job due to duplicate SNAT IP on NSX

book

Article ID: 398229

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

  • TKGi cluster upgrade fails with error:

    Last Action Description:  Instance update failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: <cluster-guid>, broker-request-id: <request-id>, task-id: <bosh-task-id>, operation: update, error-message: Action Failed get_task: Task <task-id> result: 1 of 9 pre-start scripts failed. Failed Jobs: pks-nsx-t-prepare-master-vm. Successful Jobs: etcd, kube-apiserver, config_scanner, bpm, bosh-dns, syslog_forwarder, ncp, pks-nsx-t-ncp.

  • OpsMan GUI > Certificates doesn't show any expired certificate, ruling out certificate expiration issues, i.e. tls-nsx-t expired certificate (see KBs in Additional Information below).

  • pks-nsx-t-prepare-master-vm pre-start.stdout.log shows the following errors:
    • Log into the failing master VM:
      # bosh -d service-instance_<cluster-guid> ssh <master-instance-name>
      # sudo -i

    • Check pks-nsx-t-prepare-master-vm pre-start.stdout.log:
      # cat /var/vcap/sys/log/pks-nsx-t-prepare-master-vm/pre-start.stdout.log

      Current cluster NSX API mode: Manager
      Registering client certificate
      <client-certificate-id>
      Registration of client certificate is successful
      Checking if client certificate is ready to be used
      2025-05-09T15:49:37Z 1: checking client certificate...
      querying NSXAPI get error: "context deadline exceeded"
      querying NSXAPI get error: "context deadline exceeded"
      timeout: client certificate is not working after 60 seconds

The error indicates connection problems with NSX Manager from the master VM.

Environment

TKGi with NSX

Cause

The NSX Translated IP address for the cluster's SNAT rule is also assigned to some other NSX object causing intermittent misrouting.

Resolution

Checks:

  1. From the failing master VM, try to curl the NSX Manager several times. You observe that some requests succeed and some others time out.
    # curl -kv -X GET --cert /var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_superuser.crt --key /var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_superuser.key https://<NSX-MG-FQDN>/api/v1/node

  2. From the failing master VM, try running the pks-nsx-prepare-master-vm script commands several times.
    Both "create principal" and "check" requests intermittently succeed and fail.
    Note: the commands below can be found in the pks-nsx-prepare-master-vm script in /var/vcap/jobs/pks-nsx-t-prepare-master-vm/bin/ directory.

    # /var/vcap/packages/pks-nsx-t-cli/bin/pksnsxcli create principal \
      --api-type="Manager" \
      --instance-id="<cluster-guid>" \
      -c "/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_superuser.crt" \
      -k "/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_superuser.key" \
      -C "/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_client.crt" \
      --nsx-ca-cert-path="/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_ca.crt" \
      --insecure='false' \
      --nsx-manager-host='<NSX-MG-FQDN>' || exit $?

    # for i in $(seq 24); do
      /var/vcap/packages/pks-nsx-t-cli/bin/pksnsxcli check \
        --api-type="Manager" \
        -c "/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_client.crt" \
        -k "/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_client.key" \
        --nsx-ca-cert-path="/var/vcap/jobs/pks-nsx-t-prepare-master-vm/config/nsx_t_ca.crt" \
        --insecure='false' \
        --nsx-manager-host='<NSX-MG-FQDN>' || exit $?
      echo "$(date +%Y-%m-%dT%H:%M:%SZ) ${i}: checking client certificate..."
    done

  3. On NSX Manager GUI (Manager API mode):

    1. Search in the search bar the master VM's IP address without the last digit. For example, if the master VM's IP is 172.29.42.27, search 172.29.42 (without .27).

    2. Got to NAT Rules and identify the rule associated to the failing TKGi cluster. Expand it and take a note of the Translated IP address.



    3. Search in the search bar the Translated IP.
      The expectation is that only one match shows up, for the cluster's SNAT rule.
      If there's more than one match, it means that the cluster's SNAT Translated IP is also assigned to some other NSX elements, causing misrouting of traffic.

 

Resolution:

If the above checks show several NSX elements associated to the same Translated IP address, you need to inspect those additional elements (i.e. another cluster's SNAT rule, Virtual Servers, etc.) and work with your network team to eliminate the duplicities.

Additional Information