TKGi cluster upgrade fails with one or more master nodes VM in failing or unknown ( - ) state.
Failed jobs pks-nsx-t-ncp and pks-nsx-t-prepare-master-vm
Bosh cpi and TKGi networking (NCP) is configured for NSX-T Manager API.
pks-nsx-t-prepare-master-vm log shows:
Current cluster NSX API mode: Policy
Registering client certificate
[GET /trust-management/principal-identities][500] getPrincipalIdentitiesInternalServerError &{RelatedAPIError:{Details:Client certificate not found in trust store ErrorCode:99 ErrorData:<nil> ErrorMessage:Internal server error has occurred. ModuleName:common-services} RelatedErrors:[]}
TKGi 1.21.x
TKGi 1.22.x
The cause of this problem is still being investigated. Please collect cluster logs before apply workaround, and open a support case.
As a workaround we can redeploy the cluster using the manifest.
Download the cluster manifest:
bosh -d ServiceInstance-UID manifest > SI.yml
Verify policy-api is set to false.
Redeploy the cluster:
bosh -d SI-UID deploy SI.yml
Finish the upgrade:
tkgi upgrade-cluster XXYY