Debugging TKGI cluster creation issue: failed to create Tier-1 router

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Symptoms:

When creating a cluster on Pivotal Container Service/TKGI, cluster creation fails with the following error:

./tkgi cluster cluster11 --json

  "last_action_description": "Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: 6675c28d-09f4-467d-8583-75e9b9d8a448, broker-request-id: c07642f0-7643-47b6-ad4e-5af6ee2607a9, task-id: 8069, operation: create, error-message: Action Failed get_task: Task 6aa706be-5a92-4ae2-61a3-1ac138450657 result: 1 of 8 pre-start scripts failed. Failed Jobs:...",

Environment

Cause

The cause of this issue is that on NSX-T, the edge cluster node fails to function. Specifically, you will see the message, “Insufficient resources to allocate in edge cluster”. This causes the degradation of underlying transport nodes, resulting in the failure to create the Tier-1 (T1) router during PKS cluster creation.

Resolution

To debug a cluster creation failure, use BOSH to inspect the cluster jobs Virtual Machines (VMs) for failures. More specifically, inspect the Master and the Work nodes for failures.

1. Use the bosh vms command to list all of the PKS cluster VMs and check if any of the VMs are in a failing state.

ubuntu@opsmgr-customer0-io:~$ BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=eh-########-QvCNwE####g6EWXU BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.###.###.11 bosh vms

Note: Observe that the Master VM should be in failing state in this case.

2. SSH into the Master VM (which is failing in this case) using the bosh ssh command.

BOSH_CLIENT=ops_manager BOSH_CLIENT_SECRET=eh-########-QvCNwE####g6EWXU  BOSH_CA_CERT=/var/tempest/workspaces/default/root_ca_certificate BOSH_ENVIRONMENT=192.###.###.11 bosh -d service-instance_6675c28d-09f4-467d-8583-75e9b9d8a448 ssh  master/206b453e-1869-4961-972a-27b53a3f763a

3. Change to /var/vcap/sys/log/pks-nsx-t-prepare-master-vm and inspect the pre-start.stderr.log. Observe the following message:

time="2019-01-15T02:54:21Z" level=error msg="Failed to createT1Router: &{ManagedResource:{RevisionedResource:{Resource:{Links:[] Schema: Self:<nil>} Revision:<nil>} CreateTime:0 CreateUser: LastModifiedTime:0 LastModifiedUser: SystemOwned:<nil> Description: DisplayName:lb-pks-6675c28d-09f4-467d-8583-75e9b9d8a448-cluster-router ID: ResourceType: Tags:[0xc4203aa6e0]} AdvancedConfig:<nil> AllocationProfile:0xc42007a168 EdgeClusterID:c3c8dc3a-8794-4bee-a700-f35d0faa3adc EdgeClusterMemberIndices:[] FailoverMode: FirewallSections:[] HighAvailabilityMode:ACTIVE_STANDBY PreferredEdgeClusterMemberIndex:<nil> RouterType:0xc42040c940}" pks-networking=networkManager

Error: [POST /logical-routers][400] createLogicalRouterBadRequest  &{RelatedAPIError:{Details: ErrorCode:10087 ErrorData:<nil> ErrorMessage:[Routing] Insufficient resources to allocate in edge cluster EdgeCluster/c3c8dc3a-8794-4bee-a700-f35d0faa3adc and pool FORWARDING_POOL for context LogicalRouter/7ed59486-9e4e-4724-8214-ffd8092b6165. ModuleName:ROUTING} RelatedErrors:[]}

Note: As seen above, cluster creation fails to create the T1 router and produces the message "insufficient resources to allocate in edge cluster".

The resolution to this issue is to replace one of the failing edge cluster nodes with a new edge cluster node with bigger capacity. In this case, we deployed a large edge cluster node to add capacity.