How to triage an issue with a PKS environment

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

This article refers to How to Triage an issue with a PKS environment

Environment

Product Version: 1.3+
OS: Ubuntu

Resolution

Checklist:

The first step in troubleshooting is triage. This article intends to broadly layout the Pivotal Container Service (PKS) components. where a problem might occur in these components, and information that can be collected to help Pivotal's Support team triage PKS issues faster.

This article should not be referred to as a guide for deep dive troubleshooting, but rather as a checklist for information collection.
Note: If you are sharing this information in a public community post and not in a Pivotal support ticket, ensure that you redact any sensitive information.
Issues within PKS can fall into any one of the following categories:

Pivotal Operations Manager (Ops Manager)
BOSH control plane
Pivotal Container Service control plane (PKS deployment)
Kubernetes Clusters (Service instance deployments)
Workloads running on Kubernetes clusters

This checklist is divided by component, detailing when to collect logs for the component and what to collect for that particular component.

Note: The scenarios described under "when to collect" is not an exhaustive list, but aims at providing scenarios to help identify the components easily.

Product information

Ops Manager Version
PKS Version
NSX-T Version
IaaS

Pivotal Operations Manager (Ops Manager)

When to collect Ops Manager logs?

Some of the issues that warrant Ops Manager log collections are:

Ops Manager UI login failures
Failure during tile import
Tile field validation errors
Tile configuration errors
Validation failure during Apply changes

What to collect?

Collect diagnostic report from Ops Manager using the UI or Use the Ops Manager API and send a GET request to /api/v0/diagnostic_report endpoint.
- For example: curl -k https://opsmgr.domain.com/api/v0/diagnostic_report -X GET -H "Authorization: Bearer $access_token"
Contents of /var/log/opsmanager
Contents of /home/tempest-web/uaa/tomcat/logs
Contents of /var/log/nginx/

BOSH control plane

When to collect BOSH control plane logs?

Any symptoms that indicate that something is wrong with one of the components on the BOSH director VM. Some of the scenarios where these logs are helpful are:

VMs frequently going into unresponsive state
BOSH director not responding
Scan and fix tasks stuck in bosh task queue
BOSH tasks timing out
Upload to bosh blobstore failing

What to collect?

In any of the scenarios above or whenever asked for BOSH director logs, please refer to the detailed procedure on How to collect logs from BOSH director.

Pivotal Container Service control plane (PKS deployment)

When to collect PKS deployment logs?

Problems with PKS deployment or the PKS control plane may fall into the following categories:

Unable to target or reach UAA used for authentication with PKS
Issues with creating, updating PKS users
Troubles with login to the PKS api
Not able to get kubeconfig to access the clusters
Kubernetes cluster creation failures
Kubernetes cluster upgrade failures
Kubernetes cluster deletion failures
Smoke test failure

What to collect?

Log collection for this component is fairly straight forward. After logging into BOSH collect PKS deployment logs using bosh logs -d pivotal-container-service-<UUID>. This should always be collected, along with a few additional artifacts that may be needed depending on the operation that has failed.

Cluster creation or deletion fails
- Debug logs for the failed bosh task using the command, bosh task <task_id> --debug
- Service instance logs, covered under Kubernetes Clusters (Service instance deployments)
Cluster upgrade fails
- Change-log of the failed "Apply Changes" attempt from Ops Manager
- Debug logs for the failed bosh task using the command, bosh task <task_id> --debug
- Service instance logs, covered under Kubernetes Clusters (Service instance deployments)

Kubernetes Clusters (Service instance deployments)

When to collect service instance deployment logs?

Any issues with the Kubernetes clusters and components warrant the collection of these logs:

Authentication failures when using kubectl
Unable to reach Kubernetes API
Kubernetes cluster creation failures
Kubernetes cluster upgrade failures
Smoke test failure
Unable to create new workloads
Not able to reach services or workloads deployed on Kubernetes cluster
Kubernetes cluster deletion failures

What to collect?

To analyze the impacted cluster, the logs for that particular service deployment are needed. These can be collected after logging into BOSH and collecting the following:

Service instance deployment logs
- bosh logs -d service-instance_<UUID>
If the issue has been isolated to Kubernetes master VMs only
- bosh logs -d service-instance_<UUID> master
- Above command will work in multi master setup as well
If the issue has been isolated to Kubernetes worker VMs
- bosh logs -d service-instance_<UUID> worker
Cluster creation or deletion fails
- PKS deployment logs
- Debug logs for the failed bosh task using bosh task <task_id> --debug
- bosh logs -d service-instance_<UUID>
Cluster upgrade fails
- PKS deployment logs
- Change-log of the failed "Apply Changes" attempt from Ops Manager
- Debug logs for the failed bosh task using bosh task <task_id> --debug
- bosh logs -d service-instance_<UUID>

Workloads running on Kubernetes Clusters

Although supporting a specific user application and corresponding workloads is out of the scope of Pivotal Support, there are few scenarios where applications are not able to run successfully over the underlying platform (PKS). The aim here is to collect artifacts that will help isolate platform issue from an application issue.

When do you want to collect logs for workloads or applications running on Kubernetes clusters?

Application deployment fails
Pods stuck in ContainerCreating state
Pods stuck in ImagePullBackOff state
Kubernetes services not reachable
Kubernetes services missing endpoints
Not able to create persistent volumes and volume mount failures
Workloads hung in drain state

What to collect?

To isolate application issue vs a platform issue, the following information is needed:

Service deployment logs
- Collection described in the steps above
kubectl describe on failing Kubernetes objects
kubectl logs for failing pods
Sample application deployment manifest to help reproduce the issue
When using helm chart to deploy apps. Helm chart name and release version if publicly available.
For ingress and service accessibility issues
- kubectl describe of ingress or service resource
- traceroute on the IP
- In case of NSX-T, use the search bar to review if same IP is pointing to multiple NSX-T objects.
- Debugging services
Debugging pod
Application debugging