How to triage an issue with a PKS environment
search cancel

How to triage an issue with a PKS environment

book

Article ID: 298727

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

This article refers to How to Triage an issue with a PKS environment

Environment

Product Version: 1.3+
OS: Ubuntu

Resolution

Checklist:

The first step in troubleshooting is triage. This article intends to broadly layout the Pivotal Container Service (PKS) components. where a problem might occur in these components, and information that can be collected to help Pivotal's Support team triage PKS issues faster.

This article should not be referred to as a guide for deep dive troubleshooting, but rather as a checklist for information collection.
Note: If you are sharing this information in a public community post and not in a Pivotal support ticket, ensure that you redact any sensitive information.
Issues within PKS can fall into any one of the following categories:

  • Pivotal Operations Manager (Ops Manager)
  • BOSH control plane
  • Pivotal Container Service control plane (PKS deployment)
  • Kubernetes Clusters (Service instance deployments)
  • Workloads running on Kubernetes clusters



This checklist is divided by component, detailing when to collect logs for the component and what to collect for that particular component.

Note: The scenarios described under "when to collect" is not an exhaustive list, but aims at providing scenarios to help identify the components easily.

Product information

  • Ops Manager Version

  • PKS Version

  • NSX-T Version

  • IaaS

Pivotal Operations Manager (Ops Manager)

When to collect Ops Manager logs?
Some of the issues that warrant Ops Manager log collections are:
  • Ops Manager UI login failures
  • Failure during tile import
  • Tile field validation errors
  • Tile configuration errors
  • Validation failure during Apply changes

What to collect?

BOSH control plane

When to collect BOSH control plane logs?

Any symptoms that indicate that something is wrong with one of the components on the BOSH director VM. Some of the scenarios where these logs are helpful are:

  • VMs frequently going into unresponsive state
  • BOSH director not responding
  • Scan and fix tasks stuck in bosh task queue
  • BOSH tasks timing out
  • Upload to bosh blobstore failing

What to collect?

In any of the scenarios above or whenever asked for BOSH director logs, please refer to the detailed procedure on How to collect logs from BOSH director.

 

Pivotal Container Service control plane (PKS deployment)

When to collect PKS deployment logs?

Problems with PKS deployment or the PKS control plane may fall into the following categories:

  • Unable to target or reach UAA used for authentication with PKS
  • Issues with creating, updating PKS users
  • Troubles with login to the PKS api
  • Not able to get kubeconfig to access the clusters
  • Kubernetes cluster creation failures
  • Kubernetes cluster upgrade failures
  • Kubernetes cluster deletion failures
  • Smoke test failure

What to collect?

Log collection for this component is fairly straight forward. After logging into BOSH collect PKS deployment logs using bosh logs -d pivotal-container-service-<UUID>. This should always be collected, along with a few additional artifacts that may be needed depending on the operation that has failed.

  • Cluster creation or deletion fails
    • Debug logs for the failed bosh task using the command, bosh task <task_id> --debug
    • Service instance logs, covered under Kubernetes Clusters (Service instance deployments)
  • Cluster upgrade fails
    • Change-log of the failed "Apply Changes" attempt from Ops Manager
    • Debug logs for the failed bosh task using the command,  bosh task <task_id> --debug
    • Service instance logs, covered under Kubernetes Clusters (Service instance deployments)

Kubernetes Clusters (Service instance deployments)

When to collect service instance deployment logs?

Any issues with the Kubernetes clusters and components warrant the collection of these logs:

  • Authentication failures when using kubectl
  • Unable to reach Kubernetes API
  • Kubernetes cluster creation failures
  • Kubernetes cluster upgrade failures
  • Smoke test failure
  • Unable to create new workloads
  • Not able to reach services or workloads deployed on Kubernetes  cluster
  • Kubernetes cluster deletion failures

What to collect?

To analyze the impacted cluster, the logs for that particular service deployment are needed. These can be collected after logging into BOSH and collecting the following:
  • Service instance deployment logs
    • bosh logs -d service-instance_<UUID>
  • If the issue has been isolated to Kubernetes master VMs only
    • bosh logs -d service-instance_<UUID> master
    • Above command will work in multi master setup as well
  • If the issue has been isolated to Kubernetes worker VMs
    • bosh logs -d service-instance_<UUID> worker
  • Cluster creation or deletion fails
    • PKS deployment logs
    • Debug logs for the failed bosh task using bosh task <task_id> --debug
    • bosh logs -d service-instance_<UUID>
  • Cluster upgrade fails
    • PKS deployment logs
    • Change-log of the failed "Apply Changes" attempt from Ops Manager
    • Debug logs for the failed bosh task using bosh task <task_id> --debug
    • bosh logs -d service-instance_<UUID>

Workloads running on Kubernetes Clusters

Although supporting a specific user application and corresponding workloads is out of the scope of Pivotal Support, there are few scenarios where applications are not able to run successfully over the underlying platform (PKS). The aim here is to collect artifacts that will help isolate platform issue from an application issue.

When do you want to collect logs for workloads or applications running on Kubernetes clusters?

  • Application deployment fails
  • Pods stuck in ContainerCreating state
  • Pods stuck in ImagePullBackOff state
  • Kubernetes services not reachable
  • Kubernetes services missing endpoints
  • Not able to create persistent volumes and volume mount failures
  • Workloads hung in drain state

What to collect?

To isolate application issue vs a platform issue, the following information is needed:

  • Service deployment logs
    • Collection described in the steps above
  • kubectl describe on failing Kubernetes objects
  • kubectl logs for failing pods
  • Sample application deployment manifest to help reproduce the issue
  • When using helm chart to deploy apps. Helm chart name and release version if publicly available.
  • For ingress and service accessibility issues
    • kubectl describe of ingress or service resource
    • traceroute on the IP
    • In case of NSX-T, use the search bar to review if same IP is pointing to multiple NSX-T objects.
    • Debugging services
  • Debugging pod
  • Application debugging