How to troubleshoot VMware Tanzu Kubernetes Grid clusters with custom Crashd args and diagnostics.crsh files

Products

Tanzu Kubernetes Grid

Issue/Introduction

This article provides custom args and diagnostics.crsh files to use with the VMware Tanzu Kubernetes Grid (TKG) Crash Diagnostic (Crashd) utility and shows how to use them to gather data from TKG clusters in specific scenarios or use cases.

The attached custom files can:

Provide a quick method to install additional system utilities (netstat, etc.) and obtain further diagnostics in single execution.
Reduce the amount of time needed to diagnose certain scenarios.
Eliminate unwanted data during analysis.
Be further customized by the operator or engineer as the scenario or use case presents itself.
Target only Kubernetes data and only data from specific API resource groups, such as clusters attached to Tanzu Mission Control (TMC).

Procedure Overview

Note: The scripts referenced in this article have been tested on Ubuntu-based TKG clusters.

Point your Crashd binary to use any of the the attached pairs of args and diagnostics.crsh files based on your TKG diagnostic requirements.

The following is an overview of the procedure. The details of the procedure differ based on your scenario:

1. Select a pair of args and diagnostics.crsh files.

2. Download or copy the referenced args and diagnostics.crsh files to the system Crashd will run from.

3. Set the mentioned parameters in the args file.

Optional: Customize

System commands to run on nodes:
1. To capture() the output to local files
2. To run() and display results to local stdout
Kubernetes objects or Kubernetes logs to retrieve from Kubernetes API

4. Run crashd. The following is an example:

crashd run --args-file <args-file> --debug <diagnostics.crsh-file>

Prerequisites

1. Install the Crashd (Crash Diagnostics). For more information, see the following TKG documentation: Troubleshooting Tanzu Kubernetes Clusters with Crash Diagnostic

2. Confirm that crashd is working in your environment before using the custom files in the Resolution section.

Note: Using the default args and diagnostics.crsh files provided with the crashd download and configured by referencing the TKG documentation should work first.

Environment

VMware Tanzu Kubernetes Grid 1.x

Resolution

Pick and run the crashd custom files that fits your scenario or applies to your project.

Scenario 1:

You need additional system diagnostics from your cluster nodes.

Details:

The following args and diagnostics.crsh files capture system-only diagnostics (no Kubernetes objects) from all cluster nodes based on the current kubeconfig context.

These files also includes example for installing additional system utilities (netstat, etc).

Files:

args_sysdiag_all_hosts.txt

diagnostics.crsh_sysdiag_all_hosts.txt

This diagnostics.crsh file:

Sources the TKG args file. Requires only ssh_pk_file, workdir values.
Uses current kubeconfig context to identify cluster nodes.
Connects to each node.
Executes all capture and/or run commands placed in the diagnsostics.crsh file.
Creates a tar file of outputs.

Steps:

1. Download args_sysdiag_all_hosts.txt and diagnostics.crsh_sysdiag_all_hosts.txt files.

2. Set the ssh_pk_file, workdir, and cluster_config values in the args file.

3. Edit the diagnostics.crsh file and add your own commands as needed.

4. Add to section "## COMMANDS" in the file.

5. Set the kubeconfig context to the cluster of choice, or the one that already exists in default /home/ubuntu/.kube/config:

kubectl config use-context <tkg-cluster-context>

Execute:

Run the following crashd command:

crashd run --args-file <args-file> --debug <diagnostics.crsh-file>

Scenario 2:

You want system diagnostics from specific cluster nodes.

Details:

The following args and diagnostics.crsh files capture system-only diagnostics from a specific list of cluster nodes provided.

Files:

args_sysdiag_custom_hosts.txt

diagnostics.crsh_sysdiag_custom_hosts.txt

This diagnostics.crsh file:

Sources the TKG args file.
Requires ssh_pk_file, workdir, hosts (a list of host IP addresses).
Connects to each node.
Executes all "capture" commands listed in the diagnsostics.crsh file.
Creates a tar file of outputs.

Steps:

1. Download args_sysdiag_custom_hosts.txt and diagnostics.crsh_sysdiag_custom_hosts.txt.

2. Set the ssh_pk_file, ssh_user, and workdir. Set only_target_hosts=yes.

3. Set hosts =<Add comma-delimited list of any node IP>.

4. Edit the diagnostics.crsh file and add your own commands as needed.

Execute:

Run the following crashd command:

crashd run --args-file <args-file> --debug <diagnostics.crsh-file>

Scenario 3:

You only want Kubernetes objects from your clusters and you want to customize your system.

Details:

The following args and diagnostics.crsh files capture only Kubernetes objects from your management cluster or a list of workload clusters. No system diagnostics are captured.

Set the specific Kubernetes object data to capture. For example, for TMC-attached clusters. Examples for additional Kubernetes API resource groups.

Files:

args-custom_kube_capture.txt

diagnostics.crsh-custom_kube_capture.txt

This diagnostics.crsh file:

Sources the TKG args file.
Uses the same arguments discussed in Troubleshooting Tanzu Kubernetes Clusters with Crash Diagnostics.
References a list of namespaces by type of cluster (bootstrap, workload, management). This list is updated by the operator.
Captures Kubernetes objects (api-resources) provided by the operato
Collects only Kubernetes object data (no ssh commands) from a Kubernetes API.
Creates a tar file of outputs.

Steps:

1. Download args-custom_kube_capture.txt and diagnostics.crsh-custom_kube_capture.txt files.

2. Edit the args file. Set the values as explained in the Configure Crashd documentation.

3. Set cluster target type.

4. Edit the diagnostics.crsh file and:

Update the Kubernetess namespaces to diagnose.
Update the Kubernetes objects to capture.
Refer to examples included in this file.

Execute:

Run the following crashd command:

crashd run --args-file <args-file> --debug <diagnostics.crsh-file>

Additional Information

For more information on creating more crash diagnostics configuration files, refer to the below:

The open source Crash Diagnostics (Crashd) repo
The Crash Diagnostics Reference manual
The Google GitHub repo for the Starlark language. Starlark is the language Crashd script files (diagnostics.crsh) are written in.
The Crash Diagnostics (re)Design document
Additional examples of diagnostics.crsh files

Attachments

args-custom_kube_capture get_app

args_sysdiag_custom_hosts get_app

diagnostics.crsh-custom_kube_capture get_app

diagnostics.crsh_sysdiag_custom_hosts get_app

diagnostics.crsh_sysdiag_all_hosts get_app

args_sysdiag_all_hosts get_app