Troubleshooting Kubernetes clusters using audit logs
search cancel

Troubleshooting Kubernetes clusters using audit logs

book

Article ID: 368875

calendar_today

Updated On:

Products

Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid VMware Tanzu Kubernetes Grid 1.x VMware Tanzu Kubernetes Grid Plus VMware Tanzu Kubernetes Grid Plus 1.x VMware Tanzu Kubernetes Grid Service (TKGs) VMware Tanzu Toolkit for Kubernetes VMware Tanzu Toolkit for Kubernetes 1.x

Issue/Introduction

  • Learn how to use Kubernetes audit logs to troubleshoot Kubernetes clusters
  • Learn how to find out root cause of performance issues in a Kubernetes environment

Environment

Kubernetes clusters with audit logging enabled

Resolution

Verify if audit logging is enabled in the cluster

Check the kube-apiserver manifest to confirm Kubernetes audit logging is enabled

grep audit /etc/kubernetes/manifests/kube-apiserver.yaml
    - --audit-log-maxage=30
    - --audit-log-maxbackup=10
    - --audit-log-maxsize=100
    - --audit-log-path=/var/log/kubernetes/audit.log
    - --audit-policy-file=/etc/kubernetes/audit-policy.yaml
      name: audit-logs
    - mountPath: /etc/kubernetes/audit-policy.yaml
      name: audit-policy
    name: audit-logs
      path: /etc/kubernetes/audit-policy.yaml
    name: audit-policy

Understanding the Log Format

The kube-apiserver uses a structured log format. It logs Events in json format. Below is an example of a single event

{
  "kind": "Event",
  "apiVersion": "audit.k8s.io/v1",
  "level": "Request",
  "timestamp": "2021-09-01T10:53:08Z",
  "auditID": "77f9b6d1-7d3d-4408-875e-0f7ab1b1e7a4",
  "stage": "ResponseComplete",
  "requestURI": "/api/v1/namespaces/default/pods/example-pod",
  "verb": "get",
  "user": {
    "username": "user1",
    "groups": ["system:masters", "system:authenticated"]
  },
  "sourceIPs": ["192.168.0.1"],
  "responseStatus": {
    "metadata": {},
    "code": 200
  },
  "requestObject": {
    // request spec
  },
  "responseObject": {
    // response spec
  }
}

Understanding Event Structure

Few important fields to understand information contained in the Events are

  • auditID: a unique ID for the audit event.
  • stage: the stage in the request handling pipeline when this event was generated.
  • requestURI: the URI of the request.
  • verb: the Kubernetes verb that was called (e.g., get, create, update, delete, etc.).
  • user: information about the user who made the request.
  • sourceIPs: the source IP of the request.
  • responseStatus: the status of the request.

Extracting Information from audit logs for root cause analysis

Here are few frequently used commands that can help extract useful information from Kubernetes audit logs. Depending on the scenario this can be used to find out useful information about performance issues. Note that these commands use jq for json processing which should be installed before analyzing audit logs

Examples of extracting Top 10 Objects

  • Top 10 users who made requests
    • cat /var/log/kubernetes/audit.log | jq -r .user.username | sort | uniq -c | sort -nr | head -n 10
  • Top 10 IPs from which requests were made
    • cat /var/log/kubernetes/audit.log | jq -r .sourceIPs[] | sort | uniq -c | sort -nr | head -n 10
  • Top 10 most accessed URIs
    • cat /var/log/kubernetes/audit.log | jq -r .requestURI | sort | uniq -c | sort -nr | head -n 10
  • Top 10 request verbs used
    • cat /var/log/kubernetes/audit.log | jq -r .verb | sort | uniq -c | sort -nr | head -n 10
  • Top 10 response status codes
    • cat /var/log/kubernetes/audit.log | jq -r .responseStatus.code | sort | uniq -c | sort -nr | head -n 10
  • Top 10 namespaces with the most requests
    • cat /var/log/kubernetes/audit.log | jq -r '.requestURI | match("/namespaces/(.*?)/").captures[0].string' | sort | uniq -c | sort -nr | head -n 10
  • Top 10 most frequently created resources
    • cat /var/log/kubernetes/audit.log | jq -r 'select(.verb == "create") | .requestURI' | sort | uniq -c | sort -nr | head -n 10
  • Top 10 most frequently deleted resources
    • cat /var/log/kubernetes/audit.log | jq -r 'select(.verb == "delete") | .requestURI' | sort | uniq -c | sort -nr | head -n 10
  • Top 10 most frequently updated resources
    • cat /var/log/kubernetes/audit.log | jq -r 'select(.verb == "update") | .requestURI' | sort | uniq -c | sort -nr | head -n 10
  • Top 10 pods with the most requests
    • cat /var/log/kubernetes/audit.log | jq -r 'select(.requestURI | contains("/pods")) | .requestURI' | sort | uniq -c | sort -nr | head -n 10

Examples to extract general information for Kubernetes Objects

  • List all audit events
    • cat /var/log/kubernetes/audit.log | jq .
  • Get all unique users
    • cat /var/log/kubernetes/audit.log | jq .user.username | sort | uniq
  • Get all requests from a specific user
    • cat /var/log/kubernetes/audit.log | jq 'select(.user.username == "user1")'
  • Find all requests to a specific URI
    • cat /var/log/kubernetes/audit.log | jq 'select(.requestURI == "/api/v1/namespaces/default/pods/example-pod")'
  • Find all pod eviction events
    • cat /var/log/kubernetes/audit.log | jq 'select(.verb == "create" and .requestObject.kind == "Eviction")'
  • List all unique verbs
    • cat /var/log/kubernetes/audit.log | jq .verb | sort | uniq
  • Find all get requests
    • cat /var/log/kubernetes/audit.log | jq 'select(.verb == "get")'
  • Find all delete requests
    • cat /var/log/kubernetes/audit.log | jq 'select(.verb == "delete")'
  • List all unique source IPs
    • cat /var/log/kubernetes/audit.log | jq .sourceIPs[] | sort | uniq
  • Find all requests from a specific IP
    • cat /var/log/kubernetes/audit.log | jq 'select(.sourceIPs[] == "192.168.0.1")'
  • List all response statuses
    • cat /var/log/kubernetes/audit.log | jq .responseStatus.code | sort | uniq
  • Find all failed requests
    • cat /var/log/kubernetes/audit.log | jq 'select(.responseStatus.code != 200)'
  • Find all successful requests
    • cat /var/log/kubernetes/audit.log | jq 'select(.responseStatus.code == 200)'
  • Find all requests in a specific time range
    • cat /var/log/kubernetes/audit.log | jq 'select(.timestamp >= "2021-09-01T00:00:00Z" and .timestamp <= "2021-09-01T23:59:59Z")'
  • Find all requests made in the 'default' namespace
    • cat /var/log/kubernetes/audit.log | jq 'select(.requestURI | startswith("/api/v1/namespaces/default"))'
  • Find all pod creation requests
    • cat /var/log/kubernetes/audit.log | jq 'select(.requestURI | contains("/pods") and .verb == "create")'
  • Find all events related to a specific audit ID
    • cat /var/log/kubernetes/audit.log | jq 'select(.auditID == "77f9b6d1-7d3d-4408-875e-0f7ab1b1e7a4")'
  • Find all requests where a specific object was updated
    • cat /var/log/kubernetes/audit.log | jq 'select(.requestObject.kind == "Pod" and .verb == "update")'
  • Find all events in the 'ResponseComplete' stage
    • cat /var/log/kubernetes/audit.log | jq 'select(.stage == "ResponseComplete")'
  • Find all requests made by the 'system:masters' group
    • cat /var/log/kubernetes/audit.log | jq 'select(.user.groups[] == "system:masters")'
  • Find all requests where the response code was not 200 or 201
    • cat /var/log/kubernetes/audit.log | jq 'select(.responseStatus.code != 200 and .responseStatus.code != 201)'
  • Find all events with a specific kind of request object
    • cat /var/log/kubernetes/audit.log | jq 'select(.requestObject.kind == "Pod")'
  • Find all events where the response object had a certain status
    • cat /var/log/kubernetes/audit.log | jq 'select(.responseObject.status == "Failure")'
  • Find all requests that accessed the '/healthz' endpoint
    • cat /var/log/kubernetes/audit.log | jq 'select(.requestURI == "/healthz")'
  • Find all events where the response object was a certain kind
    • cat /var/log/kubernetes/audit.log | jq 'select(.responseObject.kind == "Status")'
  • Find all 'watch' events
    • cat /var/log/kubernetes/audit.log | jq 'select(.verb == "watch")'

Additional Information

Learn more about Kubernetes Audit Logging at Auditing in K8s