Metricsink vs Clustermetricsink how it works

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Bringing more light into metrics sink types and their usage with examples.

In order to collect metrics from TKGi cluster there are two built in functionalities that can be used,

Metric sink and cluster metric sink, both methods use different plugins to collect the metrics.

MetricSink is enabled per namespace and collects apps provided metrics such as NGINX stats or MySSQL stats, that are provided by the application running within the pod using Prometheus plugin

ClusterMetricSink is enabled for the cluster and is running as daemonset ensuring all workers are monitored, it is using kubernetes plugin to collect the metrics.

Will provide additional details for both configurations below.

Environment

Product Version: 1.7

Resolution

Let start with MetricSink

Here is sample of metrics sink set in default namespace and sending metrics into splunk:

apiVersion: pksapi.io/v1beta1
kind: MetricSink
metadata:
  name: MyMetricSink
  namespace: default
spec:
  inputs:
  outputs:
  - data_format: splunkmetric
    headers:
      Authorization: Splunk c797b318-...-78f2f4a3fb94
      Content-Type: application/json
    insecure_skip_verify: true
    method: POST
    splunkmetric_hec_routing: true
    type: http
    url: https://SPLUNKFQDN:8088/services/collector

Once this is applied a new telegraf-MyMetricSink deployment is started in the specified namespace, together with configmap containing defined inputs and outputs from the MetricSink. At this point the metrics is ready to collect pod app details, in order to instruct telegraf to collect the needed metrics we have to apply annotations to our pods/deployment to specify location and port from where the metrics to be scraped:

The example contains annotations defined for nginx deployment which exposes app metrics on port 9913 :

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: default
  name: nginx-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: nginx-server
  template:
    metadata:
      annotations:
        prometheus.io/path: "/metrics"
        prometheus.io/scrape: "true"
        prometheus.io/port: "9913"
      labels:
        app: nginx-server
    spec:
      containers:
      - name: nginx-demo
        image:nginx-vts-exporter
        imagePullPolicy: Always
        resources:
          limits:
            cpu: 250m
          requests:
            cpu: 20m
        ports:
        - containerPort: 80
          name: http
        - containerPort: 9913
          name: metrics

The defined annotation will be found by telegraf and it will automatically start collecting metrics from the specified address:port.

in order to troubleshoot MetricsSink

1. follow the logs:
kubectl logs -n default telefraf-ID -f
and monitor for any issues like failed to connect or unauthorized messages

2. If you do not see errors in the logs further can confirm if telegraf is sending messages by logging into the worker running the pod and verifying with packet capture

ssh to worker, then monitor for SPLUNKFQDN and / or port 8088
tcpdump -n port 8088 and host SPLUNKFQDN(or IP)

Next is ClusterMetricSink, the definition would be similar please note there is no namespace defined.

apiVersion: pksapi.io/v1beta1
kind: ClusterMetricSink
metadata:
  name: My-ClusterMetricsink
spec:
  inputs:
  outputs:
  - data_format: splunkmetric
    headers:
      Authorization: Splunk c797b318-63f4-4dda-a928-78f2f4a3fb94
      Content-Type: application/json
    insecure_skip_verify: true
    method: POST
    splunkmetric_hec_routing: true
    type: http
    url: https://SPLUNKFQDN:8088/services/collector

Once this is applied, a configmap for telegraf is updated in pks-system namespace with provided inputs and outputs, in our case as we use default inputs telgraf will be configured to retrieve metrics from kubernetes.

ClusterMetricSink is using preprovisioned daemonset in namespace pks-system to scrape kubernetes statistics, these stats contains wide range from single pod to cluster utilization.

For troubleshooting purposes logs can be also checked on the telegraf pods as well as tcpdump can be analyzed to confirm the metrics is sent across.