Logs are not reaching the log destination when TLS is enabled in the sink resource and using custom CA certificates in Tanzu Kubernetes Grid Integrated Edition
search cancel

Logs are not reaching the log destination when TLS is enabled in the sink resource and using custom CA certificates in Tanzu Kubernetes Grid Integrated Edition

book

Article ID: 298694

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Logs are not reaching the log destination when TLS is enabled in the sink resource and using custom CA certificates in Tanzu Kubernetes Grid Integrated Edition (TKGI).

After creating a sink resource with TLS enabled and using custom CA, the SSL connection cannot be established between the log forwarder and destination because the server certificate cannot be validated. This is happening because the custom CA is not trusted in the fluent-bit container. It appears that the custom CAs specified in the BOSH Director settings, if any, are not propagated to the pods.

For more information on creating a sink resource, refer to Creating and Managing Sink Resource.

The log destination would show SSL handshake errors if this problem is hit. However, the fluent-bit pod logs wouldn't show any SSL errors. Due to this, the logs cannot reach the log destination.

As a workaround, a configmap can be created with the CA PEM data, and mounted to the fluent-bit container as a certificate file in the /etc/ssl/certs directory. This  workaround can be persisted through upgrades through Plan Add-ons using a custom container.

Environment

Product Version: 1.11

Resolution

Note: The following workaround would persist during TKGI tile and cluster upgrades. However, during normal operations, if the observability-manager pod gets recreated or restarted for any reason (e.g., worker VM rebooted, etc.), then the fluent-bit daemonset will get recreated and will revert to the default specs.  A permanent fix for this would be included in TKGI v1.12.  It is recommended to upgrade to version 1.12 once it is available.

The workaround entails the creation of a container image that would contain the custom CA, fluent-bit patch, and a shell script that deploys the patch. This also requires a registry, such as a local Harbor registry, to host the custom container image. The container image will be ran in the pks-system namespace of the cluster, and this ensures the patch is persisted through upgrades.

1. In a host or machine where docker is running, create a directory for this workaround, and 'cd' into it. All files that need to be created will be stored in this directory.
$ mkdir fluent-bit-workaround; cd fluent-bit-workaround

2. Save the CA PEM data into a file named cert.pem. Below is an example PEM data. Make sure to change the PEM data accordingly.
$ cat > cert.pem <<EOF
-----BEGIN CERTIFICATE-----
MIIGPjCCBCagAwIBAgIJAKMduaqpCYfYMA0GCSqGSIb3DQEBCwUAMIGrMQswCQYD
VQQGEwJVUzETMBEGA1UECAwKQ2FsaWZvcm5pYTESMBAGA1UEBwwJUGFsbyBBbHRv
MQ8wDQYDVQQKDAZWTXdhcmUxFjAUBgNVBAsMDU1BUEJVIFN1cHBvcnQxHTAbBgNV
BAMMFFN1cHBvcnQgTGFicyBSb290IENBMSswKQYJKoZIhvcNAQkBFhxwdnRsLXN1
TX4MyEfPygH8R/eh#################################5wUzJp0a+6SIG90
1WzZFzzhnb9891F+9BNSKJy7R/8uIYhFVsP565+IesUSpW+nAyBfteLJQKFbWJyx
T/EjLE36+yNcvXpox9DjA0D1EHCCYfR8aEwB2EbiOgDSF1WjP0VNMJt+Bg016gCe
mfKpuEZdVaJWJFl/9AmQn0ChLDoQ6GAtpziFuKtHdXoqgc6WJZWuUiwMWVaU08I0
34tgDtJ/PGwEsybbnX7dkIEAhT/oUB55stiQ8SB8/FwgkAUe8JGnxJITL82CO26/
J3S9Zf4F50HbrhncESiTXyXW
-----END CERTIFICATE-----
EOF

3. Create a file named Dockerfile with the following contents:
ARG BASE_IMAGE=ubuntu:bionic
FROM $BASE_IMAGE as builder

RUN apt update && \
    apt install --no-install-recommends -y wget zip unzip ca-certificates && \
    update-ca-certificates && \
    apt-get clean

# Install kubectl
ARG KUBECTL_SOURCE=https://storage.googleapis.com/kubernetes-release/release/v1.15.11/bin/linux/amd64/kubectl
RUN wget $KUBECTL_SOURCE && mv kubectl /usr/local/bin/kubectl && \
    chmod +x /usr/local/bin/kubectl

COPY cert.pem /cert.pem
COPY fluent-bit-patch.json /fluent-bit-patch.json
COPY test.sh /test.sh
CMD ["/bin/bash", "test.sh"]

4. Create a file named fluent-bit-patch.json with the following contents.
{
   "spec":{
      "template":{
         "spec":{
            "containers":[
               {
                  "name":"fluent-bit",
                  "volumeMounts":[
                     {
                        "mountPath":"/etc/ssl/certs/cert.pem",
                        "subPath":"cert.pem",
                        "name":"ca-pemstore"
                     }
                  ]
               }
            ],
            "volumes":[
               {
                  "name":"ca-pemstore",
                  "configMap":{
                     "name":"ca-pemstore",
                     "defaultMode":420
                  }
               }
            ]
         }
      }
   }
}

5. Create a file named test.sh with the following contents.
#!/bin/bash
set -x

timeout=60
while [ $timeout != 0 ]
do
  kubectl rollout status deployment observability-manager -n pks-system --watch=true 
  if [ $? == 1 ]
  then
    ((timeout--));
    sleep 1;
  else
    break;
  fi
done

if [ $timeout == 0 ]
then
  echo "observability-manager is not started for 10s, deploy fluent-bit-patch fail!"
  sleep infinity
fi

timeout=$TIMEOUT
while [ $timeout != 0 ]
do
  kubectl rollout status daemonset fluent-bit -n pks-system --watch=true 
  if [ $? == 1 ]
  then
    ((timeout--));
    sleep 1;
  else
    break;
  fi
done

if [ $timeout == 0 ]
then
  echo "fluent-bit daemonset not started, deploy fluent-bit-patch fail!"
  sleep infinity
fi

kubectl -n pks-system describe configmap ca-pemstore > ca.bk
if [ $? == 1 ]
then
  kubectl -n pks-system create configmap ca-pemstore --from-file=cert.pem
  kubectl -n pks-system patch ds fluent-bit --patch "$(cat fluent-bit-patch.json)"
else
  kubectl -n pks-system patch ds fluent-bit --patch "$(cat fluent-bit-patch.json)"
  kubectl -n pks-system delete configmap ca-pemstore --wait=true
  kubectl -n pks-system create configmap ca-pemstore --from-file=cert.pem
  kubectl -n pks-system describe configmap ca-pemstore > ca.now
  #if ca change
  bk_md5sum=`md5sum ca.bk | awk '{print $1}'`
  now_md5sum=`md5sum ca.now | awk '{print $1}'`
  if [ $bk_md5sum != $now_md5sum ]
  then
    kubectl rollout restart daemonset/fluent-bit -n pks-system
  fi  
fi
kubectl rollout status daemonset fluent-bit -n pks-system --watch=true

#sleep infinity
kubectl -n pks-system delete deployment persist-fluent-bit-patch

6. List all the files in the current directory and make sure you have the following files.
$ ls -l
-rw-rw-r-- 1 ubuntu ubuntu 2224 Jul 20 14:51 cert.pem
-rw-rw-r-- 1 ubuntu ubuntu  565 Jul 20 15:20 Dockerfile
-rw-rw-r-- 1 ubuntu ubuntu  693 Jul 20 15:20 fluent-bit-patch.json
-rw-rw-r-- 1 ubuntu ubuntu 1591 Jul 20 15:20 test.sh

7. Build the docker image using the 'docker build' command. Make sure to change the value given to the '-t' flag. In this example, 'harbor.lab-xx.vmware.com/library/persist-fluent-bit-patch' is the repository path of the image that will be created and 'dev' is a tag.
$ docker build -f Dockerfile . -t harbor.lab-xx.vmware.com/library/persist-fluent-bit-patch:dev

8. Push the image to the registry using the 'docker push' command. The argument provided to this command is the same as the <path>:<tag> value given to the '-t' flag in the 'docker build' command in step 7.
$ docker push harbor.lab-xx.vmware.com/library/persist-fluent-bit-patch:dev

9. Prepare the add-on definition YAML that will be set in the TKGI Plan(s). Paste the following into a text editor and change the 'image' value. Make sure that the value is the same as the <path>:<tag> value given to the '-t' flag in the 'docker build' command in step 7.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: persist-fluent-bit-patch
  namespace: pks-system
  labels:
    app: persist-fluent-bit-patch
spec:
  replicas: 1
  selector:
    matchLabels:
      app: persist-fluent-bit-patch
  template:
    metadata:
      labels:
        app: persist-fluent-bit-patch
    spec:
      serviceAccountName: observability-manager
      containers:
      - name: persist-fluent-bit-patch
        image: harbor.lab-xx.vmware.com/library/persist-fluent-bit-patch:dev
        imagePullPolicy: Always
        env:
        - name: TIMEOUT
          value: "600"

10. Depending on whether you are using EPMC or Ops Manager, follow the instructions below:

EPMC

a. In the TKGI Configuration, complete any necessary settings in the Wizard page and then click "GENERATE CONFIGURATION".

b. The TKGI Configuration YAML Editor page will appear and you can add the YAML data.  Using the Editor, scroll down to the "plans" section.

c. Within the plans section, go to the plan that you want to have this workaround.  Within the specific plan, change the "addons-spec" value to contain the YAML that was prepared in step 9. 

d. Make sure that a "|" (pipe) character and a new line precedes the YAML data, and that 4 space characters precede each line of the YAML data. 

See the following screenshot as an example.  This will need to be done in every plan where you want to persist this workaround.

Ops Manager

a. In the TKGI tile settings, go to the tab of the Plan where you want to persist this workaround. This is the plan being used by the clusters wherein you want this workaround. 

b. In the field named "(Optional) Add-ons - Use with caution", paste the YAML from step 9. Save the Plan settings. This will need to be done in every plan where you want to persist this workaround.



11.  Click "Apply Configuration" (in EPMC) or "Apply Changes" (in Ops Manager), and upgrade all clusters that are using the changed plan(s).  

Afterwards, the fluent-bit pods in the updated clusters should now have the necessary CA, and should persist through the succeeding upgrades.