Workaround for the blackbox failure in the master node of a PKS cluster
search cancel

Workaround for the blackbox failure in the master node of a PKS cluster

book

Article ID: 298525

calendar_today

Updated On:

Products

VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Symptoms:
Symptoms or use cases of this issue:
  • The master node VM shows as failing in bosh vms output
  • The blackbox job shows as anything other than 'running' in monit summary output in the Master node VM
  • The blackbox stderr log file (/var/vcap/sys/log/syslog_forwarder/blackbox/blackbox.stderr.log) shows the below error:
    2018/11/14 19:00:27 lines flushed; exiting tailer
    2018/11/14 19:00:27 FATAL -- failed to create Watcher
    goroutine 4089 [running]:
    runtime/debug.Stack(0xc420012ba0, 0x21, 0x0)
            /var/vcap/data/packages/golang/xxxxx6ed3ac3b89eb235e73ed653ec3a635xxxxx/src/runtime/debug/stack.go:24 +0xa7
    github.com/hpcloud/tail/util.Fatal(0x6498cf, 0x18, 0x0, 0x0, 0x0)
            /var/vcap/packages/blackbox/src/github.com/hpcloud/tail/util/util.go:22 +0xc7
    github.com/hpcloud/tail/watch.(*InotifyTracker).run(0xc421c26600)
            /var/vcap/packages/blackbox/src/github.com/hpcloud/tail/watch/inotify_tracker.go:231 +0x4c8
    created by github.com/hpcloud/tail/watch.glob..func1
            /var/vcap/packages/blackbox/src/github.com/hpcloud/tail/watch/inotify_tracker.go:54 +0x1dd
  • There is a lot of (like hundreds of) empty audit.*.log files in the /var/vcap/sys/log/kube-apiserver directory.

Environment


Cause

The blackbox process is failing because of too many log files it has opened to monitor. It is hitting a system limit on the number of files it can open. The volume of log files mostly comprises the empty audit log files inside the /var/vcap/sys/log/kube-apiserver directory. It appears that there is a conflict between BOSH log rotation and the kube-apiserver log rotation, hence the issue.

Resolution

The recommended workaround is to configure the audit-max-size to a value that doesn't cause the kube-apiserver process to rotate the logs, and leave the log rotation solely up to BOSH logrotate. Note that by default, the maxsize is 0 in the config file, and this actually applies a default value of 100 M for the audit.log to be rotated by the kube-apiserver. So for this workaround, the recommendation is to set the maxsize to a large number, such as 10 TB.

The below steps can be followed to execute the workaround:

1. Log in to the affected master node VM (via BOSH SSH).
2. Edit /var/vcap/jobs/kube-apiserver/config/bpm.yml to set '- --audit-log-maxsize' to '10000000', and save it. Note that the maxsize is in MB unit, so this value is actually 10 TB.
3. Restart the kube-apiserver job by running monit restart kube-apiserver
4. Remove the empty audit.*.log files inside the /var/vcap/sys/log/kube-apiserver directory.
        e.g.$ cd /var/vcap/sys/log/kube-apiserver; find . -name audit-\*\.log -size 0 -delete

Once the above steps are completed, the blackbox job should recover shortly as monit would restart it automatically. Run 'monit summary' to verify if all the jobs are now running fine.