There is a lot of (like hundreds of) empty audit.*.log files in the /var/vcap/sys/log/kube-apiserver directory.
Environment
Cause
The blackbox process is failing because of too many log files it has opened to monitor. It is hitting a system limit on the number of files it can open. The volume of log files mostly comprises the empty audit log files inside the /var/vcap/sys/log/kube-apiserver directory. It appears that there is a conflict between BOSH log rotation and the kube-apiserver log rotation, hence the issue.
Resolution
The recommended workaround is to configure the audit-max-size to a value that doesn't cause the kube-apiserver process to rotate the logs, and leave the log rotation solely up to BOSH logrotate. Note that by default, the maxsize is 0 in the config file, and this actually applies a default value of 100 M for the audit.log to be rotated by the kube-apiserver. So for this workaround, the recommendation is to set the maxsize to a large number, such as 10 TB.
The below steps can be followed to execute the workaround:
1. Log in to the affected master node VM (via BOSH SSH). 2. Edit /var/vcap/jobs/kube-apiserver/config/bpm.yml to set '- --audit-log-maxsize' to '10000000', and save it. Note that the maxsize is in MB unit, so this value is actually 10 TB. 3. Restart the kube-apiserver job by runningmonit restart kube-apiserver 4. Remove the empty audit.*.log files inside the /var/vcap/sys/log/kube-apiserver directory. e.g.$ cd /var/vcap/sys/log/kube-apiserver; find . -name audit-\*\.log -size 0 -delete
Once the above steps are completed, the blackbox job should recover shortly as monit would restart it automatically. Run 'monit summary' to verify if all the jobs are now running fine.