Monit CLI: Reverse Execution Order and Pre-start Hook Blocking in "start all" Operations

Products

Operations Manager VMware Tanzu Kubernetes Grid Integrated Edition

Issue/Introduction

Using `monit start all` on BOSH-managed nodes often triggers jobs in reverse-alphabetical order. High-level services (e.g., kube-apiserver) run pre_start hooks that block the single-threaded Monit daemon. This prevents foundational dependencies (e.g., bosh-dns) from starting until the first round times out, causing significant delays.

For example, when you `monit start all` on a TKGI cluster master node with job list as below

master/####:/var/vcap/monit/job# ls -l
total 36
-rw-r--r-- 1 root root   0 Feb 24 11:32 0000_bosh-dns-aliases.monitrc
-rw-r--r-- 1 root root 373 Feb 24 11:32 0001_loggr-system-metrics-agent.monitrc
-rw-r--r-- 1 root root 738 Feb 24 11:32 0002_bosh-dns.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0003_kubo-dns-aliases.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0004_telemetry-agent.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0005_deploy-antrea.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0006_deploy-proxy-agent.monitrc
-rw-r--r-- 1 root root 233 Feb 24 11:32 0007_proxy-server.monitrc
-rw-r--r-- 1 root root 256 Feb 24 11:32 0008_syslog_forwarder.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0009_cloud-provider.monitrc
-rw-r--r-- 1 root root 384 Feb 24 11:32 0010_vsphere-cloud-controller-manager.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0011_bbr-kube-apiserver.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0012_bbr-etcd.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0013_pks-master-aliases.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0014_smoke-tests.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0015_kubernetes-roles.monitrc
-rw-r--r-- 1 root root 268 Feb 24 11:32 0016_kube-scheduler.monitrc
-rw-r--r-- 1 root root 313 Feb 24 11:32 0017_kube-controller-manager.monitrc
-rw-r--r-- 1 root root 243 Feb 24 11:32 0018_kube-apiserver.monitrc
-rw-r--r-- 1 root root 193 Feb 24 11:32 0019_etcd.monitrc
-rw-r--r-- 1 root root   0 Feb 24 11:32 0020_bpm.monitrc

you would observe significant delay until all jobs start successfully, especially etcd and kube-### jobs all fail at the their initial starts.

Environment

All products which manage jobs with Monit daemon.

Cause

While Monit parses configuration files in the order they are found by the OS (usually alphabetical), the monit start all command effectively processes the internal service list in reverse order of how they were loaded or defined.

In your specific case:

The List: Monit reads 0000_... through 0020_bpm.
The Execution: When you trigger start all, Monit iterates through its internal stack. Because 0020_bpm was likely the last one loaded into the configuration, it is often the first one Monit attempts to "start."
The Blocking Nature: As you noted, Monit's control loop is generally single-threaded for these operations. If 0018_kube-apiserver has a pre-start script that blocks (waiting for etcd), Monit will wait for that script to exit before moving to the next job in its queue.

There is no problem if those jobs are naturally started from `bosh start` or instance update/upgrade, because BOSH manages the order and start jobs in the expected order.

Resolution

Especially on TKGI cluster master node, because job dependencies and pre_start hooks, you should avoid using `monit start all`, but manually start job in the right order

The BOSH Agent: The bosh-agent is responsible for starting Monit jobs in the correct order after a VM restart.
Troubleshooting: It is best practice to start the core infrastructure first as below, which will avoid the deadlock problem.
```
monit start bosh-dns
# Wait for success
monit start etcd
# Wait for success
monit start all 
```