On Healthwatch "VM Health" dashboard, diego cells are reported unhealthy intermittently at very short time window. On the reported "unhealthy" diego cell, monit logs show rpcbind and statd job failed and was restarted.
[UTC Jan 11 00:06:30] error : 'rpcbind' failed, cannot open a connection to INET[localhost:111] via TCP
[UTC Jan 11 00:06:30] info : 'rpcbind' trying to restart
[UTC Jan 11 00:06:30] info : 'rpcbind' stop: /var/vcap/jobs/nfsv3driver/bin/rpcbind_ctl
[UTC Jan 11 00:06:31] info : 'rpcbind' start: /var/vcap/jobs/nfsv3driver/bin/rpcbind_ctl
[UTC Jan 11 00:06:42] info : 'rpcbind' connection succeeded to INET[localhost:111] via TCP
[UTC Jan 11 04:28:08] error : 'statd' failed, cannot open a connection to INET[localhost:41793] via TCP
[UTC Jan 11 04:28:08] info : 'statd' trying to restart
[UTC Jan 11 04:28:08] info : 'statd' stop: /var/vcap/jobs/nfsv3driver/bin/statd_ctl
[UTC Jan 11 04:28:09] info : 'statd' start: /var/vcap/jobs/nfsv3driver/bin/statd_ctl
[UTC Jan 11 04:28:20] info : 'statd' connection succeeded to INET[localhost:41793] via TCP
Tanzu Platform for Cloudfoundry
For statd and rpcbind jobs(they are parts of nfs-volume service) on diego cell, monit checks their healthy state at localhost:41793/111 via network connection. However the "localhost" name resolution could be temporarily impacted when bosh-agent triggers a update with /etc/hosts, which makes the health check fail. As the result, monit restart the two jobs though they are up and running.
The issue has been fixed in TPCF 4.0.30, 6.0.10 and 10.0.0. Before upgrade to the fixed releases, it can be ignored because of very limited impact.