Diego cells are reported unhealthy on Healthwatch dashboard intermittently because of restart of statd and rpcbind
search cancel

Diego cells are reported unhealthy on Healthwatch dashboard intermittently because of restart of statd and rpcbind

book

Article ID: 389818

calendar_today

Updated On:

Products

VMware Tanzu Application Service

Issue/Introduction

On Healthwatch "VM Health" dashboard, diego cells are reported unhealthy intermittently at very short time window. On the reported "unhealthy" diego cell,  monit logs show rpcbind and statd job failed and was restarted. 

[UTC Jan 11 00:06:30] error    : 'rpcbind' failed, cannot open a connection to INET[localhost:111] via TCP
[UTC Jan 11 00:06:30] info     : 'rpcbind' trying to restart
[UTC Jan 11 00:06:30] info     : 'rpcbind' stop: /var/vcap/jobs/nfsv3driver/bin/rpcbind_ctl
[UTC Jan 11 00:06:31] info     : 'rpcbind' start: /var/vcap/jobs/nfsv3driver/bin/rpcbind_ctl
[UTC Jan 11 00:06:42] info     : 'rpcbind' connection succeeded to INET[localhost:111] via TCP
[UTC Jan 11 04:28:08] error    : 'statd' failed, cannot open a connection to INET[localhost:41793] via TCP
[UTC Jan 11 04:28:08] info     : 'statd' trying to restart
[UTC Jan 11 04:28:08] info     : 'statd' stop: /var/vcap/jobs/nfsv3driver/bin/statd_ctl
[UTC Jan 11 04:28:09] info     : 'statd' start: /var/vcap/jobs/nfsv3driver/bin/statd_ctl
[UTC Jan 11 04:28:20] info     : 'statd' connection succeeded to INET[localhost:41793] via TCP

Environment

Tanzu Platform for Cloudfoundry 

  • 4.0.0~4.0.29
  • 6.0.0~6.0.9

Cause

For statd and rpcbind jobs(they are parts of nfs-volume service) on diego cell, monit checks their healthy state at localhost:41793/111 via network connection. However the "localhost" name resolution could be temporarily impacted when bosh-agent triggers a update with /etc/hosts, which makes the health check fail. As the result, monit restart the two jobs though they are up and running. 

Resolution

The issue has been fixed in TPCF 4.0.30, 6.0.10 and 10.0.0. Before upgrade to the fixed releases, it can be ignored because of very limited impact.