VMs in a cluster go into failing state with several Bosh jobs failing.
Failing jobs report errors during masters' etcd FQDN resolution in /var/vcap/sys/log/
SubChannel #101 grpc: addrConn.createTransport failed to connect to {Addr: "master-0.etcd.<>:2379", ServerName: "master-0.etcd.<>", }. Err: connection error: desc = "transport: Error while dialing: dial tcp: lookup master-0.etcd.<> on <DNS-IP>:53: no such host"
nslookup
shows connection errors to the bosh-dns server configured in /etc/resolv.conf
.
This is likely caused by expired bosh-dns
leaf certificates in the cluster.
Follow Rotate bosh-dns leaf certificates using maestro KB to rotate already expired bosh-dns
certificates.