After upgrading TAS version from 6.0.20 -> 10.2.5/stemcell version 1.943 -> 1.954, operators may encounter seeing cloud_controller stuck in starting:
cloud_controller/<GUID> starting AZ1 <IP-ADDR) cloud-controller_<CF-Deployment> ... bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.954 |
Tanzu Application Service 10.2.5 (*not exclusive)
Dell EMC Elastic Cloud Storage (*not exclusive)
The cloud controller references *ECS via DNS hostnames, not IP addresses
TAS is deployed with BOSH DNS enabled
DNS is not responding to queries.
Check the health of the cloud_controller via monit summary:
If cloud_controller vm is healthy, the issue could be with the DNS server(s). Check for DNS failure errors in the cloud_controller_ng log(s):
[ForwardHandler] <…> ERROR - error recursing for <DOMAIN>.com. to "<IP_ADDR>:53": read udp <IP_ADDR>:37494-><IP_ADDR>:53: i/o timeout
[FailoverRecursor] <...> ERROR - write error response to client after retry count reached [0/0] with rcode=2 - read udp <IP_ADDR>:37494-><IP_ADDR>:53: i/o timeout
Explanation
The Cloud Controller must verify domains and routes reliably at startup and runtime.
Failure to resolve DNS can block operations (connecting to the uaa service, cloud controller database, resolving internal service addresses used by other cloud controller dependent jobs and resolving routing domains).
In situations where DNS lookups fail repeatedly, the cloud controller may repeatedly retry without ever completing its initialization, resulting in a state of “starting” but never completing.
DNS changes to *ECS buckets which may impact DNS queries from the cloud controller:
- namespace renamed
- endpoint domain changed
- Migration from one *ECS system to another
Operator's should check with their Infrastructure team for any recent changes to DNS.
*This issue may present itself in other versions of TAS and storage systems. Nor does it require an upgrade or other platform event(s) to be exposed. The use case here describes a specific series of events that may or may not reflect your own.