cloud_controller stuck in starting after TAS upgrade

search cancel

cloud_controller stuck in starting after TAS upgrade

book

Article ID: 423290

calendar_today

Updated On:

Products

VMware Tanzu Platform - Cloud Foundry

Issue/Introduction

After upgrading TAS version from 6.0.20 -> 10.2.5/stemcell version 1.943 -> 1.954, operators may encounter seeing cloud_controller stuck in starting:

cloud_controller/<GUID>         starting       AZ1  <IP-ADDR)   cloud-controller_<CF-Deployment>    ...   bosh-vsphere-esxi-ubuntu-jammy-go_agent/1.954

Environment

Tanzu Application Service 10.2.5 (*not exclusive)

Dell EMC Elastic Cloud Storage (*not exclusive)

The cloud controller references *ECS via DNS hostnames, not IP addresses

TAS is deployed with BOSH DNS enabled

Cause

DNS is not responding to queries.

Check the health of the cloud_controller via monit summary:

If cloud_controller vm is healthy, the issue could be with the DNS server(s). Check for DNS failure errors in the cloud_controller_ng log(s):

[ForwardHandler] <…> ERROR - error recursing for <DOMAIN>.com. to "<IP_ADDR>:53": read udp <IP_ADDR>:37494-><IP_ADDR>:53: i/o timeout

[FailoverRecursor] <...> ERROR - write error response to client after retry count reached [0/0] with rcode=2 - read udp <IP_ADDR>:37494-><IP_ADDR>:53: i/o timeout

Explanation

The Cloud Controller must verify domains and routes reliably at startup and runtime.

Failure to resolve DNS can block operations (connecting to the uaa service, cloud controller database, resolving internal service addresses used by other cloud controller dependent jobs and resolving routing domains).

In situations where DNS lookups fail repeatedly, the cloud controller may repeatedly retry without ever completing its initialization, resulting in a state of “starting” but never completing.

Resolution

DNS changes to *ECS buckets which may impact DNS queries from the cloud controller:

- namespace renamed

- endpoint domain changed

- Migration from one *ECS system to another

Operator's should check with their Infrastructure team for any recent changes to DNS.

Additional Information

*This issue may present itself in other versions of TAS and storage systems. Nor does it require an upgrade or other platform event(s) to be exposed. The use case here describes a specific series of events that may or may not reflect your own.

Feedback

thumb_up Yes

thumb_down No