PostgreSQL provisioning stuck in InProgress state due to CoreDNS i/o timeout

Products

VMware Data Services Manager for VCF

Issue/Introduction

When attempting to provision a PostgreSQL database through automation, the deployment may remain indefinitely stuck in an "InProgress" state. During this time, the backing Kubernetes cluster fails to complete its package bootstrap process. Administrators will observe that the standard-packages repository fails to fetch the necessary bundles, and the AKO package installation fails. This is typically a symptom of a network egress failure where DNS packets successfully leave the Antrea overlay network, but response packets are dropped, resulting in egress timeouts to corporate DNS servers and preventing the cluster from retrieving package metadata.

Database deployment remains stuck in an InProgress state.
The following errors are observed in the cluster logs:
- CoreDNS: read udp #### -> ####:53: i/o timeout
- PackageRepository: Get "https://projects.packages.broadcom.com/v2/": dial tcp: lookup projects.packages.broadcom.com on ####:53: server misbehaving
- AKO: Expected to find at least one version, but did not
Note: The "Package not found" or missing version error is a downstream symptom of the CoreDNS timeout, not a package scoping issue.

Environment

VMware Data Services Manager for VCF 9.1

Cause

This issue is caused by upstream network routing failures or stateful tracking limitations preventing DNS response packets from returning to the CoreDNS pod:

Stateful Tracking Failure: In greenfield deployments, the NSX Gateway Firewall is often disabled by default. Without it, the gateway translates outbound DNS requests but fails to track the session state, resulting in dropped return packets from the corporate DNS.
Missing Return Routing: If traffic exits the VPC without being SNAT-ed to a Node IP, the corporate router may lack a return route for the Pod CIDR pointing back to the VPC Gateway.
Port 53 Blockage: Upstream physical firewalls may be blocking UDP/TCP port 53 egress for the newly assigned Pod CIDR range.
Network Overlap: The default Pod CIDR may conflict with corporate infrastructure, preventing successful DNS resolution entirely.

Resolution

To resolve this issue, apply the following steps to correct the network routing, enable stateful tracking, and verify DNS egress:

Enable NSX Edge Gateway Firewall: Navigate to NSX Manager > Networking > Tier-1 Gateways (or the respective VPC Gateway) and verify the Firewall status. If disabled, enable the Firewall to restore stateful inspection for UDP/DNS traffic.
Update the Workload CIDR (If Overlapping): Identify a unique, non-overlapping CIDR block (e.g., 172.20.0.0/21) and update the workloadNetworkCidr parameter in your Data Services Manager configuration. Delete the stuck PostgreSQL instance and deploy a new one to apply the updated CIDR.
Validate Return Routing: Confirm with your network team that a return route exists on the corporate router for the Pod CIDR (e.g., 172.20.0.0/21), pointing back to the VPC Gateway.
Confirm Port 53 Egress: Ensure that UDP/TCP port 53 is explicitly allowed for the Pod CIDR range at all upstream physical firewalls.
Test Upstream DNS (Isolation Test): To isolate CoreDNS, deploy a temporary pod in the vmware-system-tkg namespace to test public resolver reachability. If this succeeds while your corporate DNS fails, the issue is strictly the network path between the VPC and the corporate DNS server.
- kubectl run -i --tty --rm debug --image=busybox --namespace=vmware-system-tkg -- nslookup projects.packages.broadcom.com 8.8.8.8

Additional Information

Review Connectivity to Supervisor VIP Fails on NSX with VCF 9.0.1 (KB 424466) for details on disabled Gateway Firewalls causing connection timeouts.
Review Pod to External IP traffic flow when there is no Antrea Egress (KB 404589) for expected SNAT and hop behavior.
Review Creating a postgres database using DSM it is stuck at InProgress state (KB 424301).
Review DSM fails to deploy Postgres/MySQL workloads in HA topologies (KB 428629).