PCF Windows - DNS Resolvers Issues on Windows Cells

Products

VMware Tanzu Application Service for VMs

Issue/Introduction

Symptoms:

Windows Diego cells can become unresponsive or enter a failing state intermittently. The state of Windows Diego cells is very inconsistent.

This issue can manifest intermittently and on subsets of Windows cells in Runtime for Windows tile deployments, but has the potential to disrupt all the Windows cells in a deployment, meaning applications can no longer route traffic during and after a PCF upgrade (or other causal events).

Environment

Cause

Applications hosted on Windows cells become unresponsive and do not recover during PCF upgrades or other loss of network connectivity events because 127.0.0.1 (Consul) gets dropped from the DNS resolvers list on the Windows hosts. When Cloud Foundry jobs cannot contact the Consul DNS, they cannot resolve cf.internal.* hostnames.

This happens when the bosh director becomes unavailable or loses connection with Windows VMs (stemcell 1200.6 or earlier), such as during a PCF upgrade, the BOSH Agent on BOSH-deployed Windows VMs (including Windows Diego cells) restarts. This is expected behavior that continues during the time it cannot contact a director; the BOSH Agent also exponentially backs off its restart timing, to a maximum interval of 5 minutes between restarts, to minimize CPU load on the cell.

During this multiple restart scenario, the BOSH agent was erroneously overwriting all DNS resolver entries in the OS with the list of cloud config resolvers, thus removing the necessary 127.0.0.1 value, inserted by Consul during the Consul job’s pre-start process. This pre-start process is not executed again by the BOSH agent upon its restart, but the core issue lies in how the BOSH agent overwrites the DNS resolver entries

Since any loss of connectivity can cause this issue, it means that in addition to PCF upgrades, network events (like router replacements), director failures, increased ESX load, and possibly others, could cause this issue.

Resolution

The permanent and suggested fix is to upgrade to Runtime for Windows stemcell 1200.7 or above. For Azure/GCP/AWS, see Stemcell 1200.7 for PCF (Windows).

If you're unable to do that at this time, you can perform the following steps as a temporary workaround.

For the stemcell versions 1200.6 and below, the IP Address 127.0.0.1 can be added manually to DNS Resolvers to fix the issue. However, the bosh agent restart will remove this IP again, causing the same issue described above.

1. Connect to each Windows cell either via your IaaS virtual console or via RDP.

2. Edit the DNS configuration by navigating to the Control Panel > Network and Internet > Network and Sharing Center

3. Under network connections, choose Ethernet.

4. Chose Properties > TCP/IPv4 > Properties

Note above, that 127.0.0.1 doesn’t appear in the DNS list under the section “Use the following DNS server addresses. This is the cause of the issue.

5. Click on Advanced.

6. In Advanced TCP/IP Settings pane, click on DNS tab, then click Add.

7. Type 127.0.0.1 in the dialog box and click Add.

8. Use the up/down arrows to move 127.0.0.1 into the first position

9. Click OK to close Advanced TCP/IP Settings.

10. Click OK to close TCP/IPv4 Properties.

11. Close Ethernet Properties to persist changes.

The fix has been applied. The BOSH jobs should eventually become healthy, and then apps can serve traffic and have new apps cf pushed to the cells.