Windows Diego cells can become unresponsive or enter a failing state intermittently. The state of Windows Diego cells is very inconsistent.
This issue can manifest intermittently and on subsets of Windows cells in Runtime for Windows tile deployments, but has the potential to disrupt all the Windows cells in a deployment, meaning applications can no longer route traffic during and after a PCF upgrade (or other causal events).
Applications hosted on Windows cells become unresponsive and do not recover during PCF upgrades or other loss of network connectivity events because 127.0.0.1
(Consul) gets dropped from the DNS resolvers list on the Windows hosts. When Cloud Foundry jobs cannot contact the Consul DNS, they cannot resolve cf.internal.*
hostnames.
This happens when the bosh director becomes unavailable or loses connection with Windows VMs (stemcell 1200.6
or earlier), such as during a PCF upgrade, the BOSH Agent on BOSH-deployed Windows VMs (including Windows Diego cells) restarts. This is expected behavior that continues during the time it cannot contact a director; the BOSH Agent also exponentially backs off its restart timing, to a maximum interval of 5 minutes between restarts, to minimize CPU load on the cell.
During this multiple restart scenario, the BOSH agent was erroneously overwriting all DNS resolver entries in the OS with the list of cloud config resolvers, thus removing the necessary 127.0.0.1
value, inserted by Consul during the Consul job’s pre-start process. This pre-start process is not executed again by the BOSH agent upon its restart, but the core issue lies in how the BOSH agent overwrites the DNS resolver entries
Since any loss of connectivity can cause this issue, it means that in addition to PCF upgrades, network events (like router replacements), director failures, increased ESX load, and possibly others, could cause this issue.
The permanent and suggested fix is to upgrade to Runtime for Windows stemcell 1200.7 or above. For Azure/GCP/AWS, see Stemcell 1200.7 for PCF (Windows).
If you're unable to do that at this time, you can perform the following steps as a temporary workaround.
For the stemcell versions 1200.6
and below, the IP Address 127.0.0.1
can be added manually to DNS Resolvers to fix the issue. However, the bosh agent restart will remove this IP again, causing the same issue described above.
Note above, that 127.0.0.1 doesn’t appear in the DNS list under the section “Use the following DNS server addresses. This is the cause of the issue.
127.0.0.1
in the dialog box and click Add.
127.0.0.1
into the first position
The fix has been applied. The BOSH jobs should eventually become healthy, and then apps can serve traffic and have new apps cf
pushed to the cells.