The DNS Forwarder service fails to start when the cache size is set to 0
search cancel

The DNS Forwarder service fails to start when the cache size is set to 0

book

Article ID: 415632

calendar_today

Updated On:

Products

VMware NSX

Issue/Introduction

  • NSX docker daemon fails to restart DNS forwarder
    • The DNS Forwarder service remains in an ERROR state, and there is one DNS Forwarder DOWN alarm raised on the NSX UI.
    • The DNS Forwarder datapath becomes non-functional, and subsequent updates do not recover the service.
    • The DNS Forwarder backend datapath container is in "Exited" status by checking `docker ps -a | grep service_dns` on edge.
  • Impact to customer
    • The customer experiences DNS query failures to the DNS Forwarder service.
    • The DNS Forwarder service status remains in the ERROR state, and the alarm is disturbing.

  • May be logs similar to:
    • NSX Manager /var/log/proton/nsxapi.log
      • 2025-09-22T09:13:27.084Z  INFO providerTaskExecutor-1-7 DNSForwarderProviderNsxT 77770 POLICY [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] set tenancy context in updateDnsForwarder() with policyDnsForwarder DnsForwarder [listenerIp=10.0.0.10, logLevel=INFO, cacheSize=0, defaultForwarderZonePath=/infra/dns-forwarder-zones/#######-####-####-####-############, conditionalForwarderZonePath=[], enabled=true, getForwardRelationShips()=[RelationshipInfo{targetPath=/infra/dns-forwarder-zones/#######-####-####-####-############, relationshipType=DEFAULT_DNS_FORWARDER_ZONE_RELATIONSHIP}]][policyPath=/infra/tier-0s/Provider-LR/dns-forwarder, markedForDelete=false] for dnsForwarderModel DnsForwarder [logicalRouter=LogicalRouter/#######-####-####-####-############, srClusterId=null, cacheSize=0, listenerIp=10.0.0.10, defaultZone=DnsForwarderZone [sourceIp=null, domainNames=[], upstreamServers=[10.0.0.11]], conditionalZones=null, logLevel=INFO, enabled=true, msgTimestamp=0, serviceGroupId=null, isStandbySite=false]
    • On an edge var/log/dns/prestart.log
      • 4 2025-04-28 10:46:28,353 14 dns.dns_utils ERROR Failed to run cmd /opt/vmware/nsx-edge/bin/dns/dnsconf_gen.py #######-####-####-####-############ with error Traceback (most recent call last):
        15 2025-04-28 10:46:28,355 14 dns.dns_fdr_prestart ERROR Failed to generate dnsmasq/iptable config file with cmd: /opt/vmware/nsx-edge/bin/dns/dnsconf_gen.py #######-####-####-####-############
        16 2025-09-22 03:04:26,528 13 dns.dns_utils ERROR Failed to run cmd /opt/vmware/nsx-edge/bin/dns/dnsconf_gen.py #######-####-####-####-############ with error Traceback (most recent call last):
        136 2025-09-22 03:06:23,224 13 dns.dns_utils ERROR Failed to run cmd /opt/vmware/nsx-edge/bin/dns/dnsconf_gen.py #######-####-####-####-############ with error Traceback (most recent call last):
        145 KeyError: 'cache_size'
  • NSX DNS forwarder down after upgrade may be observed.

Environment

VMware NSX (Any version 9.0 and prior)

 

Cause

  • The Policy service received a DNS Forwarder CREATE API request that specified a cache size of 0, which is an invalid value that the API validation logic should have rejected.
  • The DNS Forwarder backend did not expect to receive an invalid cache size from the Policy, resulting in a Datapath crash.
  • This issue can be caused by changing this value to zero. 

Resolution

Delete the existing DNS Forwarder service and recreate it with a non-zero cache size.

  • Logs should resemble:
    • nsxapi.log:
      • 2025-09-23T21:44:52.060Z  INFO providerTaskExecutor-1-20 MpDnsForwarderServiceImpl 77770 DNS [nsx@6876 comp="nsx-manager" level="INFO" subcomp="manager"] Updated old DnsForwarder DnsForwarder [logicalRouter=LogicalRouter/#######-####-####-####-############, srClusterId=LogicalRouter/#######-####-####-####-############, cacheSize=0, listenerIp=10.0.0.50, defaultZone=DnsForwarderZone [sourceIp=null, upstreamServers=[10.0.0.10]], conditionalZones=null, logLevel=INFO, enabled=true, msgTimestamp=1234567890123, serviceGroupId=null, isStandbySite=false] with new DnsForwarder DnsForwarder [logicalRouter=LogicalRouter/#######-####-####-####-############, srClusterId=LogicalSrClusterConfig/#######-####-####-####-############, cacheSize=1024, listenerIp=10.0.0.11, defaultZone=DnsForwarderZone [sourceIp=null, domainNames=[], upstreamServers=[10.0.0.10]], conditionalZones=null, logLevel=INFO, enabled=false, msgTimestamp=1234567890123, serviceGroupId=null, isStandbySite=false]
  • Change the value to anything other than 0 (E.G. 1024) 
    • Reboot the edge
  • Resolved in a VCF version greater than 9.0

Additional Information

NOTE: Perform resolution steps during a maintenance Window.