Aria Operations for Logs Upgrade Fails on Worker Node due to DNS Resolution Timeout
search cancel

Aria Operations for Logs Upgrade Fails on Worker Node due to DNS Resolution Timeout

book

Article ID: 432387

calendar_today

Updated On:

Products

VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

  • Upgrade hangs or fails on worker nodes.

  • The cassandra.log on the failed worker node shows errors indicating that the node is unable to gossip with the other nodes in the cluster.

    ERROR [main] YYYY-MM-DDT08:18:44, 164 CassandraDaemon. java: 900 - Exception encountered during startup
    java.lang. RuntimeException: Unable to gossip with any peers
  • Cassandra service stops immediately after attempting to start.

  • Packet captures on ports 7000, 7001, and 9042 on the impacted worker node reveals numerous TCP retransmissions after the upgrade, whereas no retransmissions were observed in the healthy state. 

  • Manual execution of resolvectl query <Primary_FQDN> on the impacted worker node hangs indefinitely rather than failing or succeeding.

Environment

Aria operations for logs 8.x

Cause

DNS misconfiguration on the impacted worker node.

The local DNS configuration (/etc/systemd/network/10-eth0.network) does not match the vApp properties or the environment's valid DNS servers. This causes the Cassandra service to hang during the startup phase while attempting to resolve the Primary node's FQDN. When the 120-second service startup timeout is reached, the process is killed, preventing the upgrade from proceeding.

Resolution

Correct the DNS configuration on the affected node so that it matches the settings defined in the vApp properties. Refer to KB - https://knowledge.broadcom.com/external/article?articleNumber=315960