ASM Impact of Recursive Broadcom DNS issue on Feb 2nd
search cancel

ASM Impact of Recursive Broadcom DNS issue on Feb 2nd

book

Article ID: 390987

calendar_today

Updated On:

Products

CA App Synthetic Monitor

Issue/Introduction

On February 2, 2025, between approximately 18:30 CET and 21:00 CET, users attempting to access resources hosted on, or services dependent on, the .broadcom.com domain experienced intermittent accessibility problems.

This document describe the issue, impact and actions taken on ASM to prevent the issue from happening again 

Environment

DX ASM SaaS

Cause

Broadcom DNS issue

Resolution

For ASM this DNS caused the following issues

  • Users could not login to the ASM dashboard
  • Monitor checks could not be submitted to monitoring stations, causing disruption of monitoring.
  • On-premise monitoring stations could not connect to the Tunnel Server

Consequently, the ASM Kubernetes cluster was automatically downscaled by one node, causing some pods to be rescheduled to run on other nodes. This was because there was less incoming traffic and less CPU load during the DNS issue. The cluster was not upscaled again after the issue got resolved. When the DNS issue was resolved the ASM issues listed above were resolved too.

An hour later, at 22:00 CET on Feb 2, the ASM Kubernetes cluster reported errors in communication with one of the cluster nodes and as a result the node was temporarily detached from the cluster. This caused some pods to be terminated and rescheduled to run on remaining nodes, including one of the three RabbitMQ nodes. This caused the connection between RabbitMQ and other components to be temporarily unavailable. Most of the components recovered, except for the scheduler pod, which stopped submitting check request messages to other components in the cluster. As a result no monitor checks were being run. Multiple alerts were triggered and incidents reported to SaaS Ops.

The problem was resolved by manual restart. Preventative measures have been identified and implemented. Full service restoration was at 7:37 CET, Feb 3.

At 19:41 CET, Feb 3 the ASM Kubernetes cluster again reported errors in communication with the same cluster node and as a result the node was temporarily detached from the cluster. This caused an outage of the redis sentinel service which was recovered automatically. There was a 2-minute gap in monitor checks.

Additional Information