Slow cluster startup or runtime performance issues due to huge number of VM guestfilesystem metrics
search cancel

Slow cluster startup or runtime performance issues due to huge number of VM guestfilesystem metrics

book

Article ID: 428715

calendar_today

Updated On:

Products

VCF Operations VCF Operations/Automation (formerly VMware Aria Suite)

Issue/Introduction

Common Symptoms:

  • Cluster performance becomes slower over time when collecting data from Kubernetes worker VMs.

  • Cluster appears to hang at "Going Online" after clicking Bring Cluster Online (but may finish startup after many hours in some cases).

  • The /storage/vcops/log/analytics-<uuid>.log on the replica or data node show analytics service is waiting for primary node persistence startup to complete.

  • Primary node analytics node shows it has started to initialize object cache:

    INFO  analytics 24596 [ops@# threadId="#" threadName="Analytics Main Thread"]  [com.vmware.statsplatform.persistence.cache.ResourceCache.initCacheFromDB] - load resources from db and populate cache

Environment

  • VMware Aria Operations 8.17.x
  • VMware Aria Operations 8.18.x
  • VMware Cloud Foundation Operations 9.0.x

Cause

  • Operations clusters collecting from vCenter endpoints having Kubernetes worker VMs may collect a huge number of ephemeral file system metrics from those VMs over a short period of time.

  • Default data retention settings for Virtual Machine objects result in ephemeral guestfilesytem metrics being maintained for years despite the metrics themselves only existing for minutes in some scenarios.

  • Millions of guestfilesytem metric keys are created over time, resulting in the symptoms enumerated in the Issues/Introduction section above, and:

    • Slow File System Database (FSDB) load/access/save due to huge FSDB dat files being loaded into cache

    • 95+% of all metric keys in the internal database referencing ephemeral guestfilesystems that provide little to no value in monitoring

Resolution

To determine if your Operations cluster is affected by this issue:

  1. Log in to the primary node as root via SSH or vSphere Console

  2. Check metric key counts (total and per prefix):

    su - postgres -c "/opt/vmware/vpostgres/current/bin/psql -d vcopsdb -p 5433" <<- SQLEOM
    	SELECT LEFT(metric_key, POSITION(':' in metric_key) - 1) AS metric_key_prefix, count(*) AS cnt FROM metric_key
    	GROUP BY metric_key_prefix HAVING COUNT(*) > 1
    	UNION ALL SELECT 'TOTAL_ROWS', COUNT(*) FROM metric_key
    	ORDER BY cnt DESC LIMIT 3;
    SQLEOM

     

    Example Output:

          metric_key_prefix      |   cnt
    --------------------------------------
     TOTAL_ROWS                  | 3726595
     guestfilesystem             | 3646979
     BGPNeighborInstancedMetric  |   44355
    (3 rows)

     

  3. If the second or third row in the output is guestfilesystem, contact Broadcom VCF Support for assistance, reference this KB number, and provide the output of the command from step 2.