"High Disk Utilization" Check Fails with Precheck of upgrade failure for Aria Operations for Networks

Products

VCF Operations for Networks

Issue/Introduction

If upgraded is started immediately after taking snapshots for an Aria Operations for Networks deployment, then error mentioned below is expected.

This is upgrade precheck which is build in Aria Operations for Networks.

This is expected behavior as after snapshots when Aria Operations for Networks appliances are powered on there is high I/O seen due to which foundation database replicate status shows Replication healthy with some moving data.

1. Aria Operations for Networks fails to update through Aria Suite Lifecycle manager (vRSLCM) - Disk Utilization Check Fails

Error on Aria Suite Lifecycle manager (vRSLCM) GUI shows as below:

com.vmware.vrealize.lcm.plugin.core.vrni.common.exception.VRNIUpgradeCheckStatusException: Error occurred while checking upgrade pre-check status with IP ##.##.###.## Pre-check message : {
"msg" : "High disk utilization",
"type" : "INFO",
"status" : "FAIL",
"id" : "DiskUtilizationCheckTask",
"title" : "Disk Utilization Check",
"consentMsg" : null
}
at com.vmware.vrealize.lcm.plugin.core.vrni.task.upgrade.UpgradePrecheckStatusTask.execute(UpgradePrecheckStatusTask.java:161)
at com.vmware.vrealize.lcm.automata.core.TaskThread.run(TaskThread.java:62)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)

2. On Aria Operations for Networks GUI below error message is seen High disk utilization is Failed and (contact Support)

3. Aria Operations for Networks database shows replication status as Healthy (Repartitioning) with some moving data. See below output:

ubuntu@platform1:~$ fdbcli
Using cluster file `/etc/foundationdb/fdb.cluster'.

The database is available.

Welcome to the fdbcli. For help, type `help'.
fdb> status details

Using cluster file `/etc/foundationdb/fdb.cluster'.

Configuration:
  Redundancy mode        - double
  Storage engine         - ssd-2
  Coordinators           - 3
  Desired Proxies        - 2
  Desired Logs           - 2
  Usable Regions         - 1

Cluster:
  FoundationDB processes - 20
  Zones                  - 10
  Machines               - 10
  Memory availability    - 11.0 GB per process on machine with least available
  Retransmissions rate   - 93 Hz
  Fault Tolerance        - 1 machine
  Server time            - 09/17/24 18:32:05

Data:
  Replication health     - Healthy (Repartitioning)
  Moving data            - 149 GB
  Sum of key-value sizes - 2.048 TB
  Disk space used        - 5.230 TB

Operating space:
  Storage server         - 1603.0 GB free on most full server
  Log server             - 1650.0 GB free on most full server

Workload:
  Read rate              - 9295 Hz
  Write rate             - 3475 Hz
  Transactions started   - 27044 Hz
  Transactions committed - 648 Hz
  Conflict rate          - 0 Hz

Backup and DR:
  Running backups        - 0
  Running DRs            - 0

Environment

Aria Operations for Networks 6.12.0
Aria Operations for Networks 6.13.0
Aria Operations for Networks 6.14.0
Aria Operations for Networks 6.14.1

Cause

Post snapshots via vCenter or via Aria Suite Lifecycle when Aria Operations for Networks appliances are powered on there are high I/O seen due to which foundation database shows replication health as Healthy (Repartitioning) with some moving data.

Resolution

This is expected behavior. After taking snapshots, it is recommended to check the services status and database healthy from cli.

If Snapshots taken goes beyond 24 to 36 hours then delete the snapshots to help aid settling down of the moving data to 0 GB.

Take a New set of snapshots using Best practices to shutdown Aria Operations for Networks Clustered deployments

Database should show Healthy with 0 GB Moving data.

It is expected to wait for few minutes for High I/O seen on the database to settle down before triggering the upgrade.

FDB replication status usually takes 48 to 72 hours of time to be healthy with 0 GB of moving data. (This is estimated time and can go beyond)

On 3 to 15 Node platform Cluster deployments, this also varies as it is dependent on Size of Moving data.

If Moving data is quite high > 700 GB or more than 1000GB then we need to validated the IOPS value seen on the GUI

GS support team will have to review and evaluate this further and make changes to Aria operations for Networks database.

Open Broadcom support case by referring to this Knowledge base article.

See Creating and managing Broadcom support cases.