"HRegion Service running but not healthy" in a Clustered Deployment (More than one Platform node) of VCF Operations for Networks

Products

VCF Operations for Networks

Issue/Introduction

NOTE: This KB article is appropriate only for clustered deployments of VCF Operations for Networks.

Clustered deployments of VCF Operations for Networks are deployments where there is more than one Platform node, regardless of how many Collector node(s) have been deployed.

If a simple deployment (i.e. one, and only one Platform node, regardless of the number of Collector nodes) is in use, refer to KB 433022 - "HRegion Service running but not healthy" in a Simple Deployment (one Platform node) of VCF Operations for Networks

OBSERVATIONS:

While logged into the VCF Operations for Networks GUI, and selecting Settings --> Infrastructure and Support --> Infrastructure and Updates, one or more "Problem(s)" alerts are seen.
- The principal alert of concern regarding this KB is "HRegionServer is running but not healthy."
- There may be other alerts that appear as well, including examples like:
  - Data Retention (Metric Store Maintenance) service is unhealthy.
  - TSDB Server failed to flush data to HBase.
Upgrade pre-checks may report that HBase or HDFS is unhealthy

NOTE: VCF Operations for Networks was formerly named Aria Operations for Networks (AON), and prior to that was named vRealize Network Insight (vRNI).

Environment

VCF Operations for Networks in a clustered deployment

Cause

The precise root cause is indeterminate; however, the symptoms indicate the presence of an HBase/HDFS database inconsistency.

This condition can be caused by one, or a combination of the following scenarios:

An improper scale-up operation to brick sizes that are larger than originally deployed
Manual shutdown or reboot of one or more Platform Node(s) in a clustered deployment.
- In a clustered environment, a manual shutdown or reboot should never be done of Platform Nodes. Instead, follow the procedure documented in Best practices to shutdown VCF Operations for Networks Clustered deployments
Shutdown of one or more Platform Node(s), when using Lifecycle Manager in a clustered deployment to take snapshots using VCF Operations for Network

NOTE: A reboot that is generated after any significant changes using the "change-network-settings" CLI command, such as changes described in KB Add/Modify the IP Address, Gateway, Netmask and DNS server/IP after VMware Aria Operations for Networks appliances are deployed will NOT cause the symptoms described in this KB.

Resolution

If this issue is observed, DO NOT perform any manual shutdown or reboot procedure of Platform Node(s).

Open a support case with Broadcom Support using the directions at KB 142884 - Creating and managing Broadcom cases to review the VCF Operations for Networks deployment.
On the VCF Operations for Networks GUI , Navigate to Settings --> Infrastructure and support --> Infrastructure and Updates pages. From there take a sufficient number of screenshots to capture the entire page.
- Additionally, if there are any Problem(s) displayed, click on each problem and for each problem, capture sufficient screenshots to illustrate the detail of the alert.
Open a SSH/Putty session to the VCF Operations for Networks Platform Node using the support user.
Start logging for the SSH session selecting "Printable Output" and directing the logging to a file with a name like "<Date>Case_#######_Putty_Log_Platform.log" (where ####### is the Broadcom Support Case number)
Execute the following commands:
1. ub
2. cd /home/ubuntu/
3. ./run_all.sh uptime
4. ./run_all.sh df -h
5. ./run_all.sh sudo /home/ubuntu/check-service-health.sh -p -d
6. sudo -u hbase hbase hbck
7. sudo cat /home/ubuntu/build-target/deployment/patch.txt
8. sudo cat /home/ubuntu/build-target/deployment/appliance.status
9. sudo grep id: /etc/vnera/deployment/deployment.def
On the VCF Operations for Networks GUI , Navigate to Settings --> Infrastructure and support --> Support, and select the Platform nodes and any Collector node(s) and click "Create Support Bundle"
Attach the collected information to the Support Case using the Instructions at KB 140731 - Uploading files to cases on the Broadcom Support Portal

Additional Information

In a clustered environment, any time a shutdown of a platform node cluster is needed (for example, to take powered off VM snapshots), a manual shutdown or reboot should never be done of Platform Nodes.

Instead, follow the procedure documented in Best practices to shutdown VCF Operations for Networks Clustered deployments

If you have a simple deployment (i.e. one, and only one Platform node, regardless of the number of Collector nodes), please refer to KB 433022 - "HRegion Service running but not healthy" in a Simple Deployment (one Platform node) of VCF Operations for Networks

Attachments

hbase_repair_script.sh.txt get_app