"HRegion Service running but not healthy" and possible other issues observed in VCF Operations for Networks

Products

VCF Operations for Networks

Issue/Introduction

VCF Operations for Network platform nodes may show any of the below errors for services when logged into the VCF Operations for Networks Platform node under Settings --> Infrastructure and Support:

HRegionServer is running but not healthy. 
Data Retention (Metric Store Maintenance) service is unhealthy.
One or more essential services are not healthy.
TSDB Server failed to flush data to HBase.

NOTE: VCF Operations for Networks was formerly named Aria Operations for Networks (AON), and prior to that was named vRealize Network Insight (vRNI).

Environment

VCF Operations for Networks

Cause

One or a combination of the below events has been identified as the cause of this issue:
1. Unexpected/unwanted shutdown or reboot of the Platform Node causing HBASE/HDFS Database Inconsistencies ,which results in services either "running but not healthy" or "not running"
2. Manual shutdown or reboot of one or more Platform Node(s) in a clustered deployment.
  - NOTES TO THE ABOVE POINT:
    - A reboot that is generated after any significant changes using the "change-network-settings" CLI command, such as changes described in KB Add/Modify the IP Address, Gateway, Netmask and DNS server/IP after VMware Aria Operations for Networks appliances are deployed will NOT cause the symptoms described in this KB.
    - In a clustered environment, a manual shutdown or reboot should never be done of Platform Nodes. Instead, follow the procedure documented in Best practices to shutdown VCF Operations for Networks Clustered deployments
3. Shutdown of one or more Platform Node(s), when using Lifecycle Manager in a clustered deployment to take snapshots using VCF Operations for Network

Resolution

If you have encountered this issue, ensure you DO NOT perform any manual shutdown or reboot procedure of Platform Node(s).

Open a support case with Broadcom Support to review your Aria Operations for Networks deployment. For more information, see Creating and managing Broadcom support cases.
Capture below details:

On Aria Operations for Networks GUI , Navigate to Settings>Infrastructure and support>Infrastructure and Updates pages, from there take 1-2 screenshots covering the entire page, additionally if you see any Problems Click on it and capture another screenshots showing all the problems.
If Platform nodes are in Clustered deployment then take a SSH/Putty session on VMware Aria Operations for Networks Platform Node1, login with username support

Execute below commands:
```
ub
./run_all.sh uptime
./run_all.sh df -h
./run_all.sh sudo /home/ubuntu/check-service-health.sh -p -d
sudo -u hbase hbase hbck
sudo cat /home/ubuntu/build-target/deployment/patch.txt
sudo cat /home/ubuntu/build-target/deployment/appliance.status
sudo grep id: /etc/vnera/deployment/deployment.def
```
Note: Outputs of above commands are expected to be longer hence copy/paste the outputs to a Notepad file, save it and upload or sent as email attachment to to this Case.

If there is only 1 Platform node then take a SSH/Putty session on VMware Aria Operations for Networks Platform Node1, login with username support

Execute below commands:

ub
.uptime
df -h
./check-service-health.sh -p -d
sudo -u hbase hbase hbck
sudo cat /home/ubuntu/build-target/deployment/patch.txt
sudo cat /home/ubuntu/build-target/deployment/appliance.status
sudo grep id: /etc/vnera/deployment/deployment.def

Note: Outputs of above commands are expected to be longer hence copy/paste the outputs to a Notepad file, save it and upload or sent as email attachment to to this Case.

Additional Information

Any time a shutdown of a platform node cluster is needed, for example to take cold snapshots, it is recommended to follow Best practices to shutdown Aria Operations for Networks Clustered deployments to avoid this issue.

Attachments

hbase_repair_script.sh.txt get_app