"HRegion Service running but not healthy" and possible other issues observed in VCF Operations for Networks
search cancel

"HRegion Service running but not healthy" and possible other issues observed in VCF Operations for Networks

book

Article ID: 324414

calendar_today

Updated On:

Products

VCF Operations for Networks

Issue/Introduction


Aria Operations for Network platform nodes may show any of the below errors for services when logged into the VCF Operations for Networks Platform node under Settings --> Infrastructure and Support:

HRegionServer is running but not healthy. 
Data Retention (Metric Store Maintenance) service is unhealthy.
One or more essential services are not healthy.
TSDB Server failed to flush data to HBase.

NOTE:  VCF Operations for Networks was formerly named Aria Operations for Networks (AON), and prior to that was named vRealize Network Insight (vRNI).

 

Environment

Aria Operations for Networks 6.13.0
Aria Operations for Networks 6.14.0
Aria Operations for Networks 6.14.1

Cause

One or a combination of the below events has been identified as the cause of this issue:

  1. Unexpected/unwanted shutdown or reboot of the Platform Node causing HBASE/HDFS Database Inconsistencies ,which results in services either "running but not healthy" or "not running"

  2. Manual shutdown or reboot of one or more Platform Node(s) in a clustered deployment.

  3. Shutdown of one or more Platform Node(s), when using Lifecycle Manager in a clustered deployment to take snapshots using VCF Operations for Network

Resolution

If you have encountered this issue, ensure you DO NOT perform any manual shutdown or reboot procedure of Platform Node(s).

  1. Open a support case with Broadcom Support to review your Aria Operations for Networks deployment. For more information, see Creating and managing Broadcom support cases. 
     
  2. Capture below details:
  1.  On Aria Operations for Networks GUI , Navigate to Settings>Infrastructure and support>Infrastructure and Updates pages, from there take 1-2 screenshots covering the entire page, additionally if you see any Problems Click on it and capture another screenshots showing all the problems.
  2. If Platform nodes are in Clustered deployment then take a SSH/Putty session on VMware Aria Operations for Networks Platform Node1, login with username support

    Execute below commands:

    ub
    ./run_all.sh uptime
    ./run_all.sh df -h
    ./run_all.sh sudo /home/ubuntu/check-service-health.sh -p -d
    sudo -u hbase hbase hbck
    sudo cat /home/ubuntu/build-target/deployment/patch.txt
    sudo cat /home/ubuntu/build-target/deployment/appliance.status
    sudo grep id: /etc/vnera/deployment/deployment.def

    Note: Outputs of above commands are expected to be longer hence copy/paste the outputs to a Notepad file, save it and upload or sent as email attachment to to this Case. 

  3. If there is only 1 Platform node then take a SSH/Putty session on VMware Aria Operations for Networks Platform Node1, login with username support

    Execute below commands:
    ub
    .uptime
    df -h
    ./check-service-health.sh -p -d
    sudo -u hbase hbase hbck
    sudo cat /home/ubuntu/build-target/deployment/patch.txt
    sudo cat /home/ubuntu/build-target/deployment/appliance.status
    sudo grep id: /etc/vnera/deployment/deployment.def

Note: Outputs of above commands are expected to be longer hence copy/paste the outputs to a Notepad file, save it and upload or sent as email attachment to to this Case. 

Additional Information

Any time a shutdown of a platform node cluster is needed, for example to take cold snapshots, it is recommended to follow follow Best practices to shutdown Aria Operations for Networks Clustered deployments to avoid this issue this issue.

Attachments

hbase_repair_script.sh.txt get_app