"HRegion Service running but not healthy" in a Clustered Deployment (More than one Platform node) of VCF Operations for Networks
search cancel

"HRegion Service running but not healthy" in a Clustered Deployment (More than one Platform node) of VCF Operations for Networks

book

Article ID: 324414

calendar_today

Updated On:

Products

VCF Operations for Networks

Issue/Introduction

NOTE:  This KB article is appropriate only for clustered deployments of VCF Operations for Networks.  

  • Clustered deployments of VCF Operations for Networks are deployments where there is more than one Platform node, regardless of how many Collector node(s) have been deployed.

If you have a simple deployment (i.e. one, and only one Platform node, regardless of the number of Collector nodes), please refer to KB 433022 - "HRegion Service running but not healthy" in a Simple Deployment (one Platform node) of VCF Operations for Networks

 

OBSERVATIONS:

While logged into the VCF Operations for Networks GUI, and selecting Settings --> Infrastructure and Support --> Infrastructure and Updates, you observe one of more "Problem(s)" alerts.

The principal alert of concern regarding this KB is "HRegionServer is running but not healthy."

There may be other alerts that appear as well, including examples like:

  • Data Retention (Metric Store Maintenance) service is unhealthy.

  • TSDB Server failed to flush data to HBase.


NOTE:  VCF Operations for Networks was formerly named Aria Operations for Networks (AON), and prior to that was named vRealize Network Insight (vRNI).

 

Environment

VCF Operations for Networks 

Cause

The precise root cause is indeterminate; however, the symptoms indicate the presence of an HBase/HDFS database inconsistency.

This condition can be caused by one, or a combination of the following scenarios:

  • An improper scale-up operation to brick sizes that are larger than originally deployed

  • Manual shutdown or reboot of one or more Platform Node(s) in a clustered deployment.  

  • Shutdown of one or more Platform Node(s), when using Lifecycle Manager in a clustered deployment to take snapshots using VCF Operations for Network

 

NOTE:

A reboot that is generated after any significant changes using the "change-network-settings" CLI command, such as changes described in KB Add/Modify the IP Address, Gateway, Netmask and DNS server/IP after VMware Aria Operations for Networks appliances are deployed will NOT cause the symptoms described in this KB.

 

Resolution

If you have encountered this issue, ensure you DO NOT perform any manual shutdown or reboot procedure of Platform Node(s).

  1. Open a support case with Broadcom Support using the directions at KB 142884 - Creating and managing Broadcom cases to review your VCF Operations for Networks deployment. 

  2. On the VCF Operations for Networks GUI , Navigate to Settings --> Infrastructure and support --> Infrastructure and Updates pages, from there take a sufficient number of screenshots to capture the entire page.

    • Additionally, if there are any Problem(s) displayed, click on each problem and for each problem, capture sufficient screenshots to illustrate the detail of the alert.

  3. Open a SSH/Putty session to the VCF Operations for Networks Platform Node using the support user.

    • Start logging for the SSH session selecting "Printable Output" and directing the logging to a file with a name like "<Date>Case_#######_Putty_Log_Platform.log" (where ####### is the Broadcom Support Case number)

    • Execute the following commands:

      • ub
      • cd /home/ubuntu/
      • ./run_all.sh uptime
      • ./run_all.sh df -h
      • ./run_all.sh sudo /home/ubuntu/check-service-health.sh -p -d
      • sudo -u hbase hbase hbck
      • sudo cat /home/ubuntu/build-target/deployment/patch.txt
      • sudo cat /home/ubuntu/build-target/deployment/appliance.status
      • sudo grep id: /etc/vnera/deployment/deployment.def

  4. On the VCF Operations for Networks GUI , Navigate to Settings --> Infrastructure and support --> Support, and select the Platform nodes and any Collector node(s) and click "Create Support Bundle"

  5. Attach the following materials to the Support Case using the Instructions at KB 140731 - Uploading files to cases on the Broadcom Support Portal

 

Additional Information

In a clustered environment, any time a shutdown of a platform node cluster is needed (for example, to take powered off VM snapshots), a manual shutdown or reboot should never be done of Platform Nodes.

If you have a simple deployment (i.e. one, and only one Platform node, regardless of the number of Collector nodes), please refer to KB 433022 - "HRegion Service running but not healthy" in a Simple Deployment (one Platform node) of VCF Operations for Networks

 

Attachments

hbase_repair_script.sh.txt get_app