Regions are a subset of the table’s data, and they are essentially a contiguous, sorted range of rows that are stored together. Regions are the basic element of availability and distribution for tables.
HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode. Each Region Server is responsible to serve a set of regions, and one Region (i.e. range of rows) can be served only by one Region Server.
HMaster is the implementation of the Master Server. The Master server is responsible for monitoring all RegionServer instances in the cluster, and is the interface for all metadata changes. In a distributed cluster, the Master typically runs on the NameNode. A cluster may have multiple masters, all Masters compete to run the cluster. If the active Master loses its lease in ZooKeeper (or the Master shuts down), then the remaining Masters jostle to take over the Master role.
Initially, there is only one region for a table. When regions become too large after adding more rows, the region is split into two at the middle key, creating two roughly equal halves.
Create a table with many regions by supplying the split points at the table creation time. Pre splitting is recommended but have a caveat-- poorly chosen splitting points can end up with heterogeneous load distribution, which degrades the cluster performance.
Default split policy. It splits the regions when the total data size of one of the stores in the region gets bigger than configured “hbase.hregion.max.filesize”, which has a default value of 10GB.
Split regions that are unevenly loaded from CLI.
***********************************************************************************
Example:
hbase(main):024:0> split 'b07d0034cbe72cb040ae9cf66300a10c', 'b'
0 row(s) in 0.1620 seconds
***********************************************************************************
When HBase starts regions are assigned as follows (short version):
Periodically, and when there are no regions in transition, a load balancer will run and move regions around to balance the cluster's load. The balancer runs on the master to redistribute regions on the cluster. It is configured via hbase.balancer.period and defaults to 300000 (5 minutes).
HBase tables are made of rows and columns. All columns in HBase belong to a particular column family. Table cells -- the intersection of row and column coordinates -- are versioned. A cell’s content is an uninterrupted array of bytes.
Rows in HBase tables are sorted by row key. The sort is byte-ordered. All table accesses are via the table row key -- its primary key.
A column name is made of its column family prefix and a qualifier. For example, the column contents:html is of the column family contents The colon character (:) delimits the column family from the column family qualifier.
Following is an example HBase table-- webtable
Row Key |
Time Stamp |
ColumnFamily Contents |
ColumnFamily anchor |
---|---|---|---|
"com.cnn.www" |
t9 |
|
anchor:cnnsi.com="CNN" |
"com.cnn.www" |
t8 |
|
anchor:my.look.ca="CNN.com" |
"com.cnn.www" |
t6 |
contents:html="<html>..." |
|
"com.cnn.www" |
t5 |
contents:html="<html>..." |
|
"com.cnn.www" |
t3 |
contents:html="<html>..." |
|
A request for the value of the contents:html column at time stamp t8 would return no value. Similarly, a request for an anchor:my.look.ca value at time stamp t9 would return no value. However, if no timestamp is supplied, the most recent value for a particular column would be returned and would also be the first one found since timestamps are stored in descending order. Thus a request for the values of all columns in the row com.cnn.www if no timestamp is specified would be: the value of contents:html from time stamp t6, the value of anchor:cnnsi.com from time stamp t9, the value of anchor:my.look.ca from time stamp t8.
Physically, all column family members are stored together on the filesystem. Because tunings and storage specifications are done at the column family level, it is advised that all column family members have the same general access pattern and size characteristics.
A (row, column, version) tuple exactly specifies a cell in HBase. Cell content is uninterrupted bytes.
It's possible to have an unbounded number of cells where the row and column are the same but the cell address differs only in its version dimension.
The version is specified using a long integer. Typically this long contains time instances such as those returned by java.util.Date.getTime() or System.currentTimeMillis(), that is: “the difference, measured in milliseconds, between the current time and midnight, January 1, 1970 UTC”.
get: Returns attributes of a specified row. If no explicit version specified, the cell whose version has the largest value is returned .
put: Either adds new rows to a table (if the key is new) or can update existing rows (if the key already exists). Doing a put always creates a new version of a cell, at a certain timestamp. By default the system uses the server's currentTimeMillis, but you can specify the version (= the long integer) yourself, on a per-column level. This means you could assign a time in the past or the future, or use the long value for non-time purposes.
scan: Iteration over multiple rows for specified attributes. Same behavior as get when working with versions.
delete: removes a row from a table. There are three types:
HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compactions.
Catalog tables are used by HBase to keep the info of the user tables. There are two catalog tables: -ROOT- and .META.
ROOT- keeps track of where the .META. table is. The -ROOT- table structure is as follows:
Key:
Values:
The .META. table keeps a list of all regions in the system. The .META. table structure is as follows:
Key:
Values:
Note: When you start HBase, the HMaster is responsible for assigning the regions to each HRegionServer. This also includes the "special" -ROOT- and .META. tables.
The WAL is the lifeline that is needed when disaster strikes. Similar to a BIN log in MySQL it records all changes to the data. This is important in case something happens to the primary storage. So if the server crashes it can effectively replay that log to get everything up to where the server should have been just before the crash. It also means that if writing the record to the WAL fails the whole operation must be considered a failure.
The WAL is a standard Hadoop SequenceFile and it stores HLogKey's. These keys contain a sequential number as well as the actual data and are used to replay not-yet-persisted data after a server crash.
Each RegionServer adds updates (Puts, Deletes) to its write-ahead log (WAL) first, and then to the Memstore the affected Store. This ensures that HBase has durable writes. Without WAL, there is the possibility of data loss in the case of a RegionServer failure before each MemStore is flushed and new StoreFiles are written. HLog is the HBase WAL implementation, and there is one HLog instance per RegionServer. (Memstore, Store and how data is stored in HBase will be discussed below)
The class which implements the WAL is called HLog. The HLog class specification can be found here.
When a HRegion is instantiated the single HLog is passed on as a parameter to the constructor of HRegion.
HLog has the following important features:
The ZooKeeper cluster acts as a coordination service for the entire HBase cluster, handling master selection, root region server lookup, node registration, and so on. Region Servers and Master discovery via Zookeeper.
All participating nodes and clients need to be able to access the running ZooKeeper ensemble. Apache HBase by default manages a ZooKeeper "cluster" for you. It will start and stop the ZooKeeper ensemble as part of the HBase start/stop process. You can also manage the ZooKeeper ensemble independent of HBase and just point HBase at the cluster it should use. To toggle HBase management of ZooKeeper, use the HBASE_MANAGES_ZK variable in conf/hbase-env.sh. This variable, which defaults to true, tells HBase whether to start/stop the ZooKeeper ensemble servers as part of HBase start/stop.
When HBase manages the ZooKeeper ensemble, you can specify ZooKeeper configuration using its native zoo.cfg file, or, the easier option is to just specify ZooKeeper options directly in conf/hbase-site.xml. A ZooKeeper configuration option can be set as a property in the HBase hbase-site.xml XML configuration file by prefacing the ZooKeeper option name with hbase.zookeeper.property. For example, the clientPort setting in ZooKeeper can be changed by setting the hbase.zookeeper.property.clientPort property.