Apache HBase

Apache HBase is an open-source, distributed, non-relational database that relies on HDFS to provide random, real-time and consistent access to large amounts of data. It is a highly scalable column-oriented database inspired by Google’s BigTable.

Relying on HDFS makes HBase fault-tolerant: data is replicated across nodes and automatic failover ensures its high availability. In addition, table sharding and load balancing are automated.

When to use HBase

The Apache Hadoop core components - HDFS, YARN and MapReduce - allows to efficiently store, process and manage large amounts of data. However, they only perform batch processing and are not made for real-time analytics. HBase overcomes these limitations by offering random read and write operations.

HBase is suited when:

  • You have large volumes of data (hundreds of terabytes);
  • You have sparse data;
  • You need a flexible schema;
  • You need real-time read and write access;
  • You don’t need ACID transactions.

Note: although HBase does not support structured query languages like SQL, it can be used with Apache Phoenix to provide SQL-like queries to HBase tables.

HBase Data Model

The HBase data model is composed of tables with rows and columns. However, the similarities with relational models end here. HBase is a column-oriented database that can be seen as a key-value store. It is organized as:

  • Table: A collection of rows. Divided into regions distributed across RegionServers according to rowkeys.
  • Row: A collection of one or more columns. Rows are identified and lexicographically sorted by a row key, generated by the application from a given pattern.
    E.g.: a row key can be [timestamp][hostname][log-event]
  • Column: A column is composed of a column family and a column qualifier separated by a colon (:).
    • Column family: A column family defines options such as time to live, number of versions to keep, compression, etc. Column families are defined at table creation which means that each row in a table gets the same column family (which can be empty). They are physically colocated in stores on the same servers.
    • Column qualifier: One or more column qualifiers can be added to a column family. They provide a label to the values. Column qualifiers are not defined at table creation and may vary between rows.
      E.g.: given the column family content, column qualifiers can be html or pdf. We would have content:html and content:pdf columns.
  • Cell: A versioned key-value pair. The key is made of coordinates (a row key and a column) and a timestamp which defines the version of the value.

HBase table

An HBase table can be represented as a JSON object:

{
  "0001": {                      # Row using 0001 as rowkey
    "Personal data": {           # Column Family
      "Name": {                  # Column Qualifier
        "1656678513": "Jules",   # Cell value with timestamp 
        "1625142513": "Henri"
      },
      "City": {
        "1625142513": "Paris"
      }
    },
    "Professional data": {
      "Designation": {
        "1625142513": "Manager"
      },
      "Salary": {
        "1625142513": "50,000"
      }
    }
  }
  ...
}

HBase Storage

As mentioned previously, a region is a subset of a table, gathering a specific range of row keys inside a RegionServers. Each RegionServer contains a Write Ahead Log (WAL) in which each transaction is recorded for recovery purposes.

Regions are divided into stores corresponding to their column families, with one store per column family. A store contains an in-memory store, the MemStore, and zero or more HFiles (or StoreFile) stored on HDFS.

When entering through HBase, data is saved to the MemStore and the WAL. Once the MemStore reaches a predefined size, it is flushed in immutable HFiles. If the RegionServer becomes unavailable before the MemStore is flushed, the WAL is used to recover the state of the MemStore. HBase compacts the data regularly to keep the number of HFiles under control and the cluster balanced.

HBase Architecture

HBase’s architecture consists of 3 components:

  • HMaster, the master service of HBase. It aims to manage cluster operations (such as allocation, load balancing, and splitting) and is in charge of metadata management (such as schema changes, table or column families creation). It monitors all RegionServers in the cluster and is responsible for assigning regions to them. One HMaster is required for the whole cluster.
  • RegionServers, the slave services of HBase. They manage and serve data for the region they are assigned to. One or more RegionServer are required.
  • Client, the interface to use HBase. It allows users to write or read data by locating the requested region(s) thanks to ZooKeeper and contacting the associated RegionServer directly. It caches the results to provide a faster search process for future requests.

HBase architecture

HBase in TDP

Components described above are distributed as follows on the TDP getting started cluster:

  • HMaster runs on a master node, along with the HDFS NameNode.
  • RegionServers run on the worker nodes, along with the HDFS DataNodes.
  • Client runs on the edge node along with other clients.

TDP also provides a HBase REST component used to provide a REST API to interact with HBase services. Authentication is accomplished using Apache Knox.

Like all other components, HBase is configured in high availability mode: to avoid having a single point of failure, the HMaster is deployed twice on different nodes in an active/passive configuration.

HBase relies heavily on ZooKeeper which stores the hbase:meta table, a metadata table that lists all HBase regions and their associated RegionServer. HBase client rarely interacts with HMaster, it directly asks Zookeeper which RegionServer to query.

What to learn next?

Now that you’re familiar with HBase, the HBase tutorial teaches how to import and manage NoSQL data.

Additionally, refer to our Hive presentation to learn how to manage structured data in TDP.