Tutorial: managing NoSQL data with HBase

Apache HBase is an open-source column-oriented database, bringing random and real time read/write access to data stored in HDFS.

This page provides a description of the HBase data model, gives instructions for performing the most common actions on it, and presents a use case of a file import from HDFS.

Refer to our Hive Basics documentation to learn more about HBase architecture and use cases.

Requirements

This tutorial assumes you are running a cluster based on TDP getting started, an easy-to-launch TDP environment for testing purposes. This deployment provides you with:

  • tdp_user, a user with the ability to kinit for authentication.
  • An edge node accessible by SSH
  • HDFS directories:
    • /user/tdp_user
  • Ranger policies:
    • Hbase tables testTable

Note: If you are using your own TDP deployment, you need to adapt the previous configuration accordingly.

Before beginning the tutorial, connect to the cluster and kinit using the following commands:

# Connect to edge-01.tdp
vagrant ssh edge-01
# Switch user to tdp_user
sudo su tdp_user
# Authenticate the user with his Kerberos principal and password
kinit -kt ~/tdp_user.keytab tdp_user@REALM.TDP

Data model

HBase is a column-oriented database organized as:

  • Table: A collection of rows.
  • Row: A collection of one or more columns. Rows are identified by a row key, generated by the application from a given pattern.
    E.g.: a row key can be [timestamp][hostname][log-event]
  • Column: A column is composed of a column family and a column qualifier separated by a colon (:).
    • Column family: A column family defines options such as time to live, number of versions to keep, compression, etc. Column families are defined at table creation which means that each row in a table gets the same column family (which can be empty).
    • Column qualifier: One or more column qualifiers can be added to column families to provide a label for the value. They are defined by the application and are mutable.
      E.g.: given the column family content, column qualifiers can be html or pdf. We would have content:html and content:pdf columns.
  • Cell: A versioned key-value pair. The key is made of coordinates (a row key and a column) and a timestamp used as the version.

Read HBase Basics to learn more about how data is stored in HBase.

Data access optimization

HBase is a key-value store in which rows are sorted lexicographically by row key. Thus, designing a good row key allows storing related rows close to each other. This is one of the core concepts of HBase: providing scan optimization by never doing full table scans.

An example of a good row key convention is to use an inverted timestamp and domain. Storage is more efficient when rowkey is 2022-03-17_io.alliage.www_log.connexion instead of connexion.log_www.alliage.io_17-03-2022.

Note: Row keys are immutable. A row needs to be deleted and created again to change its key.

The HBase shell

The easiest way to interact with HBase is using the HBase shell. It is a REPL (or read–eval–print loop), an environment that takes user input (read), executes them (eval), and returns the result to the user (print).

The HBase shell allows you to manage tables and data from the terminal.

Basic HBase commands

Once logged and authenticated on the edge node of your TDP cluster, access to HBase shell:

hbase shell

This command opens a shell provided by HBase. The most used commands are:

  • exit logout and exit the shell.
  • help gives help with the main commands of the shell.
  • status lists the number of HBase servers used.
  • version gives the HBase version used.
  • table_help provides help for table-reference commands.
  • whoami Provides information about the user.

Create a table

Tables are created using the create command with the table name and a column family:

hbase(main):001:0> create 'testTable', 'column_family'

Display HBase tables

List all the tables created in HBase using the list command:

hbase(main):001:0> list

Print the description of your table using the describe command:

hbase(main):001:0> describe 'testTable'

Insert data in a table

To add data in a table, use the put command to insert rows. Populate our empty table:

hbase(main):001:0> put 'testTable', 'rowkey', 'column_family:column_qualifier', 'my_value'

The above command creates a cell with value my_value accessible with its coordinates (the rowID and the column). If the cell already exists, a new version of the value is created.

Read data

Retrieve a single row

The get command is used to retrieve a single row of a table:

hbase(main):001:0> get 'testTable' 'rowkey'

It is also possible to use the get command to read a specific column:

hbase(main):001:0> get 'testTable', 'rowkey', {COLUMN => 'column_family:column_qualifier'}

Read the whole table

Additionally, the scan command is used to get all the data of a table:

hbase(main):001:0> scan 'testTable'

Dealing with versions

Define the number of allowed versions for a column family using the alter command:

hbase(main):001:0> alter `testTable`, NAME => 'column_family', VERSIONS => 2

Note: the number of version of a column family can also be defined at table creation (e.g. create 'testTable',{NAME => 'cf', VERSIONS => 10})

Add a new entry at the same coordinate as the first one:

hbase(main):001:0> put 'testTable', 'rowkey', 'column_family:column_qualifier', 'my_new_value'

See the previous versions by adding the VERSIONS parameter to the get command:

hbase(main):001:0> get 'testTable', 'rowkey', {COLUMN => 'column_family:column_qualifier', VERSIONS => 2}

Note: version retrieving also works with the scan command (e.g. scan 'testTable', {VERSIONS => 4}).

To find the value of a specific version, use the TIMESTAMP parameter:

hbase(main):001:0> get 'testTable', 'rowkey', {COLUMN => 'column_family:column_qualifier', TIMESTAMP => 1234567890123}

Delete data in a table

Use the delete command to remove a specific cell from a table:

hbase(main):001:0> delete 'testTable', 'rowkey', 'column_family:column_qualifier', 'timestamp'

To delete all the cells in a row, use the deleteall command:

hbase(main):001:0> deleteall 'testTable', 'rowkey'

Delete a table

A table must first be disabled before attempting to delete it. Use the disable command to disable the testTable:

hbase(main):001:0> disable 'testTable'

Check if a table is enabled using the is_enables command:

hbase(main):001:0> is_enabled 'testTable'

Note: Attempting to delete enabled tables prints the error: ERROR: Table testTable is enabled. Disable it first.

Delete it with the drop command:

hbase(main):001:0> drop 'testTable'

Check if the table is deleted using the exists command:

hbase(main):001:0> exists 'testTable'

Namespace management

Namespaces are used for logical table grouping into the HBase system. It is used for resource management, security, and isolation. For example, a namespace can be created to group tables to hand out specific permissions to the users (e.g. allow a user to only read the data inside the table).

Note: all users have access to the namespace and table names, whichever their access level. Permissions only apply to the data and not to the model.

Note: tdp_user can’t execute the commands in this section. You need a user with administrator rights to use namespaces commands.

Create a namespace

Using the create_namespace command, create a new namespace:

create_namespace 'namespace_name'

Create an HBase table in that namespace:

hbase(main):001:0> create 'namespace_name:table_name','column_family'

If no namespace is specified, the table is created in the default namespace.

List tables created under a namespace

Display all available namespaces with the list_namespace command:

hbase(main):001:0> list_namespace

The list_namespace_tables command display all the tables created in a given namespace:

hbase(main):001:0> list_namespace_tables 'namespace_name'

Delete namespace

The drop_namespace command deletes a namespace:

hbase(main):001:0> drop_namespace 'namespace_name'

Note: Only empty namespaces can be dropped. Thus, if you want to drop any namespace, you first need to drop all the tables created inside of it.

Use case of a data import and filter applications

This section provides instructions to get-started with HBase by importing a dataset from HDFS into HBase and then apply filters on the HBase table. We use the IMDb title.basics table as an example.

Import data from HDFS to HBase

Step 1: Downloading the dataset

The dataset needs to be available in HDFS at /user/tdp_user/data/imdb/. For the sake of simplicity, we provide a script to do so on the IMDb page.

Step 2: HBase table creation

Launch the HBase shell:

hbase shell

Create a new HBase table for the dataset:

hbase(main):001:0> create 'testTable', 'im'

Check if the table has been created:

hbase(main):001:0> list

Step 3: Import data into HBase

Leave the HBase shell by typing exit. Load data into HBase using the following:

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
  -Dimporttsv.separator="," \
  -Dimporttsv.columns="HBASE_ROW_KEY, im:tconst, im:titleType, im:primaryTitle, im:originalTitle, im:isAdult, im:startYear, im:endYear, im:runtimeMinutes, im:genres" \
  testTable /user/tdp_user/data/imdb/title.basics.tsv
  • org.apache.hadoop.hbase.mapreduce.ImportTsv command to import a tsv file in HBase using a MapReduce job. Takes options, table name, and filepath as parameters.
  • -Dimporttsv.separator option to specify the delimiter of the file (, here).
  • -Dimporttsv.columns option to specify column names.

Print the inserted data in the HBase table using scan 'testTable'. Don’t forget to log back to the HBase shell (hbase shell command).

Apply filters

By default, scan reads the entire table from start to end. It’s possible to limit the scan by using a filter. HBase includes several filter types and allows you to create your own. Some examples of basic filters are:

  • PrefixFilter filter specific rowkey.
  • TimeStampsFilter filter specific timestamps.
  • FamilyFilter or QualifierFilter filter specific column family or column qualifier.
  • SingleColumnValueFilter filter specific column (combination of column family and column qualifier).

Full table scans aren’t viable when dealing with big data. Using HBase filters along with the scan command avoids them. Filters provide push-down optimization by being applied directly where the data is stored instead of fetching it and being applied locally. Queries must be bound to a range query, with controlled boundaries fixed by the user.

Filter example

Using the HBase shell, scan the table while keeping the im:isStart column only:

scan 'testTable', {COLUMNS=>'im:isStart', LIMIT => 10}

Note: HBase is made for big data. Never do a full scan of a table. Always define a limit (using the LIMIT parameter) or a range (using the TimeStampsFilter for example).

The result is empty as im:isStart is not defined for many rows. We want to display the im:runtimeMinutes while keeping only the rows where im:isStart is defined:

import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'testTable',
  { COLUMNS => 'im:runtimeMinutes',
    LIMIT => 10,
    FILTER => SingleColumnValueFilter.new(
      Bytes.toBytes('im'), Bytes.toBytes('isStart'),
      CompareFilter::CompareOp.valueOf('NOT_EQUAL'),
      BinaryComparator.new(Bytes.toBytes('\N')))
  }

Note: Some classes are required to use a filter. The HBase shell supports Java imports.

  • SingleColumnValueFilter specifies the column, the operator, and the operand.
  • CompareFilter::CompareOp.valueOf specifies the operator (e.g. EQUAL, NOT_EQUAL, GREATER_OR_EQUAL, LESS…).
  • BinaryComparator specifies the operand to which the result needs to match.

Further reading

As with the HBase shell, you achieve these results with your favorite programming language. Native client are available for Java, C++, Scala and Python. Other languages may use the HBase Thrift and REST interfaces like the HBase client for Node.js written by Adaltas.

Refer to HBase Basics to learn about HBase architecture and use cases. For further reading on how to store and use other data structures, refer to HDFS Basics or Hive Basics.