Tutorial: managing NoSQL data with HBase
Apache HBase is an open-source column-oriented database, bringing random and real time read/write access to data stored in HDFS.
This page provides a description of the HBase data model, gives instructions for performing the most common actions on it, and presents a use case of a file import from HDFS.
Refer to our Hive Basics documentation to learn more about HBase architecture and use cases.
Requirements
This tutorial assumes you are running a cluster based on TDP getting started, an easy-to-launch TDP environment for testing purposes. This deployment provides you with:
tdp_user
, a user with the ability tokinit
for authentication.- An edge node accessible by SSH
- HDFS directories:
/user/tdp_user
- Ranger policies:
- Hbase tables
testTable
- Hbase tables
Note: If you are using your own TDP deployment, you need to adapt the previous configuration accordingly.
Before beginning the tutorial, connect to the cluster and kinit
using the following commands:
# Connect to edge-01.tdp
vagrant ssh edge-01
# Switch user to tdp_user
sudo su tdp_user
# Authenticate the user with his Kerberos principal and password
kinit -kt ~/tdp_user.keytab tdp_user@REALM.TDP
Data model
HBase is a column-oriented database organized as:
- Table: A collection of rows.
- Row: A collection of one or more columns. Rows are identified by a row key, generated by the application from a given pattern.
E.g.: a row key can be[timestamp][hostname][log-event]
- Column: A column is composed of a column family and a column qualifier separated by a colon (
:
).- Column family: A column family defines options such as time to live, number of versions to keep, compression, etc. Column families are defined at table creation which means that each row in a table gets the same column family (which can be empty).
- Column qualifier: One or more column qualifiers can be added to column families to provide a label for the value. They are defined by the application and are mutable.
E.g.: given the column familycontent
, column qualifiers can behtml
orpdf
. We would havecontent:html
andcontent:pdf
columns.
- Cell: A versioned key-value pair. The key is made of coordinates (a row key and a column) and a timestamp used as the version.
Read HBase Basics to learn more about how data is stored in HBase.
Data access optimization
HBase is a key-value store in which rows are sorted lexicographically by row key. Thus, designing a good row key allows storing related rows close to each other. This is one of the core concepts of HBase: providing scan optimization by never doing full table scans.
An example of a good row key convention is to use an inverted timestamp and domain. Storage is more efficient when rowkey is 2022-03-17_io.alliage.www_log.connexion
instead of connexion.log_www.alliage.io_17-03-2022
.
Note: Row keys are immutable. A row needs to be deleted and created again to change its key.
The HBase shell
The easiest way to interact with HBase is using the HBase shell. It is a REPL (or read–eval–print loop), an environment that takes user input (read), executes them (eval), and returns the result to the user (print).
The HBase shell allows you to manage tables and data from the terminal.
Basic HBase commands
Once logged and authenticated on the edge node of your TDP cluster, access to HBase shell:
hbase shell
This command opens a shell provided by HBase. The most used commands are:
exit
logout and exit the shell.help
gives help with the main commands of the shell.status
lists the number of HBase servers used.version
gives the HBase version used.table_help
provides help for table-reference commands.whoami
Provides information about the user.
Create a table
Tables are created using the create
command with the table name and a column family:
hbase(main):001:0> create 'testTable', 'column_family'
Display HBase tables
List all the tables created in HBase using the list
command:
hbase(main):001:0> list
Print the description of your table using the describe
command:
hbase(main):001:0> describe 'testTable'
Insert data in a table
To add data in a table, use the put
command to insert rows. Populate our empty table:
hbase(main):001:0> put 'testTable', 'rowkey', 'column_family:column_qualifier', 'my_value'
The above command creates a cell with value my_value accessible with its coordinates (the rowID and the column). If the cell already exists, a new version of the value is created.
Read data
Retrieve a single row
The get
command is used to retrieve a single row of a table:
hbase(main):001:0> get 'testTable' 'rowkey'
It is also possible to use the get
command to read a specific column:
hbase(main):001:0> get 'testTable', 'rowkey', {COLUMN => 'column_family:column_qualifier'}
Read the whole table
Additionally, the scan
command is used to get all the data of a table:
hbase(main):001:0> scan 'testTable'
Dealing with versions
Define the number of allowed versions for a column family using the alter
command:
hbase(main):001:0> alter `testTable`, NAME => 'column_family', VERSIONS => 2
Note: the number of version of a column family can also be defined at table creation (e.g. create 'testTable',{NAME => 'cf', VERSIONS => 10}
)
Add a new entry at the same coordinate as the first one:
hbase(main):001:0> put 'testTable', 'rowkey', 'column_family:column_qualifier', 'my_new_value'
See the previous versions by adding the VERSIONS
parameter to the get
command:
hbase(main):001:0> get 'testTable', 'rowkey', {COLUMN => 'column_family:column_qualifier', VERSIONS => 2}
Note: version retrieving also works with the scan
command (e.g. scan 'testTable', {VERSIONS => 4}
).
To find the value of a specific version, use the TIMESTAMP
parameter:
hbase(main):001:0> get 'testTable', 'rowkey', {COLUMN => 'column_family:column_qualifier', TIMESTAMP => 1234567890123}
Delete data in a table
Use the delete
command to remove a specific cell from a table:
hbase(main):001:0> delete 'testTable', 'rowkey', 'column_family:column_qualifier', 'timestamp'
To delete all the cells in a row, use the deleteall
command:
hbase(main):001:0> deleteall 'testTable', 'rowkey'
Delete a table
A table must first be disabled before attempting to delete it. Use the disable
command to disable the testTable:
hbase(main):001:0> disable 'testTable'
Check if a table is enabled using the is_enables
command:
hbase(main):001:0> is_enabled 'testTable'
Note: Attempting to delete enabled tables prints the error: ERROR: Table testTable is enabled. Disable it first.
Delete it with the drop
command:
hbase(main):001:0> drop 'testTable'
Check if the table is deleted using the exists
command:
hbase(main):001:0> exists 'testTable'
Namespace management
Namespaces are used for logical table grouping into the HBase system. It is used for resource management, security, and isolation. For example, a namespace can be created to group tables to hand out specific permissions to the users (e.g. allow a user to only read the data inside the table).
Note: all users have access to the namespace and table names, whichever their access level. Permissions only apply to the data and not to the model.
Note: tdp_user can’t execute the commands in this section. You need a user with administrator rights to use namespaces commands.
Create a namespace
Using the create_namespace
command, create a new namespace:
create_namespace 'namespace_name'
Create an HBase table in that namespace:
hbase(main):001:0> create 'namespace_name:table_name','column_family'
If no namespace is specified, the table is created in the default namespace.
List tables created under a namespace
Display all available namespaces with the list_namespace
command:
hbase(main):001:0> list_namespace
The list_namespace_tables
command display all the tables created in a given namespace:
hbase(main):001:0> list_namespace_tables 'namespace_name'
Delete namespace
The drop_namespace
command deletes a namespace:
hbase(main):001:0> drop_namespace 'namespace_name'
Note: Only empty namespaces can be dropped. Thus, if you want to drop any namespace, you first need to drop all the tables created inside of it.
Use case of a data import and filter applications
This section provides instructions to get-started with HBase by importing a dataset from HDFS into HBase and then apply filters on the HBase table. We use the IMDb title.basics
table as an example.
Import data from HDFS to HBase
Step 1: Downloading the dataset
The dataset needs to be available in HDFS at /user/tdp_user/data/imdb/
. For the sake of simplicity, we provide a script to do so on the IMDb page.
Step 2: HBase table creation
Launch the HBase shell:
hbase shell
Create a new HBase table for the dataset:
hbase(main):001:0> create 'testTable', 'im'
Check if the table has been created:
hbase(main):001:0> list
Step 3: Import data into HBase
Leave the HBase shell by typing exit
. Load data into HBase using the following:
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv \
-Dimporttsv.separator="," \
-Dimporttsv.columns="HBASE_ROW_KEY, im:tconst, im:titleType, im:primaryTitle, im:originalTitle, im:isAdult, im:startYear, im:endYear, im:runtimeMinutes, im:genres" \
testTable /user/tdp_user/data/imdb/title.basics.tsv
org.apache.hadoop.hbase.mapreduce.ImportTsv
command to import a tsv file in HBase using a MapReduce job. Takes options, table name, and filepath as parameters.-Dimporttsv.separator
option to specify the delimiter of the file (,
here).-Dimporttsv.columns
option to specify column names.
Print the inserted data in the HBase table using scan 'testTable'
. Don’t forget to log back to the HBase shell (hbase shell
command).
Apply filters
By default, scan
reads the entire table from start to end. It’s possible to limit the scan
by using a filter. HBase includes several filter types and allows you to create your own. Some examples of basic filters are:
PrefixFilter
filter specific rowkey.TimeStampsFilter
filter specific timestamps.FamilyFilter
orQualifierFilter
filter specific column family or column qualifier.SingleColumnValueFilter
filter specific column (combination of column family and column qualifier).
Full table scans aren’t viable when dealing with big data. Using HBase filters along with the scan
command avoids them. Filters provide push-down optimization by being applied directly where the data is stored instead of fetching it and being applied locally. Queries must be bound to a range query, with controlled boundaries fixed by the user.
Filter example
Using the HBase shell, scan the table while keeping the im:isStart
column only:
scan 'testTable', {COLUMNS=>'im:isStart', LIMIT => 10}
Note: HBase is made for big data. Never do a full scan of a table. Always define a limit (using the LIMIT
parameter) or a range (using the TimeStampsFilter
for example).
The result is empty as im:isStart
is not defined for many rows. We want to display the im:runtimeMinutes
while keeping only the rows where im:isStart
is defined:
import org.apache.hadoop.hbase.filter.CompareFilter
import org.apache.hadoop.hbase.filter.SingleColumnValueFilter
import org.apache.hadoop.hbase.filter.BinaryComparator
import org.apache.hadoop.hbase.util.Bytes
scan 'testTable',
{ COLUMNS => 'im:runtimeMinutes',
LIMIT => 10,
FILTER => SingleColumnValueFilter.new(
Bytes.toBytes('im'), Bytes.toBytes('isStart'),
CompareFilter::CompareOp.valueOf('NOT_EQUAL'),
BinaryComparator.new(Bytes.toBytes('\N')))
}
Note: Some classes are required to use a filter. The HBase shell supports Java imports.
SingleColumnValueFilter
specifies the column, the operator, and the operand.CompareFilter::CompareOp.valueOf
specifies the operator (e.g.EQUAL
,NOT_EQUAL
,GREATER_OR_EQUAL
,LESS
…).BinaryComparator
specifies the operand to which the result needs to match.
Further reading
As with the HBase shell, you achieve these results with your favorite programming language. Native client are available for Java, C++, Scala and Python. Other languages may use the HBase Thrift and REST interfaces like the HBase client for Node.js written by Adaltas.
Refer to HBase Basics to learn about HBase architecture and use cases. For further reading on how to store and use other data structures, refer to HDFS Basics or Hive Basics.