Apache Hive

Hive is an Apache project used for warehousing infrastructure which allows querying datasets in HDFS, HBase and other data stores. Hive additionally provides fault tolerance and scalability for data storage. A JDBC/ODBC driver provides the interface between the application and HDFS. This driver turns HQLs (comparable to SQL syntax) into MapReduce jobs and presents the Hadoop file system as tables. Hive is more efficient for data warehousing operations since it works well with structured data.

Why use Hive?

With only Hadoop services, it was difficult for developers to create complicated MapReduce jobs to query data. Hive brought new capabilities and performance upgrades such as:

  • Easier access to data through SQL, this allows for warehouse capabilities like data analysis, reporting and extract/transform/load tasks.
  • Enforcing a structure on varying data formats.
  • Access to files in different data storage systems like HDFS or HBase.
  • Query execution using different frameworks.

This gives developers and DBAs the best of both worlds: bulk processing massive datasets while also allowing them to create simple queries in a familiar environment.

Hive Architecture

In the following graph we can see a classic Hive architecture:

HIVE architecture

Each component will be explored individually. Note that certain components are not being used in the TDP Getting Started cluster. For instance, in the execution engine Tez is mainly used, or for the Client, Thrift is currently not available. For more information we suggest to review each component:

Hive Clients

Hive has three different clients that support applications. These can be written in different languages like Python or C++:

  • JDBC Driver: Enables connection of hive and Java applications.
  • ODBC Driver: Enables connection to Hive for ODBC-supported applications.
  • Thrift Server: Serves to request multiple programming languages supporting Thrift.

Hive Services

Hive needs different services such as Hive server2, CLI through Beeline, Metastore, and others to execute queries. Let’s review these in detail.

Hive Server 2

This is the JDBC/ODBC interface to the Hive Metastore. HiveServer2 is a second-generation Hive server, allowing remote clients to execute queries. It provides better support for JDBC and ODBC. HiveServer2 has four layers of execution: Server, Transport, Protocol and Processor. More can be read about these layers here.

Hive Metastore

Stores metadata information to expose data storage into a relational model. It is the central repository where metadata information is stored to expose data into a relational model. This enables data abstraction, reducing the required information to be provided by users and allowing for better data discovery. All this enables users to query relevant data in the warehouse.

Command-line Interface: Beeline

Beeline is a command-line interface to Hive that works in embedded mode, or by connecting to a HiveServer2 process using JDBC.

You can check Hive version using beeline from edge-01 node:

vagrant ssh edge-01.tdp

To set up, you have to set up as a valid user, kinit as the user, and export the password to access the truststore certificate.

In this case, we are using tdp_user as it’s generated with the necessary rights on our quick start:

sudo su tdp_user
kinit -kt /home/tdp_user/tdp_user.keytab tdp_user@REALM.TDP
export hive_truststore_password=Truststore123!

Now Hive can be accessed via command line using a beeline from an SSH session:

beeline -u "jdbc:hive2://master-01.tdp:2181,master-02.tdp:2181,master-03.tdp:2181/;serviscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/etc/ssl/certs/truststore.jks;trustStorePassword=${hive_truststore_password}" --showDbInPrompt=true

It prompts you with the following information:

Beeline version 3.1.3-TDP-0.1.0-SNAPSHOT by Apache Hive>

Note: If you want to do the same through your own TDP deployment, get your cluster information as well as a proper user and keytab to access HDFS.

Other Services

  • Hive Driver: Gets queries from different sources (web UI, CLI, Thrift, and JDBC/ODBC driver) and transfers them to the compiler.
  • Hive Compiler: Parses the query and checks on the different queries and expressions. It translates HiveQL declarations into MapReduce jobs.
  • Optimizer: Performs transformation tasks on the execution plan, splitting it to improve efficiency and scalability.
  • Hadoop Ranger: Used for fine-tuned access control and delegation to resources.

Hive Storage & Compute

Execution Engine

In the execution, backend Hive can make use of Apache Hadoop MapReduce or Apache Tez frameworks as their execution backend. The Optimizer generates the logical plan in the form of a DAG of map-reduce tasks and HDFS tasks. Apache Tez provides reduced overheads when compared with MapReduce.

Storage

Large data set files that are stored directly in either the Apache Hadoop Distributed File System (HDFS) or other data storage systems such as Apache HBase. In the TDP Getting Started clusters both of these are available.

What to learn next?

A Hive tutorial is provided to guide you through creating and using Hive tables.

Also, learn more about TDP components with the following articles: