Hive tutorial

Hive was originally released as a data warehousing solution, leveraging tabular format on top of data stored in HDFS. It enables data to be queried using a SQL-like syntax (called Hive-QL).

Notes: Hive external tables

Hive differentiates 2 types of tables: external tables and managed or internal tables. This page only covers external tables which are the preferred way to manage data. We recommend using them over managed tables for reasons developed in a future article.

External tables enable Hive to manage data stored inside of remote HDFS locations. Only schema and metadata are stored inside the Hive metastore, which allows processes outside of Hive to access and manage the external table files. Thus, if a user DROP an external table, then only metadata stored in Hive metastore will be deleted and data will remain in their folder on HDFS after deletion.

Requirements

This tutorial assumes you are running a cluster based on TDP getting started, an easy-to-launch TDP environment for testing purposes. This deployment provides you with:

  • tdp_user, a user with the ability to kinit for authentication.
  • An edge node accessible by SSH
  • HDFS directories:
    • /user/tdp_user
  • Ranger policies:
    • Hive database: tdp_user_db

Note: If you are using your own TDP deployment, you need to adapt the previous configuration accordingly.

Before beginning the tutorial, connect to the cluster and kinit using the following commands:

# Connect to edge-01.tdp
vagrant ssh edge-01
# Switch user to tdp_user
sudo su tdp_user
# Authenticate the user with his Kerberos principal and password
kinit -kt ~/tdp_user.keytab tdp_user@REALM.TDP

Downloading the dataset

This tutorial is based on the NYC Green Taxi Trip dataset which needs to be available in HDFS at /user/tdp_user/data/nyc_green_taxi_trip/.

For the sake of simplicity, we provide a script and instructions to do so on the dataset page.

Creating a table using Beeline

Beeline is a CLI providing an easy way to interact with tables stored in Hive. It is a JDBC client that allows SQL-like queries, called Hive-QL, to run on HiveServer2. To learn more about Hive architecture, please refer to the Hive page.

To start, open a Beeline session using the beeline command. Note: beeline is an alias for connecting to Hive through ZooKeeper:

export hive_truststore_password=Truststore123!
/opt/tdp/hive/bin/hive --config /etc/hive/conf.s2 --service beeline -u "jdbc:hive2://master-01.tdp:2181,master-02.tdp:2181,master-03.tdp:2181/;serviceDiscoveryMode=zooKeeper;zooKeeperNamespace=hiveserver2;sslTrustStore=/etc/ssl/certs/truststore.jks;trustStorePassword=${hive_truststore_password}"

You are now able to issue queries thanks to the Hive Data Manipulation Language. To create the tdp_user_db database from the CLI:

DROP DATABASE IF EXISTS tdp_user_db CASCADE;
CREATE DATABASE IF NOT EXISTS tdp_user_db
LOCATION '/warehouse/tablespace/external/hive';
SHOW DATABASES;

Create a table, called taxi_trip_ext, targeting the dataset folder:

CREATE EXTERNAL TABLE tdp_user_db.taxi_trip_ext (
  VendorID INT,
  lpep_pickup_datetime STRING,
  lpep_dropoff_datetime STRING,
  store_and_fwd_flag STRING,
  RatecodeID DOUBLE,
  PULocationID INT,
  DOLocationID INT,
  passenger_count DOUBLE,
  trip_distance DOUBLE,
  fare_amount DOUBLE,
  extra DOUBLE,
  mta_tax DOUBLE,
  tip_amount DOUBLE,
  tolls_amount DOUBLE,
  ehail_fee STRING,
  improvement_surcharge DOUBLE,
  total_amount DOUBLE,
  payment_type DOUBLE,
  trip_type DOUBLE,
  congestion_surcharge DOUBLE
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS PARQUET
LOCATION '/user/tdp_user/data/nyc_green_taxi_trip';

Check that the table is correctly created by printing the first records:

SELECT * FROM tdp_user_db.taxi_trip_ext LIMIT 3;

Note: This query is a quick shortcut to extracting a sample from a dataset. It is not converted to MapReduce and it doesn’t generate any Yarn jobs.

Further reading

External tables are the recommended way to manage data through Hive.

You now know how to manage data through Hive. To learn more about data management in TDP, refer to HDFS or Hbase.