HDFS tutorial

Requirements

This tutorial assumes you are running a cluster based on TDP getting started, an easy-to-launch TDP environment for testing purposes. This deployment provides you with:

  • tdp_user, a user with the ability to kinit for authentication.
  • An edge node accessible by SSH
  • HDFS directories:
    • /user/tdp_user

Note: If you are using your own TDP deployment, you need to adapt the previous configuration accordingly.

Before beginning the tutorial, connect to the cluster and kinit using the following commands:

# Connect to edge-01.tdp 
vagrant ssh edge-01
# Switch user to tdp_user
sudo su tdp_user
# Authenticate the user with his Kerberos principal and password
kinit -kt ~/tdp_user.keytab tdp_user@REALM.TDP

HDFS basic commands

The following commands are the most commonly used:

  • hdfs dfs -mkdir <dir_name> - create a directory in HDFS at the user local root folder
  • hdfs dfs -put <src> <dst> - upload object(s) to HDFS
  • hdfs dfs -cp <src> <dst> - copy a file
  • hdfs dfs -mv <src> <dst>- move a file (also used to rename a file)
  • hdfs dfs -rm <object> - delete an file (use -r to delete a folder)

The list of all the available options can be retrieved by typing hdfs dfs -help in the command line. You can also refer to this documentation.

Importing data to HDFS through an edge node

The following sections take place in the context of the edge node. It is a pre-installed workspace, configured with the cluster services clients, which allows issuing commands.

The first step is to create a drivers.csv file with sample data in your home directory:

cat <<EOF > ~/drivers.csv
10,George Vetticaden,621011971,244-4532 Nulla Rd.,N,miles
11,Jamie Engesser,262112338,366-4125 Ac Street,N,miles
12,Paul Coddin,198041975,Ap #622-957 Risus. Street,Y,hours
13,Joe Niemiec,139907145,2071 Hendrerit. Ave,Y,hours
14,Adis Cesir,820812209,Ap #810-1228 In St.,Y,hours
15,Rohit Bakshi,239005227,648-5681 Dui- Rd.,Y,hours
EOF

To verify the content of this new file:

cat ~/drivers.csv

Then, create a dedicated subdirectory on HDFS and upload the file there:

# Create a subdirectory in your HDFS home directory
hdfs dfs -mkdir -p /user/tdp_user/datasets/drivers_data

# Upload the file into the subdirectory
hdfs dfs -put ~/drivers.csv /user/tdp_user/datasets/drivers_data

Verify that the file has been well imported by listing the content of the subdirectory:

hdfs dfs -ls /user/tdp_user/datasets/drivers_data

The output should be similar to this:

hdfs dfs -ls driver_data
Found 1 items
-rw-r--r--    1 tdp_user tdp_user   329 2022-05-14 23:51 drivers_data/drivers.csv

Now that the file is stored in HDFS, it’s not needed to keep the original one in our local folder. Delete it by running:

rm ~/drivers.csv

Importing data directly through HDFS

In most cases, we want to avoid downloading files on the edge node. To do so, data can be piped directly into HDFS.

This time, we use a parquet file from the NYC Taxis dataset, hosted online:

# Set env variables for yellow taxi records from May 2021
TRIPDATA_MONTH=05
TRIPDATA_YEAR=2021
TRIPDATA_TYPE=yellow_tripdata
# Create subdirectory
hdfs dfs -mkdir -p /user/tdp_user/datasets/nyc_data
# Download dataset and pipe it into HDFS
wget -q -O - "https://s3.amazonaws.com/nyc-tlc/trip+data/${TRIPDATA_TYPE}_${TRIPDATA_YEAR}-${TRIPDATA_MONTH}.parquet" | hdfs dfs -put - "/user/tdp_user/datasets/nyc_data/${TRIPDATA_TYPE}_${TRIPDATA_YEAR}-${TRIPDATA_MONTH}.parquet"

Further reading

To read more about inserting data directly to HDFS: Download datasets into HDFS and Hive.

To learn how to store and use more complex data structures, refer to Hive or HBase.