Datasets used across the tutorials
Some services of TDP, as Hive or HDFS, require datasets to be tested. This page lists the datasets used in Alliage Academy tutorials.
Information to use and download those datasets are provided on their respective pages.
What kind of data?
TDP and its services handle many types of data. Alliage Academy tutorials and analysis focus on 3 types of datasets:
- Structured / Relational: data organized with predefined schemas and relationships. It represents the base structure of a relational database like RDBMS.
- Semi-structured: a type of structured data where information is not stored in a tabular way but is self-described.
- Time series: a series of timestamped data. It is mainly used in forecasts.
This variety is reflected by the selection of datasets that we use among our tutorials.
Sources of data
We provide scripts to easily download the following datasets to HDFS. Navigate through each dataset’s page to get details about the data schema and how to get it:
- IMDb dataset, ideal to test relational features.
- NYC Green Taxi Trip dataset, ideal to test time series features.
Some other datasets can be used to test TDP:
Note: Alliage doesn’t provide any script to install those additional datasets. However, the HDFS tutorial provides information to add them manually.