IMDb
The Internet Movie Database, better known as IMDb, provides an open dataset about movies and series.
This dataset is suitable for working with :
- Relational data with multiple tables
- Join strategy evaluation
Note, due to its relatively small size, it is not suitable for Big Data processing and tuning evaluation such as to launch intensive Spark jobs on large clusters.
Description of data
The IMDb dataset is organized as seven tables:
- title.basics: General informations of titles
- title.akas: Alternative versions of titles
- title.crew: Director and writer informations of titles
- title.episode: Episode and season information of series
- title.principals: Cast informations of titles
- title.ratings: IMDb rating and vote informations of titles
- name.basics: General information of persons
The raw size of the dataset is about 5.5 GB. It is available to download from the IMDb data files website in compressed TSV
format.
Usage
Alliage provides the imdb.sh
script to download the IMDb dataset to a TDP getting started cluster.
Internally, the command downloads, decompresses and pipes the dataset to HDFS. The --help
option contains additionnal information about the script usage.
While not recommanded security wise, a quick way to install the dataset is:
curl \
-s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/imdb.sh \
| bash
It is possible to pass options to the command. For example, to print the help:
curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/imdb.sh | bash -s -- --help
By default, the downloaded files are accessible by the tdp_user
user inside his user directory in “data/imdb”. For example:
hdfs dfs -ls "/user/tdp_user/data/imdb"
Note: For MacOS users, the script use the getopt
command which is available by default but not functionnal. The GNU version of getopt
can be installed with brew install gnu-getopt
. MacOS users of nix-darwin can also run nix -p getopt
.
Schemas
title.basics
schema
Fields | Types | Descriptions |
---|---|---|
tconst | string | Alphanumeric unique identifier of the title. |
titleType | string | The type/format of the title (e.g. movie, short, tvSeries, tvEpisode, video, etc). |
primaryTitle | string | The more popular title / the title used by the filmmakers on promotional materials at the point of release. |
originalTitle | string | Original title, in the original language. |
isAdult | boolean | Is this title made for adults? 0= non-adult title 1= adult title |
startYear | integer | In YYYY format. Represents the release year of a title. In the case of TV Series, it is the series start year. |
endYear | integer | In YYYY format. TV Series end year.\N for all other title types. |
runtimeMinutes | integer | Primary runtime of the title, in minutes. |
genres | [string] | Includes up to three genres associated with the title. |
title.akas
schema
Fields | Types | Descriptions |
---|---|---|
titleId | string | A tconst, an alphanumeric unique identifier of the title. |
ordering | integer | A number to uniquely identify rows for a given titleId. |
title | string | The localized title. |
region | string | The region for this version of the title. |
language | string | The language of the title. |
types | [string] | Enumerated set of attributes for this alternative title. One or more of the following: alternative, dvd, festival, tv, video, working, original, imdbDisplay. New values may be added in the future without warning. |
attributes | [string] | Additional terms to describe this alternative title, not enumerated. |
isOriginalTitle | boolean | Is this alternative title the original one? 0= not original title 1= original title |
title.crew
schema
Fields | Types | Descriptions |
---|---|---|
tconst | string | Alphanumeric unique identifier of the title. |
directors | [nconsts] | Director(s) of the given title. |
writers | [nconsts] | Writer(s) of the given title. |
title.episode
schema
Fields | Types | Descriptions |
---|---|---|
tconst | string | Alphanumeric identifier of episode. |
parentTconst | string | Alphanumeric identifier of the parent TV Series. |
seasonNumber | integer | Season number the episode belongs to. |
episodeNumber | integer | Episode number of the tconst in the TV series. |
title.principals
schema
Fields | Types | Descriptions |
---|---|---|
tconst | string | Alphanumeric unique identifier of the title. |
ordering | integer | A number to uniquely identify rows for a given titleId. |
nconst | string | Alphanumeric unique identifier of the name/person. |
category | string | The category of job that person was in. |
job | string | The specific job title.\N if not applicable. |
characters | string | The name of the character played.\N if not applicable. |
title.ratings
schema
Fields | Types | Descriptions |
---|---|---|
tconst | string | Alphanumeric unique identifier of the title. |
averageRating | float | Weighted average of all the individual user ratings. |
numVotes | integer | Number of votes the title has received. |
name.basics
schema
Fields | Types | Descriptions |
---|---|---|
nconst | string | Alphanumeric unique identifier of the name/person. |
primaryName | string | Name by which the person is most often credited. |
birthYear | integer | In YYYY format. |
deathYear | integer | In YYYY format.\N if not applicable |
primaryProfession | [strings] | The top-3 professions of the person. |
knownForTitles | [tconsts] | Titles the person is known for. |