IMDb

The Internet Movie Database, better known as IMDb, provides an open dataset about movies and series.

This dataset is suitable for working with :

  • Relational data with multiple tables
  • Join strategy evaluation

Note, due to its relatively small size, it is not suitable for Big Data processing and tuning evaluation such as to launch intensive Spark jobs on large clusters.

Description of data

The IMDb dataset is organized as seven tables:

The raw size of the dataset is about 5.5 GB. It is available to download from the IMDb data files website in compressed TSV format.

Usage

Alliage provides the imdb.sh script to download the IMDb dataset to a TDP getting started cluster.

Internally, the command downloads, decompresses and pipes the dataset to HDFS. The --help option contains additionnal information about the script usage.

While not recommanded security wise, a quick way to install the dataset is:

curl \
  -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/imdb.sh \
  | bash

It is possible to pass options to the command. For example, to print the help:

curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/imdb.sh | bash -s -- --help

By default, the downloaded files are accessible by the tdp_user user inside his user directory in “data/imdb”. For example:

hdfs dfs -ls "/user/tdp_user/data/imdb"

Note: For MacOS users, the script use the getopt command which is available by default but not functionnal. The GNU version of getopt can be installed with brew install gnu-getopt. MacOS users of nix-darwin can also run nix -p getopt.

Schemas

title.basics schema

FieldsTypesDescriptions
tconststringAlphanumeric unique identifier of the title.
titleTypestringThe type/format of the title (e.g. movie, short, tvSeries, tvEpisode, video, etc).
primaryTitlestringThe more popular title / the title used by the filmmakers on promotional materials at the point of release.
originalTitlestringOriginal title, in the original language.
isAdultbooleanIs this title made for adults?

0= non-adult title
1= adult title
startYearintegerIn YYYY format. Represents the release year of a title. In the case of TV Series, it is the series start year.
endYearintegerIn YYYY format. TV Series end year.
\N for all other title types.
runtimeMinutesintegerPrimary runtime of the title, in minutes.
genres[string]Includes up to three genres associated with the title.

title.akas schema

FieldsTypesDescriptions
titleIdstringA tconst, an alphanumeric unique identifier of the title.
orderingintegerA number to uniquely identify rows for a given titleId.
titlestringThe localized title.
regionstringThe region for this version of the title.
languagestringThe language of the title.
types[string]Enumerated set of attributes for this alternative title.

One or more of the following: alternative, dvd, festival, tv, video, working, original, imdbDisplay.
New values may be added in the future without warning.
attributes[string]Additional terms to describe this alternative title, not enumerated.
isOriginalTitlebooleanIs this alternative title the original one?

0= not original title
1= original title

title.crew schema

FieldsTypesDescriptions
tconststringAlphanumeric unique identifier of the title.
directors[nconsts]Director(s) of the given title.
writers[nconsts]Writer(s) of the given title.

title.episode schema

FieldsTypesDescriptions
tconststringAlphanumeric identifier of episode.
parentTconststringAlphanumeric identifier of the parent TV Series.
seasonNumberintegerSeason number the episode belongs to.
episodeNumberintegerEpisode number of the tconst in the TV series.

title.principals schema

FieldsTypesDescriptions
tconststringAlphanumeric unique identifier of the title.
orderingintegerA number to uniquely identify rows for a given titleId.
nconststringAlphanumeric unique identifier of the name/person.
categorystringThe category of job that person was in.
jobstringThe specific job title.
\N if not applicable.
charactersstringThe name of the character played.
\N if not applicable.

title.ratings schema

FieldsTypesDescriptions
tconststringAlphanumeric unique identifier of the title.
averageRatingfloatWeighted average of all the individual user ratings.
numVotesintegerNumber of votes the title has received.

name.basics schema

FieldsTypesDescriptions
nconststringAlphanumeric unique identifier of the name/person.
primaryNamestringName by which the person is most often credited.
birthYearintegerIn YYYY format.
deathYearintegerIn YYYY format.
\N if not applicable
primaryProfession[strings]The top-3 professions of the person.
knownForTitles[tconsts]Titles the person is known for.