New York City Green Taxi Trip

The New York City Taxi & Limousine Commission provides a collection of public data about taxi rides in New York City.

Alliage tutorials uses the Green Taxi Trip dataset. It is suitable for working with:

  • Time series
  • Key/value and column family storage
  • Geospatial queries

Description

The Green Taxi Trip dataset contains information related to pick-up and drop-off time and location, as well as driver income information.

The dataset is divided by years and months from 2013 until now. It is about 1.2GB and is published in Parquet format.

Requirements

Downloading and using the dataset requires a running cluster based on TDP getting started, an easy-to-launch TDP environment for testing purposes. This deployment provides you with:

  • tdp_user, a user with the ability to kinit for authentication.
  • An edge node accessible by SSH
  • HDFS directories:
    • /user/tdp_user

Note: If you are using your own TDP deployment, you need to adapt the previous configuration accordingly.

Before beginning downloading the dataset, connect to the cluster and kinit using the following commands:

# Connect to edge-01.tdp 
vagrant ssh edge-01
# Switch user to tdp_user
sudo su tdp_user
# Authenticate the user with his Kerberos principal and password
kinit -kt ~/tdp_user.keytab tdp_user@REALM.TDP

Usage

Alliage provides the nyc-green-taxi-trip.sh script to download the NYC Green Taxi Trip dataset to a TDP getting started cluster for the tdp_user user.

Internally, the command downloads and pipes the dataset to HDFS. The --help option contains additionnal information about the script usage.

The dataset is partionned by months and years. While not recommanded security wise, a quick way to install the dataset is:

curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/nyc-green-taxi-trip.sh | bash

It is possible to pass options to the command. For example, to print the help:

curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/nyc-green-taxi-trip.sh | bash -s -- --help

The script provides --from and --to options to specify a subset of the dataset to download. For example, the following command downloads data between March 2015 and July 2020:

curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/nyc-green-taxi-trip.sh | bash -s - --from 03-2015 --to 07-2020

By default, the downloaded files are accessible by the tdp_user user inside this user directory in “data/nyc_green_taxi_trip”. For example:

hdfs dfs -ls "/user/tdp_user/data/nyc_green_taxi_trip"

Note: For MacOS users, the script uses the getopt command which is available by default but not functionnal. The GNU version of getopt can be installed with brew install gnu-getopt. MacOS users of nix-darwin can also run nix -p getopt.

Schema

The Green Taxi Trip dataset is composed of the following fields:

FieldsTypesDescriptions
VendorIDLongA code indicating the TPEP provider that provided the record.
1 = Creative Mobile Technologies, LLC
2 = VeriFone Inc.
Lpep_pickup_datetimeDateTimeThe date and time when the meter was engaged.
Lpep_dropoff_datetimeDateTimeThe date and time when the meter was disengaged.
Store_and_fwd_flagStringThis flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.
Y = store and forward trip
N = not a store and forward trip
RatecodeIDDoubleThe final rate code is in effect at the end of the trip.
1 = Standard rate
2 =JFK
3 = Newark
4 = Nassau or Westchester
5 = Negotiated fare
6 = Group ride
PULocationIDLongTLC Taxi Zone in which the taximeter was engaged.
DOLocationIDLongTLC Taxi Zone in which the taximeter was disengaged.
Passenger_countDoubleThe number of passengers in the vehicle. This is a driver-entered value.
Trip_distanceDoubleThe elapsed trip distance in miles reported by the taximeter.
Fare_amountDoubleThe time-and-distance fare calculated by the meter.
ExtraDoubleMiscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.
Mta_taxDouble$0.50 MTA tax that is automatically triggered based on the metered rate in use.
Tip_amountDoubleTip amount – This field is automatically populated for credit card tips. Cash tips are not included.
Tolls_amountDoubleTotal amount of all tolls paid on the trip.
Ehail_feeIntegerTotal amount of fee for using electronic hailing (E-Hail).
Improvement_surchargeDouble$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.
Total_amountDoubleThe total amount charged to passengers. Does not include cash tips.
Payment_typeDoubleA numeric code signifying how the passenger paid for the trip.
1 = Credit card
2 = Cash
3 = No charge
4 = Dispute
5 = Unknown
6= Voided trip
Trip_typeDoubleA code indicating whether the trip was a street hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver.
1= Street-hail
2= Dispatch
Congestion_surchargeDoubleTotal amount of surcharges applied on the passenger for non-shared trips.

Clean up

Once the dataset in no longer needed, all downloaded parquet files in the nyc_green_taxi_trip folder can be deleted using the following command:

hdfs dfs -rm "/user/tdp_user/data/nyc_green_taxi_trip/*"