New York City Green Taxi Trip
The New York City Taxi & Limousine Commission provides a collection of public data about taxi rides in New York City.
Alliage tutorials uses the
Green Taxi Trip dataset. It is suitable for working with:
- Time series
- Key/value and column family storage
- Geospatial queries
Green Taxi Trip dataset contains information related to pick-up and drop-off time and location, as well as driver income information.
The dataset is divided by years and months from 2013 until now. It is about 1.2GB and is published in Parquet format.
Downloading and using the dataset requires a running cluster based on TDP getting started, an easy-to-launch TDP environment for testing purposes. This deployment provides you with:
tdp_user, a user with the ability to
- An edge node accessible by SSH
- HDFS directories:
Note: If you are using your own TDP deployment, you need to adapt the previous configuration accordingly.
Before beginning downloading the dataset, connect to the cluster and
kinit using the following commands:
# Connect to edge-01.tdp vagrant ssh edge-01 # Switch user to tdp_user sudo su tdp_user # Authenticate the user with his Kerberos principal and password kinit -kt ~/tdp_user.keytab tdp_user@REALM.TDP
Alliage provides the
nyc-green-taxi-trip.sh script to download the NYC Green Taxi Trip dataset to a TDP getting started cluster for the
Internally, the command downloads and pipes the dataset to HDFS. The
--help option contains additionnal information about the script usage.
The dataset is partionned by months and years. While not recommanded security wise, a quick way to install the dataset is:
curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/nyc-green-taxi-trip.sh | bash
It is possible to pass options to the command. For example, to print the help:
curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/nyc-green-taxi-trip.sh | bash -s -- --help
The script provides
--to options to specify a subset of the dataset to download. For example, the following command downloads data between March 2015 and July 2020:
curl -s https://raw.githubusercontent.com/alliage-io/alliage-academy/main/datasets/nyc-green-taxi-trip.sh | bash -s - --from 03-2015 --to 07-2020
By default, the downloaded files are accessible by the
tdp_user user inside this user directory in “data/nyc_green_taxi_trip”. For example:
hdfs dfs -ls "/user/tdp_user/data/nyc_green_taxi_trip"
Note: For MacOS users, the script uses the
getopt command which is available by default but not functionnal. The GNU version of
getopt can be installed with
brew install gnu-getopt. MacOS users of nix-darwin can also run
nix -p getopt.
The Green Taxi Trip dataset is composed of the following fields:
|VendorID||Long||A code indicating the TPEP provider that provided the record.|
1 = Creative Mobile Technologies, LLC
2 = VeriFone Inc.
|Lpep_pickup_datetime||DateTime||The date and time when the meter was engaged.|
|Lpep_dropoff_datetime||DateTime||The date and time when the meter was disengaged.|
|Store_and_fwd_flag||String||This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server.|
Y = store and forward trip
N = not a store and forward trip
|RatecodeID||Double||The final rate code is in effect at the end of the trip.|
1 = Standard rate
3 = Newark
4 = Nassau or Westchester
5 = Negotiated fare
6 = Group ride
|PULocationID||Long||TLC Taxi Zone in which the taximeter was engaged.|
|DOLocationID||Long||TLC Taxi Zone in which the taximeter was disengaged.|
|Passenger_count||Double||The number of passengers in the vehicle. This is a driver-entered value.|
|Trip_distance||Double||The elapsed trip distance in miles reported by the taximeter.|
|Fare_amount||Double||The time-and-distance fare calculated by the meter.|
|Extra||Double||Miscellaneous extras and surcharges. Currently, this only includes the $0.50 and $1 rush hour and overnight charges.|
|Mta_tax||Double||$0.50 MTA tax that is automatically triggered based on the metered rate in use.|
|Tip_amount||Double||Tip amount – This field is automatically populated for credit card tips. Cash tips are not included.|
|Tolls_amount||Double||Total amount of all tolls paid on the trip.|
|Ehail_fee||Integer||Total amount of fee for using electronic hailing (E-Hail).|
|Improvement_surcharge||Double||$0.30 improvement surcharge assessed on hailed trips at the flag drop. The improvement surcharge began being levied in 2015.|
|Total_amount||Double||The total amount charged to passengers. Does not include cash tips.|
|Payment_type||Double||A numeric code signifying how the passenger paid for the trip.|
1 = Credit card
2 = Cash
3 = No charge
4 = Dispute
5 = Unknown
6= Voided trip
|Trip_type||Double||A code indicating whether the trip was a street hail or a dispatch that is automatically assigned based on the metered rate in use but can be altered by the driver.|
|Congestion_surcharge||Double||Total amount of surcharges applied on the passenger for non-shared trips.|
Once the dataset in no longer needed, all downloaded parquet files in the nyc_green_taxi_trip folder can be deleted using the following command:
hdfs dfs -rm "/user/tdp_user/data/nyc_green_taxi_trip/*"