Uploaded by lester.zp1030

CIND 719 Assignment 1-Hive

advertisement
Dataset
CIND 719 ASSIGNMENT #1
The dataset provided for this assignment is from the second year of Bay Area Bike Share's
operation. Files contain data from 9/1/14 to 8/31/15. Both files in the dataset are comma delimited
and do not contain header rows. Please examine the datasets and understand how they are related
before you solve any questions.
•
station_data.csv
-station id: station ID number
-name: name of station
-lat: latitude
-long: longitude
-dockcount: number of total docks at station
-landmark: city (San Francisco, Redwood City, Palo Alto, Mountain View, San Jose)
-installation: original date that station was installed.
•
trip_data.csv
-Trip ID: numeric ID of bike trip
-Duration: time of trip in seconds
-Start Date: start date of trip with date and time, in PST
-Start Station: station name of start station (corresponds to 'name' in the station_data.csv dataset)
-Start Terminal: numeric reference for start station (corresponds to 'station id' in the
station_data.csv dataset)
-End Date: end date of trip with date and time, in PST
-End Station: station name for end station (corresponds to 'name' in the station_data.csv dataset)
-End Terminal: numeric reference for end station (corresponds to 'station id' in the station_data.csv
dataset)
-Bike #: numeric ID of bike used
-Subscription Type: 'Subscriber' = annual or 30-day member; 'Customer' = 24-hour or 3-day member
-Zip Code: Home zip code of subscriber (customers can choose to manually enter zip at kiosk
however data is unreliable)
Assumptions and tips
• Base your work solely on the data provided. Do not rely on your geography knowledge.
• For the purposes of this exercise, assume that bike routes are one-way. That is, the
route from station A to B may not be the same as the route from station B to A.
• Exclude trips that start and end at the same station from your analysis and outputs
unless stated otherwise.
QUESTIONS
1. Find the 'most popular' bike, i.e. the bike that has made the highest number of trips
(1.5 pts)
2. Find the number of trips made by each subscription type. (1.5 pts)
1
3. Build a table that shows which stations are connected, and the minimum duration
between them. You can use either station id or station name.
Save this table as a comma separated text file in ‘/user/assignment1/stationlist’ in
HDFS. Include the directory listing of the output directory and first five lines of the
output file(s) in your submission. (3 pts)
4. Find the number of trips originating from each landmark. Your output should include
the landmark name and the number of trip originating from it. (3 pts)
5. Find the number of trips crossing landmarks, i.e. trips that originate in one landmark
and end in another. Your output should include the originating and ending landmark
names and the number of trips between them. (6 pts)
Submit
• Submit the following to the Assignment 1 folder under Assessment/Assignments in
your course shell. You may upload as many files as necessary.
o Hive DDL commands for data prep.
o Your script for each question.
o The screenshot of each script output for each question.
2
Download