Table of Contents¶

Introduction
Posing Questions
Data Collection and Wrangling
- Condensing the Trip Data
Exploratory Data Analysis
- Statistics
- Visualizations
Performing Your Own Analysis
Conclusions

Introduction¶

Tip: Quoted sections like this will provide helpful instructions on how to navigate and use a Jupyter notebook.

Over the past decade, bicycle-sharing systems have been growing in number and popularity in cities across the world. Bicycle-sharing systems allow users to rent bicycles for short trips, typically 30 minutes or less. Thanks to the rise in information technologies, it is easy for a user of the system to access a dock within the system to unlock or return bicycles. These technologies also provide a wealth of data that can be used to explore how these bike-sharing systems are used.

In this project, you will perform an exploratory analysis on data provided by Motivate, a bike-share system provider for many major cities in the United States. You will compare the system usage between three large cities: New York City, Chicago, and Washington, DC. You will also see if there are any differences within each system for those users that are registered, regular users and those users that are short-term, casual users.

Posing Questions¶

Before looking at the bike sharing data, you should start by asking questions you might want to understand about the bike share data. Consider, for example, if you were working for Motivate. What kinds of information would you want to know about in order to make smarter business decisions? If you were a user of the bike-share service, what factors might influence how you would want to use the service?

Question 1: Write at least two questions related to bike sharing that you think could be answered by data.

Answer: Possible questions:

Time patterns in bike share usage in each city based on

 a) days of week and 
 b) months of the year 
 c) in term of frequency 
 d) in term of trip duration

The most popular starting locations and destitation locations in each city.
Demography of bike share users (age, gender) and possible differences in bike share usage for specific demographic groups.

Tip: If you double click on this cell, you will see the text change so that all of the formatting is removed. This allows you to edit this block of text. This block of text is written using Markdown, which is a way to format text using headers, links, italics, and many other options using a plain-text syntax. You will also use Markdown later in the Nanodegree program. Use Shift + Enter or Shift + Return to run the cell and show its rendered form.

Data Collection and Wrangling¶

Now it's time to collect and explore our data. In this project, we will focus on the record of individual trips taken in 2016 from our selected cities: New York City, Chicago, and Washington, DC. Each of these cities has a page where we can freely download the trip data.:

New York City (Citi Bike): Link
Chicago (Divvy): Link
Washington, DC (Capital Bikeshare): Link

If you visit these pages, you will notice that each city has a different way of delivering its data. Chicago updates with new data twice a year, Washington DC is quarterly, and New York City is monthly. However, you do not need to download the data yourself. The data has already been collected for you in the /data/ folder of the project files. While the original data for 2016 is spread among multiple files for each city, the files in the /data/ folder collect all of the trip data for the year into one file per city. Some data wrangling of inconsistencies in timestamp format within each city has already been performed for you. In addition, a random 2% sample of the original data is taken to make the exploration more manageable.

Question 2: However, there is still a lot of data for us to investigate, so it's a good idea to start off by looking at one entry from each of the cities we're going to analyze. Run the first code cell below to load some packages and functions that you'll be using in your analysis. Then, complete the second code cell to print out the first trip recorded from each of the cities (the second line of each data file).

Tip: You can run a code cell like you formatted Markdown cells above by clicking on the cell and using the keyboard shortcut Shift + Enter or Shift + Return. Alternatively, a code cell can be executed using the Play button in the toolbar after selecting it. While the cell is running, you will see an asterisk in the message to the left of the cell, i.e. In [*]:. The asterisk will change into a number to show that execution has completed, e.g. In [1]. If there is output, it will show up as Out [1]:, with an appropriate number to match the "In" number.

## import all necessary packages and functions.
import csv # read and write csv files
from datetime import datetime # operations to parse dates
from pprint import pprint # use to print data structures like dictionaries in
                          # a nicer way than the base print function.

def print_first_point(filename):
    """
    This function prints and returns the first data point (second row) from
    a csv file that includes a header row.
    """
    # print city name for reference
    city = filename.split('-')[0].split('/')[-1]
    print('\nCity: {}'.format(city))
    
    with open(filename, 'r') as f_in:
        ## TODO: Use the csv library to set up a DictReader object. ##
        ## see https://docs.python.org/3/library/csv.html           ##
        trip_reader = csv.DictReader(f_in)
        
        ## TODO: Use a function on the DictReader object to read the     ##
        ## first trip from the data file and store it in a variable.     ##
        ## see https://docs.python.org/3/library/csv.html#reader-objects ##
        first_trip = next(trip_reader)
        
        ## TODO: Use the pprint library to print the first trip. ##
        ## see https://docs.python.org/3/library/pprint.html     ##
        pprint(first_trip)
        
    # output city name and first trip for later testing
    return (city, first_trip)

# list of files for each city
data_files = ['./data/NYC-CitiBike-2016.csv',
              './data/Chicago-Divvy-2016.csv',
              './data/Washington-CapitalBikeshare-2016.csv',]

# print the first trip from each file, store in dictionary
example_trips = {}
for data_file in data_files:
    city, first_trip = print_first_point(data_file)
    example_trips[city] = first_trip

City: NYC
OrderedDict([('tripduration', '839'),
             ('starttime', '1/1/2016 00:09:55'),
             ('stoptime', '1/1/2016 00:23:54'),
             ('start station id', '532'),
             ('start station name', 'S 5 Pl & S 4 St'),
             ('start station latitude', '40.710451'),
             ('start station longitude', '-73.960876'),
             ('end station id', '401'),
             ('end station name', 'Allen St & Rivington St'),
             ('end station latitude', '40.72019576'),
             ('end station longitude', '-73.98997825'),
             ('bikeid', '17109'),
             ('usertype', 'Customer'),
             ('birth year', ''),
             ('gender', '0')])

City: Chicago
OrderedDict([('trip_id', '9080545'),
             ('starttime', '3/31/2016 23:30'),
             ('stoptime', '3/31/2016 23:46'),
             ('bikeid', '2295'),
             ('tripduration', '926'),
             ('from_station_id', '156'),
             ('from_station_name', 'Clark St & Wellington Ave'),
             ('to_station_id', '166'),
             ('to_station_name', 'Ashland Ave & Wrightwood Ave'),
             ('usertype', 'Subscriber'),
             ('gender', 'Male'),
             ('birthyear', '1990')])

City: Washington
OrderedDict([('Duration (ms)', '427387'),
             ('Start date', '3/31/2016 22:57'),
             ('End date', '3/31/2016 23:04'),
             ('Start station number', '31602'),
             ('Start station', 'Park Rd & Holmead Pl NW'),
             ('End station number', '31207'),
             ('End station', 'Georgia Ave and Fairmont St NW'),
             ('Bike number', 'W20842'),
             ('Member Type', 'Registered')])

If everything has been filled out correctly, you should see below the printout of each city name (which has been parsed from the data file name) that the first trip has been parsed in the form of a dictionary. When you set up a DictReader object, the first row of the data file is normally interpreted as column names. Every other row in the data file will use those column names as keys, as a dictionary is generated for each row.

This will be useful since we can refer to quantities by an easily-understandable label instead of just a numeric index. For example, if we have a trip stored in the variable row, then we would rather get the trip duration from row['duration'] instead of row[0].

Condensing the Trip Data¶

It should also be observable from the above printout that each city provides different information. Even where the information is the same, the column names and formats are sometimes different. To make things as simple as possible when we get to the actual exploration, we should trim and clean the data. Cleaning the data makes sure that the data formats across the cities are consistent, while trimming focuses only on the parts of the data we are most interested in to make the exploration easier to work with.

You will generate new data files with five values of interest for each trip: trip duration, starting month, starting hour, day of the week, and user type. Each of these may require additional wrangling depending on the city:

Duration: This has been given to us in seconds (New York, Chicago) or milliseconds (Washington). A more natural unit of analysis will be if all the trip durations are given in terms of minutes.
Month, Hour, Day of Week: Ridership volume is likely to change based on the season, time of day, and whether it is a weekday or weekend. Use the start time of the trip to obtain these values. The New York City data includes the seconds in their timestamps, while Washington and Chicago do not. The datetime package will be very useful here to make the needed conversions.
User Type: It is possible that users who are subscribed to a bike-share system will have different patterns of use compared to users who only have temporary passes. Washington divides its users into two types: 'Registered' for users with annual, monthly, and other longer-term subscriptions, and 'Casual', for users with 24-hour, 3-day, and other short-term passes. The New York and Chicago data uses 'Subscriber' and 'Customer' for these groups, respectively. For consistency, you will convert the Washington labels to match the other two.

Question 3a: Complete the helper functions in the code cells below to address each of the cleaning tasks described above.

def duration_in_mins(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the trip duration in units of minutes.
    
    Remember that Washington is in terms of milliseconds while Chicago and NYC
    are in terms of seconds. 
    
    HINT: The csv module reads in all of the data as strings, including numeric
    values. You will need a function to convert the strings into an appropriate
    numeric type when making your transformations.
    see https://docs.python.org/3/library/functions.html
    """
    
    # YOUR CODE HERE
    # {city: [variable_name, time_coefficient]}
    var_dict = {"NYC": ['tripduration', 60], "Chicago": ['tripduration', 60], "Washington": ['Duration (ms)', 60000]}
    duration = float(datum[var_dict[city][0]]) / var_dict[city][1]   
    return duration


# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': 13.9833,
         'Chicago': 15.4333,
         'Washington': 7.1231}

for city in tests:
    assert abs(duration_in_mins(example_trips[city], city) - tests[city]) < .001

def time_of_trip(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the month, hour, and day of the week in
    which the trip was made.
    
    Remember that NYC includes seconds, while Washington and Chicago do not.
    
    HINT: You should use the datetime module to parse the original date
    strings into a format that is useful for extracting the desired information.
    see https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
    """
    
    # YOUR CODE HERE
    #{city: [variable_name, format_string]}
    var_dict = {"NYC": ['starttime', '%m/%d/%Y %H:%M:%S'], "Chicago": ['starttime', '%m/%d/%Y %H:%M'], "Washington": ['Start date', '%m/%d/%Y %H:%M']}
    date_obj = datetime.strptime(datum[var_dict[city][0]], var_dict[city][1])
    month, hour, day_of_week = date_obj.month, date_obj.hour, date_obj.date().strftime('%A')
    return (month, hour, day_of_week)


# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': (1, 0, 'Friday'),
         'Chicago': (3, 23, 'Thursday'),
         'Washington': (3, 22, 'Thursday')}

for city in tests:
    assert time_of_trip(example_trips[city], city) == tests[city]

def type_of_user(datum, city):
    """
    Takes as input a dictionary containing info about a single trip (datum) and
    its origin city (city) and returns the type of system user that made the
    trip.
    
    Remember that Washington has different category names compared to Chicago
    and NYC. 
    """
    
    # YOUR CODE HERE
    var_dict = {"NYC": 'usertype', "Chicago": 'usertype', "Washington": 'Member Type'}
    user_type = datum[var_dict[city]]
    if city == 'Washington':
        wash_types = {'Registered': 'Subscriber', 'Casual': 'Customer'}
        user_type = wash_types[user_type]
    if user_type == '':
        user_type = 'Customer'
    
    return user_type


# Some tests to check that your code works. There should be no output if all of
# the assertions pass. The `example_trips` dictionary was obtained from when
# you printed the first trip from each of the original data files.
tests = {'NYC': 'Customer',
         'Chicago': 'Subscriber',
         'Washington': 'Subscriber'}

for city in tests:
    assert type_of_user(example_trips[city], city) == tests[city]

Question 3b: Now, use the helper functions you wrote above to create a condensed data file for each city consisting only of the data fields indicated above. In the /examples/ folder, you will see an example datafile from the Bay Area Bike Share before and after conversion. Make sure that your output is formatted to be consistent with the example file.

def condense_data(in_file, out_file, city):
    """
    This function takes full data from the specified input file
    and writes the condensed data to a specified output file. The city
    argument determines how the input file will be parsed.
    
    HINT: See the cell below to see how the arguments are structured!
    """
    
    with open(out_file, 'w') as f_out, open(in_file, 'r') as f_in:
        # set up csv DictWriter object - writer requires column names for the
        # first row as the "fieldnames" argument
        out_colnames = ['duration', 'month', 'hour', 'day_of_week', 'user_type']        
        trip_writer = csv.DictWriter(f_out, fieldnames = out_colnames)
        trip_writer.writeheader()
        
        ## TODO: set up csv DictReader object ##
        trip_reader = csv.DictReader(f_in)

        # collect data from and process each row
        for row in trip_reader:
            # set up a dictionary to hold the values for the cleaned and trimmed
            # data point
            new_point = {}

            ## TODO: use the helper functions to get the cleaned data from  ##
            ## the original data dictionaries.                              ##
            ## Note that the keys for the new_point dictionary should match ##
            ## the column names set in the DictWriter object above.         ##
            new_point['duration'] = duration_in_mins(row, city)
            new_point['month'], new_point['hour'], new_point['day_of_week'] = time_of_trip(row, city)
            new_point['user_type'] = type_of_user(row, city)

            ## TODO: write the processed information to the output file.     ##
            ## see https://docs.python.org/3/library/csv.html#writer-objects ##
            trip_writer.writerow(new_point)

# Run this cell to check your work
city_info = {'Washington': {'in_file': './data/Washington-CapitalBikeshare-2016.csv',
                            'out_file': './data/Washington-2016-Summary.csv'},
             'Chicago': {'in_file': './data/Chicago-Divvy-2016.csv',
                         'out_file': './data/Chicago-2016-Summary.csv'},
             'NYC': {'in_file': './data/NYC-CitiBike-2016.csv',
                     'out_file': './data/NYC-2016-Summary.csv'}}

for city, filenames in city_info.items():
    condense_data(filenames['in_file'], filenames['out_file'], city)
    print_first_point(filenames['out_file'])

City: Washington
OrderedDict([('duration', '7.123116666666666'),
             ('month', '3'),
             ('hour', '22'),
             ('day_of_week', 'Thursday'),
             ('user_type', 'Subscriber')])

City: Chicago
OrderedDict([('duration', '15.433333333333334'),
             ('month', '3'),
             ('hour', '23'),
             ('day_of_week', 'Thursday'),
             ('user_type', 'Subscriber')])

City: NYC
OrderedDict([('duration', '13.983333333333333'),
             ('month', '1'),
             ('hour', '0'),
             ('day_of_week', 'Friday'),
             ('user_type', 'Customer')])

Tip: If you save a jupyter Notebook, the output from running code blocks will also be saved. However, the state of your workspace will be reset once a new session is started. Make sure that you run all of the necessary code blocks from your previous session to reestablish variables and functions before picking up where you last left off.

Exploratory Data Analysis¶

Now that you have the data collected and wrangled, you're ready to start exploring the data. In this section you will write some code to compute descriptive statistics from the data. You will also be introduced to the matplotlib library to create some basic histograms of the data.

Statistics¶

First, let's compute some basic counts. The first cell below contains a function that uses the csv module to iterate through a provided data file, returning the number of trips made by subscribers and customers. The second cell runs this function on the example Bay Area data in the /examples/ folder. Modify the cells to answer the question below.

Question 4a: Which city has the highest number of trips? Which city has the highest proportion of trips made by subscribers? Which city has the highest proportion of trips made by short-term customers?

Answer: New York City has the highest number of trips - 276798, followed by Chicago with 72131 trips; Washington is the last one with 66326 trips. NYC also has 88.8% of trips made by subscribers, which is the highest rate for this type of users, while in Chicago 23.8% of all trips are made by short-term customers, which is the maximum rate for this case in the three cities.

def number_of_trips(filename):
    """
    This function reads in a file with trip data and reports the number of
    trips made by subscribers, customers, and total overall.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        # initialize count variables
        n_subscribers = 0
        n_customers = 0
        
        # tally up ride types
        for row in reader:
            if row['user_type'] == 'Subscriber':
                n_subscribers += 1
            else:
                n_customers += 1
        
        # compute total number of rides
        n_total = n_subscribers + n_customers
        
        # return tallies as a tuple
        return(n_subscribers, n_customers, n_total)

## Modify this and the previous cell to answer Question 4a. Remember to run ##
## the function on the cleaned data files you created from Question 3.      ##
path = './data/'
file_list = ['Chicago-2016-Summary.csv', 'NYC-2016-Summary.csv', 'Washington-2016-Summary.csv']

#[(city, n_subscribers, n_customers, n_total)]
city_list = []
for file in file_list:
    city_list.append((file.split('-')[0],) + number_of_trips(path + file))

print('City,', 'Subscribers,', 'Customers,', 'Total')
pprint(city_list)
print('Which city has the highest number of trips?')
hnt = max(city_list, key=lambda x: x[3])
print(hnt[0], hnt[3])

print('Which city has the highest proportion of trips made by subscribers?') 
hpt_sub = max(city_list, key=lambda x: x[1]/x[3])
print(hpt_sub[0], hpt_sub[1]/hpt_sub[3])

print('Which city has the highest proportion of trips made by short-term customers?')
hpt_cust = max(city_list, key=lambda x: x[2]/x[3])
print(hpt_cust[0], hpt_cust[2]/hpt_cust[3])

#data_file = './examples/BayArea-Y3-Summary.csv'
#print(number_of_trips(data_file))

City, Subscribers, Customers, Total
[('Chicago', 54982, 17149, 72131),
 ('NYC', 245896, 30902, 276798),
 ('Washington', 51753, 14573, 66326)]
Which city has the highest number of trips?
NYC 276798
Which city has the highest proportion of trips made by subscribers?
NYC 0.8883590199351151
Which city has the highest proportion of trips made by short-term customers?
Chicago 0.23774798630269925

Tip: In order to add additional cells to a notebook, you can use the "Insert Cell Above" and "Insert Cell Below" options from the menu bar above. There is also an icon in the toolbar for adding new cells, with additional icons for moving the cells up and down the document. By default, new cells are of the code type; you can also specify the cell type (e.g. Code or Markdown) of selected cells from the Cell menu or the dropdown in the toolbar.

Now, you will write your own code to continue investigating properties of the data.

Question 4b: Bike-share systems are designed for riders to take short trips. Most of the time, users are allowed to take trips of 30 minutes or less with no additional charges, with overage charges made for trips of longer than that duration. What is the average trip length for each city? What proportion of rides made in each city are longer than 30 minutes?

Answer: In Chicago the average trip length is 16.6 minutes, 8.3% pf trips are longer than 30 minutes. In NYC the average trip length is 15.8 minutes, 7.3% of trips are longer than 30 minutes. In Washington the average trip length is 18.9 minutes, 10.8% pf trips are longer than 30 minutes.

## Use this and additional cells to answer Question 4b.                 ##
##                                                                      ##
## HINT: The csv module reads in all of the data as strings, including  ##
## numeric values. You will need a function to convert the strings      ##
## into an appropriate numeric type before you aggregate data.          ##
## TIP: For the Bay Area example, the average trip length is 14 minutes ##
## and 3.5% of trips are longer than 30 minutes.                        ##

def trip_durations(filename):
    """
    This function reads in a file with trip data and reports the average trip
    length and the proportion of trips longer than 30 minutes.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        durations = []
        
        for row in reader:
            durations.append(float(row['duration'])) 
        
        avg_duration = sum(durations)/len(durations)
        long_trip_prop = sum(x > 30 for x in durations)/len(durations)
        
        # return results as a tuple
        return(avg_duration, long_trip_prop)

## TIP: For the Bay Area example, the average trip length is 14 minutes ##
## and 3.5% of trips are longer than 30 minutes.                        ##

#test
data_file = './examples/BayArea-Y3-Summary.csv'
print(trip_durations(data_file))

(14.038656929671422, 0.035243689474519765)

# for city files
for file in file_list:
    print(file.split('-')[0], trip_durations(path + file))

Chicago (16.563629368787335, 0.08332062497400562)
NYC (15.81259299802294, 0.07302437156337835)
Washington (18.93287355913721, 0.10838886711093688)

Question 4c: Dig deeper into the question of trip duration based on ridership. Choose one city. Within that city, which type of user takes longer rides on average: Subscribers or Customers?

Answer: On average Subscribers tend to have shorter rides than Customers in all cities: their average trip durations are 12.1 minutes in Chicago, 12.5 minutes in Washington and 13.7 in NYC. The average trips for Customers are more than twice longer - 31 minutes in Chicago, 32.8 minutes in NYC and 41.7 minutes in Washington.

## Use this and additional cells to answer Question 4c. If you have    ##
## not done so yet, consider revising some of your previous code to    ##
## make use of functions for reusability.                              ##
##                                                                     ##
## TIP: For the Bay Area example data, you should find the average     ##
## Subscriber trip duration to be 9.5 minutes and the average Customer ##
## trip duration to be 54.6 minutes. Do the other cities have this     ##
## level of difference?                                                ##

def avg_trip_durations_by_users(filename):
    """
    This function reads in a file with trip data and reports the average trip
    length and the proportion of trips longer than 30 minutes for each user type.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        users = {}
        for row in reader:
            if row['user_type'] not in ['Subscriber', 'Customer']:
                row['user_type'] = 'Customer'
            if row['user_type'] in users:
                users[row['user_type']].append(float(row['duration']))
            else:
                users[row['user_type']] = [float(row['duration'])]
        
        result_list = []
        for user_type in users:
            result_list.append((user_type, sum(users[user_type])/len(users[user_type])))
        
        #returns list of tuples
        return sorted(result_list, key=lambda x: x[0], reverse = True)

## TIP: For the Bay Area example data, you should find the average     ##
## Subscriber trip duration to be 9.5 minutes and the average Customer ##
## trip duration to be 54.6 minutes. Do the other cities have this     ##
## level of difference?                                                ##

#test
data_file = './examples/BayArea-Y3-Summary.csv'
print(avg_trip_durations_by_users(data_file))

[('Subscriber', 9.512633839275217), ('Customer', 54.55121116377032)]

for file in file_list:
    print(file.split('-')[0], avg_trip_durations_by_users(path + file))

Chicago [('Subscriber', 12.067201690250076), ('Customer', 30.979781133982506)]
NYC [('Subscriber', 13.680790523907177), ('Customer', 32.77595139473187)]
Washington [('Subscriber', 12.528120499294745), ('Customer', 41.67803139252976)]

Visualizations¶

The last set of values that you computed should have pulled up an interesting result. While the mean trip time for Subscribers is well under 30 minutes, the mean trip time for Customers is actually above 30 minutes! It will be interesting for us to look at how the trip times are distributed. In order to do this, a new library will be introduced here, matplotlib. Run the cell below to load the library and to generate an example plot.

# load library
import matplotlib.pyplot as plt

# this is a 'magic word' that allows for plots to be displayed
# inline with the notebook. If you want to know more, see:
# http://ipython.readthedocs.io/en/stable/interactive/magics.html
%matplotlib inline 

# example histogram, data taken from bay area sample
data = [ 7.65,  8.92,  7.42,  5.50, 16.17,  4.20,  8.98,  9.62, 11.48, 14.33,
        19.02, 21.53,  3.90,  7.97,  2.62,  2.67,  3.08, 14.40, 12.90,  7.83,
        25.12,  8.30,  4.93, 12.43, 10.60,  6.17, 10.88,  4.78, 15.15,  3.53,
         9.43, 13.32, 11.72,  9.85,  5.22, 15.10,  3.95,  3.17,  8.78,  1.88,
         4.55, 12.68, 12.38,  9.78,  7.63,  6.45, 17.38, 11.90, 11.52,  8.63,]
plt.hist(data)
plt.title('Distribution of Trip Durations')
plt.xlabel('Duration (m)')
plt.show()

In the above cell, we collected fifty trip times in a list, and passed this list as the first argument to the .hist() function. This function performs the computations and creates plotting objects for generating a histogram, but the plot is actually not rendered until the .show() function is executed. The .title() and .xlabel() functions provide some labeling for plot context.

You will now use these functions to create a histogram of the trip times for the city you selected in question 4c. Don't separate the Subscribers and Customers for now: just collect all of the trip times and plot them.

## Use this and additional cells to collect all of the trip times as a list ##
## and then use pyplot functions to generate a histogram of trip times.     ##
def get_trip_durations(filename):
    """
    This function reads in a file with trip data and returns the list of all
    trip durations in the file.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        durations = []
        
        for row in reader:
            durations.append(float(row['duration'])) 

        return durations

for file in file_list:
    data = get_trip_durations(path + file)
    plt.hist(data)
    plt.title('Distribution of Trip Durations in ' + file.split('-')[0])
    plt.xlabel('Duration (m)')
    plt.show()

If you followed the use of the .hist() and .show() functions exactly like in the example, you're probably looking at a plot that's completely unexpected. The plot consists of one extremely tall bar on the left, maybe a very short second bar, and a whole lot of empty space in the center and right. Take a look at the duration values on the x-axis. This suggests that there are some highly infrequent outliers in the data. Instead of reprocessing the data, you will use additional parameters with the .hist() function to limit the range of data that is plotted. Documentation for the function can be found [here].

Question 5: Use the parameters of the .hist() function to plot the distribution of trip times for the Subscribers in your selected city. Do the same thing for only the Customers. Add limits to the plots so that only trips of duration less than 75 minutes are plotted. As a bonus, set the plots up so that bars are in five-minute wide intervals. For each group, where is the peak of each distribution? How would you describe the shape of each distribution?

Answer: In all three cities Subscribers have the mode range of trip durations from 5 to 10 minutes, followed by the next frequent range from 10 to 15. The distiribution of time durations of Subscribers is positively skewed. For Customers in NYC and Chicago the mode range is from 20 to 25 followed be the second most frequent interval from 15 to 20, while in Washington it is vice versa - the mode is 15-20 minute interval and the second most frequent range is from 20 to 25 minutes. The distribution is also positively skewed, but has higher variability and sufficiently more observations beyond 30 minutes.

## Use this and additional cells to answer Question 5. ##
def get_trip_durations_by_users(filename):
    """
    This function reads in a file with trip data and returns dictionary of user types
    with all trip durations for each type.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        users = {}
        for row in reader:
            if row['user_type'] not in ['Subscriber', 'Customer']:
                row['user_type'] = 'Customer'
            if row['user_type'] in users:
                users[row['user_type']].append(float(row['duration']))
            else:
                users[row['user_type']] = [float(row['duration'])]
        
        return users

for file in file_list:
    data = get_trip_durations_by_users(path + file)
    for key in data:
        plt.hist(data[key], range=(0,75), bins = 75//5, edgecolor='black', alpha=0.5, label=key)
    plt.title('Distribution of Trip Durations in ' + file.split('-')[0])
    plt.xlabel('Duration (m)')
    plt.legend(loc='upper right')
    plt.show()

Performing Your Own Analysis¶

So far, you've performed an initial exploration into the data available. You have compared the relative volume of trips made between three U.S. cities and the ratio of trips made by Subscribers and Customers. For one of these cities, you have investigated differences between Subscribers and Customers in terms of how long a typical trip lasts. Now it is your turn to continue the exploration in a direction that you choose. Here are a few suggestions for questions to explore:

How does ridership differ by month or season? Which month / season has the highest ridership? Does the ratio of Subscriber trips to Customer trips change depending on the month or season?
Is the pattern of ridership different on the weekends versus weekdays? On what days are Subscribers most likely to use the system? What about Customers? Does the average duration of rides change depending on the day of the week?
During what time of day is the system used the most? Is there a difference in usage patterns for Subscribers and Customers?

If any of the questions you posed in your answer to question 1 align with the bullet points above, this is a good opportunity to investigate one of them. As part of your investigation, you will need to create a visualization. If you want to create something other than a histogram, then you might want to consult the Pyplot documentation. In particular, if you are plotting values across a categorical variable (e.g. city, user type), a bar chart will be useful. The documentation page for .bar() includes links at the bottom of the page with examples for you to build off of for your own use.

Question 6: Continue the investigation by exploring another question that could be answered by the data available. Document the question you want to explore below. Your investigation should involve at least two variables and should compare at least two groups. You should also use at least one visualization as part of your explorations.

Answer: Replace this text with your responses and include a visualization below!

To conduct the exploratory data analysis of the data the following steps were made.
Step 1. The set of variables available for analysis was extended to include start and end stations for each trip and also age and gender for users. For the last two categories the information is available only for two cities - NYC and Chicago, and mostly for subscribers.
Step 2. Helper fuctions were created to obtain counts and duration-related data for factor variables of the data set and also for visualisations. Step 3. The data were explored for answers to the questions stated at the beginning of the project.

Concerning the time patterns in bike share usage we can conclude that the patterns for the three cities are similar in several ways:

the trips on weekends (Saturdays and Sundays) tend to be longer than on weekdays, mostly due to the greater number of short-term customers who use the service more on weekends and for longer trips on average, while less number of subscribers use the service at that time.
the total number of trips on weekends is usually smaller than on weekdays.
the distribution of number of trips by hour is bimodal with two pronounced peaks - around 7 and 17. The shortest trips are usually in the morning hours in all cities, though 'the hour of the minimum' differs from city to city. Since the pattern follows the typical work day schedule, some differences may be found between weekday data and weekend data.
the 'high season' for bike trip in NYC and Washinton is from June to October, while in Chicago - from June to September. The average trip duration by month in Chicago mimics the same distribution of the number of trips, while in NYC and Washington the average durations of trips in spring months are comparable to that of summer trips. This may be caused by the geographical locations of the cities and their climate differences.

Concerning the demographical characteristics of bike share users, we should mention that the distributions of trip durations for both male and female users in NYC and Chicago matches the overall distribution for subscribers, and there are almost no gender data for customers. On average, the trip duration of female users are 2 minutes longer in both cities than of male users. The minimum age reported by users is 16, while the maximum is 117 for Chicago and 133 for NYC, which looks more like error input. The age distribution is right-skewed, with the mode range in between 25 and 35 for both cities a noticable 'tail' to older ages, which is 'thicker' in NYC. In Chicago bike sharing seems to be less popular in the age groups after 35 in comparison with NYC.

## Use this and additional cells to continue to explore the dataset. ##
## Once you have performed your exploration, document your findings  ##
## in the Markdown cell above.                                       ##

def get_start_end_stations(datum, city):
    var_dict = {"NYC": ('start station name', 'end station name'), 
                "Chicago": ('from_station_name', 'to_station_name'),
                "Washington": ('Start station', 'End station')}

    # (start station, end station)
    return (datum[var_dict[city][0]], datum[var_dict[city][1]])

def get_age_gender(datum, city):
    var_dict = {"NYC": ('birth year', 'gender'), 
                "Chicago": ('birthyear', 'gender')}
    YEAR = 2016 # data year
    GENDER = {'0': 'NA', '1': 'Male', '2': 'Female'}
    
    age = 'NA'
    gender = 'NA'
    
    if city in var_dict:
        if datum[var_dict[city][0]]:
            age = YEAR - int(datum[var_dict[city][0]])
        
        if datum[var_dict[city][1]]:
            gender = datum[var_dict[city][1]]
            if gender in GENDER:
                gender = GENDER[datum[var_dict[city][1]]]
                    
    return (age, gender)
    

def condense_data_ext(in_file, out_file, city):
    """
    This function is the extended version of condense_data() function. 
    It takes full data from the specified input file
    and writes the condensed data to a specified output file. The city
    argument determines how the input file will be parsed.

    """
    
    with open(out_file, 'w') as f_out, open(in_file, 'r') as f_in:
        # set up csv DictWriter object - writer requires column names for the
        # first row as the "fieldnames" argument
        out_colnames = ['duration', 'month', 'hour', 'day_of_week', 'start_station',
                        'end_station', 'user_type', 'age', 'gender']        
        trip_writer = csv.DictWriter(f_out, fieldnames = out_colnames)
        trip_writer.writeheader()
        
        ## TODO: set up csv DictReader object ##
        trip_reader = csv.DictReader(f_in)

        # collect data from and process each row
        for row in trip_reader:
            # set up a dictionary to hold the values for the cleaned and trimmed
            # data point
            new_point = {}

            new_point['duration'] = duration_in_mins(row, city)
            new_point['month'], new_point['hour'], new_point['day_of_week'] = time_of_trip(row, city)
            new_point['user_type'] = type_of_user(row, city)
            new_point['start_station'], new_point['end_station'] = get_start_end_stations(row, city)
            new_point['age'], new_point['gender'] = get_age_gender(row, city)
            
            trip_writer.writerow(new_point)

## Writing extended data into separate csv files
## Printing first rows of extented data for each city
city_info = {'Washington': {'in_file': './data/Washington-CapitalBikeshare-2016.csv',
                            'out_file': './data/Washington-2016-Summary_Ext.csv'},
             'Chicago': {'in_file': './data/Chicago-Divvy-2016.csv',
                         'out_file': './data/Chicago-2016-Summary_Ext.csv'},
             'NYC': {'in_file': './data/NYC-CitiBike-2016.csv',
                     'out_file': './data/NYC-2016-Summary_Ext.csv'}}

for city, filenames in city_info.items():
    condense_data_ext(filenames['in_file'], filenames['out_file'], city)
    print_first_point(filenames['out_file'])

City: Washington
OrderedDict([('duration', '7.123116666666666'),
             ('month', '3'),
             ('hour', '22'),
             ('day_of_week', 'Thursday'),
             ('start_station', 'Park Rd & Holmead Pl NW'),
             ('end_station', 'Georgia Ave and Fairmont St NW'),
             ('user_type', 'Subscriber'),
             ('age', 'NA'),
             ('gender', 'NA')])

City: Chicago
OrderedDict([('duration', '15.433333333333334'),
             ('month', '3'),
             ('hour', '23'),
             ('day_of_week', 'Thursday'),
             ('start_station', 'Clark St & Wellington Ave'),
             ('end_station', 'Ashland Ave & Wrightwood Ave'),
             ('user_type', 'Subscriber'),
             ('age', '26'),
             ('gender', 'Male')])

City: NYC
OrderedDict([('duration', '13.983333333333333'),
             ('month', '1'),
             ('hour', '0'),
             ('day_of_week', 'Friday'),
             ('start_station', 'S 5 Pl & S 4 St'),
             ('end_station', 'Allen St & Rivington St'),
             ('user_type', 'Customer'),
             ('age', 'NA'),
             ('gender', 'NA')])

def get_avg_time_and_trip_count_for_key(filename, key):
    """
    Helper function. Returns average trip duration and number of trips 
    for any key (variable) in the city file
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        key_dict =  {}
        for row in reader:
            if row[key] in key_dict:
                key_dict[row[key]].append(float(row['duration']))
            else:
                key_dict[row[key]] = [float(row['duration'])]
            
        #[(subcategory, avg_trip_duration, number_of_trips)]
        result_list = []
        for item in key_dict:
            result_list.append((item, sum(key_dict[item])/len(key_dict[item]), len(key_dict[item])))
            
        #Highest number of trips first
        return sorted(result_list, key=lambda x: x[2], reverse = True)

ext_file_list = ['Chicago-2016-Summary_Ext.csv', 'NYC-2016-Summary_Ext.csv', 'Washington-2016-Summary_Ext.csv']
keys = ['day_of_week', 'user_type', 'gender']

for item in keys:
    print(item)
    for file in ext_file_list:
        print(file.split('-')[0])
        pprint(get_avg_time_and_trip_count_for_key(path + file, item))
    print()

day_of_week
Chicago
[('Monday', 16.115381889066008, 11286),
 ('Tuesday', 14.297109950203156, 10911),
 ('Friday', 15.576898488657195, 10741),
 ('Thursday', 13.933764654942733, 10008),
 ('Saturday', 20.630974446794895, 9927),
 ('Sunday', 21.381674262827143, 9654),
 ('Wednesday', 14.462123420796864, 9604)]
NYC
[('Wednesday', 14.636322047696872, 44629),
 ('Thursday', 14.552951725693545, 44330),
 ('Tuesday', 14.297031403529449, 42405),
 ('Friday', 16.023964096740567, 41389),
 ('Monday', 15.07595238095228, 39340),
 ('Saturday', 18.795957884848054, 33353),
 ('Sunday', 18.789433954239893, 31352)]
Washington
[('Wednesday', 16.294892746379023, 10103),
 ('Thursday', 16.685823671207284, 9984),
 ('Friday', 17.931890384486792, 9970),
 ('Tuesday', 16.69108387361512, 9748),
 ('Monday', 17.5637550635157, 9394),
 ('Saturday', 24.81150271161071, 8900),
 ('Sunday', 23.972442542846792, 8227)]

user_type
Chicago
[('Subscriber', 12.067201690250076, 54982),
 ('Customer', 30.979781133982506, 17149)]
NYC
[('Subscriber', 13.680790523907177, 245896),
 ('Customer', 32.77595139473187, 30902)]
Washington
[('Subscriber', 12.528120499294745, 51753),
 ('Customer', 41.67803139252976, 14573)]

gender
Chicago
[('Male', 11.620316874625653, 41194),
 ('NA', 30.97292954024325, 17154),
 ('Female', 13.404497085781982, 13783)]
NYC
[('Male', 13.14215250440846, 184681),
 ('Female', 15.383592582681091, 59788),
 ('NA', 31.860992194830324, 32329)]
Washington
[('NA', 18.93287355913721, 66326)]

#Plotting avt time for month and hour
import numpy as np

def plot_avg_time(city, data, key):
    """Plots a bar chart using the results of get_avg_time_and_trip_count_for_key() function
    """
    #sort data by key
    data = sorted(data, key=lambda x: int(x[0]))
    bars = [x[0] for x in data]
    heights = [x[1] for x in data]
    y_pos = np.arange(len(bars))
    
    plt.figure(figsize = (10,6))
    plt.bar(y_pos, heights)
    plt.title('Avg Trip Durations in ' + city + " by " + key)
    plt.xticks(y_pos, bars)
    
    plt.show()

for item in ['month', 'hour']:
    for file in ext_file_list:
        key = item
        city = file.split('-')[0]
        data = get_avg_time_and_trip_count_for_key(path + file, key)
        plot_avg_time(city, data, key)

#Plotting number of trips for month and hour
def plot_trip_count(city, data, key):
    """Plots a bar chart using the results of get_avg_time_and_trip_count_for_key() function
    """
    #sort data by key
    data = sorted(data, key=lambda x: int(x[0]))
    bars = [x[0] for x in data]
    heights = [x[2] for x in data]
    y_pos = np.arange(len(bars))
    
    plt.figure(figsize = (10,6))
    plt.bar(y_pos, heights)
    plt.title('Number of trips in ' + city + " by " + key)
    plt.xticks(y_pos, bars)
    
    plt.show()

for item in ['month', 'hour']:
    for file in ext_file_list:
        key = item
        city = file.split('-')[0]
        data = get_avg_time_and_trip_count_for_key(path + file, key)
        plot_trip_count(city, data, key)

for file in ext_file_list:
    print(file.split('-')[0], 'most popular Start Station:')
    print(max(get_avg_time_and_trip_count_for_key(path + file, 'start_station'), key=lambda x: x[2]))
    print(file.split('-')[0], 'most popular End Station:')
    print(max(get_avg_time_and_trip_count_for_key(path + file, 'end_station'), key=lambda x: x[2]))

Chicago most popular Start Station:
('Streeter Dr & Grand Ave', 27.938913082925026, 1837)
Chicago most popular End Station:
('Streeter Dr & Grand Ave', 28.421518151815178, 2020)
NYC most popular Start Station:
('Pershing Square North', 14.21704851752018, 2968)
NYC most popular End Station:
('Pershing Square North', 12.86237823312056, 2977)
Washington most popular Start Station:
('Columbus Circle / Union Station', 13.603536275216163, 1388)
Washington most popular End Station:
('Columbus Circle / Union Station', 13.296270974820976, 1443)

def pivot_key_by_another_key(filename, key_to_count, key_to_split):
    """
    Helper function. Pivot substitute. Returns number of entries 
    for any key (variable) in the city file, split by another key (variable)
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        result_dict =  {}
        for row in reader:
            top_cat = row[key_to_split]
            sub_cat = row[key_to_count]
            
            if top_cat in result_dict:
                if sub_cat in result_dict[top_cat]:
                    result_dict[top_cat][sub_cat] += 1
                else:
                    result_dict[top_cat][sub_cat] = 1
            else:
                result_dict[top_cat] = {}
                result_dict[top_cat][sub_cat] = 1           
           
        return result_dict

for file in ext_file_list:
    print(file.split('-')[0])
    pprint(pivot_key_by_another_key(path + file, 'user_type', 'month'))

Chicago
{'1': {'Customer': 62, 'Subscriber': 1839},
 '10': {'Customer': 1492, 'Subscriber': 5668},
 '11': {'Customer': 667, 'Subscriber': 4144},
 '12': {'Customer': 60, 'Subscriber': 1718},
 '2': {'Customer': 228, 'Subscriber': 2166},
 '3': {'Customer': 565, 'Subscriber': 3154},
 '4': {'Customer': 1017, 'Subscriber': 3550},
 '5': {'Customer': 2012, 'Subscriber': 5199},
 '6': {'Customer': 2612, 'Subscriber': 7182},
 '7': {'Customer': 3323, 'Subscriber': 6963},
 '8': {'Customer': 2757, 'Subscriber': 7053},
 '9': {'Customer': 2354, 'Subscriber': 6346}}
NYC
{'1': {'Customer': 488, 'Subscriber': 9692},
 '10': {'Customer': 3380, 'Subscriber': 28139},
 '11': {'Customer': 2039, 'Subscriber': 22109},
 '12': {'Customer': 789, 'Subscriber': 15397},
 '2': {'Customer': 569, 'Subscriber': 10601},
 '3': {'Customer': 1878, 'Subscriber': 16535},
 '4': {'Customer': 2632, 'Subscriber': 17528},
 '5': {'Customer': 3209, 'Subscriber': 21246},
 '6': {'Customer': 3136, 'Subscriber': 26106},
 '7': {'Customer': 3977, 'Subscriber': 23545},
 '8': {'Customer': 4412, 'Subscriber': 26692},
 '9': {'Customer': 4393, 'Subscriber': 28306}}
Washington
{'1': {'Customer': 222, 'Subscriber': 2212},
 '10': {'Customer': 1560, 'Subscriber': 5232},
 '11': {'Customer': 1075, 'Subscriber': 4139},
 '12': {'Customer': 432, 'Subscriber': 2922},
 '2': {'Customer': 283, 'Subscriber': 2571},
 '3': {'Customer': 1188, 'Subscriber': 4383},
 '4': {'Customer': 1192, 'Subscriber': 4410},
 '5': {'Customer': 1248, 'Subscriber': 4520},
 '6': {'Customer': 1707, 'Subscriber': 5613},
 '7': {'Customer': 2186, 'Subscriber': 5155},
 '8': {'Customer': 1806, 'Subscriber': 5392},
 '9': {'Customer': 1674, 'Subscriber': 5204}}

for file in ext_file_list:
    print(file.split('-')[0])
    pprint(pivot_key_by_another_key(path + file, 'user_type', 'day_of_week'))

Chicago
{'Friday': {'Customer': 2093, 'Subscriber': 8648},
 'Monday': {'Customer': 2446, 'Subscriber': 8840},
 'Saturday': {'Customer': 4251, 'Subscriber': 5676},
 'Sunday': {'Customer': 4282, 'Subscriber': 5372},
 'Thursday': {'Customer': 1365, 'Subscriber': 8643},
 'Tuesday': {'Customer': 1555, 'Subscriber': 9356},
 'Wednesday': {'Customer': 1157, 'Subscriber': 8447}}
NYC
{'Friday': {'Customer': 3783, 'Subscriber': 37606},
 'Monday': {'Customer': 3717, 'Subscriber': 35623},
 'Saturday': {'Customer': 7227, 'Subscriber': 26126},
 'Sunday': {'Customer': 6898, 'Subscriber': 24454},
 'Thursday': {'Customer': 3133, 'Subscriber': 41197},
 'Tuesday': {'Customer': 2918, 'Subscriber': 39487},
 'Wednesday': {'Customer': 3226, 'Subscriber': 41403}}
Washington
{'Friday': {'Customer': 2012, 'Subscriber': 7958},
 'Monday': {'Customer': 1736, 'Subscriber': 7658},
 'Saturday': {'Customer': 3311, 'Subscriber': 5589},
 'Sunday': {'Customer': 2975, 'Subscriber': 5252},
 'Thursday': {'Customer': 1530, 'Subscriber': 8454},
 'Tuesday': {'Customer': 1426, 'Subscriber': 8322},
 'Wednesday': {'Customer': 1583, 'Subscriber': 8520}}

# Plotting age distribution
def get_values_by_key_no_na(filename, key):
    """
    Helper function. This function reads in a file with trip data and returns 
    a list of values for a specific key, omitting NAs.
    """
    with open(filename, 'r') as f_in:
        # set up csv reader object
        reader = csv.DictReader(f_in)
        
        value_list = []
        
        # tally up ride types
        for row in reader:
            if row[key] != 'NA':
                value_list.append(row[key])

        return value_list
    
for file in ext_file_list[0:2]:
    city = file.split('-')[0]
    data = [int(x) for x in get_values_by_key_no_na(path + file, 'age')]
    print(min(data), max(data))
    plt.figure(figsize = (10,6))
    plt.hist(data, bins = 20, edgecolor='black', range=(0,100))
    plt.title('Age distribution of Bike Share Users in ' + city)
    plt.show()

16 117

16 131

def get_duration_distribution(filename, key):
        with open(filename, 'r') as f_in:
        # set up csv reader object
            reader = csv.DictReader(f_in)
            
            # {subcategory: [durations]}
            key_dict =  {}
            for row in reader:
                if row[key] in key_dict:
                    key_dict[row[key]].append(float(row['duration']))
                else:
                    key_dict[row[key]] = [float(row['duration'])]
            
            return key_dict

for file in ext_file_list:
    city = file.split('-')[0]
    data = get_duration_distribution(path + file, 'gender')
    
    plt.figure(figsize = (10,6))
    for item in sorted(data.keys()):
        plt.hist(data[item], bins = 75//5, range = (0,75), alpha=0.5, label=item, edgecolor='black')
    plt.title('Trip Durations in ' + city)
    plt.legend(loc='upper right')
    plt.show()

Conclusions¶

Congratulations on completing the project! This is only a sampling of the data analysis process: from generating questions, wrangling the data, and to exploring the data. Normally, at this point in the data analysis process, you might want to draw conclusions about the data by performing a statistical test or fitting the data to a model for making predictions. There are also a lot of potential analyses that could be performed on the data which are not possible with only the data provided. For example, detailed location data has not been investigated. Where are the most commonly used docks? What are the most common routes? As another example, weather has potential to have a large impact on daily ridership. How much is ridership impacted when there is rain or snow? Are subscribers or customers affected more by changes in weather?

Question 7: Putting the bike share data aside, think of a topic or field of interest where you would like to be able to apply the techniques of data science. What would you like to be able to learn from your chosen subject?

Answer: There are a lot of topics providing data for researchers nowadays, but my personal and to some extent scientific interect is in different areas of human behavior which can be as practical as customer groups of a specific company, or as general as international migration and demography. I'd like to learn how different characteristics of human groups correlate with different measurable aspects of life like economic situation, for example, if we are considering a broader level, or with different characterictics of a product provided by a company or its marketing policies.

2016 US Bike Share Activity Snapshot¶