avatarXinyu Chen (陈新宇)

Summary

The web content provides an analysis of millions of taxi trips in Chicago, detailing the process of matching trip data with community area boundaries and visualizing trip patterns, with a focus on comparing travel times and speeds between 2019 and 2022.

Abstract

The City of Chicago's open data portal offers extensive datasets on human mobility, including taxi trips, rideshare trips, bikeshare trips, and e-scooter trips. Researchers have utilized this data to summarize annual trips and visualize the boundaries of Chicago's 77 community areas using Python libraries such as geopandas and matplotlib. The analysis involves processing taxi trip data from 2022, matching trip locations with community area boundaries, and removing outliers to ensure data quality. The study reveals insights into the most popular pickup and dropoff locations, changes in average trip duration and distance, and the impact of these changes on travel times and speeds for specific pickup-dropoff pairs between 2019 and 2022. Visualizations show a significant reduction in average travel time and an increase in average speed for certain routes, suggesting less congestion in 2022 compared to 2019.

Opinions

  • The authors consider the analysis of taxi trips crucial for understanding human mobility patterns within Chicago.
  • There is an emphasis on the importance of data cleaning and outlier removal to maintain the integrity of the analysis.
  • The comparison between 2019 and 2022 data implies that there have been notable improvements in traffic conditions and mobility efficiency in Chicago.
  • The use of Python for data processing and visualization is endorsed as an effective approach for handling large datasets and generating insightful visual representations.
  • The detailed methodology provided suggests a preference for transparency and reproducibility in research.

Analyzing Millions of Taxi Trips in the City of Chicago

The City of Chicago’s open data portal provides a large amount of human mobility data, including taxi trips, TNP rideshare trips, Divvy bikeshare trips, and E-scooter trips. There is a brief summary of annual trips of these travel modes in the City of Chicago, depending on the data availability.

Visualizing Boundaries of Community Areas in Chicago

The data can be viewed on the Chicago Data Portal with a web browser, see 77 community areas in Chicago. To use these data, one can export and download the data in the Shapefile format. In this post, we rename four files of the Shapefile data as follows,

  • areas.dbf
  • areas.prj
  • areas.shp
  • areas.shx

and place these files at the folder Boundaries_Community_Areas.

Then it is not hard to use the geopandas and matplotlib packages in Python to visualize the boundaries of community areas.

import geopandas as gpd
import matplotlib.pyplot as plt

fig = plt.figure(figsize = (14, 8))
ax = fig.subplots(1)
shape = gpd.read_file("Boundaries _Community_Areas/areas.shp")
shape.plot(cmap = 'RdYlBu_r', ax = ax)
plt.xticks([])
plt.yticks([])
for _, spine in ax.spines.items():
    spine.set_visible(False)
plt.show()
fig.savefig("boundaries_community_areas_chicago.png", bbox_inches = "tight")

Figure 1 shows the boundaries of 77 community areas in the City of Chicago. Note that we can set the cmap as RdYlBu_r or YlOrRd_r.

Figure 1. Boundaries of community areas in the City of Chicago, USA.

Matching Taxi Trips with Community Areas

There are three basic steps to follow for processing taxi trip data:

  • Download taxi trips in 2022 in the .csv format, e.g., Taxi_Trips_-_2022.csv.
  • Use the pandas package in Python to process the raw trip data.
  • Match trip pickup/dropoff locations with boundaries of the community area.
import pandas as pd

data = pd.read_csv('Taxi_Trips_-_2022.csv')
data.head()

For each taxi trip, one can select some important information:

  • Trip Start Timestamp: When the trip started, rounded to the nearest 15 minutes.
  • Trip Seconds: Time of the trip in seconds.
  • Trip Miles: Distance of the trip in miles.
  • Pickup Community Area: The Community Area where the trip began. This column will be blank for locations ourside Chicago.
  • Dropoff Community Area: The Community Area where the trip ended. This column will be blank for locations outside Chicago.
df = pd.DataFrame()
df['Trip Start Timestamp'] = data['Trip Start Timestamp']
df['Trip Seconds'] = data['Trip Seconds']
df['Trip Miles'] = data['Trip Miles']
df['Pickup Community Area'] = data['Pickup Community Area']
df['Dropoff Community Area'] = data['Dropoff Community Area']
del data
df

By doing so, there are 6,382,425 rows in this new dataframe. For the following analysis, one should remove the trips whose pickup/dropoff locations outside Chicago. In addition, one should clean the outliers that are with 0 (trip) seconds or 0 (trip) miles.

df = df.dropna() # Remove rows with NaN
df = df.drop(df[df['Trip Seconds'] == 0].index)
df = df.drop(df[df['Trip Miles'] == 0].index)
df = df.reset_index()
df = df.drop(['index'], axis = 1)
df.to_csv('taxi_trip_2022.csv', index = False)

import numpy as np

print(np.mean(df['Trip Seconds'].values))
print(np.mean(df['Trip Miles'].values))

By doing so, there are 4,763,961 remaining taxi trips in the dataframe. If you want to aggregate the trip counts of each pickup/dropoff community area, the simplest way to get row counts per pickup/dropoff community area is by calling .groupby().size().

pickup_df = df.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
dropoff_df = df.groupby(['Dropoff Community Area']).size().reset_index(name = 'dropoff_counts')

Visualizing Pickup and Dropoff Trips in 2022

It is not hard to first use the geopandas package to merge pickup/dropoff trip counts into the .shp data and then visualize the trip data with the matplotlib package.

import geopandas as gpd
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 14

chicago = gpd.read_file("Boundaries _Community_Areas/areas.shp")
pickup_df = df.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
dropoff_df = df.groupby(['Dropoff Community Area']).size().reset_index(name = 'dropoff_counts')
pickup_df['area_numbe'] = pickup_df['Pickup Community Area']
dropoff_df['area_numbe'] = dropoff_df['Dropoff Community Area']
chicago['area_numbe'] = chicago.area_numbe.astype(float)

pickup = chicago.set_index('area_numbe').join(pickup_df.set_index('area_numbe')).reset_index()
dropoff = chicago.set_index('area_numbe').join(dropoff_df.set_index('area_numbe')).reset_index()

fig = plt.figure(figsize = (14, 8))
for i in [1, 2]:
    ax = fig.add_subplot(1, 2, i)
    if i == 1:
        pickup.plot('pickup_counts', cmap = 'YlOrRd', legend = True,
                    legend_kwds = {'shrink': 0.618, 'label': 'Pickup trip count'},
                    vmin = 0, vmax = 1.4e+6, ax = ax)
    elif i == 2:
        dropoff.plot('dropoff_counts', cmap = 'YlOrRd', legend = True, 
                     legend_kwds = {'shrink': 0.618, 'label': 'Dropoff trip count'},
                     vmin = 0, vmax = 1.4e+6, ax = ax)
    plt.xticks([])
    plt.yticks([])
plt.show()
fig.savefig("pickup_dropoff_trips_chicago_2022.png", bbox_inches = "tight")

Figure 2 shows taxi pickup and dropoff trips (2022) on 77 community areas in the City of Chicago. Note that the average trip duration is 1207.75 seconds and the average trip distance is 6.16 miles.

Figure 2. Taxi pickup and dropoff trips (2022) in the City of Chicago, USA. There are 4,763,961 remaining trips after the data processing.

For comparison, Figure 3 shows taxi pickup and dropoff trips (2019) on 77 community areas in the City of Chicago. Note that the average trip duration is 915.62 seconds and the average trip distance is 3.93 miles.

Figure 3. Taxi pickup and dropoff trips (2019) in the City of Chicago, USA. There are 12,484,572 remaining trips after the data processing. See the data processing codes.

In addition, one can analyze the trips of other travel modes in the City of Chicago. Figure 4 shows E-scooter pickup and dropoff trips on 77 community areas in the City of Chicago, see how to process and visualize E-scooter trips. Note that the average trip duration is 913.18 seconds and the average trip distance is 2448.60 meters.

Figure 4. E-scooter pickup and dropoff trips (2022) in the City of Chicago, USA. There are 1,476,028 remaining trips after the data processing.

Taxi Travel Time Changes of Popular Pickup-Dropoff Pairs between 2019 and 2022

As shown in Figure 2 and Figure 3, there are some most popular pickup community areas, e.g., see left panel of Figure 2:

  • Community area 8: 1,261,696 trips
  • Community area 32: 888,724 trips
  • Community area 76: 688,553 trips
  • Community area 28: 448,476 trips

The pickup trips in these four areas are about 69% of all trips in 2022. For comparison, the left panel of Figure 3 shows that these four areas are also the most popular pickup areas. Specifically, we have

  • Community area 8: 4,006,793 trips
  • Community area 32: 3,647,522 trips
  • Community area 28: 1,451,411 trips
  • Community area 76: 1,096,552 trips

It seems to be 81.72% of all trips in 2019.

import pandas as pd

data19 = pd.read_csv('taxi_trip_2019.csv')
data22 = pd.read_csv('taxi_trip_2022.csv')

# Extract the most popular pickup areas
df19 = data19.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
df19 = df19.sort_values(by = ['pickup_counts'], ascending = False)
df22 = data22.groupby(['Pickup Community Area']).size().reset_index(name = 'pickup_counts')
df22 = df22.sort_values(by = ['pickup_counts'], ascending = False)

In what follows, one can choose some pickup-dropoff pairs to analyze taxi travel times.

# Return hour from datetime column
data19['hour'] = pd.to_datetime(data19['Trip Start Timestamp'], 
                                errors = 'coerce').dt.hour
data22['hour'] = pd.to_datetime(data22['Trip Start Timestamp'], 
                                errors = 'coerce').dt.hour

From Area 8 to Area 76

When analyzing taxi travel times and movement speeds, one should remove some outliers (e.g., anomalies in trip seconds/miles).

# From Area 8 to Area 76 in 2019
df1 = data19[(data19['Pickup Community Area'] == 8) & (data19['Dropoff Community Area'] == 76)]
df1 = df1.drop(df1[df1['Trip Seconds'] < 600].index)
df1 = df1.drop(df1[df1['Trip Seconds'] > 7200].index)
df1 = df1.drop(df1[df1['Trip Miles'] < 10].index)
df1 = df1.drop(df1[df1['Trip Miles'] > 25].index)

# From Area 8 to Area 76 in 2022
df2 = data22[(data22['Pickup Community Area'] == 8) & (data22['Dropoff Community Area'] == 76)]
df2 = df2.drop(df2[df2['Trip Seconds'] < 600].index)
df2 = df2.drop(df2[df2['Trip Seconds'] > 7200].index)
df2 = df2.drop(df2[df2['Trip Miles'] < 10].index)
df2 = df2.drop(df2[df2['Trip Miles'] > 25].index)

In what follows, we visualize the average travel time and speed from area 8 (i.e., Downtown) to area 76 (i.e., Airport) in both 2019 and 2022. Figure 5 shows average travel time curves and movement speed curves. It is not hard to see the remarkable reduction of average travel time in 2022 when comparing to 2019. The results of average speed demonstrate that the paths/routes from area 8 to area 76 is less congested in 2022 when comparing to 2019.

Figure 5. Average travel time and speed from area 8 (i.e., Downtown) to area 76 (i.e., Airport) in both 2019 and 2022.
Figure 6. Average travel time and speed from area 32 (i.e., Downtown) to area 76 (i.e., Airport) in both 2019 and 2022.
import numpy as np
import matplotlib.pyplot as plt

fig = plt.figure(figsize = (8, 2.5))
ax = fig.add_subplot(1, 2, 1)
# Average travel time in 2019
m1 = df1.groupby(['hour'])['Trip Seconds'].mean().values / 30
s1 = df1.groupby(['hour'])['Trip Seconds'].std().values / 30
plt.plot(m1, color = 'blue', linewidth = 1.8, label = '2019')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)), 
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower), 
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'blue', alpha = 0.05)

# Average travel time in 2022
m1 = df2.groupby(['hour'])['Trip Seconds'].mean().values / 30
s1 = df2.groupby(['hour'])['Trip Seconds'].std().values / 30
plt.plot(m1, color = 'red', linewidth = 1.8, label = '2022')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)), 
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower), 
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'red', alpha = 0.05)

plt.xticks(np.arange(0, 24 + 1, 2))
plt.xlabel('Time (hour)')
plt.ylabel('Average travel time (min)')
plt.grid(axis = 'both', linestyle='dashed', linewidth = 0.1, color = 'gray')
ax.tick_params(direction = "in")
ax.set_xlim([-1, 24])
plt.legend()

ax = fig.add_subplot(1, 2, 2)
# Average speed in 2019
df1['speed'] = df1['Trip Miles'].values / (df1['Trip Seconds'] / 3600)
m1 = df1.groupby(['hour'])['speed'].mean().values
s1 = df1.groupby(['hour'])['speed'].std().values
plt.plot(m1, color = 'blue', linewidth = 1.8, label = '2019')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)), 
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower), 
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'blue', alpha = 0.05)

# Average speed in 2022
df2['speed'] = df2['Trip Miles'].values / (df2['Trip Seconds'] / 3600)
m1 = df2.groupby(['hour'])['speed'].mean().values
s1 = df2.groupby(['hour'])['speed'].std().values
plt.plot(m1, color = 'red', linewidth = 1.8, label = '2022')
upper = m1 + s1
lower = m1 - s1
x_bound = np.append(np.append(np.append(np.array([0, 0]), np.arange(0, 24)), 
                              np.array([24 - 1, 24 - 1])), np.arange(24 - 1, -1, -1))
y_bound = np.append(np.append(np.append(np.array([upper[0], lower[0]]), lower), 
                              np.array([lower[-1], upper[-1]])), np.flip(upper))
plt.fill(x_bound, y_bound, color = 'red', alpha = 0.05)

plt.xticks(np.arange(0, 24 + 1, 2))
plt.xlabel('Time (hour)')
plt.ylabel('Average speed (mph)')
plt.grid(axis = 'both', linestyle='dashed', linewidth = 0.1, color = 'gray')
ax.tick_params(direction = "in")
ax.set_xlim([-1, 24])
plt.legend()

plt.savefig("travel_time_plus_speed_pickup_8_dropoff_76.png", bbox_inches = "tight")
plt.show()
Data Science
Artificial Intelligence
Social Science
Mobility
Python
Recommended from ReadMedium