Primary Questions:
- How do trip durations differ by rider type?
- How do ride volumes vary by day and hour?
- Which start stations are most used by each segment?
This project was completed as part of the Google Data Analytics Capstone project to explore user behaviour and membership conversion strategies for Cyclistic Bike Share.
This case study analyzes Cyclistic’s bike-share usage data to understand differences between casual riders and annual members to then identify opportunities to convert casual users into long-term members. Using 2019 and 2020 Q1 trip data, this analysis employed R, tidyverse, lubridate, and geospatial tools to clean, process, and visualise user behaviour.
Findings reveal that casual riders take longer, leisure-oriented trips, often near tourist hotspots, whereas members ride more frequently during weekday commuting hours. Casual riders’ top stations were twice as close to Chicago’s main attractions, reinforcing their recreational nature.
Based on these insights, Cyclistic should implement seasonal or short-term membership bundles, introduce a ‘leisure tier’ membership, and dynamically position bikes by time of day and location to better match demand.
Understand how casual riders differ from members so Cyclistic can design bespoke strategies to convert casual riders into members.
Data sourced from public Divvy/Cyclistic trip data (no PII). For Posit’s memory limits, this project uses 2019 Q1 and 2020 Q1 CSVs. Tourism hotspots identified via https://www.touropia.com/tourist-attractions-in-chicago/ & coordinates determined via https://www.google.com/maps/@51.3752576,0.5321047,14z?entry=ttu&g_ep=EgoyMDI1MTAyNi4wIKXMDSoASAFQAw%3D%3D
library(tidyverse)
library(lubridate)
library(skimr)
library(here)
library(janitor)
library(scales)
raw19 <- read_csv(here("raw_data","Divvy_Trips_2019_Q1 - Divvy_Trips_2019_Q1.csv"))
raw20 <- read_csv(here("raw_data","Divvy_Trips_2020_Q1 - Divvy_Trips_2020_Q1.csv"))
skim(raw19)
| Name | raw19 |
| Number of rows | 365069 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| character | 6 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| start_time | 0 | 1.00 | 18 | 19 | 0 | 343022 | 0 |
| end_time | 0 | 1.00 | 18 | 19 | 0 | 338367 | 0 |
| from_station_name | 0 | 1.00 | 10 | 43 | 0 | 594 | 0 |
| to_station_name | 0 | 1.00 | 10 | 43 | 0 | 600 | 0 |
| usertype | 0 | 1.00 | 8 | 10 | 0 | 2 | 0 |
| gender | 19711 | 0.95 | 4 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| trip_id | 0 | 1.00 | 21960871.66 | 127175.00 | 21742443 | 21848765 | 21961829 | 22071823 | 22178528 | ▇▇▇▇▇ |
| bikeid | 0 | 1.00 | 3429.48 | 1923.32 | 1 | 1777 | 3489 | 5157 | 6471 | ▆▆▆▆▇ |
| tripduration | 0 | 1.00 | 1016.34 | 27913.51 | 61 | 326 | 524 | 866 | 10628400 | ▇▁▁▁▁ |
| from_station_id | 0 | 1.00 | 198.09 | 153.49 | 2 | 76 | 170 | 287 | 665 | ▇▅▃▁▁ |
| to_station_id | 0 | 1.00 | 198.58 | 154.47 | 2 | 76 | 168 | 287 | 665 | ▇▅▃▁▁ |
| birthyear | 18023 | 0.95 | 1981.67 | 11.25 | 1900 | 1975 | 1985 | 1990 | 2003 | ▁▁▁▅▇ |
skim(raw20)
| Name | raw20 |
| Number of rows | 426887 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 8 | 16 | 0 | 426887 | 0 |
| rideable_type | 0 | 1 | 11 | 11 | 0 | 1 | 0 |
| started_at | 0 | 1 | 18 | 19 | 0 | 399265 | 0 |
| ended_at | 0 | 1 | 18 | 19 | 0 | 399532 | 0 |
| start_station_name | 0 | 1 | 5 | 43 | 0 | 607 | 0 |
| end_station_name | 1 | 1 | 5 | 43 | 0 | 602 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_station_id | 0 | 1 | 209.80 | 163.22 | 2.00 | 77.00 | 176.00 | 298.00 | 675.00 | ▇▆▃▁▁ |
| end_station_id | 1 | 1 | 209.34 | 163.20 | 2.00 | 77.00 | 175.00 | 297.00 | 675.00 | ▇▅▃▁▁ |
| start_lat | 0 | 1 | 41.90 | 0.04 | 41.74 | 41.88 | 41.89 | 41.92 | 42.06 | ▁▁▇▂▁ |
| start_lng | 0 | 1 | -87.64 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.55 | ▁▁▇▇▁ |
| end_lat | 1 | 1 | 41.90 | 0.04 | 41.74 | 41.88 | 41.89 | 41.92 | 42.06 | ▁▁▇▂▁ |
| end_lng | 1 | 1 | -87.64 | 0.02 | -87.77 | -87.66 | -87.64 | -87.63 | -87.55 | ▁▁▇▇▁ |
p19 <- raw19 %>%
rename(
ride_id=trip_id, started_at=start_time, ended_at=end_time,
start_station_name=from_station_name, start_station_id=from_station_id,
end_station_name=to_station_name, end_station_id=to_station_id,
member_casual=usertype
) %>%
mutate(member_casual = recode(tolower(member_casual),
"subscriber"="member", "customer"="casual"))
p20 <- raw20 %>%
mutate(member_casual = recode(tolower(member_casual),
"subscriber"="member", "customer"="casual"))
p19 <- p19 %>%
mutate(ride_id = as.character(ride_id))
p20 <- p20 %>%
mutate(ride_id = as.character(ride_id))
p19 <- p19 %>%
mutate(start_lng = NA_real_, start_lat = NA_real_)
trips <-bind_rows(
p19 %>% select(ride_id, started_at, ended_at, start_station_id, end_station_id,
start_station_name, end_station_name, member_casual, start_lat, start_lng),
p20 %>% select(ride_id, started_at, ended_at, start_station_id, end_station_id,
start_station_name, end_station_name, member_casual, start_lat, start_lng)
)
# Joining Station Coordinates from p20 Dataset onto the Combined Dataset
## Determine Each Station's Coordinates:
station_lookup_id <- p20 %>%
filter(!is.na(start_station_id), !is.na(start_lat), !is.na(start_lng)) %>%
group_by(start_station_id) %>%
summarise(
start_lat_lookup = mean(as.numeric(start_lat), na.rm = TRUE),
start_lng_lookup = mean(as.numeric(start_lng), na.rm = TRUE),
.groups = "drop"
)
## Join Coordinates onto Combined Dataset:
trips <- trips %>%
left_join(station_lookup_id, by = "start_station_id") %>%
mutate(
start_lat = dplyr::coalesce(as.numeric(start_lat), start_lat_lookup),
start_lng = dplyr::coalesce(as.numeric(start_lng), start_lng_lookup)
) %>%
select(-start_lat_lookup, -start_lng_lookup)
trips_clean <- trips %>%
mutate(
started_at = ymd_hms(started_at, quiet = TRUE),
ended_at = ymd_hms(ended_at, quiet = TRUE),
ride_length_sec = as.numeric(difftime(ended_at, started_at, units = "secs")),
ride_length_mins = ride_length_sec/60,
day_of_week = wday(started_at, week_start = 1, label = TRUE),
month = month(started_at, label = TRUE),
hour = hour(started_at)
) %>%
filter(!is.na(started_at), !is.na(ended_at),
ride_length_mins > 1, ride_length_mins < 720) %>% # Removes Implausible Trips
drop_na(member_casual)
Trips shorter than 1 minute or longer than 12 hours were excluded as likely data errors. ### Post-Clean Check:
skim(trips_clean)
| Name | trips_clean |
| Number of rows | 783363 |
| Number of columns | 15 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| factor | 2 |
| numeric | 7 |
| POSIXct | 2 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| ride_id | 0 | 1 | 7 | 16 | 0 | 783363 | 0 |
| start_station_name | 0 | 1 | 5 | 43 | 0 | 636 | 0 |
| end_station_name | 0 | 1 | 5 | 43 | 0 | 636 | 0 |
| member_casual | 0 | 1 | 6 | 6 | 0 | 2 | 0 |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| day_of_week | 0 | 1 | TRUE | 7 | Tue: 134552, Thu: 131595, Wed: 128869, Fri: 122471 |
| month | 0 | 1 | TRUE | 3 | Mar: 304120, Jan: 245623, Feb: 233620, Apr: 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| start_station_id | 0 | 1 | 202.03 | 155.84 | 2.00 | 77.00 | 174.00 | 289.00 | 675.00 | ▇▅▃▁▁ |
| end_station_id | 0 | 1 | 202.00 | 156.23 | 2.00 | 77.00 | 173.00 | 289.00 | 675.00 | ▇▅▃▁▁ |
| start_lat | 40 | 1 | 41.90 | 0.04 | 41.74 | 41.88 | 41.89 | 41.91 | 42.06 | ▁▁▇▂▁ |
| start_lng | 40 | 1 | -87.64 | 0.02 | -87.77 | -87.65 | -87.64 | -87.63 | -87.55 | ▁▁▇▇▁ |
| ride_length_sec | 0 | 1 | 794.32 | 1165.86 | 61.00 | 333.00 | 541.00 | 913.00 | 43002.00 | ▇▁▁▁▁ |
| ride_length_mins | 0 | 1 | 13.24 | 19.43 | 1.02 | 5.55 | 9.02 | 15.22 | 716.70 | ▇▁▁▁▁ |
| hour | 0 | 1 | 13.29 | 4.65 | 0.00 | 9.00 | 14.00 | 17.00 | 23.00 | ▁▅▃▇▂ |
Variable type: POSIXct
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| started_at | 0 | 1 | 2019-01-01 00:04:37 | 2020-03-31 23:51:34 | 2020-01-06 19:26:01 | 734576 |
| ended_at | 0 | 1 | 2019-01-01 00:11:07 | 2020-04-01 01:00:13 | 2020-01-06 19:37:29 | 730156 |
This analysis explores differences between member and casual riders in trip duration and temporal/spatial riding patterns across Chicago. Each insight aims to reveal distinct usage motivations between the rider types (commuting/recreation) and hence guide potential membership conversion strategies.
# Mean & Max Trip Length by Customer Type
mean_max <- trips_clean %>%
group_by(member_casual) %>%
summarise(
mean_duration = round(mean(ride_length_mins, na.rm= TRUE), 2),
max_duration = max(ride_length_mins)
)
# Visualisation of Average Trip Duration by Customer Type
mean_max %>%
ggplot(aes(x=member_casual, y=mean_duration)) +
geom_col(width=0.85, colour = "black", fill = "steelblue") +
geom_text(data=mean_max, aes(label=paste0(mean_duration, " mins")),
vjust = -0.6, size=3, show.legend = FALSE) +
labs(
title = "Average Trip Duration by Member Type",
subtitle = "Casual members longer trips suggest recreational usage, Members appear to be commuters",
x = "Customer",
y = "Average Trip Duration (mins)"
) +
theme_minimal(base_size = 12)
Casual riders take significantly longer trips on average. This suggests differing use cases:
This is supported by weekday patterns: member-trip durations remain consistent across the working week but increase on weekends - indicating weekday commuting usage and some recreation on weekends.
# Determining Hourly Usage by Member Type:
hourly_usage <- trips_clean %>%
count(member_casual, hour, name = "hour_vol")
# Identifying Peak Usage Hours by Member Type:
peak <- hourly_usage %>%
group_by(member_casual) %>%
slice_max(hour_vol,n=1)
hour_breaks <- c(0,3,6,9,12,15,18,21)
hour_labels <- c("12am","3am","6am","9am","12pm","3pm","6pm","9pm")
hour_boundaries <- c(0,23,1)
# Visualisation Showing Usage Hours by Member Type
hourly_usage %>%
ggplot(aes(x=hour,y=hour_vol, colour = member_casual)) +
geom_line(linewidth = 1) +
facet_wrap(~member_casual, nrow=1, scales = "free_y") +
geom_vline(xintercept = c(8,17), linetype = "dashed", alpha = 0.4) +
geom_point(data = peak, size = 2) +
geom_text(data=peak, aes(label=paste0(hour, ":00")),
vjust = -0.6, size=3, show.legend = FALSE) +
scale_x_continuous(breaks = hour_breaks, labels = hour_labels, limits = c(0,23)) +
scale_y_continuous(labels=scales::comma) +
labs(
title = "Hourly Rider Demand: Members vs Casuals",
subtitle = "Typical Commute Hours Marked",
x = "Hour of Day (24h)",
y = "Number of Rides",
color = "Rider Type"
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "none",
axis.text.x = element_text(angle = 45, hjust = 1)
)
Analysing usage by hour reveals clear behavioural differences:
Higher weekend usage volume would support this.
day_counts <- trips_clean %>%
count(member_casual, day_of_week, name = "day_volume")
day_counts %>%
ggplot(aes(x = day_of_week, y = day_volume)) +
geom_col(width = 0.85, fill = "steelblue") +
facet_wrap(~ member_casual, nrow = 1, scales = "free_y") +
scale_y_continuous(labels = comma) +
labs(
title = "Ride Volume by Day of Week",
x = "Day of Week",
y = "Number of Rides",
) +
theme_minimal(base_size = 12) +
theme(
panel.grid.minor = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
strip.text = element_text(face = "bold")
)
Examining Usage Volumes Across the Week Strengthens this Contrast:
#Importing Necessary Packages:
library(leaflet.extras)
library(geosphere)
library(htmlwidgets)
library(leaflet)
# Determine Station Usage by Customer Type:
start_station_volumes <- trips_clean %>%
filter(!is.na(start_station_name), start_station_name != "") %>%
count(member_casual, start_station_name, start_lat, start_lng, name = "station_volumes")
# Identifying Most Popular Stations by Customer Type:
top_stations <- start_station_volumes %>%
group_by(member_casual) %>%
slice_max(station_volumes, n= 15, with_ties = FALSE) %>%
ungroup()
#Creating Colour Palette for the Visualisation
pal <- colorFactor(
palette = c("goldenrod", "steelblue"),
domain = c("casual", "member")
)
#Determining Most Popular Station by Customer Type
peaks <-start_station_volumes %>%
filter(!is.na(start_lat), !is.na(start_lng)) %>%
group_by(member_casual) %>%
slice_max(station_volumes, n=1)
# Adding Tourist Hotspots onto Map
hotspots_raw <- read_csv(here("raw_data","chicago_hotspots.csv")) %>%
clean_names()
hotspots <- hotspots_raw %>%
transmute(
name = chicago_tourist_hot_spot,
lat = as.numeric(lat),
lng = as.numeric(long))
### Calculating Nearest Tourist Hotpot to each Top Station
nearest_hotspot <- top_stations %>%
select(member_casual, start_station_name, start_lat, start_lng) %>%
tidyr::crossing(
hotspots %>% select(hotspot_name = name, lat, lng)
) %>%
mutate(
dist_m = geosphere::distHaversine(
cbind(start_lng, start_lat),
cbind(lng, lat)
)
) %>%
group_by(member_casual, start_station_name, start_lat, start_lng) %>%
slice_min(dist_m, n = 1, with_ties = FALSE) %>%
ungroup()
### Calculating % of Top Stations within Xm of A Tourist Hotspot & Summary Statistics
nearest_by_type <- nearest_hotspot %>%
group_by(member_casual) %>%
summarise(
n_stations = dplyr::n(),
mean_dist_m = mean(dist_m),
median_dist_m = median(dist_m),
pct_within_125m = mean(dist_m <= 125)* 100,
pct_within_250m = mean(dist_m <= 250) * 100,
pct_within_500m = mean(dist_m <= 500) * 100,
pct_within_1km = mean(dist_m <= 1000) * 100,
.groups = "drop"
)
### Making Info Pop-up for Graph Displaying Key Summary Statistics
cas_med <- nearest_by_type %>%
filter(member_casual=="casual") %>%
pull(median_dist_m) %>%
round(2)
mem_med <- nearest_by_type %>%
filter(member_casual=="member") %>%
pull(median_dist_m) %>%
round(2)
cas_250 <- nearest_by_type %>%
filter(member_casual=="casual") %>%
pull(pct_within_250m) %>%
round(2)
mem_250 <- nearest_by_type %>%
filter(member_casual=="member") %>%
pull(pct_within_250m) %>%
round(2)
###Building HTML for the on-map Info Panel:
stats_html <- htmltools::HTML(
sprintf(
"<div style='background:white; padding:8px 10px; border-radius: 6px;
box-shadow:0 1px 6px rgba(0,0,0,.18); font-size:13px;line-height:1.5'>
<div style='font-weight:700; margin-bottom:4px; color:#333;'>Tourist Hotspot Proximity:</div>
<div style='margin-bottom:6px;'>
<span style='color:#555; width:70px;font-weight:600;'>Casual:</span><br/>
<span style='color#000;'>Median Distance: <b>%s m</b></span><br/>
<span style='color:#555; '>Within 250 m: <b>%s%%</b></span>
</div>
<div style='border-top:1px solid #eee; padding-topL:6px;'>
<span style='color:#555; width:70px;font-weight:600;'>Members:</span><br/>
<span style='color#000;'>Median Distance: <b>%s m</b></span><br/>
<span style='color:#555; '>Within 250 m: <b>%s%%</b></span>
</div>
</div>",
round(cas_med,1), round(cas_250,1), round(mem_med,1), round(mem_250,1)
)
)
#Creating Interactive Map Visualisation of Spatial Usage by Customer Type with Tourist Hotspots Identified
map <- leaflet(top_stations) %>%
addProviderTiles(providers$CartoDB.Positron) %>%
addCircles(
data = peaks,
lng = ~start_lng, lat = ~start_lat,
radius = 70,
color = "black",
weight = 2,
fillOpacity = 0,
options = pathOptions(interactive = FALSE),
group = "Most Popular Station") %>%
addCircleMarkers(
lng = ~start_lng,
lat = ~start_lat,
radius = ~scales::rescale(station_volumes, to=c(5,20)),
color = ~pal(member_casual),
fillColor = ~pal(member_casual),
fillOpacity = 0.7,
stroke =TRUE, weight = 1,
popup = ~paste0(
"<b>", htmltools::htmlEscape(start_station_name), "</b><br/>",
"Customer type: <b>", member_casual, "</b><br/>",
"Rides: <b>", comma(station_volumes), "</b>"
),
group = "Top Stations"
) %>%
addHeatmap(
data = hotspots,
lng = ~lng, lat = ~lat,
radius = 45,
blur = 10,
max = 2,
group = "Tourist Hotspots"
) %>%
addLayersControl(
overlayGroups = c("Top Stations", "Most Popular Station", "Tourist Hotspots"),
options = layersControlOptions(collapsed=FALSE)
) %>%
addEasyButton(
easyButton(
icon = "fa-info", title = "Tourism Summary",
onClick = JS(
sprintf(
"function(btn,map){
var sz = map.getSize();
var pt = L.point(120, sz.y - 10);
var ll = map.containerPointToLatLng(pt)
var html = `%s`;
L.popup({
autoPan: false,
closeButton: true,
offset: L.point(10,-10),
maxWidth: 220,
className: 'corner-popup'
})
.setLatLng(ll)
.setContent(html)
.openOn(map);
}",
gsub("`","\\\\`", as.character(stats_html))
)
)
)
) %>%
hideGroup("Tourist Hotspots") %>%
addLegend(
"bottomright",
pal = pal, values = ~member_casual,
title = "Customer Type"
)
map
Mapping each customer type’s most popular stations highlights a strong spatial divide:
These Visual Patterns are Reinforced Statistically: Median distances from top Tourist Hotspots:
Share of Stations within 250m of a top Tourist Hotspot:
This indicates casual customers’ behaviour strongly aligns with tourist and leisure zones, while members’ activity reflects daily commuting networks.
Overall, the findings suggest a clear segmentation:
This distinction can inform direction for marketing and operations teams:
Further analysis could examine seasonal patterns or monthly demand shifts to better tailor campaigns across the year
This section outlines strategic next steps and initial ideas for bespoke marketing initiatives designed to convert casual users into members, while improving system efficiency.