This project was completed as part of the Google Data Analytics Capstone project to explore user behaviour and membership conversion strategies for Cyclistic Bike Share.

Executive Summary

This case study analyzes Cyclistic’s bike-share usage data to understand differences between casual riders and annual members to then identify opportunities to convert casual users into long-term members. Using 2019 and 2020 Q1 trip data, this analysis employed R, tidyverse, lubridate, and geospatial tools to clean, process, and visualise user behaviour.

Findings reveal that casual riders take longer, leisure-oriented trips, often near tourist hotspots, whereas members ride more frequently during weekday commuting hours. Casual riders’ top stations were twice as close to Chicago’s main attractions, reinforcing their recreational nature.

Based on these insights, Cyclistic should implement seasonal or short-term membership bundles, introduce a ‘leisure tier’ membership, and dynamically position bikes by time of day and location to better match demand.

ASK

Business Task:

Understand how casual riders differ from members so Cyclistic can design bespoke strategies to convert casual riders into members.

Primary Questions:

1. How do trip durations differ by rider type?
1. How do ride volumes vary by day and hour?
1. Which start stations are most used by each segment?

Stakeholders: Marketing, Operations, Executive Team

Prepare

Data Sources:

Data sourced from public Divvy/Cyclistic trip data (no PII). For Posit’s memory limits, this project uses 2019 Q1 and 2020 Q1 CSVs. Tourism hotspots identified via https://www.touropia.com/tourist-attractions-in-chicago/ & coordinates determined via https://www.google.com/maps/@51.3752576,0.5321047,14z?entry=ttu&g_ep=EgoyMDI1MTAyNi4wIKXMDSoASAFQAw%3D%3D

Tools: R, RStudio, tidyverse, lubridate, skimr, here. janitor, scales, leaflet, leaflet.extras, htmlwidgets, geosphere.

Importation of Data:

library(tidyverse)  
library(lubridate)
library(skimr)
library(here)
library(janitor)
library(scales)
raw19 <- read_csv(here("raw_data","Divvy_Trips_2019_Q1 - Divvy_Trips_2019_Q1.csv"))
raw20 <- read_csv(here("raw_data","Divvy_Trips_2020_Q1 - Divvy_Trips_2020_Q1.csv"))

Structure Check:

skim(raw19)

Data summary
Name	raw19
Number of rows	365069
Number of columns	12
_______________________
Column type frequency:
character	6
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
start_time	0	1.00	18	19	343022
end_time	0	1.00	18	19	338367
from_station_name	0	1.00	10	43	594
to_station_name	0	1.00	10	43	600
usertype	0	1.00	8	10	2
gender	19711	0.95	4	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
trip_id	0	1.00	21960871.66	127175.00	21742443	21848765	21961829	22071823	22178528	▇▇▇▇▇
bikeid	0	1.00	3429.48	1923.32	1	1777	3489	5157	6471	▆▆▆▆▇
tripduration	0	1.00	1016.34	27913.51	61	326	524	866	10628400	▇▁▁▁▁
from_station_id	0	1.00	198.09	153.49	2	76	170	287	665	▇▅▃▁▁
to_station_id	0	1.00	198.58	154.47	2	76	168	287	665	▇▅▃▁▁
birthyear	18023	0.95	1981.67	11.25	1900	1975	1985	1990	2003	▁▁▁▅▇

skim(raw20)

Data summary
Name	raw20
Number of rows	426887
Number of columns	13
_______________________
Column type frequency:
character	7
numeric	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
ride_id	0	1	8	16	426887
rideable_type	0	1	11	11	1
started_at	0	1	18	19	399265
ended_at	0	1	18	19	399532
start_station_name	0	1	5	43	607
end_station_name	1	1	5	43	602
member_casual	0	1	6	6	2

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
start_station_id	0	1	209.80	163.22	2.00	77.00	176.00	298.00	675.00	▇▆▃▁▁
end_station_id	1	1	209.34	163.20	2.00	77.00	175.00	297.00	675.00	▇▅▃▁▁
start_lat	0	1	41.90	0.04	41.74	41.88	41.89	41.92	42.06	▁▁▇▂▁
start_lng	0	1	-87.64	0.02	-87.77	-87.66	-87.64	-87.63	-87.55	▁▁▇▇▁
end_lat	1	1	41.90	0.04	41.74	41.88	41.89	41.92	42.06	▁▁▇▂▁
end_lng	1	1	-87.64	0.02	-87.77	-87.66	-87.64	-87.63	-87.55	▁▁▇▇▁

Process:

Cleaning Decisions and Standardisation:

Renamed 2019 columns to align with those of 2020.
Standardised rider labels: subscriber = member, customer = casual.
Ensured ride_id is formatted in character in both datasets.
Ensured both datasets included start station coordinates.
Resolved timestamps and then calculated duration of journeys (minutes).
Gave weekdays Mon-Sun labels.
Removed implausible trips (≤1 min or ≥12 hr) and critical N/A rows.
Combined 2019 & 2020 Tables as 1.

Standardise - Adjusting Column Names:

p19 <- raw19 %>% 
  rename(
    ride_id=trip_id, started_at=start_time, ended_at=end_time,
    start_station_name=from_station_name, start_station_id=from_station_id,
    end_station_name=to_station_name, end_station_id=to_station_id,
    member_casual=usertype
  ) %>% 
  mutate(member_casual = recode(tolower(member_casual),
                                "subscriber"="member", "customer"="casual"))

2020 Already Has Member/Casual but we Normalise Just in Case:

p20 <- raw20 %>% 
  mutate(member_casual = recode(tolower(member_casual),
                                "subscriber"="member", "customer"="casual"))

Formatting ride_id Correctly:

p19 <- p19 %>% 
  mutate(ride_id = as.character(ride_id))
p20 <- p20 %>% 
  mutate(ride_id = as.character(ride_id))

Combining Tables with Common Fields:

p19 <- p19 %>% 
  mutate(start_lng = NA_real_, start_lat = NA_real_)

trips <-bind_rows(
  p19 %>% select(ride_id, started_at, ended_at, start_station_id, end_station_id,
                 start_station_name, end_station_name, member_casual, start_lat, start_lng),
  p20 %>% select(ride_id, started_at, ended_at, start_station_id, end_station_id,
                 start_station_name, end_station_name, member_casual, start_lat, start_lng)
)

Filling in Missing Station Corrdinates from 2019 Dataset:

# Joining Station Coordinates from p20 Dataset onto the Combined Dataset
## Determine Each Station's Coordinates:
station_lookup_id <- p20 %>%
  filter(!is.na(start_station_id), !is.na(start_lat), !is.na(start_lng)) %>%
  group_by(start_station_id) %>%
  summarise(
    start_lat_lookup = mean(as.numeric(start_lat), na.rm = TRUE),
    start_lng_lookup = mean(as.numeric(start_lng), na.rm = TRUE),
    .groups = "drop"
  )
## Join Coordinates onto Combined Dataset:
trips <- trips %>% 
  left_join(station_lookup_id, by = "start_station_id") %>%
  mutate(
    start_lat = dplyr::coalesce(as.numeric(start_lat), start_lat_lookup),
    start_lng = dplyr::coalesce(as.numeric(start_lng), start_lng_lookup)
  ) %>% 
  select(-start_lat_lookup, -start_lng_lookup)

Feature Engineering (Timestamps, Monday-start Lables & Dropping NAs)

trips_clean <- trips %>% 
  mutate(
    started_at = ymd_hms(started_at, quiet = TRUE),
    ended_at = ymd_hms(ended_at, quiet = TRUE),
    ride_length_sec = as.numeric(difftime(ended_at, started_at, units = "secs")),
    ride_length_mins = ride_length_sec/60,
    day_of_week = wday(started_at, week_start = 1, label = TRUE),
    month = month(started_at, label = TRUE),
    hour = hour(started_at)
  ) %>% 
  filter(!is.na(started_at), !is.na(ended_at),
        ride_length_mins > 1, ride_length_mins < 720) %>% # Removes Implausible Trips
  drop_na(member_casual)

Trips shorter than 1 minute or longer than 12 hours were excluded as likely data errors. ### Post-Clean Check:

skim(trips_clean)

Data summary
Name	trips_clean
Number of rows	783363
Number of columns	15
_______________________
Column type frequency:
character	4
factor	2
numeric	7
POSIXct	2
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
ride_id	1	7	16	783363
start_station_name	1	5	43	636
end_station_name	1	5	43	636
member_casual	1	6	6	2

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
day_of_week	0	1	TRUE	7	Tue: 134552, Thu: 131595, Wed: 128869, Fri: 122471
month	0	1	TRUE	3	Mar: 304120, Jan: 245623, Feb: 233620, Apr: 0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
start_station_id	0	1	202.03	155.84	2.00	77.00	174.00	289.00	675.00	▇▅▃▁▁
end_station_id	0	1	202.00	156.23	2.00	77.00	173.00	289.00	675.00	▇▅▃▁▁
start_lat	40	1	41.90	0.04	41.74	41.88	41.89	41.91	42.06	▁▁▇▂▁
start_lng	40	1	-87.64	0.02	-87.77	-87.65	-87.64	-87.63	-87.55	▁▁▇▇▁
ride_length_sec	0	1	794.32	1165.86	61.00	333.00	541.00	913.00	43002.00	▇▁▁▁▁
ride_length_mins	0	1	13.24	19.43	1.02	5.55	9.02	15.22	716.70	▇▁▁▁▁
hour	0	1	13.29	4.65	0.00	9.00	14.00	17.00	23.00	▁▅▃▇▂

Variable type: POSIXct

skim_variable	n_missing	complete_rate	min	max	median	n_unique
started_at	0	1	2019-01-01 00:04:37	2020-03-31 23:51:34	2020-01-06 19:26:01	734576
ended_at	0	1	2019-01-01 00:11:07	2020-04-01 01:00:13	2020-01-06 19:37:29	730156

Data Has Been Processed

Analyse

This analysis explores differences between member and casual riders in trip duration and temporal/spatial riding patterns across Chicago. Each insight aims to reveal distinct usage motivations between the rider types (commuting/recreation) and hence guide potential membership conversion strategies.

Trip Duration by Customer Type:

# Mean & Max Trip Length by Customer Type
mean_max <- trips_clean %>% 
  group_by(member_casual) %>%
  summarise(
    mean_duration = round(mean(ride_length_mins, na.rm= TRUE), 2),
    max_duration = max(ride_length_mins)
)
# Visualisation of Average Trip Duration by Customer Type
mean_max %>% 
  ggplot(aes(x=member_casual, y=mean_duration)) +
  geom_col(width=0.85, colour = "black", fill = "steelblue") +
  geom_text(data=mean_max, aes(label=paste0(mean_duration, " mins")),
            vjust = -0.6, size=3, show.legend = FALSE) +
  labs(
    title = "Average Trip Duration by Member Type",
    subtitle = "Casual members longer trips suggest recreational usage, Members appear to be commuters",
    x = "Customer",
    y = "Average Trip Duration (mins)"
  ) +
  theme_minimal(base_size = 12)

Figure 1 - Average Trip Duration by Customer Type

Casual riders take significantly longer trips on average. This suggests differing use cases:

Casual Riders are likely using the service for recreational or leisure, where there are few time-constraints.
Members’ shorter trips suggest use-cases revolve around purpose-driven commutes.

This is supported by weekday patterns: member-trip durations remain consistent across the working week but increase on weekends - indicating weekday commuting usage and some recreation on weekends.

# Determining Hourly Usage by Member Type:
hourly_usage <- trips_clean %>% 
  count(member_casual, hour, name = "hour_vol")
# Identifying Peak Usage Hours by Member Type:
peak <- hourly_usage %>% 
  group_by(member_casual) %>% 
  slice_max(hour_vol,n=1)

hour_breaks <- c(0,3,6,9,12,15,18,21)
hour_labels <- c("12am","3am","6am","9am","12pm","3pm","6pm","9pm")
hour_boundaries <- c(0,23,1)

# Visualisation Showing Usage Hours by Member Type
hourly_usage %>% 
  ggplot(aes(x=hour,y=hour_vol, colour = member_casual)) +
  geom_line(linewidth = 1) +
  facet_wrap(~member_casual, nrow=1, scales = "free_y") +
  geom_vline(xintercept = c(8,17), linetype = "dashed", alpha = 0.4) +
  geom_point(data = peak, size = 2) +
  geom_text(data=peak, aes(label=paste0(hour, ":00")),
            vjust = -0.6, size=3, show.legend = FALSE) +
  scale_x_continuous(breaks = hour_breaks, labels = hour_labels, limits = c(0,23)) +
  scale_y_continuous(labels=scales::comma) +
  labs(
    title = "Hourly Rider Demand: Members vs Casuals",
    subtitle = "Typical Commute Hours Marked",
    x = "Hour of Day (24h)",
    y = "Number of Rides",
    color = "Rider Type"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
      legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

Figure 2 - Usage Hours by Customer Type

Hourly Patterns by Customer Type

Analysing usage by hour reveals clear behavioural differences:

Casual customers peak at 3pm, indicating more flexible, recreational usage.

Higher weekend usage volume would support this.

Members show two peaks at 8am and 5pm which align with typical commuting hours.

Weekly Patterns by Customer Type

day_counts <- trips_clean %>% 
  count(member_casual, day_of_week, name = "day_volume")

day_counts %>% 
ggplot(aes(x = day_of_week, y = day_volume)) +
  geom_col(width = 0.85, fill = "steelblue") +
  facet_wrap(~ member_casual, nrow = 1, scales = "free_y") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Ride Volume by Day of Week",
    x = "Day of Week",
    y = "Number of Rides",
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    strip.text = element_text(face = "bold")
  )

Figure 3 - Daily Usage by Customer Type

Examining Usage Volumes Across the Week Strengthens this Contrast:

Members usage is concentrated on weekdays - consistent with commuting.
Casual customers primarily ride on weekends, reflecting leisure or tourist activity.

Spatial Behaviour: Station Hotspots and Tourism Proximity

#Importing Necessary Packages:
library(leaflet.extras)
library(geosphere)
library(htmlwidgets)
library(leaflet)

# Determine Station Usage by Customer Type:
start_station_volumes <- trips_clean %>% 
  filter(!is.na(start_station_name), start_station_name != "") %>% 
  count(member_casual, start_station_name, start_lat, start_lng, name = "station_volumes")

# Identifying Most Popular Stations by Customer Type:
top_stations <- start_station_volumes %>% 
  group_by(member_casual) %>% 
  slice_max(station_volumes, n= 15, with_ties = FALSE) %>% 
  ungroup()

#Creating Colour Palette for the Visualisation
pal <- colorFactor(
  palette = c("goldenrod", "steelblue"),
  domain  = c("casual", "member")
)

#Determining Most Popular Station by Customer Type
peaks <-start_station_volumes %>% 
  filter(!is.na(start_lat), !is.na(start_lng)) %>% 
  group_by(member_casual) %>% 
  slice_max(station_volumes, n=1)

# Adding Tourist Hotspots onto Map
hotspots_raw <- read_csv(here("raw_data","chicago_hotspots.csv")) %>%
  clean_names()

hotspots <- hotspots_raw %>% 
  transmute(
    name = chicago_tourist_hot_spot,
    lat = as.numeric(lat),
    lng = as.numeric(long))

### Calculating Nearest Tourist Hotpot to each Top Station
nearest_hotspot <- top_stations %>%
  select(member_casual, start_station_name, start_lat, start_lng) %>%
  tidyr::crossing(
    hotspots %>% select(hotspot_name = name, lat, lng)
  ) %>%
  mutate(
    dist_m = geosphere::distHaversine(
      cbind(start_lng, start_lat),
      cbind(lng, lat)
    )
  ) %>%
  group_by(member_casual, start_station_name, start_lat, start_lng) %>%
  slice_min(dist_m, n = 1, with_ties = FALSE) %>%
  ungroup()
### Calculating % of Top Stations within Xm of A Tourist Hotspot & Summary Statistics
nearest_by_type <- nearest_hotspot %>%
  group_by(member_casual) %>%
  summarise(
    n_stations      = dplyr::n(),
    mean_dist_m     = mean(dist_m),
    median_dist_m   = median(dist_m),
    pct_within_125m = mean(dist_m <= 125)* 100,
    pct_within_250m = mean(dist_m <= 250) * 100,
    pct_within_500m = mean(dist_m <= 500) * 100,
    pct_within_1km  = mean(dist_m <= 1000) * 100,
    .groups = "drop"
  )
### Making Info Pop-up for Graph Displaying Key Summary Statistics
cas_med <- nearest_by_type %>% 
  filter(member_casual=="casual") %>% 
  pull(median_dist_m) %>% 
  round(2)

mem_med <- nearest_by_type %>% 
  filter(member_casual=="member") %>% 
  pull(median_dist_m) %>% 
  round(2)

cas_250 <- nearest_by_type %>% 
  filter(member_casual=="casual") %>% 
  pull(pct_within_250m) %>% 
  round(2)

mem_250 <- nearest_by_type %>% 
  filter(member_casual=="member") %>% 
  pull(pct_within_250m) %>% 
  round(2)

###Building HTML for the on-map Info Panel:
stats_html <- htmltools::HTML(
  sprintf(
    "<div style='background:white; padding:8px 10px; border-radius: 6px;
    box-shadow:0 1px 6px rgba(0,0,0,.18); font-size:13px;line-height:1.5'>
    <div style='font-weight:700; margin-bottom:4px; color:#333;'>Tourist Hotspot Proximity:</div>
    <div style='margin-bottom:6px;'>
    <span style='color:#555; width:70px;font-weight:600;'>Casual:</span><br/>
    <span style='color#000;'>Median Distance: <b>%s m</b></span><br/>
    <span style='color:#555; '>Within 250 m: <b>%s%%</b></span>
    </div>
    <div style='border-top:1px solid #eee; padding-topL:6px;'>
    <span style='color:#555; width:70px;font-weight:600;'>Members:</span><br/>
    <span style='color#000;'>Median Distance: <b>%s m</b></span><br/>
    <span style='color:#555; '>Within 250 m: <b>%s%%</b></span>
    </div>
    </div>",
    round(cas_med,1), round(cas_250,1), round(mem_med,1), round(mem_250,1)
  )
)

#Creating Interactive Map Visualisation of Spatial Usage by Customer Type with Tourist Hotspots Identified
map <- leaflet(top_stations) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircles(
    data = peaks,
    lng = ~start_lng, lat = ~start_lat,
    radius = 70,
    color = "black",
    weight = 2,
    fillOpacity = 0,
    options = pathOptions(interactive = FALSE),
    group = "Most Popular Station") %>% 
  addCircleMarkers(
    lng = ~start_lng,
    lat = ~start_lat,
    radius = ~scales::rescale(station_volumes, to=c(5,20)),
    color = ~pal(member_casual),
    fillColor = ~pal(member_casual),
    fillOpacity = 0.7,
    stroke =TRUE, weight = 1,
    popup = ~paste0(
      "<b>", htmltools::htmlEscape(start_station_name), "</b><br/>",
      "Customer type: <b>", member_casual, "</b><br/>",
      "Rides: <b>", comma(station_volumes), "</b>"
    ),
    group = "Top Stations" 
    ) %>% 
  addHeatmap(
    data = hotspots,
    lng = ~lng, lat = ~lat,
    radius = 45,
    blur = 10,
    max = 2,
    group = "Tourist Hotspots"
  ) %>% 
  addLayersControl(
    overlayGroups = c("Top Stations", "Most Popular Station", "Tourist Hotspots"),
    options = layersControlOptions(collapsed=FALSE)
  ) %>% 
  addEasyButton(
    easyButton(
      icon = "fa-info", title = "Tourism Summary",
      onClick = JS(
        sprintf(
          "function(btn,map){
          var sz = map.getSize();
          var pt = L.point(120, sz.y - 10);
          var ll = map.containerPointToLatLng(pt)
          var html = `%s`;
          L.popup({
          autoPan: false,
          closeButton: true,
          offset: L.point(10,-10),
          maxWidth: 220,
          className: 'corner-popup'
          })
          .setLatLng(ll)
          .setContent(html)
          .openOn(map);
          }",
          gsub("`","\\\\`", as.character(stats_html))
        )
      )
    )
  ) %>% 
  hideGroup("Tourist Hotspots") %>% 
  addLegend(
    "bottomright",
    pal = pal, values = ~member_casual,
    title = "Customer Type"
  )

 map

Figure 4 - Map Displaying Most Used Stations by Customer Type with Tourism Hotpsots Identified

Mapping each customer type’s most popular stations highlights a strong spatial divide:

Casual customers’ top stations cluster around central Chicago landmarks.
Member’ top stations are more dispersed, towards business districts and residential areas,

These Visual Patterns are Reinforced Statistically: Median distances from top Tourist Hotspots:

Casual Customers: ~250m
Members: ~500m

Share of Stations within 250m of a top Tourist Hotspot:

Casual Customers: 47%
Members: 20%

This indicates casual customers’ behaviour strongly aligns with tourist and leisure zones, while members’ activity reflects daily commuting networks.

Summary and Next Steps:

Overall, the findings suggest a clear segmentation:

Casual Customers = tourists and leisure users
Members = commuters & frequent local users

This distinction can inform direction for marketing and operations teams:

For casuals: promote short-term membership bundles at tourist-heavy locations or offer discounts for weekend explorers.
For members: reinforce commute reliability and convenience, possibly introducing loyalty incentives for weekday use.

Further analysis could examine seasonal patterns or monthly demand shifts to better tailor campaigns across the year

Act:

This section outlines strategic next steps and initial ideas for bespoke marketing initiatives designed to convert casual users into members, while improving system efficiency.

Expand Network Access & Tiered Memberships:

introduce a tiered membership system with a dedicated ‘leisure tier’, granting access to Cyclistic bikes across all the cities of operation.
This would captures the tourist and travelling segment of our customer base, converting short-term uses into repeat, location-independent members.

Offer Temporal Membership Options Aligned with Seasonal Demand

Introduce 3-month or 6-month membership options, designed for seasonal riders.
This strategy should be validated by analysing monthly data trends (not available in this analysis’ Q1-only dataset) to confirm peak periods - likely spring & summer when tourism and leisure activity increase.

Dynamic Bike Distribution for Demand Optimisation

adjust bike placement by time-of-day:
- Morning: concentrate supply in business and residential districts to support commuters.
- Late Morning to Afternoon:: shift supply towards central Chicago landmarks to serve leisure users.
- Evening:: redistribute to business and housing districts to accommodate return commutes.
Weekends: maintain higher availability near tourist landmarks throughout the day to meet predominantly recreational demand.

Cyclistic Case Study

Luca Nasillo

2025-10-24

Executive Summary

ASK

Business Task:

Primary Questions:

Stakeholders: Marketing, Operations, Executive Team

Prepare

Data Sources:

Tools: R, RStudio, tidyverse, lubridate, skimr, here. janitor, scales, leaflet, leaflet.extras, htmlwidgets, geosphere.

Importation of Data:

Structure Check:

Process:

Cleaning Decisions and Standardisation:

Standardise - Adjusting Column Names:

2020 Already Has Member/Casual but we Normalise Just in Case:

Formatting ride_id Correctly:

Combining Tables with Common Fields:

Filling in Missing Station Corrdinates from 2019 Dataset:

Feature Engineering (Timestamps, Monday-start Lables & Dropping NAs)

Data Has Been Processed

Analyse

Trip Duration by Customer Type:

Hourly Patterns by Customer Type

Weekly Patterns by Customer Type

Spatial Behaviour: Station Hotspots and Tourism Proximity

Summary and Next Steps:

Act: