This project was completed as part of the Google Data Analytics Capstone project to explore user behaviour and membership conversion strategies for Cyclistic Bike Share.

Executive Summary

This case study analyzes Cyclistic’s bike-share usage data to understand differences between casual riders and annual members to then identify opportunities to convert casual users into long-term members. Using 2019 and 2020 Q1 trip data, this analysis employed R, tidyverse, lubridate, and geospatial tools to clean, process, and visualise user behaviour.

Findings reveal that casual riders take longer, leisure-oriented trips, often near tourist hotspots, whereas members ride more frequently during weekday commuting hours. Casual riders’ top stations were twice as close to Chicago’s main attractions, reinforcing their recreational nature.

Based on these insights, Cyclistic should implement seasonal or short-term membership bundles, introduce a ‘leisure tier’ membership, and dynamically position bikes by time of day and location to better match demand.

ASK

Business Task:

Understand how casual riders differ from members so Cyclistic can design bespoke strategies to convert casual riders into members.

Primary Questions:

    1. How do trip durations differ by rider type?
    1. How do ride volumes vary by day and hour?
    1. Which start stations are most used by each segment?

Stakeholders: Marketing, Operations, Executive Team

Prepare

Data Sources:

Data sourced from public Divvy/Cyclistic trip data (no PII). For Posit’s memory limits, this project uses 2019 Q1 and 2020 Q1 CSVs. Tourism hotspots identified via https://www.touropia.com/tourist-attractions-in-chicago/ & coordinates determined via https://www.google.com/maps/@51.3752576,0.5321047,14z?entry=ttu&g_ep=EgoyMDI1MTAyNi4wIKXMDSoASAFQAw%3D%3D

Tools: R, RStudio, tidyverse, lubridate, skimr, here. janitor, scales, leaflet, leaflet.extras, htmlwidgets, geosphere.

Importation of Data:

library(tidyverse)  
library(lubridate)
library(skimr)
library(here)
library(janitor)
library(scales)
raw19 <- read_csv(here("raw_data","Divvy_Trips_2019_Q1 - Divvy_Trips_2019_Q1.csv"))
raw20 <- read_csv(here("raw_data","Divvy_Trips_2020_Q1 - Divvy_Trips_2020_Q1.csv"))

Structure Check:

skim(raw19)
Data summary
Name raw19
Number of rows 365069
Number of columns 12
_______________________
Column type frequency:
character 6
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
start_time 0 1.00 18 19 0 343022 0
end_time 0 1.00 18 19 0 338367 0
from_station_name 0 1.00 10 43 0 594 0
to_station_name 0 1.00 10 43 0 600 0
usertype 0 1.00 8 10 0 2 0
gender 19711 0.95 4 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
trip_id 0 1.00 21960871.66 127175.00 21742443 21848765 21961829 22071823 22178528 ▇▇▇▇▇
bikeid 0 1.00 3429.48 1923.32 1 1777 3489 5157 6471 ▆▆▆▆▇
tripduration 0 1.00 1016.34 27913.51 61 326 524 866 10628400 ▇▁▁▁▁
from_station_id 0 1.00 198.09 153.49 2 76 170 287 665 ▇▅▃▁▁
to_station_id 0 1.00 198.58 154.47 2 76 168 287 665 ▇▅▃▁▁
birthyear 18023 0.95 1981.67 11.25 1900 1975 1985 1990 2003 ▁▁▁▅▇
skim(raw20)
Data summary
Name raw20
Number of rows 426887
Number of columns 13
_______________________
Column type frequency:
character 7
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 8 16 0 426887 0
rideable_type 0 1 11 11 0 1 0
started_at 0 1 18 19 0 399265 0
ended_at 0 1 18 19 0 399532 0
start_station_name 0 1 5 43 0 607 0
end_station_name 1 1 5 43 0 602 0
member_casual 0 1 6 6 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_station_id 0 1 209.80 163.22 2.00 77.00 176.00 298.00 675.00 ▇▆▃▁▁
end_station_id 1 1 209.34 163.20 2.00 77.00 175.00 297.00 675.00 ▇▅▃▁▁
start_lat 0 1 41.90 0.04 41.74 41.88 41.89 41.92 42.06 ▁▁▇▂▁
start_lng 0 1 -87.64 0.02 -87.77 -87.66 -87.64 -87.63 -87.55 ▁▁▇▇▁
end_lat 1 1 41.90 0.04 41.74 41.88 41.89 41.92 42.06 ▁▁▇▂▁
end_lng 1 1 -87.64 0.02 -87.77 -87.66 -87.64 -87.63 -87.55 ▁▁▇▇▁

Process:

Cleaning Decisions and Standardisation:

  • Renamed 2019 columns to align with those of 2020.
  • Standardised rider labels: subscriber = member, customer = casual.
  • Ensured ride_id is formatted in character in both datasets.
  • Ensured both datasets included start station coordinates.
  • Resolved timestamps and then calculated duration of journeys (minutes).
  • Gave weekdays Mon-Sun labels.
  • Removed implausible trips (≤1 min or ≥12 hr) and critical N/A rows.
  • Combined 2019 & 2020 Tables as 1.

Standardise - Adjusting Column Names:

p19 <- raw19 %>% 
  rename(
    ride_id=trip_id, started_at=start_time, ended_at=end_time,
    start_station_name=from_station_name, start_station_id=from_station_id,
    end_station_name=to_station_name, end_station_id=to_station_id,
    member_casual=usertype
  ) %>% 
  mutate(member_casual = recode(tolower(member_casual),
                                "subscriber"="member", "customer"="casual"))

2020 Already Has Member/Casual but we Normalise Just in Case:

p20 <- raw20 %>% 
  mutate(member_casual = recode(tolower(member_casual),
                                "subscriber"="member", "customer"="casual"))

Formatting ride_id Correctly:

p19 <- p19 %>% 
  mutate(ride_id = as.character(ride_id))
p20 <- p20 %>% 
  mutate(ride_id = as.character(ride_id))

Combining Tables with Common Fields:

p19 <- p19 %>% 
  mutate(start_lng = NA_real_, start_lat = NA_real_)

trips <-bind_rows(
  p19 %>% select(ride_id, started_at, ended_at, start_station_id, end_station_id,
                 start_station_name, end_station_name, member_casual, start_lat, start_lng),
  p20 %>% select(ride_id, started_at, ended_at, start_station_id, end_station_id,
                 start_station_name, end_station_name, member_casual, start_lat, start_lng)
)

Filling in Missing Station Corrdinates from 2019 Dataset:

# Joining Station Coordinates from p20 Dataset onto the Combined Dataset
## Determine Each Station's Coordinates:
station_lookup_id <- p20 %>%
  filter(!is.na(start_station_id), !is.na(start_lat), !is.na(start_lng)) %>%
  group_by(start_station_id) %>%
  summarise(
    start_lat_lookup = mean(as.numeric(start_lat), na.rm = TRUE),
    start_lng_lookup = mean(as.numeric(start_lng), na.rm = TRUE),
    .groups = "drop"
  )
## Join Coordinates onto Combined Dataset:
trips <- trips %>% 
  left_join(station_lookup_id, by = "start_station_id") %>%
  mutate(
    start_lat = dplyr::coalesce(as.numeric(start_lat), start_lat_lookup),
    start_lng = dplyr::coalesce(as.numeric(start_lng), start_lng_lookup)
  ) %>% 
  select(-start_lat_lookup, -start_lng_lookup)

Feature Engineering (Timestamps, Monday-start Lables & Dropping NAs)

trips_clean <- trips %>% 
  mutate(
    started_at = ymd_hms(started_at, quiet = TRUE),
    ended_at = ymd_hms(ended_at, quiet = TRUE),
    ride_length_sec = as.numeric(difftime(ended_at, started_at, units = "secs")),
    ride_length_mins = ride_length_sec/60,
    day_of_week = wday(started_at, week_start = 1, label = TRUE),
    month = month(started_at, label = TRUE),
    hour = hour(started_at)
  ) %>% 
  filter(!is.na(started_at), !is.na(ended_at),
        ride_length_mins > 1, ride_length_mins < 720) %>% # Removes Implausible Trips
  drop_na(member_casual)

Trips shorter than 1 minute or longer than 12 hours were excluded as likely data errors. ### Post-Clean Check:

skim(trips_clean)
Data summary
Name trips_clean
Number of rows 783363
Number of columns 15
_______________________
Column type frequency:
character 4
factor 2
numeric 7
POSIXct 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
ride_id 0 1 7 16 0 783363 0
start_station_name 0 1 5 43 0 636 0
end_station_name 0 1 5 43 0 636 0
member_casual 0 1 6 6 0 2 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
day_of_week 0 1 TRUE 7 Tue: 134552, Thu: 131595, Wed: 128869, Fri: 122471
month 0 1 TRUE 3 Mar: 304120, Jan: 245623, Feb: 233620, Apr: 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
start_station_id 0 1 202.03 155.84 2.00 77.00 174.00 289.00 675.00 ▇▅▃▁▁
end_station_id 0 1 202.00 156.23 2.00 77.00 173.00 289.00 675.00 ▇▅▃▁▁
start_lat 40 1 41.90 0.04 41.74 41.88 41.89 41.91 42.06 ▁▁▇▂▁
start_lng 40 1 -87.64 0.02 -87.77 -87.65 -87.64 -87.63 -87.55 ▁▁▇▇▁
ride_length_sec 0 1 794.32 1165.86 61.00 333.00 541.00 913.00 43002.00 ▇▁▁▁▁
ride_length_mins 0 1 13.24 19.43 1.02 5.55 9.02 15.22 716.70 ▇▁▁▁▁
hour 0 1 13.29 4.65 0.00 9.00 14.00 17.00 23.00 ▁▅▃▇▂

Variable type: POSIXct

skim_variable n_missing complete_rate min max median n_unique
started_at 0 1 2019-01-01 00:04:37 2020-03-31 23:51:34 2020-01-06 19:26:01 734576
ended_at 0 1 2019-01-01 00:11:07 2020-04-01 01:00:13 2020-01-06 19:37:29 730156

Data Has Been Processed

Analyse

This analysis explores differences between member and casual riders in trip duration and temporal/spatial riding patterns across Chicago. Each insight aims to reveal distinct usage motivations between the rider types (commuting/recreation) and hence guide potential membership conversion strategies.

Trip Duration by Customer Type:

# Mean & Max Trip Length by Customer Type
mean_max <- trips_clean %>% 
  group_by(member_casual) %>%
  summarise(
    mean_duration = round(mean(ride_length_mins, na.rm= TRUE), 2),
    max_duration = max(ride_length_mins)
)
# Visualisation of Average Trip Duration by Customer Type
mean_max %>% 
  ggplot(aes(x=member_casual, y=mean_duration)) +
  geom_col(width=0.85, colour = "black", fill = "steelblue") +
  geom_text(data=mean_max, aes(label=paste0(mean_duration, " mins")),
            vjust = -0.6, size=3, show.legend = FALSE) +
  labs(
    title = "Average Trip Duration by Member Type",
    subtitle = "Casual members longer trips suggest recreational usage, Members appear to be commuters",
    x = "Customer",
    y = "Average Trip Duration (mins)"
  ) +
  theme_minimal(base_size = 12)

  • Figure 1 - Average Trip Duration by Customer Type

Casual riders take significantly longer trips on average. This suggests differing use cases:

  • Casual Riders are likely using the service for recreational or leisure, where there are few time-constraints.
  • Members’ shorter trips suggest use-cases revolve around purpose-driven commutes.

This is supported by weekday patterns: member-trip durations remain consistent across the working week but increase on weekends - indicating weekday commuting usage and some recreation on weekends.

# Determining Hourly Usage by Member Type:
hourly_usage <- trips_clean %>% 
  count(member_casual, hour, name = "hour_vol")
# Identifying Peak Usage Hours by Member Type:
peak <- hourly_usage %>% 
  group_by(member_casual) %>% 
  slice_max(hour_vol,n=1)

hour_breaks <- c(0,3,6,9,12,15,18,21)
hour_labels <- c("12am","3am","6am","9am","12pm","3pm","6pm","9pm")
hour_boundaries <- c(0,23,1)

# Visualisation Showing Usage Hours by Member Type
hourly_usage %>% 
  ggplot(aes(x=hour,y=hour_vol, colour = member_casual)) +
  geom_line(linewidth = 1) +
  facet_wrap(~member_casual, nrow=1, scales = "free_y") +
  geom_vline(xintercept = c(8,17), linetype = "dashed", alpha = 0.4) +
  geom_point(data = peak, size = 2) +
  geom_text(data=peak, aes(label=paste0(hour, ":00")),
            vjust = -0.6, size=3, show.legend = FALSE) +
  scale_x_continuous(breaks = hour_breaks, labels = hour_labels, limits = c(0,23)) +
  scale_y_continuous(labels=scales::comma) +
  labs(
    title = "Hourly Rider Demand: Members vs Casuals",
    subtitle = "Typical Commute Hours Marked",
    x = "Hour of Day (24h)",
    y = "Number of Rides",
    color = "Rider Type"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    panel.grid.major.x = element_blank(),
      legend.position = "none",
    axis.text.x = element_text(angle = 45, hjust = 1)
  )

  • Figure 2 - Usage Hours by Customer Type

Hourly Patterns by Customer Type

Analysing usage by hour reveals clear behavioural differences:

  • Casual customers peak at 3pm, indicating more flexible, recreational usage.

Higher weekend usage volume would support this.

  • Members show two peaks at 8am and 5pm which align with typical commuting hours.

Weekly Patterns by Customer Type

day_counts <- trips_clean %>% 
  count(member_casual, day_of_week, name = "day_volume")

day_counts %>% 
ggplot(aes(x = day_of_week, y = day_volume)) +
  geom_col(width = 0.85, fill = "steelblue") +
  facet_wrap(~ member_casual, nrow = 1, scales = "free_y") +
  scale_y_continuous(labels = comma) +
  labs(
    title = "Ride Volume by Day of Week",
    x = "Day of Week",
    y = "Number of Rides",
  ) +
  theme_minimal(base_size = 12) +
  theme(
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    strip.text = element_text(face = "bold")
  )

  • Figure 3 - Daily Usage by Customer Type

Examining Usage Volumes Across the Week Strengthens this Contrast:

  • Members usage is concentrated on weekdays - consistent with commuting.
  • Casual customers primarily ride on weekends, reflecting leisure or tourist activity.

Spatial Behaviour: Station Hotspots and Tourism Proximity

#Importing Necessary Packages:
library(leaflet.extras)
library(geosphere)
library(htmlwidgets)
library(leaflet)

# Determine Station Usage by Customer Type:
start_station_volumes <- trips_clean %>% 
  filter(!is.na(start_station_name), start_station_name != "") %>% 
  count(member_casual, start_station_name, start_lat, start_lng, name = "station_volumes")

# Identifying Most Popular Stations by Customer Type:
top_stations <- start_station_volumes %>% 
  group_by(member_casual) %>% 
  slice_max(station_volumes, n= 15, with_ties = FALSE) %>% 
  ungroup()

#Creating Colour Palette for the Visualisation
pal <- colorFactor(
  palette = c("goldenrod", "steelblue"),
  domain  = c("casual", "member")
)

#Determining Most Popular Station by Customer Type
peaks <-start_station_volumes %>% 
  filter(!is.na(start_lat), !is.na(start_lng)) %>% 
  group_by(member_casual) %>% 
  slice_max(station_volumes, n=1)

# Adding Tourist Hotspots onto Map
hotspots_raw <- read_csv(here("raw_data","chicago_hotspots.csv")) %>%
  clean_names()

hotspots <- hotspots_raw %>% 
  transmute(
    name = chicago_tourist_hot_spot,
    lat = as.numeric(lat),
    lng = as.numeric(long))

### Calculating Nearest Tourist Hotpot to each Top Station
nearest_hotspot <- top_stations %>%
  select(member_casual, start_station_name, start_lat, start_lng) %>%
  tidyr::crossing(
    hotspots %>% select(hotspot_name = name, lat, lng)
  ) %>%
  mutate(
    dist_m = geosphere::distHaversine(
      cbind(start_lng, start_lat),
      cbind(lng, lat)
    )
  ) %>%
  group_by(member_casual, start_station_name, start_lat, start_lng) %>%
  slice_min(dist_m, n = 1, with_ties = FALSE) %>%
  ungroup()
### Calculating % of Top Stations within Xm of A Tourist Hotspot & Summary Statistics
nearest_by_type <- nearest_hotspot %>%
  group_by(member_casual) %>%
  summarise(
    n_stations      = dplyr::n(),
    mean_dist_m     = mean(dist_m),
    median_dist_m   = median(dist_m),
    pct_within_125m = mean(dist_m <= 125)* 100,
    pct_within_250m = mean(dist_m <= 250) * 100,
    pct_within_500m = mean(dist_m <= 500) * 100,
    pct_within_1km  = mean(dist_m <= 1000) * 100,
    .groups = "drop"
  )
### Making Info Pop-up for Graph Displaying Key Summary Statistics
cas_med <- nearest_by_type %>% 
  filter(member_casual=="casual") %>% 
  pull(median_dist_m) %>% 
  round(2)

mem_med <- nearest_by_type %>% 
  filter(member_casual=="member") %>% 
  pull(median_dist_m) %>% 
  round(2)

cas_250 <- nearest_by_type %>% 
  filter(member_casual=="casual") %>% 
  pull(pct_within_250m) %>% 
  round(2)

mem_250 <- nearest_by_type %>% 
  filter(member_casual=="member") %>% 
  pull(pct_within_250m) %>% 
  round(2)

###Building HTML for the on-map Info Panel:
stats_html <- htmltools::HTML(
  sprintf(
    "<div style='background:white; padding:8px 10px; border-radius: 6px;
    box-shadow:0 1px 6px rgba(0,0,0,.18); font-size:13px;line-height:1.5'>
    <div style='font-weight:700; margin-bottom:4px; color:#333;'>Tourist Hotspot Proximity:</div>
    <div style='margin-bottom:6px;'>
    <span style='color:#555; width:70px;font-weight:600;'>Casual:</span><br/>
    <span style='color#000;'>Median Distance: <b>%s m</b></span><br/>
    <span style='color:#555; '>Within 250 m: <b>%s%%</b></span>
    </div>
    <div style='border-top:1px solid #eee; padding-topL:6px;'>
    <span style='color:#555; width:70px;font-weight:600;'>Members:</span><br/>
    <span style='color#000;'>Median Distance: <b>%s m</b></span><br/>
    <span style='color:#555; '>Within 250 m: <b>%s%%</b></span>
    </div>
    </div>",
    round(cas_med,1), round(cas_250,1), round(mem_med,1), round(mem_250,1)
  )
)

#Creating Interactive Map Visualisation of Spatial Usage by Customer Type with Tourist Hotspots Identified
map <- leaflet(top_stations) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>% 
  addCircles(
    data = peaks,
    lng = ~start_lng, lat = ~start_lat,
    radius = 70,
    color = "black",
    weight = 2,
    fillOpacity = 0,
    options = pathOptions(interactive = FALSE),
    group = "Most Popular Station") %>% 
  addCircleMarkers(
    lng = ~start_lng,
    lat = ~start_lat,
    radius = ~scales::rescale(station_volumes, to=c(5,20)),
    color = ~pal(member_casual),
    fillColor = ~pal(member_casual),
    fillOpacity = 0.7,
    stroke =TRUE, weight = 1,
    popup = ~paste0(
      "<b>", htmltools::htmlEscape(start_station_name), "</b><br/>",
      "Customer type: <b>", member_casual, "</b><br/>",
      "Rides: <b>", comma(station_volumes), "</b>"
    ),
    group = "Top Stations" 
    ) %>% 
  addHeatmap(
    data = hotspots,
    lng = ~lng, lat = ~lat,
    radius = 45,
    blur = 10,
    max = 2,
    group = "Tourist Hotspots"
  ) %>% 
  addLayersControl(
    overlayGroups = c("Top Stations", "Most Popular Station", "Tourist Hotspots"),
    options = layersControlOptions(collapsed=FALSE)
  ) %>% 
  addEasyButton(
    easyButton(
      icon = "fa-info", title = "Tourism Summary",
      onClick = JS(
        sprintf(
          "function(btn,map){
          var sz = map.getSize();
          var pt = L.point(120, sz.y - 10);
          var ll = map.containerPointToLatLng(pt)
          var html = `%s`;
          L.popup({
          autoPan: false,
          closeButton: true,
          offset: L.point(10,-10),
          maxWidth: 220,
          className: 'corner-popup'
          })
          .setLatLng(ll)
          .setContent(html)
          .openOn(map);
          }",
          gsub("`","\\\\`", as.character(stats_html))
        )
      )
    )
  ) %>% 
  hideGroup("Tourist Hotspots") %>% 
  addLegend(
    "bottomright",
    pal = pal, values = ~member_casual,
    title = "Customer Type"
  )

 map
  • Figure 4 - Map Displaying Most Used Stations by Customer Type with Tourism Hotpsots Identified

Mapping each customer type’s most popular stations highlights a strong spatial divide:

  • Casual customers’ top stations cluster around central Chicago landmarks.
  • Member’ top stations are more dispersed, towards business districts and residential areas,

These Visual Patterns are Reinforced Statistically: Median distances from top Tourist Hotspots:

  • Casual Customers: ~250m
  • Members: ~500m

Share of Stations within 250m of a top Tourist Hotspot:

  • Casual Customers: 47%
  • Members: 20%

This indicates casual customers’ behaviour strongly aligns with tourist and leisure zones, while members’ activity reflects daily commuting networks.

Summary and Next Steps:

Overall, the findings suggest a clear segmentation:

  • Casual Customers = tourists and leisure users
  • Members = commuters & frequent local users

This distinction can inform direction for marketing and operations teams:

  • For casuals: promote short-term membership bundles at tourist-heavy locations or offer discounts for weekend explorers.
  • For members: reinforce commute reliability and convenience, possibly introducing loyalty incentives for weekday use.

Further analysis could examine seasonal patterns or monthly demand shifts to better tailor campaigns across the year

Act:

This section outlines strategic next steps and initial ideas for bespoke marketing initiatives designed to convert casual users into members, while improving system efficiency.

  1. Expand Network Access & Tiered Memberships:
  1. Offer Temporal Membership Options Aligned with Seasonal Demand
  1. Dynamic Bike Distribution for Demand Optimisation