March 2023
To complete the Capstone Course from the Google Data Analytics Professional Certificate program, I was tasked with selecting a case study to complete using the tools and knowledge I garnered from the program. Below is a detailed report of my journey through the 6 phases of the Data Analysis process (Ask, Prepare, Process, Analyze, Share, Act).
Cyclistic (Company): A bike-share program that features more than 5,800 bicycles and 600 docking stations. Cyclistic sets itself apart by also offering reclining bikes, hand tricycles, and cargo bikes, making bike-share more inclusive to people with disabilities and riders who can’t use a standard two-wheeled bike. The majority of riders opt for traditional bikes; about 8% of riders use the assistive options. Cyclistic users are more likely to ride for leisure, but about 30% use them to commute to work each day.
Lily Moreno: The director of marketing and your manager. Moreno is responsible for the development of campaigns and initiatives to promote the bike-share program. These may include email, social media, and other channels.
Cyclistic marketing analytics team: A team of data analysts who are responsible for collecting, analyzing, and reporting data that helps guide Cyclistic marketing strategy. You joined this team six months ago and have been busy learning about Cyclistic’s mission and business goals — as well as how you, as a junior data analyst, can help Cyclistic achieve them.
Cyclistic executive team: The notoriously detail-oriented executive team will decide whether to approve the recommended marketing program.
I'm a junior data analyst working with the marketing analyst team at Cyclistic, a fictitious bike-share company based in Chicago, IL. The director of marketing is confident the company's future success depends on maximizing the number of annual memberships. My team is seeking to understand how casual riders and annual members use Cyclistic bikes differently. From these insights, my team will design a new marketing strategy to convert casual riders into annual members. But first, Cyclistic executives must approve my recommendations, so they must be backed up with compelling data insights and professional data visualizations.
In 2016, Cyclistic launched a successful bike-share offering. Since then, the program has grown to a fleet of 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Until now, Cyclistic’s marketing strategy relied on building general awareness and appealing to broad consumer segments. One approach that helped make these things possible was the flexibility of its pricing plans: single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders. Although the pricing flexibility helps Cyclistic attract more customers, Moreno believes that maximizing the number of annual members will be key to future growth. Rather than creating a marketing campaign that targets all-new customers, Moreno believes there is a very good chance to convert casual riders into members. She notes that casual riders are already aware of the Cyclistic program and have chosen Cyclistic for their mobility needs.
Moreno has set a clear goal: Design marketing strategies aimed at converting casual riders into annual members. In order to do that, however, the marketing analyst team needs to better understand how annual members and casual riders differ, why casual riders would buy a membership, and how digital media could affect their marketing tactics. Moreno and her team are interested in analyzing the Cyclistic historical bike trip data to identify trends.
My team is looking to answer 3 questions:
How do annual members and casual riders use Cyclistic bikes differently?
Why would casual riders buy Cyclistic annual memberships?
How can Cyclistic use digital media to influence casual riders to become members?
Moreno has tasked me with the first question: How do annual members and casual riders use Cyclistic bikes differently?
My objective is to produce a report with the following deliverables:
A clear statement of the business task.
A description of all data sources used.
Documentation of any cleaning or manipulation of data.
A summary of your analysis.
Supporting visualizations & key findings.
Your top 3 recommendations based on your analysis.
Business Task: Use historical data to discover trends in how annual members and casual riders use Cyclistic bikes differently, to increase conversion rates for casual to membership riders. Provide compelling data stories and visualizations to support my top 3 recommendations for this marketing strategy.
I used Cyclistic's historical trip data from January 2022 - December 2022 to analyze and identify trends. Note: Cyclistic is a fictitious company, so for the purposes of the case study, the datasets were made available by Motivate International Inc., under this license. This is public data that was used to explore how different customer types are using Cyclistic bikes. We know this data is reliable, original, comprehensive, current, and cited, as it is provided directly by "Cyclistic" (Divvy).
Note: because data-privacy issues prohibited me from using riders' personally identifiable information, I was not able to connect pass purchases to credit card numbers to determine if casual riders live in the Cyclistic service area, or if they have purchased multiple single passes.
I also used some quick Google searches for general stats used to draw correlations in the data and better understand the ridership stories. Articles used are linked within the document.
I downloaded 12 months worth of data from Jan 2022 - Dec 2022, and stored the files locally on my computer.
The files were too large to open in Excel, but I spot checked them in Notepad++ to get an idea of the contents within.
I attempted to upload the files to BigQuery to run my initial clean up via SQL, but some of the csv files were too large to create tables from. I split the largest files (May - Oct 2022 files) into 2, and uploaded those final 12 files into BigQuery.
BigQuery - SQL (for full repository of SQL queries, see Index > Github):
I first confirmed the column schema matched between all tables, by running a column schema query. I was able to confirm quickly that all names matched, which made it easier to combine all tables into one data set.
Ran a SELECT DISTINCT query to compare the ride_id column for a few tables because upon first glance, the values seemed to be unique. I verified the number of results returned = total values in the table. This confirmed that ride_id is unique, so I decided to drop that column from my final table for clarity. Sample:
SELECT DISTINCT *
FROM 'capstone-project-379202.2022_ride_data.jan2022`
I combined the tables via UNION clauses, ensuring to only join the necessary columns. Sample:
SELECT
member_casual, rideable_type, started_at, ended_at, start_station_name, start_station_id, start_lat, start_lng, end_station_id, end_station_name, end_lat, end_lng
FROM `capstone-project-379202.2022_ride_data.dec2022`
UNION ALL
SELECT member_casual, rideable_type, started_at, ended_at, start_station_name, start_station_id, start_lat, start_lng, end_station_id, end_station_name, end_lat, end_lng
FROM `capstone-project-379202.2022_ride_data.jan2022`
UNION ALL
From there, I filtered by rideable_type to find only docked_bike values. Further information provided by the project roadmap confirmed that docked bikes were taken out of circulation by Cyclistic for maintenance, and therefore were not true member or casual rides. docked_bike values accounted for ~3% of all bike rides (177.5K out of 5.67M total rides). My final table used the following filter query to exclude docked_bike entries. Sample:
SELECT *
FROM `capstone-project-379202.2022_ride_data.combined-data`
WHERE rideable_type = "electric_bike" OR
rideable_type = "classic_bike"
After running the results for the above query, I exported my results to a csv file, so I could have an easier time exploring in R.
I spent far too long trying to upload my dataset to RStudio, only to have it crash every time I tried to run a read_csv() function. After realizing that my dataset was just over 1GB and the free version of RStudio does not support more than 1GB of memory per project, I decided to upgrade, vs. use any less data. Once I completed the upgrade, it was smooth sailing. To see the full RMarkdown of this project, see Index > Kaggle.
I first installed the necessary packages and loaded the required libraries:
install.packages("tidyverse")
install.packages("sp")
install.packages("hms")
library(tidyverse)
library(lubridate)
library(ggplot2)
library(sp)
library(hms)
library(readr)
library(deplyr)
Next, I created a data frame from the BigQuery results csv file.
Finally, prior to running a descriptive analysis and exploring the data further, I added a few columns (date, month, year, day_of_week, ride_length), and converted the ride_length column to numeric so it would be easier to work with:
cyclistic_data$date <- as.Date(cyclistic_data$started_at)
cyclistic_data$month <- format(as.Date(cyclistic_data$date), "%m")
cyclistic_data$day <- format(as.Date(cyclistic_data$date), "%d")
cyclistic_data$year <- format(as.Date(cyclistic_data$date), "%Y")
cyclistic_data$day_of_week <- format(as.Date(cyclistic_data$date), "%A")
cyclistic_data$ride_length <-
difftime(cyclistic_data$ended_at,cyclistic_data$started_at)
is.factor(cyclistic_data$ride_length)
cyclistic_data$ride_length <-
as.numeric(as.character(cyclistic_data$ride_length))
is.numeric(cyclistic_data$ride_length)
After using the View() command, I was ready to proceed with analyzing the data.
I began my descriptive analysis by finding the mean, median, max and min values for ride length:
mean(cyclistic_data$ride_length)
median(cyclistic_data$ride_length)
max(cyclistic_data$ride_length)
min(cyclistic_data$ride_length)
summary(cyclistic_data$ride_length)
After seeing that some ride times were negative in the results, I removed negative ride times from the data set, and created a new dataframe (data_v2):
data_v2 <- cyclistic_data[!(cyclistic_data$ride_length<0),]
I used the View() function to confirm the new df looked right, then re-ran the analysis for mean, median, max, min, and summ.
I next used the aggregate() function to average ride time by each day, re-ordered the days of the week to be in a more logical order, and ran the aggregate() function once more:
aggregate(data_v2$ride_length ~ data_v2$member_casual + data_v2$day_of_week, FUN = mean)
Finally, I reviewed ridership by type and weekday:
data_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),
average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday)
I ran 2 visualizations in R to make some initial observations, but ultimately decided to save my work, and continue visualizing the data in Tableau.
R Viz 1: Ridership by Week Day
data_v2 %>%
mutate(weekday = wday(started_at, label = TRUE)) %>%
group_by(member_casual, weekday) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, weekday) %>%
ggplot(aes(x=weekday,y=number_of_rides, fill=member_casual))+
geom_col(position="dodge")
R Viz 1: Ridership by Member Type & Weekday
R Viz 2: Ridership by Month
data_v2 %>%
group_by(member_casual, month) %>%
summarise(number_of_rides = n(),average_duration = mean(ride_length)) %>%
arrange(member_casual, month) %>%
ggplot(aes(x=month, y=number_of_rides, fill=member_casual))+
geom_col(position="dodge")
R Viz 2: Ridership by Member Type & Month
I chose to complete my visualizations in Tableau for sake of ease. The interface made it easy to draw different data points together to see what relationships, if any, the data presented. This data story can be found via Index > Tableau.
General Stats/Observations
I first noted observations by reviewing the following:
Total Rides by Rideable Type (total rides in 2022)
Total Rides by Member Type (total rides in 2022)
Total Rides by Member Type (by quarter)
Electric rides accounted for just over half (~53%) of bike rides
More member riders than casual riders at ~2/3 (61%) of rides
Higher ride totals in Q2 & Q3, for both member types
Significant differences in ridership types in Q1 and Q4 -
Casual riders made up 24% of rides in Q1, and 32% of rides in Q4
In Q2 and Q3, casual riders made up 41 & 44% of rides, respectively
Chicago tends to have inclement weather in winter, leading into spring, which may be one factor that correlates to lower ridership in Q1 and Q4 (Jan Q1, Feb Q1, Mar Q1 , & December Q4 ). Further analysis needed to confirm this trend YOY.
After looking at the above high-level stats, I made observations based on the below:
Average Ride Length by Member & Rideable Type (by quarter)
Average Ride Length by Member Type (by month)
Ride lengths for both casual and member riders are shortest in November and December.
Member riders average generally short ride times all year long, with only a 3 minute variance from the shortest average ride length (11mins) to the longest (14mins). This is likely due to member rider consistency in riding both classic and electric bikes.
Electric bike trips have a shorter duration
Electric bikes can typically go from 20-28mph, with minimal user assistance. Classic bikes can only go as fast as the user pedals.
Beginner and casual cyclists may average 10-15mph, while more advanced cyclists can average 22mph.
This may explain why member riders average similar ride lengths between bike types, but casual riders tend to average much longer ride lengths on classic bikes than on electric bikes.
This may also explain why casual riders tend to average longer ride lengths, overall (they are inconsistently riding).
Finally, I reviewed ridership data from the following perspectives:
Total Rides by Member Type (by month)
Rides Per Day (by member type)
Average Ride Length by Member type (by day of week)
Casual riders take less rides than members
All rides peak from April - October
Least number of rides in January for both member and casual riders
Highest number of rides in July for casual riders, and August for member riders
What drove these peaks? Weather? Events?
Casual riders take less rides than members on a daily basis.
Casual rides peak during weekends (Fri - Sun), while member rides peak during weekdays (Mon - Fri).
Casual rides take longer on average, every day of the week
Why do ride times dip Tue - Thurs?
Member rides average 12-14 minutes, regardless of the week day.
Where are members going on weekends?
What's the actual percentage of members who use the pass for commuting purposes?
The data tells us who, what, when, and how member and casual riders use Cyclistic bikes differently, but does not detail the why of our ridership story. Combining general knowledge about our Chicago rider base with the stats we have above, we can infer some of these details, and make smart decisions around our marketing. The following takeaways have guided the suggestions I have for our marketing strategy:
Ridership is lowest in Jan, Feb, Mar, and Dec months, with casual riders taking the least number of rides throughout the year.
Classic bike rides take longer than electric bike rides.
Member rides take less time than casual rides on a daily, weekly, monthly, and quarterly average.
Ridership is at its peak in summer months.
Annual membership push for Black Friday & Cyber Monday 2023 : 13-month membership offer (Dec 2023 - Dec 2024).
Use targeted marketing for Cyclistic electric bike usage in winter months - "Get to warm, fast."
Do we have enough electric bikes to meet the potential demand?
"Explore the city, explore your limits" campaign: focus around building consistency as a Cyclistic rider, and exploring the Chicago community. Members average shorter ride times likely due to physical endurance from consistently riding, but may also be better versed in navigation, short cuts, etc. from the same consistency (note, further research could be conducted to validate this theory).
Consider partnering with local businesses for a "Rider Passport," only available with annual membership. If riders visit locations in passport, earn stamps, which turn into rewards.
Exclusive Summer Fun marketing series: promote this series to casual riders, with push for membership ("going to X number of events is the prices of 1 annual pass"). Promote this series to member riders to encourage community building and continue to aid peak ridership frequency.
Member rides happen daily, including weekends, indicating commuters may frequent the locations they work near for leisure (note: further research could be done to validate this theory), or live very close to Cyclistic docking stations.
My data analysis only answered the question of "how" member and causal riders use Cyclistic bikes differently. There is additional research that can be done to better understand why casual riders are not converting to annual memberships at this time, and why member riders keep coming back (or don't!). While our team can start to experiment with new marketing strategies, I would also recommend our team take a Design Thinking approach to solve the greater questions of "why" by first running a Survey for customer empathy. We should try to discover things like:
Why do member riders continue to consistently use Cyclistic in cold months?
Where do casual riders vs. member riders go in winter months?
What percentage of member riders commute to work, and at what frequency? (is it possible our commuter membership has changed, post-COVID?)
Why do repeat casual riders (not explored in this case study due to privacy regulations), choose to not upgrade to annual membership?
Until we gain a little customer empathy, our ridership story is not fully complete, but we should have enough to move forward with some targeted marketing approaches in the interim.
Tableau Dashboard (visualizations)
GitHub: Cyclistic Case Study Repository (SQL queries)
Kaggle: Cyclistic Case Study Repository (R-Markdown & Dataset)
Google Slides: Internal presentation example. Speaker notes included; for best experience, View Slideshow.
Short Report: See the consolidated pdf of this report
Back to Data Projects Showcase