Customer Churn Prediction with PySpark on AWS

Project Overview

Predicting churn rates is a very common and challenging problem in any customer-facing business. While bringing in new members is critical, member retention is the other interesting topic that data scientists and data analysts regularly encounter.

This analysis is the capstone project of the Udacity's Data Science Nanodegree Program and aims to predict member churn using Sparkify subscribers activity data. Sparkify, like Spotify or Apple Music, is a fictitious music streaming service company and have free tiers and paid subscribers.

The full dataset stored on Amazon S3 is 12GB. So I explored the data pattern of a smaller subset (128MB) first and then deployed full analysis including data cleaning and modeling to Spark clusters on the cloud using AWS Elastic MapReduce (EMR).

File Description

Sparkify_Churn_Prediction.ipynb: This notebook includes the analysis on the smaller subset (128MB) before deployment.

Sparkify_Churn_Prediction_Deploy.ipynb: This notebook includes the deployment of Spark clusters using AWS EMR for full data analysis (12GB).

run_jupyter.sh: This file creates a Docker image which will have an isolated environment to run the Spark application.

images folder: All charts or screenshots in the analysis.

Analysis Steps

Step 1. Load and Clean Dataset

 a. The smaller subset (128MB)
 b. The Full dataset (12GB)

Step 2. Exploratory Data Analysis

 a. Define Churn
 b. Use Spark SQL and PySpark to explore data pattern

Step 3. Feature Engineering

 a. Extract 18 Potential Features:
    - Female: Member gender, female (1) or male (0)
    - Main State Ratio: State with highest frequency / Count of total states
    - Distinct Locations: Count of distinct locations
    - Unique Page Actions: Count of distinct page actions
    - Page Action Ratios:
          - tbup_ratio: Count of Thumbs Up events / Count of total pages
          - tbdw_ratio: Count of Thumbs Down events / Count of total pages
          - addfriend_ratio: Count of Add Friend events / Count of total pages
          - ad_ratio: Count of Roll Advert events / Count of total pages
          - up_ratio: Count of Upgrade events / Count of total pages
          - help_ratio: Count of Help events / Count of total pages
          - error_ratio: Count of Error events / Count of total pages
    - Length of enrollment: max date - min date +1
    - Active Days: Count of distinct dates
    - Active Ratio: Active days / Length of enrollment
    - Distinct Artists per Active Day: Count of distinct artists / active days  
    - Distinct Songs per Active Day: Count of distinct songs / active days  
    - Avg Songs per Active Day: Count of total songs / active days   
    - Avg Items per Session: Count of total items / Count of distinct sessions

 b. Features Visualization
 c. Drop features with only one unique value and high correlations

Step 4. Modeling and Evaluation

 a. Split the full dataset into train (including validation) and test sets
 b. Build machine learning pipelines and evaluate the performance
    Supervised Learning Algorithms:
    - Logistic Regression
    - Random Forest
    - Gradient-Boosted Tree
    - Linear Support Vector Machine
    
    Performance Metrics:
    - F1 Score
    - Accuracy
    
 c. Implement three methods to deal with the imbalanced class issue
    - Over-sampling
    - Under-sampling
    - Balancing class weights
    
 d. Conduct 3-fold cross-validation and parameter tuning to improve model performance

Results

The Linear Support Vector Machine model has the best performance for churn prediction on the smaller subset (128MB) with a F1 Score of 0.933 and an accuracy of 96.7%, while the tuned Gradient-boosted Tree (GBT) model has the best performance on the full dataset (12GB) with a high F1 Score of 0.934 and an accuracy of 97.0%.

The GBT model also has high recall and precision rates. 91.2% (recall) of true churners are found by the model, and 95.7% (precision) of users the model that predicts are churners are true churners.

Unique Page Actions (upage_ct) and length of enrollment (enroll_days) are the top two features with high importance.

The analysis and results are best presented at the medium post here.

Data Source, Acknowledgements

The simulated small subset (128MB) and the full dataset (12GB) used in this project are both provided by Udacity. Thank you for providing the user activity data that are so close to the real-world data from streaming music service platforms.

Full dataset stored on Amazon S3: "s3n://udacity-dsnd/sparkify/sparkify_event_data.json"

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
images		images
.gitignore		.gitignore
README.md		README.md
Sparkify_Churn_Prediction.ipynb		Sparkify_Churn_Prediction.ipynb
Sparkify_Churn_Prediction_Deploy.ipynb		Sparkify_Churn_Prediction_Deploy.ipynb
run_jupyter.sh		run_jupyter.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Customer Churn Prediction with PySpark on AWS

Table of Contents

Project Overview

File Description

Analysis Steps

Results

Data Source, Acknowledgements

About

Releases

Packages

Languages

kellyhe/Churn_Prediction_Spark

Folders and files

Latest commit

History

Repository files navigation

Customer Churn Prediction with PySpark on AWS

Table of Contents

Project Overview

File Description

Analysis Steps

Results

Data Source, Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages