WGU D208 Task 1 (2025): Applying Predictive
Analytics for Decision-Making
College of Information Technology, Western Governors University
D208: Predictive Modeling
Dr. Keiona Middleton
Table of Contents
A: Research Question..................................................................................................................................3
1 – Question Identification.......................................................................................................................3
2 – Goals & Objectives............................................................................................................................3
, 2
B: Method Justification................................................................................................................................3
1 – Four Assumptions of Multiple Linear Regression Methods...............................................................3
2 – Benefits of Using Jupyter Notebook and Python for Analysis...........................................................4
3 – Multiple Linear Regression Analysis Justification.............................................................................4
C: Data Preparation......................................................................................................................................4
1 – Data Preparation Goals......................................................................................................................4
2 – Statistics Summary...........................................................................................................................12
3 – Univariate & Bivariate Statistics......................................................................................................14
4 – Data Transformation (Data Wrangling)...........................................................................................18
5 – Data Preparation File.......................................................................................................................20
D: Model Analysis.....................................................................................................................................20
1 – Initial Model....................................................................................................................................20
2 – Model Method & Justification..........................................................................................................22
3 – Reduced Model................................................................................................................................22
E: Model Comparison................................................................................................................................28
1 – Initial vs. Reduced Regression Models............................................................................................28
2 – Output & Calculations......................................................................................................................29
3 – Copy of Code...................................................................................................................................31
F: Data Summary & Implications..............................................................................................................31
1 – Results.............................................................................................................................................31
2 – Recommendations............................................................................................................................32
G: Demonstration.......................................................................................................................................32
Panopto Video Presentation...................................................................................................................32
H: Third Party Web-References.................................................................................................................33
I - References.............................................................................................................................................33
, 3
A: Research Question
The “Telecommunications Churn” data was utilized to demonstrate my ability to practice
predictive modeling. Customers in the telecom sector have the option to select from a variety of
service providers and actively move between them. The percentage of customers that switch to a
different service provider within a specific time frame is called customer churn. According to
WGU, it is 10 times more expensive to keep an existing customer than it is to get a new one
(WGU, 2024). The purpose of this data analysis and predictive modeling exercise is to see if
there are any indications of customer churn. That will therefore provide insight as to how to
minimize it.
1 – Question Identification
Can linear regression models predict the future bandwidth usage per year of a customer?
2 – Goals & Objectives
To apply appropriate strategies to avoid or mitigate instances of customer churn, it is essential
that the telecommunications company first understand the customer. More specifically, it is
important that the company fully understands the implications of customer churn. Once they
understand customer churn, they can make predictions of the customer data that they have. The
objective of this analysis is to provide the telecommunications company with a predictive model
on the bandwidth usage per year of each customer and then relate it to independent variables.
This will help them determine areas in which they can facilitate or improve the profitability of
the services that they provide to their customers by either limiting or expanding their bandwidth
usage.
B: Method Justification
1 – Four Assumptions of Multiple Linear Regression Methods
Multiple Linear Regression (MLR) will be used for this analysis. There are four assumptions
associated with MLR to be effective:
1. Variables are normally distributed. If there are outliers, it can be removed. However,
the weight of the loss of information must be evaluated before removing the data.
2. There is a linear relationship between the independent and dependent variables. This
can be established by the assessment of a scatter plot.
3. There is no collinearity between the variables. This means that there should not be a
high correlation between the variables. This can be determined by using a Variance
Inflation Test (VIF).
4. Homoscedasticity is present in the data. This means that there is the same variance
of errors between the independent variables.
There must be proof that all these assumptions and/or conditions must be met for the data to be
considered reliable.
, 4
2 – Benefits of Using Jupyter Notebook and Python for Analysis
Jupyter Notebook, an interactive web-based computing platform was utilized to implement the
Python programming language to identify duplicates, missing values, and outliers in the
“Telecommunications Churn” scenario dataset in D206. It was also used to explore the data in
D207. This is a popular platform to code because you can simultaneously switch between
different tools/libraries to create visualizations, calculate statistics, and more. Python’s
simplified syntax requirements were used to perform complex tasks. For this exercise, the same
tools will be used to answer the research question. The following describes the packages and
libraries that will be used:
• NumPy: includes mathematical functions
• Pandas: allows for data processing and machine learning
• Seaborn: high-level interface for creating statistical graphs
• Matplotlib: used to create static or interactive visualizations
• PyLab: procedural interface to Matplotlib
• SciPy: provides algorithms for equations and statistics
• Sklearn: tool used for predictive data analysis
3 – Multiple Linear Regression Analysis Justification
Multiple Linear Regression (MLR) is an appropriate analysis technique because it allows for the
measurement of more than one independent variable on another variable. In this case, the
bandwidth per year variable is a continuous variable that can further examined by independent
variables such as age, tenure, monthly charge, streaming TV, and streaming movies. This will be
useful in determining how the bandwidth usage per year is related to the occurrence of customer
churn.
C: Data Preparation
1 – Data Preparation Goals
To clean the provided dataset by identifying duplicates, missing values, and outliers to then
mitigate them, if needed. The data will also be wrangled and will prepare the categorical
variables for linear regression. This will be done by changing the data types as necessary and
creating nominal gender and churn variables.
Step 1: Import Necessary Packages/Libraries
#Import necessary packages
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
#Import visualization packages
import seaborn as sns
import sklearn