Final Submission – Marketing Campaign Customer Segmentation | Unsupervised Learning (B)
Two parts:
Executive Summary
What are the key takeaways? What are the key next steps?
Problem and Solution Summary
What problem was being solved? What are the key points that describe the final proposed solution design? Why is this a ‘valid’ solution that is likely to solve the problem?
Recommendations for Implementation
What are some key recommendations to implement the solutions? What are the key actionables for stakeholders? What is the expected benefit and/or costs? What are the key risks and challenges? What further analysis needs to be done or what other associated problems need to be solved?
Milestone 1
1
Milestone 1 – Marketing Campaign data
9/22/2022
Context
Milestone 1
2
The marketing campaign data consists of various features corresponding to the customers
and sales corresponding to the marketing campaign in 2016. There is a need to find which
features play a major role in the sales with respect to the customers. By exploring the
distributions in the data and solving the relationships existing between various features, we
would get key insights on how the factors related to customers can be improved to increase
the sales of the stores.
The objectives
The intended goal is to explore and identify significant relationships between the various
variables in the marketing campaign dataset. The data consists of numerical variables such as
Income, Recency, and MntFishProducts, etc, and categorical variables such as Education,
Marital status, Complain, etc. The customer’s age, income, and expenses are compared with
family size, education, and marital status to understand which category/section of customers
significantly affects the sales.
The key questions
▪
Is the Income and Expense of the customers related to each other?
▪
Does the family size affect the Income level of the customers?
▪
Does the distribution of customers vary between different marital status categories?
▪
How is the Amount Per Purchase distributed?
▪
What can be interpreted from the age distribution of customers?
The problem formulation
Data science offers various techniques and abilities to explore and visualize data. Using
these abilities of data science we can build insights, understand the data easily, and eventually
use the insights gained to make informed future decisions. For the problem of the marketing
campaign, we are using data processing, univariate data exploration, bivariate data analysis,
Milestone 1
3
and visualization of data to understand and interpret the customers and sales in a better
manner.
Data description
The marketing campaign dataset consists of 2240 observations and 27 features. The dataset
consists of several important numeric variables representing Income, Number of days since
last purchase, and Amount spent on different products, and key categorical variables
representing education level and marital status of customers. Of all the variables, only the
‘Income’ variable had missing values and its corresponding percentage was 1.0714 %
Observations and insights
The summary statistics of the marketing campaign data reveal that the average income of a
customer is 52247.251 and on average a customer is expected to have the year of birth as
1969. The number of unique observations in Education, Marital_Status, Kidhome,
Teenhome, and Complain are 5, 8, 3, 3, and 2 respectively. As some categories in the
Education and Marital status can be combined into already present categories, the appropriate
data processing was done. From the distribution of the Income variable, we can see that only
a few data points lie on the extreme right of the distribution. If these outliers are absent, then
the distribution of the Income variable would be fairly symmetric.
From the upper whisker calculation for the Income variable, we can see that 8 observations
are identified as outliers. And the upper whisker value lies beyond the 99.5 % percentile
value for the Income variable. All numeric variables except the Income variable seem to be
highly right-skewed after dropping outliers.
The Income and expenses of customer have a high positive correlation with each other (i.e.)
with the increase in the income of the customers, their expense also tends to increase at a
large rate.
Milestone 1
4
Yes, the family size certainly affects the Income level of the customers. From the bar plot
of Family size vs Income, we can clearly see that the family size of 1 has the highest income
and the family size of 4 have the lowest income (i.e.) with the increase in the family size, the
income of the customer tends to decrease
The highest percentage (38.6%) of customers are married, while only 21.8 % of customers
are single with reference to the bar plot of Marital status. This indicates that the distribution
of customers varies between different marital status categories.
Milestone 1
The distribution of Amount Per Purchase seems to be highly right-skewed. A large
proportion of the data values of Amount Per Purchase lie between $ 0 and $ 125 and this
indicates that customers mostly prefer to buy less than $ 125 on each of their order.
After removing the observations having age greater than 115, we can see that the
distribution of the Age variable is symmetric. A large proportion of customers can be found
between the age bracket of 40 to 45 years.
5
Milestone 1
Potential techniques
Identify different segments in the customers corresponding to the marketing campaign,
considering their Income, education, marital status, spending, etc using the K-means
clustering technique. Also, the same clustering problem can be modelled using the KMedoids algorithm.
Overall solution design
The overall solution design would consist of visualizations of clusters in the key variables
of the dataset. The summary statistics of the various clusters produced by the algorithms can
be compared and interpreted for further insights.
Measure of success
The ability of the algorithms to classify the clusters based on the optimal number of clusters
would be the measure of success.
6
1
Milestone 2 – Customer Segmentation
9/30/2022
Customer segmentation
2
There are different scenarios when customers make purchasing trends or follow a
particular bias when buying their commodities. The trend is essential in recognizing the progress
and the direction that an organization should follow. The customer may make these choices
knowingly or unknowingly. The customer segmentation is used to create more effective and
targeted customer marketing campaigns. It also assists in enhancing the effectiveness of an
organization as well as the decisions of the customers themselves. Customer segmentation is
basically grouping customers according to their characteristics. In addition to customer
segmentation, market basket analysis is also carried out. The parameters that are used to segment
customers include recency, frequency, and monetary value.
Recency is the last time the customer purchases a product from the store. Frequency
represents how often the customer visit’s the store or makes a purchase. After segmentation has
been done, the marketing department can now target the specific customer base segments with
the right marketing incentives that ensure effective marketing or leads to sales and profits for the
organization. In addition to effective marketing, customer segmentation also assists an
organization in product design by learning what customers want, carrying out promotions, and
ensuring that customers are satisfied.
In the project, the dataset used has 2,240 rows or data entries and thirteen attributes or features.
The attributes are of integer type, and the dataset has no missing values, thus suitable for
the customer segmentation project. In addition, accurate results are obtained since all values are
3
included, and detailed analysis is conducted effectively. The next step is to find the attributes’
correlations using the correlation matrix provided by python methods and the panda’s library. A
positive correlation shows that the variables associate positively and change in the same
direction. A negative correlation indicates that the attribute variables change in opposite
directions. A zero or neutral correlation shows that there is no relationship or association with
change in attributes and variables. The identities are essential in the analysis as they predict the
direction of certain decisions implemented by an organization.
I used seaborn and correlation heatmap to visualize the correlations. The following
results were obtained, and they presented a clear direction of the correlations.
4
A correlation of positive one shows a perfect positive relationship between the variables
while a negative one shows a perfect negative association between the variables. It is important
to determine these correlations as they predict the customers’ direction and the decision the
organization should implement. On close observation, it is noted that the data contains some
attributes that are skewed or have large values and others that have small values hence
necessitating feature scaling so that they can be in the same range. Scaling of features is a
statistical technique for scaling the features to a similar range. Feature scaling is relevant,
especially in unsupervised machine learning, like applying the K-means clustering algorithms.
The machine learning method computes distance metrics which some features would bias if
feature scaling was not done. Feature scaling will enable the machine learning model to produce
more accurate results by ensuring no features are dominating the others in the dataset.
In this project, a standard scaler function is used to scale the data to a normalized range.
The standard scaler function brings down all the features of the data to a common scale while
maintaining the differences in the range of the values. With the standard scaler function, each
feature is scaled and centered independently. Standardized attributes have a mean of zero and a
total variance of one. Tune is used in understanding the data with high dimensions in low or
easily understandable dimensions such as two or three dimensions. Seaborn scatterplot is used to
visualize the tsne data.
5
Principal component analysis (PCA) is carried out to reduce the data dimensions in a
linear way. This step is essential as it enhances the accuracy of the results obtained and the final
decision that should be adopted. The clustering methods are affected by multicollinearity as the
variables’ attributes are highly correlated, which results in poor profiles of the clusters as the
results are more likely to be biased towards a few attributes. PCA is used to reduce the
multicollinearity among the variables.
6
I used the elbow method to calculate the optimum number of clusters in the code called
k. The optimum number of groups is three and five, as seen in the attached screenshot.
The cluster profiling results show that the recency, products, and money used to be the
most relevant clusters insights for customer segmentation. The store can then use the collections
for marketing as they show which segments of the customers are more likely to be converted to
sales. The cluster profiling would ensure proper budgeting of the store for effective marketing
and retain their frequent customers by providing them with promotions and incentives for their
loyalty. There are several models that can be proposed, such as the Gaussian mixture model,
DBSCAN, and kmedoids techniques, to get the best results for the customer segmentation
campaign. Although the Gaussian mixture model also gives insights into the customer segments,
the DBSCAN compared to this and other techniques, is more suitable for the segmentation of
marketing campaigns.
The DBSCAN model gives more insights into the customer segments compared to other
techniques hence more suitable for the segmentation of marketing campaigns. I, therefore,
propose a hybrid density-based spatial clustering of applications with noise (DBSCAN)
techniques to get the best results for the customer segmentation campaign. The DBSCAN model
operates in more unique as it is able to identify outliers as noise other than trying to feed them as
characteristics. Additionally, the DBSCAN does not require a preset number of clusters.
Therefore the model is able to select the best clusters out of the options presented in a given task.
The ability of the DBSCAN to detect the outliers makes it more functional and appropriate for
customer segmentation. In the future, the application of the model will yield more accurate and
appropriate results for organizations to make effective decisions concerning marketing.
7
Appendices
Correlation heatmap .
Correlation heatmap above.
8
9
10
11
12
Milestone 1
Problem Definition, Data Exploration, Proposed Approach
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Problem Definition
●
●
Context – Why is this problem important to solve?
○
Brief Introduction to the problem
○
Advantages of solving the problem
○
Good to add some facts and numbers to support your argument
Objectives – What is the intended goal?
○
The goals you are trying to achieve.
○
Example – Reducing the attrition rate, Improving the lead conversion rate
○
There can be multiple goals
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Problem Definition
●
●
The key questions – What are the key questions that need to be answered?
○
Curating questions related to the problem that need to be answered
○
The burning questions or important insights you are planning to draw while solving the problem
The problem formulation – What is it that we are trying to solve using data science?
○
Already explained the general form of the problem. Now, formulate the problem as a data scientist
○
How data science fits into the spectrum of solving the problem
○
The nature of the data science problem
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Data Exploration
●
●
Data Description
○
Background of the data and what is it about?
○
Information about the variables included in the data
Observations & Insights
○
What are some key patterns observed in the data during EDA?
○
How do the key patterns affect/relate to the problem?
○
What are the data treatments or pre-processing steps required, if any?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Proposed Approach
●
●
●
Potential techniques
○
What are the potential techniques/models that should be explored in the next step?
○
Why the techniques suggested are the best to explore for the data and problem at hand?
Overall solution design
○
What is the potential solution design?
○
The steps (and substeps) that will be followed to solve the problem
Measures of success
○
The key measures of success that will be used to compare potential techniques/models
○
Why the metric chosen is the best for the problem at hand?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Milestone 2
Refined Insights, Techniques’ Comparison, Final Solution Design
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Refined Insights
●
List the most meaningful insights from the data relevant to the problem
●
A meaningful insight has three components:
●
○
Good interpretation of the output from the data
○
Potential reason for that output
○
What it means for the problem/business?
Not more than 1 page or slide
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Comparison of Techniques and their Performances
●
Try different techniques to solve the problem
●
Compare the performance of different techniques based on the metric chosen for the problem
●
○
Which technique is performing relatively better?
○
Pros and cons of different techniques
○
Good to include a comparison table
Is there scope to improve the performance further? If yes, how?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Proposal for the Final Solution Design
●
●
What model do you propose to be adopted?
○
Based on the comparison, which is the best model for the problem?
○
Think of the tradeoff between model performance and model interpretability
Why is this the best solution to adopt?
○
Reason for choosing the best model
○
How that solves the problem?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Final Submission
Executive Summary, Problem and Solution Summary, Recommendations
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Executive Summary
●
●
What are the key takeaways?
○
Identify and focus on the big picture first and all of its components
○
These components are usually the driving force for the end goal
○
Summarize the most important findings and takeaways in the beginning
What are the key next steps?
○
Steps that can be taken to improve the solution
○
How to make the best of the solution?
○
What are the steps to be followed by the stakeholders?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Problem and Solution Summary
●
What problem was being solved?
○
●
Final proposed solution design
○
●
Summary of the problem
What are the key points that describe the final proposed solution design?
Why is this a ‘valid’ solution that is likely to solve the problem?
○
The reason for the proposed solution design
○
How it would affect the problem/business?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Recommendations for Implementation
●
What are some key recommendations to implement the solution?
●
What are the key actionables for stakeholders?
●
What is the expected benefit and/or costs?
●
○
List the benefits of the solution
○
Take some rational assumptions to put forward some numbers on costs/benefits for stakeholders
What are the key risks and challenges?
○
●
What are the potential risks or challenges of the proposed solution design
What further analysis needs to be done or what other associated problems need to be
solved?
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
General Tips
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Do’s and Don’ts for a Good Project Report
Do’s
Don’ts
✅
Focus must be on the business problem and solving
the same by analyzing the data
❌
Following this template word to word. This template is
just to help you get started
✅
Follow the guidelines provided on LMS and by the
Program Office
❌
Presenting numbers and figures without the business
interpretation and what it means for the problem
✅
Include only the important material in the main body.
Appendix can contain codes and all less important
tables, figures, etc.
❌
Using any non-standard abbreviation in your report
✅
Adding codes and reference in the Appendix
❌
Filling the main body of the report with codes
✅
Easily readable tables, figures, and graphs. Work on
the axis labels and legends
❌
Screenshots of tables/charts from Python output
✅
Present all numbers up to 2 places of decimals only,
unless required otherwise
❌
Explaining theory of the techniques in the project report
❌
Using very large fonts and/or adding unnecessary visuals
❌
Including too much content on a single slide
✅
Highlight the innovations of the project and why the
methods suggested there ought to be utilized by the
industry
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Project Report VS Live Presentation
●
●
●
●
●
●
Graded by evaluator
based on files submitted
Includes all the analysis
Can be a bit elaborate
Convey the methodology
to the evaluator
Follow the rubric
To be created for each
milestone
●
●
●
●
●
●
Project
Report
Graded by faculty based
on live presentation
Good Structure and Flow
Crisp and Neat Slides
Include only bullet points
Take your audience
through the logical steps
of your full project work
Refer here for guidelines
on creating presentation
Live
Presentation
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
Capstone Project: Forecasting Carbon Emissions
A Sample Report
Executive Summary
This project proposes the Seasonal ARIMA (SARIMA) model for the prediction and
forecasting of CO2 emissions from natural gas (NG) based electricity generation in the United
States. It is suggested that the model should only consider data beginning in the mid-1990s,
which was a period of significant change in the US energy system that precipitated the
displacement of coal with NG in electricity generation. The suggested model boasts high
conformance with observed data and low error. However, it is subject to a number of limitations,
including a lack of consideration for regional policy changes, technological shifts, and other
pertinent factors such as the development of renewables. It is recommended that stakeholders
consider these variables in building improved long-term forecasting models, as well as include
the full range of economic and environmental implications of various energy sources in
developing future energy policy.
Problem Summary
Climate change is one of the most pressing challenges facing the planet today. Global
warming is primarily driven by CO2 emissions from various industrial activities and power
generation, which remain important for economic development and our day-to-day lives. As the
US continues to develop strategies to minimize carbon emissions while maintaining a healthy
economy, it is becoming increasingly important to identify the sources of CO2 and forecast future
emissions so as to reduce climate impact and move towards a cleaner energy sector. The key
objective of this project is to build a forecasting model for the CO2 emissions from natural
gas (NG) in the US in order to provide policy recommendations to reduce emissions. The time
series analysis presented here provides insight into the relative contributions of natural gas
emissions to the total electricity sector emissions in the US, demonstrates trends in emissions
over time, and projects future NG-based emissions. The discussed analysis and forecast models
will help to understand the projected changes in natural gas-based emissions and serve as a
basis for recommendations for future environmental policy.
Solution design
A number of time-series models and methods were explored as part of the solution design,
including seasonal naïve models, models using the full and partial datasets, and the more
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
1
complex PROPHET model. The final proposed solution is the Seasonal
ARIMA (SARIMA) model using the 1995 data, which has been optimized with
AutoARIMA parameter tuning. The 1995- dataset was chosen because the US energy system
experienced a shift around this period (further discussed in the next section), resulting in a decline
in coal use and a move towards NG.
Figure 1 shows the best model, which yields very low RMSE (2.73 and 5.29 for training
and test data, respectively). This model was effective in capturing the trend and seasonality of the
data and was an overall very close match to the original dataset. SARIMA with coal emissions as
an exogenous variable was also tested (see Appendix 1), which yielded a slightly higher RMSE for
the test data due to additional errors from coal emissions forecasting. Overall, this model was
very similar to SARIMA without the exogenous variable and did not provide much additional
insight given the available dataset.
Furthermore, as discussed in the following section, it is more likely that the increased
availability of NG precipitated the phase-out of coal, implying that NG uptake may be an effective
predictor of coal emissions, but not the other way round.
The PROPHET model, illustrated in Appendix 2, also produced a very close match to the
observed dataset with a very low RMSE. While it appears to be a robust predictive model, it
works best with larger datasets and daily data. Because monthly data is not well-supported by
this module, it poses challenges for tasks such as cross-validation. Therefore, I would
recommend conventional modeling methods given the limitations of this dataset.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2
Analysis and Key Insights
To draw accurate insights from the NG predictive model, it is important to examine the
broader emission patterns in the electricity sector along with contextual information from
secondary sources. It is also important to distinguish between emissions data and the amount of
electricity generated from the given source. Different sources have varying carbon intensities
(emissions per kilowatt-hours of electricity). In general, coal tends to be one of the more
carbon-intensive fuel sources as it is associated with higher emissions than oil or NG for the
same amount of electricity generated. Figure 2 shows that the total electricity sector emissions
trended upwards throughout the 1970s and into the 2000s and fell thereafter. Coal-based
emissions historically accounted for the largest share of the total emissions, even during the
decline beginning in the 2000s. Figure 3 further illustrates that coal maintained the highest
proportion of electricity generation for the bulk of the examined time period. However, beginning
in the 1990s, the proportion of NG has progressively increased, eventually overtaking coal as the
main electricity source.
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3
The 1990s shift in the US electricity sector can be partly explained by the initial rise in
imported and domestically produced NG, followed by a sharp rise in domestic NG production
throughout the 2000s (shown in Figure 4). The sharp rise in production was caused by the
application of hydraulic fracturing (“fracking”) and horizontal drilling technology to extract the
previously discovered shale gas deposits (USGS, 2019). Prior to this technological innovation,
gas was only extracted from conventional deposits, while shale deposits remained inaccessible.
The expedient and cost-efficient exploitation of shale gas effectively outcompeted the less
economical coal power generation (Katusa, 2012). In other words, the rise of fracking NG in the
United States played a key role in the displacement of coal and its associated emissions that
were observed in our data.
Limitations and Recommendations for Further Analysis
Assuming the continued exploitation of shale gas, our model, which predicts the continued
increase of NG-based carbon emissions, is likely to be accurate and useful in the short-medium
term (i.e., 2-10 years). However, changes in regional policy can quickly nullify the key assumption.
For instance, fracking has already been banned in several states due to environmental concerns,
with further restrictions in other states and at the federal level (Bermel and Kahn, 2021).
Furthermore, like all nonrenewable resources, shale gas deposits are limited, with experts divided
on how much gas is actually available (and extractable without significant environmental
damage).
For the purpose of NG emissions forecasting, stakeholders need to consider the social,
environmental, and economic trade-offs associated with the continued expansion of fracking
and/or imports of NG, as they will likely influence the future role of NG in the US energy system.
Another aspect worth noting is the emergence of biomass in the late 1980s, and the
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4
substantive growth of wind and solar as major electricity sources beginning in
the 2000s (see Appendix 3).
While these sources were not reflected in our carbon emissions dataset, they could be
crucial to consider in long-term emissions forecasting as they may displace carbon-emitting
sources, especially if backed by favorable policies and investment.
Overall, a flat univariate time series model is not able to reflect the complex contextual factors
leading to the rise of NG in the US nor foresee major shifts (such as the one seen in the 1990s).
Radical shifts in energy systems are not only possible, but are made increasingly likely by rapid
innovation, such as the large-scale implementation of smart grids, or the development of new
energy sources (e.g., better nuclear power).
Therefore, for more accurate long-term projections, it would be necessary to include
variables pertaining to other energy sources and emerging technologies, as well as keep in
mind the physical limitations of non-renewable sources.
Recommendations for Policy
The above analysis clearly demonstrates a decline in overall emissions from the electricity
sector despite the rise in NG-based power generation. This is inevitably due to the phase-out of
the more carbon-intensive coal-based power plants and their displacement by the more efficient
NG power plants. Thus, in the short term, I would recommend a continued phase-out of coal
coupled with the continued development of NG. However, keeping in mind the carbon emissions
and other environmental implications of NG (especially from fracking), I would recommend
continued investment in cleaner energy sources, such as wind, solar, and geothermal, as well as
new technologies to reduce energy intensity/demand and increase efficiency.
It is also worth noting that emissions data from electricity generation do not fully
encompass the carbon emissions from the given energy source. Fracking, in particular, has been
associated with the leakage of methane, which is a highly potent greenhouse gas. Conventional
extraction of gas, as well as oil and coal, is also associated with emissions from mining
operations and leakages. Therefore, for better-informed policymaking, I would recommend better
record-keeping of emissions associated with extraction, processing, transportation, and other
activities pertaining to the energy sources in question.
Bibliography
● Bermel, C., and Kahn, D. 2021. Newsom to announce a ban on new fracking projects.
Newsom to announce ban on new fracking projects – POLITICO
● Katusa, M. 2012. Shale gas takes on coal to power America’s electrical plants. Shale
Gas Takes On Coal To Power America’s Electrical Plants (forbes.com)
● National Petroleum Council. 2011. Prudent Development: Realizing the potential of
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5
North America’s abundant natural gas and oil resources.
● The United States Energy Information Administration (EIA). 2021.
Electricity Explained:
Electricity in the United States. Electricity in the U.S. – U.S. Energy Information
Administration (EIA)
● United States Geological Survey (USGS). 2019. Hydraulic Fracturing. Hydraulic
Fracturing | U.S. Geological Survey (usgs.gov)
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6
Appendix
Appendix 1: SARIMA with Coal Emissions as an Exogenous Variable
Appendix 2: Prediction using the Prophet
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7
Appendix 3: US electricity generation from renewable energy sources, 1950-2021
Proprietary content. © Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8