avatarAnjana S

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

3733

Abstract

n class="hljs-selector-id">#to</span> see what the data is really about <span class="hljs-function"><span class="hljs-title">head</span><span class="hljs-params">(cluster)</span></span></pre></div><figure id="0272"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tV3DmVJCeLddzNLLp7caNw.png"><figcaption>The Data set(customers)</figcaption></figure><ul><li>Customer-Id : Unique number for each customer</li><li>Gender: Customer gender as Male/Female</li><li>Age: Age of customer in mall</li><li>Annual Income K- Annual income of the customer</li><li>Spending Score(1–100):Score assigned by the mall based on customer behavior and spending nature.</li></ul><p id="27e5">2.Inspect Data</p><ul><li>check missing data <code><b>colSums</b>(<b>is.na</b>(cluster))</code></li></ul><figure id="78ff"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xZCxWACdgtMelKQ9WMx7Pw.png"><figcaption>Output of colSums code</figcaption></figure><p id="d709">From this we could conclude that there is no missing data.</p><ul><li>To analyse the summary of Dataset</li></ul><div id="243d"><pre><span class="hljs-function"><span class="hljs-title">summary</span>(<span class="hljs-variable">cluster</span>)</span></pre></div><figure id="316f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*WLBMS5aqY4GwK3l3WEwJwg.png"><figcaption>output of summary</figcaption></figure><p id="a9cc">3. Visualization of data</p><ul><li>Gender data visualization</li></ul><div id="aa2a"><pre> <span class="hljs-keyword">gen</span>=<span class="hljs-keyword">table</span>(<span class="hljs-keyword">cluster</span><span class="hljs-variable">Gender</span>) barplot(<span class="hljs-keyword">gen</span>,main=<span class="hljs-string">"gender data barplot"</span>,xlab=<span class="hljs-string">"No of people"</span>,ylab=<span class="hljs-string">"their gender"</span>,col=rainbow(2))</pre></div><figure id="1ffc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ZpQ2WPrCTUCj9Y8LscWOJw.png"><figcaption>Gender Data Plot</figcaption></figure><p id="dd97"><b><i>Conclusion: Female customers come to mall more than male customers.</i></b></p><ul><li>Age data visualization</li></ul><div id="75a8"><pre>boxplot(cluster<span class="hljs-variable">Age</span>,<span class="hljs-attribute">col</span>=”#ff0066",main=”Boxplot <span class="hljs-keyword">for</span> age of customers coming <span class="hljs-keyword">to</span> mall”)</pre></div><figure id="5913"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BiDnZUmPQ7MjUweGYtxuLA.png"><figcaption></figcaption></figure><p id="31f2"><b><i>Conclusion:The average age of customers is 30–35 and min age-18 max age-70.</i></b></p><ul><li>Annual Income of customers</li></ul><div id="d99a"><pre><span class="hljs-function"><span class="hljs-title">hist</span><span class="hljs-params">(cluster<span class="hljs-variable">Annual</span>.Income..k..,col=<span class="hljs-string">"#660033"</span>,main=<span class="hljs-string">"Histogram for annual income"</span>,xlab=<span class="hljs-string">"Annual Income Class"</span>,ylab=<span class="hljs-string">"frequency"</span>,labels=TRUE)</span></span></pre></div><figure id="0b2b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*MIezFHNf8_epx1Fw.png"><figcaption></figcaption></figure><p id="5bb6"><b><i>Conclusion:The minimum income is 15k and the highest income is 137k and we could see average income range is 60k. We analyse that income follow normal distribution too.</i></b></p><ul><li>spending score of customers</li></ul><div id="8519"><pre>summary(<span class="hljs-keyword">cluster</span><span class="hljs-variable">Spending</span>.<span class="hljs-keyword">Score</span>..1

Options

.100.) Min. 1st <span class="hljs-keyword">Qu</span>. <span class="hljs-keyword">Median</span> <span class="hljs-keyword">Mean</span> 3rd <span class="hljs-keyword">Qu</span>. Max.

1.00 34.75 50.00 50.20 73.00 99.00</pre></div><div id="52cb"><pre>boxplot(cluster<span class="hljs-variable">$Spending.Score..1.100.</span>,horizontal=<span class="hljs-literal">TRUE</span>,col=”<span class="hljs-comment">#990000",main=”BoxPlot for Descriptive Analysis of Spending Score”)</span></pre></div><figure id="ed71"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*p3Bn37lvKcj-wsVi.png"><figcaption>Box plot for spending score</figcaption></figure><p id="46d1"><b><i>Conclusion:The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20.From the box plot we could tell the average spending score between 40–50.</i></b></p><h2 id="cb3b">K-means Algorithm</h2><ol><li>Find the optimal number of clusters (using elbow method)</li></ol><div id="0600"><pre>set<span class="hljs-selector-class">.seed</span>(<span class="hljs-number">123</span>)

k<span class="hljs-selector-class">.maximum</span> <- <span class="hljs-number">15</span> d <- as<span class="hljs-selector-class">.matrix</span>(<span class="hljs-built_in">scale</span>(cluster<span class="hljs-selector-attr">[,(3:5)]</span>))</pre></div><div id="ea5f"><pre>w<sapply(1:k.maximum,function(k){kmeans(d,k,<span class="hljs-attribute">nstart</span>=100,iter.max=100)$tot.withinss})

plot(1:k.maximum, w, <span class="hljs-attribute">type</span>=<span class="hljs-string">"b"</span>, pch = 19, frame = <span class="hljs-literal">FALSE</span>, <span class="hljs-attribute">xlab</span>=<span class="hljs-string">"Number of clusters"</span>, <span class="hljs-attribute">ylab</span>=<span class="hljs-string">"Sum of squares"</span>)</pre></div><figure id="24c3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7c_mZS3su1uxYiASZ2-QKg.png"><figcaption>Elbow method Graph</figcaption></figure><p id="1613"><b><i>Conclusion:The optimal number of clusters is 4. Also we can go upto 6.</i></b></p><p id="a64e">2. Let’s take optimal clusters as 4 then 5 and 6.</p><figure id="5d79"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*9jRFbaSY_4GM3Qhi.png"><figcaption>optimal clusters as 4</figcaption></figure><figure id="69f3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*7iVe_HSTXIwGT9k7.png"><figcaption>optimal clusters as 5</figcaption></figure><figure id="c156"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*MWwpx5SNQSnCtq-V.png"><figcaption>Optimal clusters as 6</figcaption></figure><h2 id="59be">OBSERVATION:</h2><p id="24e6">Cluster 1 — This cluster represents the customer_data having a high annual income as well as a high annual spend.</p><p id="3219">Cluster 2 — This cluster denotes a high annual income and low yearly spend.</p><p id="8ff6">Cluster 3 — This cluster denotes the customer_data with low annual income as well as low yearly spend of income.</p><p id="41c9">Cluster 6 and 4 — These clusters represent the customer_data with the medium income salary as well as the medium annual spend of salary.</p><p id="a032">Cluster 5 — This cluster represents a low annual income but its high yearly expenditure.</p><h2 id="1dcb">CONCLUSION:</h2><p id="36b3">In this data science project, we went through the customer segmentation model. Specifically, we made use of a clustering algorithm called K-means clustering. We analyzed and visualized the data and then proceeded to implement our algorithm.</p><p id="2678">Hope You enjoyed the project!</p></article></body>

Customer Segmentation using Cluster Analysis

What is Cluster Analysis?

It is a collection of data objects that are similar to one another within the same cluster but different /dissimilar to the objects in other clusters.The process of grouping objects into classes of similar objects is known as clustering.

Image shows clustering based on percentage of US Arrests in different states

Types of cluster Analysis

Cluster Methods

Hierarchical cluster analysis is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering from top to bottom. Example: All folders and files in our hard disk

The Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

  • Single linkage: The distance between two clusters is defined as the shortest distance between two points in each cluster.
  • Complete linkage:The distance between two clusters is defined as the maximum distance between two points in each cluster.
  • Average linkage:The distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.

The Divisive Hierarchical Clustering is a “top-down” clustering method :assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. So its exactly opposite to agglomerative method

divisive vs agglomerative

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. the main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

How we really represent these clusters?

For hierarchical clustering we use Dendograms. It is used to represent the hierarchical relationships between objects. Y-axis is the distance between clusters and X-axis is the order of clustered objects/groups.

For K means clustering we represent a graph with all the data points which are grouped and provided with an unique identity.

K-means

Customer Segmentation using R

  1. We need to load the Data set
#read the data
cluster=read.csv(“Mall_customers.csv”)
#to see what the data is really about
head(cluster)
The Data set(customers)
  • Customer-Id : Unique number for each customer
  • Gender: Customer gender as Male/Female
  • Age: Age of customer in mall
  • Annual Income K- Annual income of the customer
  • Spending Score(1–100):Score assigned by the mall based on customer behavior and spending nature.

2.Inspect Data

  • check missing data colSums(is.na(cluster))
Output of colSums code

From this we could conclude that there is no missing data.

  • To analyse the summary of Dataset
summary(cluster)
output of summary

3. Visualization of data

  • Gender data visualization
 gen=table(cluster$Gender)
 barplot(gen,main="gender data barplot",xlab="No of people",ylab="their gender",col=rainbow(2))
Gender Data Plot

Conclusion: Female customers come to mall more than male customers.

  • Age data visualization
boxplot(cluster$Age,col=”#ff0066",main=”Boxplot for age of customers coming to mall”)

Conclusion:The average age of customers is 30–35 and min age-18 max age-70.

  • Annual Income of customers
hist(cluster$Annual.Income..k..,col="#660033",main="Histogram for annual income",xlab="Annual Income Class",ylab="frequency",labels=TRUE)

Conclusion:The minimum income is 15k and the highest income is 137k and we could see average income range is 60k. We analyse that income follow normal distribution too.

  • spending score of customers
summary(cluster$Spending.Score..1.100.)
Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.75 50.00 50.20 73.00 99.00
boxplot(cluster$Spending.Score..1.100.,horizontal=TRUE,col=”#990000",main=”BoxPlot for Descriptive Analysis of Spending Score”)
Box plot for spending score

Conclusion:The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20.From the box plot we could tell the average spending score between 40–50.

K-means Algorithm

  1. Find the optimal number of clusters (using elbow method)
set.seed(123)
k.maximum <- 15
d <- as.matrix(scale(cluster[,(3:5)]))
w<sapply(1:k.maximum,function(k){kmeans(d,k,nstart=100,iter.max=100)$tot.withinss}) 

plot(1:k.maximum, w,
 type="b", pch = 19, frame = FALSE, 
 xlab="Number of clusters",
 ylab="Sum of squares")
Elbow method Graph

Conclusion:The optimal number of clusters is 4. Also we can go upto 6.

2. Let’s take optimal clusters as 4 then 5 and 6.

optimal clusters as 4
optimal clusters as 5
Optimal clusters as 6

OBSERVATION:

Cluster 1 — This cluster represents the customer_data having a high annual income as well as a high annual spend.

Cluster 2 — This cluster denotes a high annual income and low yearly spend.

Cluster 3 — This cluster denotes the customer_data with low annual income as well as low yearly spend of income.

Cluster 6 and 4 — These clusters represent the customer_data with the medium income salary as well as the medium annual spend of salary.

Cluster 5 — This cluster represents a low annual income but its high yearly expenditure.

CONCLUSION:

In this data science project, we went through the customer segmentation model. Specifically, we made use of a clustering algorithm called K-means clustering. We analyzed and visualized the data and then proceeded to implement our algorithm.

Hope You enjoyed the project!

Data Science
Cluster Analysis
Data Science Training
Recommended from ReadMedium