Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

n class="hljs-selector-id">#to see what the data is really about head(cluster)</pre></div><figure id="0272"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*tV3DmVJCeLddzNLLp7caNw.png"><figcaption>The Data set(customers)</figcaption></figure><ul><li>Customer-Id : Unique number for each customer</li><li>Gender: Customer gender as Male/Female</li><li>Age: Age of customer in mall</li><li>Annual Income K- Annual income of the customer</li><li>Spending Score(1–100):Score assigned by the mall based on customer behavior and spending nature.</li></ul>2.Inspect Data<ul><li>check missing data <code>colSums(is.na(cluster))</code></li></ul><figure id="78ff"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*xZCxWACdgtMelKQ9WMx7Pw.png"><figcaption>Output of colSums code</figcaption></figure>From this we could conclude that there is no missing data.<ul><li>To analyse the summary of Dataset</li></ul><div id="243d"><pre>summary(cluster)</pre></div><figure id="316f"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*WLBMS5aqY4GwK3l3WEwJwg.png"><figcaption>output of summary</figcaption></figure>3. Visualization of data<ul><li>Gender data visualization</li></ul><div id="aa2a"><pre> gen=table(cluster $Gender) barplot(gen,main="gender data barplot",xlab="No of people",ylab="their gender",col=rainbow(2))</pre></div><figure id="1ffc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*ZpQ2WPrCTUCj9Y8LscWOJw.png"><figcaption>Gender Data Plot</figcaption></figure>Conclusion: Female customers come to mall more than male customers.<ul><li>Age data visualization</li></ul><div id="75a8"><pre>boxplot(cluster$ Age,col=”#ff0066",main=”Boxplot for age of customers coming to mall”)</pre></div><figure id="5913"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*BiDnZUmPQ7MjUweGYtxuLA.png"><figcaption></figcaption></figure>Conclusion:The average age of customers is 30–35 and min age-18 max age-70.<ul><li>Annual Income of customers</li></ul><div id="d99a"><pre>hist(cluster $Annual.Income..k..,col="#660033",main="Histogram for annual income",xlab="Annual Income Class",ylab="frequency",labels=TRUE)</pre></div><figure id="0b2b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*MIezFHNf8_epx1Fw.png"><figcaption></figcaption></figure>Conclusion:The minimum income is 15k and the highest income is 137k and we could see average income range is 60k. We analyse that income follow normal distribution too.<ul><li>spending score of customers</li></ul><div id="8519"><pre>summary(cluster$ Spending.Score..1

Options

.100.) Min. 1st Qu. Median Mean 3rd Qu. Max.

1.00 34.75 50.00 50.20 73.00 99.00</pre></div><div id="52cb"><pre>boxplot(cluster$Spending.Score..1.100.,horizontal=TRUE,col=”#990000",main=”BoxPlot for Descriptive Analysis of Spending Score”)</pre></div><figure id="ed71"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*p3Bn37lvKcj-wsVi.png"><figcaption>Box plot for spending score</figcaption></figure>Conclusion:The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20.From the box plot we could tell the average spending score between 40–50.<h2 id="cb3b">K-means Algorithm</h2><ol><li>Find the optimal number of clusters (using elbow method)</li></ol><div id="0600"><pre>set.seed(123)

k.maximum <- 15 d <- as.matrix(scale(cluster[,(3:5)]))</pre></div><div id="ea5f"><pre>w<sapply(1:k.maximum,function(k){kmeans(d,k,nstart=100,iter.max=100)$tot.withinss})

plot(1:k.maximum, w, type="b", pch = 19, frame = FALSE, xlab="Number of clusters", ylab="Sum of squares")</pre></div><figure id="24c3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*7c_mZS3su1uxYiASZ2-QKg.png"><figcaption>Elbow method Graph</figcaption></figure>Conclusion:The optimal number of clusters is 4. Also we can go upto 6.2. Let’s take optimal clusters as 4 then 5 and 6.<figure id="5d79"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*9jRFbaSY_4GM3Qhi.png"><figcaption>optimal clusters as 4</figcaption></figure><figure id="69f3"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*7iVe_HSTXIwGT9k7.png"><figcaption>optimal clusters as 5</figcaption></figure><figure id="c156"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*MWwpx5SNQSnCtq-V.png"><figcaption>Optimal clusters as 6</figcaption></figure><h2 id="59be">OBSERVATION:</h2>Cluster 1 — This cluster represents the customer_data having a high annual income as well as a high annual spend.Cluster 2 — This cluster denotes a high annual income and low yearly spend.Cluster 3 — This cluster denotes the customer_data with low annual income as well as low yearly spend of income.Cluster 6 and 4 — These clusters represent the customer_data with the medium income salary as well as the medium annual spend of salary.Cluster 5 — This cluster represents a low annual income but its high yearly expenditure.<h2 id="1dcb">CONCLUSION:</h2>In this data science project, we went through the customer segmentation model. Specifically, we made use of a clustering algorithm called K-means clustering. We analyzed and visualized the data and then proceeded to implement our algorithm.Hope You enjoyed the project!</article></body>

Customer Segmentation using Cluster Analysis

What is Cluster Analysis?

It is a collection of data objects that are similar to one another within the same cluster but different /dissimilar to the objects in other clusters.The process of grouping objects into classes of similar objects is known as clustering.

Image shows clustering based on percentage of US Arrests in different states

Types of cluster Analysis

Hierarchical cluster analysis is an unsupervised clustering algorithm which involves creating clusters that have predominant ordering from top to bottom. Example: All folders and files in our hard disk

The Agglomerative Hierarchical Clustering is the most common type of hierarchical clustering used to group objects in clusters based on their similarity. It’s a “bottom-up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.

Single linkage: The distance between two clusters is defined as the shortest distance between two points in each cluster.
Complete linkage:The distance between two clusters is defined as the maximum distance between two points in each cluster.
Average linkage:The distance between two clusters is defined as the average distance between each point in one cluster to every point in the other cluster.

The Divisive Hierarchical Clustering is a “top-down” clustering method :assign all of the observations to a single cluster and then partition the cluster to two least similar clusters. Finally, we proceed recursively on each cluster until there is one cluster for each observation. So its exactly opposite to agglomerative method

K-means is a centroid-based algorithm, or a distance-based algorithm, where we calculate the distances to assign a point to a cluster. In K-Means, each cluster is associated with a centroid. the main objective of the K-Means algorithm is to minimize the sum of distances between the points and their respective cluster centroid.

How we really represent these clusters?

For hierarchical clustering we use Dendograms. It is used to represent the hierarchical relationships between objects. Y-axis is the distance between clusters and X-axis is the order of clustered objects/groups.

For K means clustering we represent a graph with all the data points which are grouped and provided with an unique identity.

Customer Segmentation using R

We need to load the Data set

#read the data
cluster=read.csv(“Mall_customers.csv”)

#to see what the data is really about
head(cluster)

Customer-Id : Unique number for each customer
Gender: Customer gender as Male/Female
Age: Age of customer in mall
Annual Income K- Annual income of the customer
Spending Score(1–100):Score assigned by the mall based on customer behavior and spending nature.

2.Inspect Data

check missing data colSums(is.na(cluster))

From this we could conclude that there is no missing data.

To analyse the summary of Dataset

summary(cluster)

3. Visualization of data

Gender data visualization

 gen=table(cluster$Gender)
 barplot(gen,main="gender data barplot",xlab="No of people",ylab="their gender",col=rainbow(2))

Conclusion: Female customers come to mall more than male customers.

Age data visualization

boxplot(cluster$Age,col=”#ff0066",main=”Boxplot for age of customers coming to mall”)

Conclusion:The average age of customers is 30–35 and min age-18 max age-70.

Annual Income of customers

hist(cluster$Annual.Income..k..,col="#660033",main="Histogram for annual income",xlab="Annual Income Class",ylab="frequency",labels=TRUE)

Conclusion:The minimum income is 15k and the highest income is 137k and we could see average income range is 60k. We analyse that income follow normal distribution too.

spending score of customers

summary(cluster$Spending.Score..1.100.)
Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.75 50.00 50.20 73.00 99.00

boxplot(cluster$Spending.Score..1.100.,horizontal=TRUE,col=”#990000",main=”BoxPlot for Descriptive Analysis of Spending Score”)

Conclusion:The minimum spending score is 1, maximum is 99 and the average is 50.20. We can see Descriptive Analysis of Spending Score is that Min is 1, Max is 99 and avg. is 50.20.From the box plot we could tell the average spending score between 40–50.

K-means Algorithm

Find the optimal number of clusters (using elbow method)

set.seed(123)
k.maximum <- 15
d <- as.matrix(scale(cluster[,(3:5)]))

w<sapply(1:k.maximum,function(k){kmeans(d,k,nstart=100,iter.max=100)$tot.withinss}) 

plot(1:k.maximum, w,
 type="b", pch = 19, frame = FALSE, 
 xlab="Number of clusters",
 ylab="Sum of squares")

Conclusion:The optimal number of clusters is 4. Also we can go upto 6.

2. Let’s take optimal clusters as 4 then 5 and 6.

OBSERVATION:

Cluster 1 — This cluster represents the customer_data having a high annual income as well as a high annual spend.

Cluster 2 — This cluster denotes a high annual income and low yearly spend.

Cluster 3 — This cluster denotes the customer_data with low annual income as well as low yearly spend of income.

Cluster 6 and 4 — These clusters represent the customer_data with the medium income salary as well as the medium annual spend of salary.

Cluster 5 — This cluster represents a low annual income but its high yearly expenditure.

CONCLUSION:

In this data science project, we went through the customer segmentation model. Specifically, we made use of a clustering algorithm called K-means clustering. We analyzed and visualized the data and then proceeded to implement our algorithm.

Hope You enjoyed the project!