avatarFarhad Malik

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

8473

Abstract

can’t figure out whether my friends are active on Whatsapp, I defaulted the empty values to “False”:</p><div id="7a45"><pre><span class="hljs-comment"># Fill missing values</span> train.fillna(<span class="hljs-string">"0"</span>, <span class="hljs-attribute">inplace</span>=<span class="hljs-literal">True</span>)</pre></div><p id="acb3"><b>Data manipulation</b></p><p id="9cb3">Finally, we often have to change the data. For example, I am calculating the inverse of age because my assumption is that it is more likely for my younger friends to join me on a football match.</p><div id="b91b"><pre><span class="hljs-selector-id">#lower</span> age, more likely they will be interested to play train<span class="hljs-selector-attr">[<span class="hljs-string">'Age'</span>]</span> = train<span class="hljs-selector-attr">[<span class="hljs-string">"Age"</span>]</span><span class="hljs-selector-class">.apply</span>(lambda x: (<span class="hljs-number">1</span>/x))</pre></div><p id="e30a"><b>Drop the text columns: Name</b></p><div id="294f"><pre>#<span class="hljs-keyword">drop</span> Name <span class="hljs-keyword">column</span> <span class="hljs-keyword">from</span> the data X <span class="hljs-operator">=</span> train.drop([<span class="hljs-string">'Name'</span>], <span class="hljs-number">1</span>).astype(<span class="hljs-type">float</span>).<span class="hljs-keyword">values</span></pre></div><figure id="27b4"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*7FwwxTAtl66GCukb"><figcaption>Photo by <a href="https://unsplash.com/@sharonp?utm_source=medium&amp;utm_medium=referral">Sharon Pittaway</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h1 id="7b9e">3. Choosing A Machine Learning Algorithm</h1><p id="c545">Now our data is ready. Our next task is to choose a machine learning algorithm. We all know that there are a large number of machine learning models available (potentially 100s).</p><p id="4fa6" type="7">Can we group the machine learning models into different groups?</p><p id="bd30">Yes. We can categorise the machine learning algorithms into three main groups:</p><ol><li><b>Supervised</b>: These algorithms require us to feed them with the right answers along with the data so that they can understand the patterns of the data and work out how they need to calculate the correct answer themselves.</li><li><b>Unsupervised</b>: These algorithms do not expect us to feed them the correct answers for the data, they can work out the answers themselves. These algorithms are good for grouping data into appropriate clusters. For example, we can use the algorithms to find if they can spot trends and patterns in complex data and group them appropriately. <b>That’s what I need.</b></li><li><b>Reinforcement learning</b>: These are feedback/interactions based learning algorithms. Essentially, these algorithms perform an action and assess whether the action helped them maximise their end goal.</li></ol><p id="5596">For more information on machine learning, read this article:</p><div id="470c" class="link-block"> <a href="https://readmedium.com/introduction-to-machine-learning-4b2d7c57613b"> <div> <div> <h2>Machine Learning In 8 Minutes</h2> <div><h3>Machine learning is the present and the future. All technologists, data scientists and financial experts can benefit…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*THNv3VWIiPT-RGvT)"></div> </div> </div> </a> </div><p id="2dce" type="7">Going back to the organising football match task; I want to know how many of my friends will attend the football match. I don’t have that information handy. Therefore, I want to cluster my friends into two clusters: Interested and Not Interested. That’s an unsupervised machine learning problem. So far so good!</p><h1 id="2f35">4. Wait, What Is Clustering?</h1><p id="91f5">Clustering is used to group data into segments. Similar data are clustered together using a distance calculation algorithm such as Euclidean, Manhattan distance, Cosine similarity, Pearson correlation, etc. We have to use an unsupervised clustering algorithm as our data is not labeled.</p><figure id="2c94"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*Zo3KtOUP3pSApZA9"><figcaption>Photo by <a href="https://unsplash.com/@greysonjoralemon?utm_source=medium&amp;utm_medium=referral">Greyson Joralemon</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h1 id="04b6">5. What Are The Two Main Types Of Clustering?</h1><p id="2468">There are mainly two types of clustering algorithms:</p><ul><li><b>Centroid-based clustering:</b> When you know the number of clusters upfront. A number of clusters are known beforehand and then data is clustered into groups. These groups are known as centroids. Data is grouped into centroids based on how close they are to the center of the centroids. Algorithms include <b>K-Means.</b></li><li><b>Hierarchical clustering:</b> When you want the machine to find the right number of clusters. Each data item is considered as a cluster and then data items are grouped together based on their distance recursively until optimum clusters of data are calculated. Algorithms include Agglomerative clustering.</li></ul><p id="bfaa">There are also other clustering algorithm types such as distribution-based clustering which uses underlying probability distribution of the data to group data into clusters.</p><figure id="b79d"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*zCckqnY0y1STjzp7"><figcaption>Photo by <a href="https://unsplash.com/@marvelous?utm_source=medium&amp;utm_medium=referral">Marvin Meyer</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h1 id="3cef">6. What Are The Most Common Unsupervised Clustering Algorithms?</h1><p id="661b" type="7">Now that we have narrowed down the search, let’s find the handful of algorithms and choose one.</p><p id="8024">There are many unsupervised machine learning algorithms including:</p><p id="ff87">K-Means, GMM, PCA and LDA.</p><p id="f0b5">Detailed comparison of algorithms is outlined here: <a href="https://readmedium.com/f14ce372b855?source=your_stories_page---------------------------">Machine Learning Algorithms Comparison</a></p><p id="9e00" type="7">I decided to choose K-Means because, one, it is popular. Scond, it will give us a sense of what we can achieve and lastly, it is pretty much transparent! We can always optimise the data set and choose a different algorithm. K-Means is my baseline algorithm.</p><h1 id="937c">7. Explaining the K-Means Algorithm</h1><p id="3076">K-Means is an unsupervised clustering algorithm that is used to group data into k-clusters. The algorithm is simple:</p><p id="2f6b">Repeat the two steps below until clusters and their mean is stable:</p><ol><li><i>For each data item, assign it to the nearest cluster center. The nearest distance can be calculated based on distance algorithms. Initially, the first item becomes the cluster.</i></li><li><i>Calculate the mean of the cluster with all data items and update if required.</i></li></ol><p id="4060">Once clusters and their mean is stable, all data items are then known to be grouped into their relevant clusters.</p><p id="309f">There are however limitations of K-Means algorithm:</p><ul><li>K-Means algorithm does not work well with missing data.</li><li>It uses a random seed to generate clusters which makes the results un-deterministic and random. We can, however, supply our own random seed number if we want to.</li><li>It can get slower with larger data items. I only have a few data items.</li><li>It does not work well with categorical (textual) data. All of my data is numerical now.</li></ul><figure id="d5fd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*hTNj55s16742P3B2"><figcaption>Photo by <a href="https://unsplash.com/@ffstop?utm_source=medium&amp;utm_medium=referral">Fotis Fotopoulos</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h1 id="e6f9">8. Use Python K-Means Algorithm

Options

To Cluster Data</h1><p id="c2bc">I used Python SciKit Learn’s KMeans algorithm to cluster data into groups:</p><div id="a2e4"><pre><span class="hljs-selector-id">#KMeans</span> Model kmeans = <span class="hljs-built_in">KMeans</span>(n_clusters=<span class="hljs-number">3</span>) kmeans<span class="hljs-selector-class">.fit</span>(X) <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, X<span class="hljs-selector-class">.shape</span><span class="hljs-selector-attr">[0]</span>): <span class="hljs-keyword">if</span> (kmeans<span class="hljs-selector-class">.labels_</span><span class="hljs-selector-attr">[i]</span> == <span class="hljs-number">1</span>): c1 = pl<span class="hljs-selector-class">.scatter</span>(X<span class="hljs-selector-attr">[i,0]</span>,X<span class="hljs-selector-attr">[i,1]</span>,c=<span class="hljs-string">'g'</span>,marker=<span class="hljs-string">'p'</span>) elif (kmeans<span class="hljs-selector-class">.labels_</span><span class="hljs-selector-attr">[i]</span> ==<span class="hljs-number">0</span>): c2 = pl<span class="hljs-selector-class">.scatter</span>(X<span class="hljs-selector-attr">[i,0]</span>,X<span class="hljs-selector-attr">[i,1]</span>,c=<span class="hljs-string">'r'</span>,marker=<span class="hljs-string">''</span>) pl<span class="hljs-selector-class">.legend</span>(<span class="hljs-selector-attr">[c1, c2]</span>, <span class="hljs-selector-attr">[<span class="hljs-string">'Interested'</span>, <span class="hljs-string">'Not Interested'</span>]</span>) pl<span class="hljs-selector-class">.title</span>(<span class="hljs-string">'K-Means Of Interested Vs Not Interested Friends'</span>) pl<span class="hljs-selector-class">.show</span>()</pre></div><p id="e40f">As a result, the data items (friends) were clustered into two groups: Interested and Not Interested. This can be seen in the scatter plot below:</p><figure id="d7bc"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*hXsAEa3t12TbnRDhacA4LQ.png"><figcaption></figcaption></figure><p id="733a">All of the green dots are the friends who are interested.</p><h1 id="f8b1">9. Testing</h1><p id="4044">I went ahead and organised the football match. I then tested how many of my friends actually attended the football match. The algorithm showed pretty good accuracy; 80%+.</p><p id="24a3">There must be other factors that I did not take into account which can potentially further improve the results.</p><figure id="8874"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*R8s4l26-W3limSDGrXuPqg.png"><figcaption></figcaption></figure><p id="46e5"><b>Full Code</b></p><div id="49ab"><pre>#<span class="hljs-keyword">import</span> libraries <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd <span class="hljs-title">from</span> sklearn.preprocessing <span class="hljs-keyword">import</span> LabelEncoder <span class="hljs-title">from</span> sklearn.cluster <span class="hljs-keyword">import</span> KMeans <span class="hljs-keyword">import</span> numpy <span class="hljs-keyword">as</span> np <span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> pl</pre></div><div id="9ace"><pre><span class="hljs-comment">#Read File</span> train = pd.read_csv(<span class="hljs-string">r"TrainFootballEvent.csv"</span>)</pre></div><div id="ce4c"><pre><span class="hljs-selector-id">#Encoding</span> Categorical To Numerical Values labelEncoder = <span class="hljs-built_in">LabelEncoder</span>() labelEncoder<span class="hljs-selector-class">.fit</span>(train<span class="hljs-selector-attr">[<span class="hljs-string">'Area'</span>]</span>) train<span class="hljs-selector-attr">[<span class="hljs-string">'Area'</span>]</span> = labelEncoder<span class="hljs-selector-class">.transform</span>(train<span class="hljs-selector-attr">[<span class="hljs-string">'Area'</span>]</span>)</pre></div><div id="0f43"><pre><span class="hljs-comment">#Fill Missing Data</span> train.fillna(<span class="hljs-string">"0"</span>, <span class="hljs-attribute">inplace</span>=<span class="hljs-literal">True</span>)</pre></div><div id="4357"><pre><span class="hljs-selector-id">#lower</span> age, more likely they will be interested to play train<span class="hljs-selector-attr">[<span class="hljs-string">'Age'</span>]</span> = train<span class="hljs-selector-attr">[<span class="hljs-string">"Age"</span>]</span><span class="hljs-selector-class">.apply</span>(lambda x: (<span class="hljs-number">1</span>/x))</pre></div><div id="e3d2"><pre>#<span class="hljs-keyword">drop</span> Name <span class="hljs-keyword">column</span> <span class="hljs-keyword">from</span> the data X <span class="hljs-operator">=</span> train.drop([<span class="hljs-string">'Name'</span>], <span class="hljs-number">1</span>).astype(<span class="hljs-type">float</span>).<span class="hljs-keyword">values</span></pre></div><div id="95e1"><pre><span class="hljs-selector-id">#KMeans</span> Model kmeans = <span class="hljs-built_in">KMeans</span>(n_clusters=<span class="hljs-number">3</span>) kmeans<span class="hljs-selector-class">.fit</span>(X) <span class="hljs-keyword">for</span> <span class="hljs-selector-tag">i</span> <span class="hljs-keyword">in</span> <span class="hljs-built_in">range</span>(<span class="hljs-number">0</span>, X<span class="hljs-selector-class">.shape</span><span class="hljs-selector-attr">[0]</span>): <span class="hljs-keyword">if</span> (kmeans<span class="hljs-selector-class">.labels_</span><span class="hljs-selector-attr">[i]</span> == <span class="hljs-number">1</span>): c1 = pl<span class="hljs-selector-class">.scatter</span>(X<span class="hljs-selector-attr">[i,0]</span>,X<span class="hljs-selector-attr">[i,1]</span>,c=<span class="hljs-string">'g'</span>,marker=<span class="hljs-string">'p'</span>) elif (kmeans<span class="hljs-selector-class">.labels_</span><span class="hljs-selector-attr">[i]</span> ==<span class="hljs-number">0</span>): c2 = pl<span class="hljs-selector-class">.scatter</span>(X<span class="hljs-selector-attr">[i,0]</span>,X<span class="hljs-selector-attr">[i,1]</span>,c=<span class="hljs-string">'r'</span>,marker=<span class="hljs-string">''</span>) pl<span class="hljs-selector-class">.legend</span>(<span class="hljs-selector-attr">[c1, c2]</span>, <span class="hljs-selector-attr">[<span class="hljs-string">'Interested'</span>, <span class="hljs-string">'Not Interested'</span>]</span>) pl<span class="hljs-selector-class">.title</span>(<span class="hljs-string">'K-Means Of Interested Vs Not Interested Friends'</span>) pl<span class="hljs-selector-class">.show</span>()</pre></div><h1 id="e2f4">10. Improvements</h1><p id="985a">We can further improve the accuracy by:</p><ul><li>Applying different clustering algorithms to increase accuracy</li><li>Gather better data to ensure we capture all factors that can influence the results</li><li>Tweek weights to get better results</li><li>Introduction of more clusters along with Yes and No, such as MayBeYes and MayBeNo</li></ul><figure id="22d8"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/0*GcS64JsvGL43NuP5"><figcaption>Photo by <a href="https://unsplash.com/@waldemarbrandt67w?utm_source=medium&amp;utm_medium=referral">Waldemar Brandt</a> on <a href="https://unsplash.com?utm_source=medium&amp;utm_medium=referral">Unsplash</a></figcaption></figure><h1 id="740f">11. Summary</h1><p id="a089">Planning, managing and organising events is a big part of our life. Some businesses are based on events management. Most of our social events require us to better understand who might not attend and whether our target audience is right.</p><p id="346c">You can use a machine-learning algorithm to forecast and plan your events better.<b> </b>This example demonstrated how clustering works. Unsupervised clustering algorithms can help us identify groups within our data.</p><p id="a4c3">Additionally, clustering can be used in a number of fields to group large sets of data. These groups can then help us plan our events better and we can make calculated decisions. K-Means is a simple yet powerful algorithm. It has huge potential in finding anomalies and outliers in our data.</p><p id="b1b7">Please let me know if you have any feedback.</p></article></body>

How I Used Machine Learning To Organise Football Matches

Explaining how machine learning models can be used in day to day tasks

Machine learning is gaining popularity and many firms are adopting it in their decision-making process. But can we use it to help with our daily tasks? I put this question to test and machine learning turned out to be very helpful. Let me explain.

I will take this opportunity to also explain machine learning from the foundation up because it’s one area that every professional need to be familiar with.

Photo by Tevarak Phanduang on Unsplash

Let’s start!

We had a great summer this year. It was the first of May and I decided to organise a football match with my friends. So I phoned a nearby sports center to query the cost of booking a football pitch for 1 hour for the first weekend of June. The sports center asked me to pay the full cost and book it in advance due to the high demand. Pre-book(?) I asked. But I don’t know how many of my friends (if any) will play the football match? Well, can you not, maybe, guess or predict? They asked!

Hold on! Predict? Machine learning can do that! It can be great at forecasting.

So I asked myself, is there a way to predict how many of my friends would be interested in playing football? Can we use machine learning algorithm to group my friends into two categories of “Interested” and “Not Interested”. That’s where machine learning comes to rescue as it can do just that! Let me explain.

Article Aim

In this article:

  1. I will briefly explain what machine learning is and its basic three techniques
  2. I will explain how we can gather data and feed it to a model along with the common data science steps
  3. Along the way, I will present the python code which I am going to use to classify my friends into the appropriate groups or categories.

I will also explain the concepts so that all of us can understand why we do things the way we do in a machine learning project.

This article will provide you an end-to-end view of a machine learning project from start to its completion.

For the ones amongst us who are new to the buzz word of machine learning, let me give a quick overview!

1. What Is A Machine Learning Model?

Machine learning is a field of artificial intelligence.

In a nutshell, it revolves around the concept of feeding lots of data to an algorithm (call it a model) so that the model can learn the patterns of the data and subsequently train itself. We can then use the trained model to forecast data that it has not seen before. So, machine learning is about getting machines to learn from the data.

Photo by Agence Olloweb on Unsplash

2. Data Preparation

For the problem that I am trying to solve, I need to gather some data. This is where data science comes in and it is occasionally the first step in a machine learning project.

What sort of data would I need and how do I decide whether a friend of mine would be interested?

Well, I am trying to categorise my friends into two groups. The first task is to understand the factors that the category is dependent on. Let me elaborate!

  • I was planning to invite my friends via WhatsApp so the first answer to investigate is whether my friends are active on WhatsApp. The active friends are more likely to view my invitation.
  • Their age and the area they live in matters too. And also, it’s vital to record whether we have played a football match before and whether they play football on a regular basis. The answer to these variables will also help me with categorising my friends into the appropriate categories.
  • What else? Well, I can also record whether wehave met each other in the last 3 months? The area they live in can also determine whether it would be feasible for them to come. As I was planning to organise the match on the first weekend of June, I should also estimate whether they are usually free on weekends and are available in June. Some of it was guesswork!

Ok, that’s how we gather data. Therefore, I created a csv file with the following columns:

  1. Name (this is just for me to identify them)
  2. Age
  3. Area
  4. Played Football Before With Me
  5. Plays Football Regularly
  6. Active On WhatsApp
  7. Met In Last 3 Months
  8. Free Weekends
  9. Available In June

Most of the labels are self-explanatory. My initial intuition is based on the assumption that these factors will influence the decision criteria on whether they will be interested in playing the football match.

For boolean fields (True/False), I used 1 to represent True and 0 to represent False.

3. Python Code — Stage 1

I used python to implement the solution.

3.1 Preparing Environment

Importing Libraries

The first task is to import the libraries we need:

import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as pl

Loading Data Into DataFrame

Next task is to load the data

#Read File
train = pd.read_csv(r"football\TrainFootballEvent.csv")

One of the important task in data science is to encode categorical values to numerical values

3.2 Data Science Steps

Encoding Data

Not all of the data can be presented as numbers in our data set. To elaborate, the area, where my friend lives in, is represented as text fields. For textual fields, we can use label encoders to convert text to numbers:

#Encoding Categorical To Numerical Values
labelEncoder = LabelEncoder()
labelEncoder.fit(train['Area'])
train['Area'] = labelEncoder.transform(train['Area'])

Fill Missing Data

At times, we get the issue of missing data in our data set. To fill missing data, we have multiple options, including:

  1. Use machine learning algorithms to generate missing data
  2. Provide default values
  3. Use historic values
  4. Calculate missing data such as by using mean, median, max, min to fill missing data

When I can’t figure out whether my friends are active on Whatsapp, I defaulted the empty values to “False”:

# Fill missing values
train.fillna("0", inplace=True)

Data manipulation

Finally, we often have to change the data. For example, I am calculating the inverse of age because my assumption is that it is more likely for my younger friends to join me on a football match.

#lower age, more likely they will be interested to play
train['Age'] = train["Age"].apply(lambda x: (1/x))

Drop the text columns: Name

#drop Name column from the data
X = train.drop(['Name'], 1).astype(float).values
Photo by Sharon Pittaway on Unsplash

3. Choosing A Machine Learning Algorithm

Now our data is ready. Our next task is to choose a machine learning algorithm. We all know that there are a large number of machine learning models available (potentially 100s).

Can we group the machine learning models into different groups?

Yes. We can categorise the machine learning algorithms into three main groups:

  1. Supervised: These algorithms require us to feed them with the right answers along with the data so that they can understand the patterns of the data and work out how they need to calculate the correct answer themselves.
  2. Unsupervised: These algorithms do not expect us to feed them the correct answers for the data, they can work out the answers themselves. These algorithms are good for grouping data into appropriate clusters. For example, we can use the algorithms to find if they can spot trends and patterns in complex data and group them appropriately. That’s what I need.
  3. Reinforcement learning: These are feedback/interactions based learning algorithms. Essentially, these algorithms perform an action and assess whether the action helped them maximise their end goal.

For more information on machine learning, read this article:

Going back to the organising football match task; I want to know how many of my friends will attend the football match. I don’t have that information handy. Therefore, I want to cluster my friends into two clusters: Interested and Not Interested. That’s an unsupervised machine learning problem. So far so good!

4. Wait, What Is Clustering?

Clustering is used to group data into segments. Similar data are clustered together using a distance calculation algorithm such as Euclidean, Manhattan distance, Cosine similarity, Pearson correlation, etc. We have to use an unsupervised clustering algorithm as our data is not labeled.

Photo by Greyson Joralemon on Unsplash

5. What Are The Two Main Types Of Clustering?

There are mainly two types of clustering algorithms:

  • Centroid-based clustering: When you know the number of clusters upfront. A number of clusters are known beforehand and then data is clustered into groups. These groups are known as centroids. Data is grouped into centroids based on how close they are to the center of the centroids. Algorithms include K-Means.
  • Hierarchical clustering: When you want the machine to find the right number of clusters. Each data item is considered as a cluster and then data items are grouped together based on their distance recursively until optimum clusters of data are calculated. Algorithms include Agglomerative clustering.

There are also other clustering algorithm types such as distribution-based clustering which uses underlying probability distribution of the data to group data into clusters.

Photo by Marvin Meyer on Unsplash

6. What Are The Most Common Unsupervised Clustering Algorithms?

Now that we have narrowed down the search, let’s find the handful of algorithms and choose one.

There are many unsupervised machine learning algorithms including:

K-Means, GMM, PCA and LDA.

Detailed comparison of algorithms is outlined here: Machine Learning Algorithms Comparison

I decided to choose K-Means because, one, it is popular. Scond, it will give us a sense of what we can achieve and lastly, it is pretty much transparent! We can always optimise the data set and choose a different algorithm. K-Means is my baseline algorithm.

7. Explaining the K-Means Algorithm

K-Means is an unsupervised clustering algorithm that is used to group data into k-clusters. The algorithm is simple:

Repeat the two steps below until clusters and their mean is stable:

  1. For each data item, assign it to the nearest cluster center. The nearest distance can be calculated based on distance algorithms. Initially, the first item becomes the cluster.
  2. Calculate the mean of the cluster with all data items and update if required.

Once clusters and their mean is stable, all data items are then known to be grouped into their relevant clusters.

There are however limitations of K-Means algorithm:

  • K-Means algorithm does not work well with missing data.
  • It uses a random seed to generate clusters which makes the results un-deterministic and random. We can, however, supply our own random seed number if we want to.
  • It can get slower with larger data items. I only have a few data items.
  • It does not work well with categorical (textual) data. All of my data is numerical now.
Photo by Fotis Fotopoulos on Unsplash

8. Use Python K-Means Algorithm To Cluster Data

I used Python SciKit Learn’s KMeans algorithm to cluster data into groups:

#KMeans Model
kmeans = KMeans(n_clusters=3) 
kmeans.fit(X)
for i in range(0, X.shape[0]):
    if (kmeans.labels_[i] == 1):
        c1 = pl.scatter(X[i,0],X[i,1],c='g',marker='p')
    elif (kmeans.labels_[i] ==0):
        c2 = pl.scatter(X[i,0],X[i,1],c='r',marker='*')
pl.legend([c1, c2], ['Interested', 'Not Interested'])
pl.title('K-Means Of Interested Vs Not Interested Friends')
pl.show()

As a result, the data items (friends) were clustered into two groups: Interested and Not Interested. This can be seen in the scatter plot below:

All of the green dots are the friends who are interested.

9. Testing

I went ahead and organised the football match. I then tested how many of my friends actually attended the football match. The algorithm showed pretty good accuracy; 80%+.

There must be other factors that I did not take into account which can potentially further improve the results.

Full Code

#import libraries
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as pl
#Read File
train = pd.read_csv(r"TrainFootballEvent.csv")
#Encoding Categorical To Numerical Values
labelEncoder = LabelEncoder()
labelEncoder.fit(train['Area'])
train['Area'] = labelEncoder.transform(train['Area'])
#Fill Missing Data
train.fillna("0", inplace=True)
#lower age, more likely they will be interested to play
train['Age'] = train["Age"].apply(lambda x: (1/x))
#drop Name column from the data
X = train.drop(['Name'], 1).astype(float).values
#KMeans Model
kmeans = KMeans(n_clusters=3) 
kmeans.fit(X)
for i in range(0, X.shape[0]):
    if (kmeans.labels_[i] == 1):
        c1 = pl.scatter(X[i,0],X[i,1],c='g',marker='p')
    elif (kmeans.labels_[i] ==0):
        c2 = pl.scatter(X[i,0],X[i,1],c='r',marker='*')
pl.legend([c1, c2], ['Interested', 'Not Interested'])
pl.title('K-Means Of Interested Vs Not Interested Friends')
pl.show()

10. Improvements

We can further improve the accuracy by:

  • Applying different clustering algorithms to increase accuracy
  • Gather better data to ensure we capture all factors that can influence the results
  • Tweek weights to get better results
  • Introduction of more clusters along with Yes and No, such as MayBeYes and MayBeNo
Photo by Waldemar Brandt on Unsplash

11. Summary

Planning, managing and organising events is a big part of our life. Some businesses are based on events management. Most of our social events require us to better understand who might not attend and whether our target audience is right.

You can use a machine-learning algorithm to forecast and plan your events better. This example demonstrated how clustering works. Unsupervised clustering algorithms can help us identify groups within our data.

Additionally, clustering can be used in a number of fields to group large sets of data. These groups can then help us plan our events better and we can make calculated decisions. K-Means is a simple yet powerful algorithm. It has huge potential in finding anomalies and outliers in our data.

Please let me know if you have any feedback.

Machine Learning
Python
Data Science
Programming
Fintechexplained
Recommended from ReadMedium