avatarRibhu Nirek

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

2252

Abstract

wikipedia.org/wiki/Fuzzy_clustering">fuzzy clustering</a>(soft k-means), hierarchical clustering, mixture models. <i>Hard clustering</i> or <i>hard k-means</i> is assigning each data point to only one cluster instead (e.g. email Spam or not Spam) instead of assigning a non-zero membership value to each cluster(Spam: 13%, Not Spam: 87%) as in <i>soft k-means</i>. I am covering <i>hard-clustering </i>in this post.</p><p id="a94d">How the<i> K-means algorithm</i> works:</p><ol><li>Pick <b><i>k </i></b>centroids randomly(without replacement) from <b><i>X</i></b>.</li></ol><p id="5727">2. Compute distance(<i>L2</i> or <i>Euclidean distance</i>) of each <b><i>x</i></b> from all <b><i>μ</i></b>’s.</p><p id="9d05">3. Pick the closest cluster one as the label for this <b><i>x</i></b>.</p><p id="140c">4. Update centroids by finding arithmetic mean of each <b><i>k</i></b> clusters.</p><p id="b9d8">5. Repeat steps<i> 2–4</i> until centroids stop changing.</p><p id="a856">Mathematically, it can be reduced to finding an optimal partition <b><i>S*</i></b> of the dataset <b><i>X</i></b>.</p><figure id="5788"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*IN0bXbGgb-oO1wgaUIOTcg.png"><figcaption>Mathematical formulation of K-means</figcaption></figure><h1 id="2ec1">Code</h1><p id="abf0">Firstly, I will be writing the basic implementation of k-means from scratch in python.</p> <figure id="5e01"> <div> <div>

            <iframe class="gist-iframe" src="/gist/ribhunirek/ffaf632657822292b36a3009f235a0b9.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><p id="560d">Let’s generate some data and apply k-means to see how it works.</p>
    <figure id="47a3">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/ribhunirek/55ff98417b268325310c80f24d2e29da.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="227e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800

Options

/1*QkdpziVfDd0muH2jc4fjLg.png"><figcaption>Synthesized data</figcaption></figure> <figure id="31e7"> <div> <div>

            <iframe class="gist-iframe" src="/gist/ribhunirek/08cb69cf17bcc2ff3f88fca2a288cde6.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="408c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*8BjDi3LU-db2WzklNYc8tA.png"><figcaption>Output: K-means from scratch</figcaption></figure><p id="44cb">Not bad, huh? Building a model from scratch in 50 lines of code is cool :)</p><p id="f360">The same task can be done within a few lines by importing the scikit-learn library.</p>
    <figure id="67d6">
        <div>
          <div>
            
            <iframe class="gist-iframe" src="/gist/ribhunirek/d47fca38b445960e2c03e89bf12857e6.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
          </div>
        </div>
    </figure></iframe></div></div></figure><figure id="2afd"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*GamrTn2I_Q2aymyqbWNC7g.png"><figcaption>Output: K-means using sklearn</figcaption></figure><p id="5701"><a href="https://scikit-learn.org/stable/">Sklearn</a> gives pretty much the same output as the model we built from scratch on this dummy data set.</p><p id="b8b6">Once you have written a basic bare and bones structure from scratch and are familiar with the nitty-gritty of the implementation. After that, implementing <b><i>k-means </i></b>or any other algorithm is a walk in the park using specialized library functions.</p><h1 id="61d7">Conclusion</h1><p id="8787">K-means is one of the simplest unsupervised learning methods. It can be used to draw insights for EDA before moving on to build a sophisticated architecture to make decisions. This blog is a good starting point to get some idea about unsupervised learning, clustering, k-means and its implementation.</p><p id="ae5c">Feel free to read, code and explore to learn more. Drop a note down below to share your experience. Thanks for reading :)</p></article></body>

K-Means Clustering Algorithm

Brief: K-means clustering is an unsupervised learning method. In this post, I introduce the idea of unsupervised learning and why it is useful. Then I talk about K-means clustering: mathematical formulation of the problem, python implementation from scratch and also using machine learning libraries.

Unsupervised Learning

Typically, machine learning models make prediction on data, learning previously unseen patterns to make important business decisions. When the data set consists of labels along with data points, it is known as supervised learning, with spam detection, speech recognition, handwriting recognition being some of its use cases. The learning methods where insights are drawn from data points without any ground truth or correct labels falls under the category of unsupervised learning.

Unsupervised learning is one of the basic techniques used in exploratory data analysis to make sense of the data before preparing to make complex machine learning models to make inferences. As this does not consist of human-labelled data, bias is minimized. Also, as there are no labels, there are no correct answers. From a probabilistic standpoint the contrast between supervised and unsupervised learning is the following: supervised learning infers the conditional probability distribution p(x|y), whereas unsupervised learning is concerned with the prior probability p(x).

K-Means Clustering Algorithm

Objective of clustering methods is to separate data points into separate clusters(pre-determined) maximizing inter-cluster distance and minimizing intra-cluster distance(increasing similarity).

K-Means is one of the clustering techniques in unsupervised learning algorithms. Some other commonly used techniques are fuzzy clustering(soft k-means), hierarchical clustering, mixture models. Hard clustering or hard k-means is assigning each data point to only one cluster instead (e.g. email Spam or not Spam) instead of assigning a non-zero membership value to each cluster(Spam: 13%, Not Spam: 87%) as in soft k-means. I am covering hard-clustering in this post.

How the K-means algorithm works:

  1. Pick k centroids randomly(without replacement) from X.

2. Compute distance(L2 or Euclidean distance) of each x from all μ’s.

3. Pick the closest cluster one as the label for this x.

4. Update centroids by finding arithmetic mean of each k clusters.

5. Repeat steps 2–4 until centroids stop changing.

Mathematically, it can be reduced to finding an optimal partition S* of the dataset X.

Mathematical formulation of K-means

Code

Firstly, I will be writing the basic implementation of k-means from scratch in python.

Let’s generate some data and apply k-means to see how it works.

Synthesized data
Output: K-means from scratch

Not bad, huh? Building a model from scratch in 50 lines of code is cool :)

The same task can be done within a few lines by importing the scikit-learn library.

Output: K-means using sklearn

Sklearn gives pretty much the same output as the model we built from scratch on this dummy data set.

Once you have written a basic bare and bones structure from scratch and are familiar with the nitty-gritty of the implementation. After that, implementing k-means or any other algorithm is a walk in the park using specialized library functions.

Conclusion

K-means is one of the simplest unsupervised learning methods. It can be used to draw insights for EDA before moving on to build a sophisticated architecture to make decisions. This blog is a good starting point to get some idea about unsupervised learning, clustering, k-means and its implementation.

Feel free to read, code and explore to learn more. Drop a note down below to share your experience. Thanks for reading :)

Machine Learning
Data Science
K Means
Clustering
First Post
Recommended from ReadMedium