Free AI web copilot to create summaries, insights and extended knowledge, download it at here
3358
Abstract
y/operator.html">operators</a> in order to simplify some of the operations that will occur downstream, such as the computation of means, of distances or of equality between two points.</p>
<figure id="1c4e">
<div>
<div>
<iframe class="gist-iframe" src="/gist/linkerzx/5126f8b9589e093b8d100cdcc61dfa6b.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="f0bc">A Cluster point is defined as a point assignable to a given cluster.</p>
<figure id="7cbc">
<div>
<div>
<iframe class="gist-iframe" src="/gist/linkerzx/d33dae6cf983f2fc5fb711f7ae884ea0.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="3060">The k-means algorithms relies in a notion of distance in order to tie each point to its closest mean-point. The distance function shown above is a calculates the euclidean distance between two <i>2D</i> points. While the closest function return the indice of the first point in a list of points to be the point the nearest to a given point <i>p</i>.</p><h2 id="161f">Fit Method</h2><p id="4590">Let’s first focus on what the <b>fit</b> method should do, if we look at how the k-means algorithm is meant to behave:</p><ul><li><b>Assignment Step: </b>It takes a new point at random for each of the n clusters that have been defined in k-means and assigns it to that cluster</li><li><b>Update Step:</b> For each of the points it assigns a point to a cluster based on it being the closest cluster and that until a certain stopping criteria is being met, usually a number of iteration or a convergence condition (the cluster means don’t change)</li></ul><p id="c539">Let’s look at how to implement the assignment step within the fit method:</p>
<figure id="0af2">
<div>
<div>
<iframe class="gist-iframe" src="/gist/linkerzx/76938e23d0c01ee32127101d50669f52.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="bd0a">What the code does is that until we have assigned the required number of initial cluster points, we assign to a cluster one point at random to be its’ center. These end up being push to the variable cluster_centers_ for further use. We then initialize every point by assigning it to its closest cluster point.</p>
<figure id="4aec">
<div>
<div>
<iframe class="gist-iframe" src="/gist/linkerzx/99ba2921795264fb3336fa5c12d196c7.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="2628">For all the point in the data set, do a first pass initialization and assign the point to a given cluster.</p><ul><li><b>Assign each point to its closest cluster mean</b> (<i>lcp</i>)</li><li><b>Recalculate the mean of each clusters</b> (<i>cluster_centers_</i>), in the above function the mean is calculated using <a
Options
href="https://docs.python.org/3.0/tutorial/datastructures.html#list-comprehensions">list comprehension</a> filtering and the <a href="https://docs.python.org/3.0/library/operator.html">operators</a> of the <i>point class</i></li><li><b>Break if the stop condition is met</b> otherwise go for another round of the loop. The stop condition being put there is either 100 iterations or a constant mean and 1 or more iteration.</li></ul>
<figure id="9993">
<div>
<div>
<iframe class="gist-iframe" src="/gist/linkerzx/ee36b8572045b6a3019d38d804987f52.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="74d9">Setting up everything together gives the above fit method.</p><h2 id="8a21">Predict Method</h2>
<figure id="9225">
<div>
<div>
<iframe class="gist-iframe" src="/gist/linkerzx/8467181cba143cd7ec14c2481695dc0a.js" allowfullscreen="" frameborder="0" height="undefined" width="undefined">
</div>
</div>
</figure></iframe></div></div></figure><p id="7eb2">The predict functions re-uses part of the code that was used during the fit/training process. It converts each point in a list of point into a cluster point assigned to its’s nearest cluster mean.</p><h2 id="c227">Wrapping up</h2><figure id="e09e"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*bvb-3YhFBu6dVxwM5kXHnQ.png"><figcaption></figcaption></figure><p id="c004">Setting up K-Means as describe above allows us to run the <b><i>fit</i></b> method on a List of points and compute the different cluster centers.</p><figure id="b8a2"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*kcEZujSMYfVsnQrRAZ8hzA.png"><figcaption></figcaption></figure><p id="92c6">The predict API when provided with a list of point inputs, returns a list of cluster points with an assignment to their closest k-means.</p><p id="0e05">We were able to create an implementation of k-means using vanilla python modeled after the sklearn api, while it may not show the performance of an implementation relying on numpy or pandas and is unoptimized, it allows for an understanding of the different steps required to implement the K-Means algorithm.</p><p id="429c">More from me on <a href="https://medium.com/analytics-and-data">Hacking Analytics</a>:</p><ul><li><a href="https://readmedium.com/on-the-evolution-of-data-engineering-c5e56d273e37">One the evolution of Data Engineering</a></li><li><a href="https://readmedium.com/experimental-design-how-to-avoid-blowing-everything-up-c7ec93ad8cc8">Experimental Design and How to Avoid blowing everything up</a></li><li><a href="https://readmedium.com/hacking-up-a-reporting-pipeline-using-python-and-excel-2ffc8be044c">Hacking up a reporting pipeline using Python and Excel</a></li><li><a href="https://readmedium.com/what-is-a-data-layer-what-is-a-tag-management-system-f717dacb1216">What is a data layer? what is a tag management system?</a></li><li><a href="https://readmedium.com/cookies-tracking-and-pixels-where-does-your-web-data-comes-from-ff5d9b8bc8f7">Cookies, tracking and pixels, where does your web data comes from?</a></li></ul></article></body>