avatarEsteban Thilliez

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4853

Abstract

then predict values.</li></ol><p id="7bc0">Well, the example is extremely simplified. It’s never that easy, but as beginners, we can be pretty proud of our first model!</p><p id="f330">We could use other algorithms, such as logistic regression or decision trees.</p><p id="736b">If we want to use another algorithm, the only line to change is:</p><div id="b035"><pre>reg = LinearRegression()</pre></div><p id="d0a1">And instead, we would write:</p><div id="b1f7"><pre><span class="hljs-keyword">from</span> sklearn.tree <span class="hljs-keyword">import</span> DecisionTreeRegressor

reg = DecisionTreeRegressor()</pre></div><h2 id="0e5a">Unsupervised Learning with Scikit-Learn</h2><p id="8220">Unsupervised learning is a type of machine learning where the goal is to discover patterns or relationships in data without the use of labeled data. Scikit-learn provides a wide range of tools and algorithms for unsupervised learning, making it a powerful library for data science.</p><p id="7d69">One of the most common unsupervised learning problem is clustering, which is used to group similar data points together. Clustering can be used for various applications such as customer segmentation, image segmentation, and anomaly detection. The most popular clustering algorithms in scikit-learn are K-Means and DBSCAN.</p><p id="26b2">Another popular unsupervised learning problem is dimensionality reduction, which is used to reduce the number of features in a dataset while preserving as much information as possible. Dimensionality reduction can be used for various applications such as visualization, feature selection, and noise reduction. The most popular dimensionality reduction algorithms in scikit-learn are PCA, LDA and t-SNE.</p><p id="4446">Let’s try to classify customers depending on their spending score in a company. First, we need a dataset:</p><div id="38a0"><pre>import pandas as pd import <span class="hljs-built_in">random</span>

ages = [<span class="hljs-built_in">random</span>.randint(<span class="hljs-number">18</span>, <span class="hljs-number">65</span>) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">100</span>)] genders = [<span class="hljs-built_in">random</span>.choice([<span class="hljs-string">'m'</span>, <span class="hljs-string">'f'</span>]) <span class="hljs-keyword">for</span> i <span class="hljs-keyword">in</span> range(<span class="hljs-number">100</span>)] annual_incomes = [<span class="hljs-built_in">random</span>.randint(<span class="hljs-number">10000</span>, <span class="hljs-number">50000</span>) <span class="hljs-keyword">if</span> age < <span class="hljs-number">30</span> <span class="hljs-keyword">else</span> <span class="hljs-built_in">random</span>.randint(<span class="hljs-number">30000</span>, <span class="hljs-number">100000</span>) <span class="hljs-keyword">for</span> age <span class="hljs-keyword">in</span> ages] spending_scores = [<span class="hljs-built_in">random</span>.randint(<span class="hljs-number">1</span>, <span class="hljs-number">30</span>) <span class="hljs-keyword">if</span> annual_income < <span class="hljs-number">30000</span> <span class="hljs-keyword">else</span> <span class="hljs-built_in">random</span>.randint(<span class="hljs-number">30</span>, <span class="hljs-number">100</span>) <span class="hljs-keyword">for</span> annual_income <span class="hljs-keyword">in</span> annual_incomes]

data = { <span class="hljs-string">'age'</span>: ages, <span class="hljs-string">'gender'</span>: genders, <span class="hljs-string">'annual_income'</span>: annual_incomes, <span class="hljs-string">'spending_score'</span>: spending_scores }

df = pd.DataFrame(data)</pre></div><p id="8e90">Now we can create clusters:</p><div id="7459"><pre>from sklearn.cluster import KMeans

<span class="hljs-comment"># Creating the model</span> kmeans = KMeans(n_clusters=4)

<span class="hljs-comment"># Fitting the model to the data</span> kmeans.fit(<span class="hljs-built_in">df</span>[[<span class="hljs-string">'age'</span>, <span class="hljs-string">'annual_income'</span>, <span class="hljs-string">'spending_score'</span>]])

<span class="hljs-comment"># Adding a column to the dataframe to show the cluster each customer belongs to</span> <span class="hljs-built_in">df</span>[<span class="hljs-string">'cluster'</span>] = kmeans.predict(<span class="hljs-built_in">df</span>[[<span class="hljs-string">'age'</span>, <span class="hljs-string">'annual_income'</span>, <span class="hljs-string">'spending_score'</span>]])

<span class="hljs-built_in">print</span>(<span class="hljs-built_in">df</span>)</pre></div><p id="7695">First, we create an instance of the KMeans class, passing the number of clusters we want to use (4) as a parameter. We then fit the model to the data using

Options

the .fit() method, selecting the features ‘age’, ‘annual_income’, ‘spending_score’ from the dataframe.</p><p id="f642">Once the model is trained, we can use it to predict the cluster for each customer, and the result is added to the dataframe as a new column ‘cluster’. Finally, we print the dataframe to see the cluster each customer belongs to.</p><p id="c9e7">We can now describe the clusters:</p><div id="d84b"><pre><span class="hljs-keyword">for</span> cluster <span class="hljs-keyword">in</span> df[<span class="hljs-string">'cluster'</span>].unique(): <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Cluster <span class="hljs-subst">{cluster}</span>"</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Age: <span class="hljs-subst">{df[df[<span class="hljs-string">'cluster'</span>] == cluster][<span class="hljs-string">'age'</span>].mean()}</span>"</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Annual Income: <span class="hljs-subst">{df[df[<span class="hljs-string">'cluster'</span>] == cluster][<span class="hljs-string">'annual_income'</span>].mean()}</span>"</span>) <span class="hljs-built_in">print</span>(<span class="hljs-string">f"Spending Score: <span class="hljs-subst">{df[df[<span class="hljs-string">'cluster'</span>] == cluster][<span class="hljs-string">'spending_score'</span>].mean()}</span>"</span>) <span class="hljs-built_in">print</span>()

Cluster <span class="hljs-number">1</span> Age: <span class="hljs-number">27.96153846153846</span> Annual Income: <span class="hljs-number">28124.53846153846</span> Spending Score: <span class="hljs-number">48.19230769230769</span>

Cluster <span class="hljs-number">3</span> Age: <span class="hljs-number">42.53846153846154</span> Annual Income: <span class="hljs-number">48615.730769230766</span> Spending Score: <span class="hljs-number">58.46153846153846</span>

Cluster <span class="hljs-number">0</span> Age: <span class="hljs-number">45.65217391304348</span> Annual Income: <span class="hljs-number">70754.95652173914</span> Spending Score: <span class="hljs-number">66.04347826086956</span>

Cluster <span class="hljs-number">2</span> Age: <span class="hljs-number">51.36</span> Annual Income: <span class="hljs-number">91312.92</span> Spending Score: <span class="hljs-number">73.4</span></pre></div><p id="22a8">Using Matplotlib and Seaborn, we can plot the clusters:</p><div id="2fe2"><pre><span class="hljs-keyword">import</span> matplotlib.pyplot <span class="hljs-keyword">as</span> plt <span class="hljs-keyword">import</span> seaborn <span class="hljs-keyword">as</span> sns

<span class="hljs-comment"># Plotting the clusters</span> plt.figure(figsize=(<span class="hljs-number">10</span>, <span class="hljs-number">8</span>)) sns.scatterplot(x=<span class="hljs-string">'annual_income'</span>, y=<span class="hljs-string">'spending_score'</span>, data=df, hue=<span class="hljs-string">'cluster'</span>, palette=<span class="hljs-string">'Set1'</span>) plt.title(<span class="hljs-string">'Clusters of Customers'</span>)

plt.show()</pre></div><figure id="387a"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*Bb276ywvBxiP055xmQwLeQ.png"><figcaption></figcaption></figure><h2 id="cf11">Final Note</h2><p id="98c9">Well, that’s a lot of new things but don’t worry, I’ll get back to each thing in detail later!</p><p id="53ab">In the next article, I’ll talk about model evaluation and selection. We’ll see how we can improve the performance of our models. Be sure to follow me if you don’t want to miss this article!</p><p id="d035"><i>To explore more of my Python stories, click <a href="https://readmedium.com/tech-aa824bad0d67">here</a>! You can also access all my content by checking <a href="https://readmedium.com/about-me-d63607c8c341">this page</a>.</i></p><p id="8692"><i>If you want to be notified every time I publish a new story, subscribe to me via email by clicking <a href="https://medium.com/subscribe/@estebanthi">here</a>!</i></p><p id="7770"><i>If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:</i></p><div id="a2d6" class="link-block"> <a href="https://medium.com/@estebanthi/membership"> <div> <div> <h2>Join Medium with my referral link — Esteban Thilliez</h2> <div><h3>Read every story from Esteban Thilliez (and thousands of other writers on Medium). Your membership fee directly…</h3></div> <div><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*IoN4BofrwCNWA_bS)"></div> </div> </div> </a> </div></article></body>

Data Science with Python — Introduction to Scikit-Learn

Photo by Markus Winkler on Unsplash

After writing about NumPy, Matplotlib, and Pandas, it’s time to write about Scikit-Learn, one of the most important libraries related to data science in Python.

I’ll introduce you to Scikit-learn and explain the basics.

What is Scikit-Learn?

Scikit-learn, or sklearn for short, is an open-source machine learning library that provides a wide range of tools and algorithms for data analysis and modeling. It is built on top of other popular Python libraries such as NumPy and pandas, and can be used for both supervised and unsupervised learning tasks.

One of the key advantages of scikit-learn is its consistent and easy-to-use interface. The library follows a simple and consistent structure, making it easy for users to quickly learn and apply new techniques. Additionally, scikit-learn provides built-in functions for model evaluation and selection, which can save you a lot of time when working on a data science project.

With Scikit-learn, you don’t have to write all the algorithms you need as they are already implemented in the library. You can easily use Perceptrons or Random forests for example.

Supervised Learning with Scikit-Learn

Supervised learning is the process of using labeled data to train a model that can make predictions on new, unseen data.

One of the most commonly used supervised learning algorithms in scikit-learn is linear regression, which is used to predict a continuous target variable based on one or more input variables.

Linear regression can be used to model relationships between variables such as the relationship between temperature and ice cream sales or the relationship between years of experience and salary.

Let’s try to do this. We’ll start by creating a sample dataset:

import pandas as pd


data = {'Temperature': [25, 27, 28, 29, 30, 31, 32, 33, 35, 40],
        'Ice Cream Sales': [100, 120, 130, 140, 160, 170, 180, 190, 200, 210]
       }
df = pd.DataFrame(data)

Now we can implement our linear regression easily using scikit-learn:

import numpy as np

# Importing the LinearRegression class
from sklearn.linear_model import LinearRegression

# Setting the input variable (Temperature) and the output variable (Ice Cream Sales)
X = df[['Temperature']] # input variable
y = df['Ice Cream Sales'] # output variable

# Creating the model
reg = LinearRegression()

# Fitting the model to the data
reg.fit(X, y)

# Predicting the output for a new input
new_input = np.array([[20]])
predicted_output = reg.predict(new_input)

print("Predicted output:", predicted_output)

The output we get is 72.26, and it’s kinda cohesive.

As you can see, it’s easy to implement supervised machine learning models with scikit-learn:

  1. We get a dataset.
  2. We create our training variables.
  3. We create our model.
  4. We train it.
  5. We can then predict values.

Well, the example is extremely simplified. It’s never that easy, but as beginners, we can be pretty proud of our first model!

We could use other algorithms, such as logistic regression or decision trees.

If we want to use another algorithm, the only line to change is:

reg = LinearRegression()

And instead, we would write:

from sklearn.tree import DecisionTreeRegressor

reg = DecisionTreeRegressor()

Unsupervised Learning with Scikit-Learn

Unsupervised learning is a type of machine learning where the goal is to discover patterns or relationships in data without the use of labeled data. Scikit-learn provides a wide range of tools and algorithms for unsupervised learning, making it a powerful library for data science.

One of the most common unsupervised learning problem is clustering, which is used to group similar data points together. Clustering can be used for various applications such as customer segmentation, image segmentation, and anomaly detection. The most popular clustering algorithms in scikit-learn are K-Means and DBSCAN.

Another popular unsupervised learning problem is dimensionality reduction, which is used to reduce the number of features in a dataset while preserving as much information as possible. Dimensionality reduction can be used for various applications such as visualization, feature selection, and noise reduction. The most popular dimensionality reduction algorithms in scikit-learn are PCA, LDA and t-SNE.

Let’s try to classify customers depending on their spending score in a company. First, we need a dataset:

import pandas as pd
import random


ages = [random.randint(18, 65) for i in range(100)]
genders = [random.choice(['m', 'f']) for i in range(100)]
annual_incomes = [random.randint(10000, 50000) if age < 30 else random.randint(30000, 100000) for age in ages]
spending_scores = [random.randint(1, 30) if annual_income < 30000 else random.randint(30, 100) for annual_income in annual_incomes]

data = {
    'age': ages,
    'gender': genders,
    'annual_income': annual_incomes,
    'spending_score': spending_scores
}

df = pd.DataFrame(data)

Now we can create clusters:

from sklearn.cluster import KMeans

# Creating the model
kmeans = KMeans(n_clusters=4)

# Fitting the model to the data
kmeans.fit(df[['age', 'annual_income', 'spending_score']])

# Adding a column to the dataframe to show the cluster each customer belongs to
df['cluster'] = kmeans.predict(df[['age', 'annual_income', 'spending_score']])

print(df)

First, we create an instance of the KMeans class, passing the number of clusters we want to use (4) as a parameter. We then fit the model to the data using the .fit() method, selecting the features ‘age’, ‘annual_income’, ‘spending_score’ from the dataframe.

Once the model is trained, we can use it to predict the cluster for each customer, and the result is added to the dataframe as a new column ‘cluster’. Finally, we print the dataframe to see the cluster each customer belongs to.

We can now describe the clusters:

for cluster in df['cluster'].unique():
    print(f"Cluster {cluster}")
    print(f"Age: {df[df['cluster'] == cluster]['age'].mean()}")
    print(f"Annual Income: {df[df['cluster'] == cluster]['annual_income'].mean()}")
    print(f"Spending Score: {df[df['cluster'] == cluster]['spending_score'].mean()}")
    print()


Cluster 1
Age: 27.96153846153846
Annual Income: 28124.53846153846
Spending Score: 48.19230769230769

Cluster 3
Age: 42.53846153846154
Annual Income: 48615.730769230766
Spending Score: 58.46153846153846

Cluster 0
Age: 45.65217391304348
Annual Income: 70754.95652173914
Spending Score: 66.04347826086956

Cluster 2
Age: 51.36
Annual Income: 91312.92
Spending Score: 73.4

Using Matplotlib and Seaborn, we can plot the clusters:

import matplotlib.pyplot as plt
import seaborn as sns

# Plotting the clusters
plt.figure(figsize=(10, 8))
sns.scatterplot(x='annual_income', y='spending_score', data=df, hue='cluster', palette='Set1')
plt.title('Clusters of Customers')

plt.show()

Final Note

Well, that’s a lot of new things but don’t worry, I’ll get back to each thing in detail later!

In the next article, I’ll talk about model evaluation and selection. We’ll see how we can improve the performance of our models. Be sure to follow me if you don’t want to miss this article!

To explore more of my Python stories, click here! You can also access all my content by checking this page.

If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!

If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link:

Data Science
Python
Programming
Coding
Big Data
Recommended from ReadMedium