Data Science with Python — Introduction to Scikit-Learn
After writing about NumPy, Matplotlib, and Pandas, it’s time to write about Scikit-Learn, one of the most important libraries related to data science in Python.
I’ll introduce you to Scikit-learn and explain the basics.
What is Scikit-Learn?
Scikit-learn, or sklearn for short, is an open-source machine learning library that provides a wide range of tools and algorithms for data analysis and modeling. It is built on top of other popular Python libraries such as NumPy and pandas, and can be used for both supervised and unsupervised learning tasks.
One of the key advantages of scikit-learn is its consistent and easy-to-use interface. The library follows a simple and consistent structure, making it easy for users to quickly learn and apply new techniques. Additionally, scikit-learn provides built-in functions for model evaluation and selection, which can save you a lot of time when working on a data science project.
With Scikit-learn, you don’t have to write all the algorithms you need as they are already implemented in the library. You can easily use Perceptrons or Random forests for example.
Supervised Learning with Scikit-Learn
Supervised learning is the process of using labeled data to train a model that can make predictions on new, unseen data.
One of the most commonly used supervised learning algorithms in scikit-learn is linear regression, which is used to predict a continuous target variable based on one or more input variables.
Linear regression can be used to model relationships between variables such as the relationship between temperature and ice cream sales or the relationship between years of experience and salary.
Let’s try to do this. We’ll start by creating a sample dataset:
import pandas as pd
data = {'Temperature': [25, 27, 28, 29, 30, 31, 32, 33, 35, 40],
'Ice Cream Sales': [100, 120, 130, 140, 160, 170, 180, 190, 200, 210]
}
df = pd.DataFrame(data)
Now we can implement our linear regression easily using scikit-learn:
import numpy as np
# Importing the LinearRegression class
from sklearn.linear_model import LinearRegression
# Setting the input variable (Temperature) and the output variable (Ice Cream Sales)
X = df[['Temperature']] # input variable
y = df['Ice Cream Sales'] # output variable
# Creating the model
reg = LinearRegression()
# Fitting the model to the data
reg.fit(X, y)
# Predicting the output for a new input
new_input = np.array([[20]])
predicted_output = reg.predict(new_input)
print("Predicted output:", predicted_output)
The output we get is 72.26, and it’s kinda cohesive.
As you can see, it’s easy to implement supervised machine learning models with scikit-learn:
- We get a dataset.
- We create our training variables.
- We create our model.
- We train it.
- We can then predict values.
Well, the example is extremely simplified. It’s never that easy, but as beginners, we can be pretty proud of our first model!
We could use other algorithms, such as logistic regression or decision trees.
If we want to use another algorithm, the only line to change is:
reg = LinearRegression()
And instead, we would write:
from sklearn.tree import DecisionTreeRegressor
reg = DecisionTreeRegressor()
Unsupervised Learning with Scikit-Learn
Unsupervised learning is a type of machine learning where the goal is to discover patterns or relationships in data without the use of labeled data. Scikit-learn provides a wide range of tools and algorithms for unsupervised learning, making it a powerful library for data science.
One of the most common unsupervised learning problem is clustering, which is used to group similar data points together. Clustering can be used for various applications such as customer segmentation, image segmentation, and anomaly detection. The most popular clustering algorithms in scikit-learn are K-Means and DBSCAN.
Another popular unsupervised learning problem is dimensionality reduction, which is used to reduce the number of features in a dataset while preserving as much information as possible. Dimensionality reduction can be used for various applications such as visualization, feature selection, and noise reduction. The most popular dimensionality reduction algorithms in scikit-learn are PCA, LDA and t-SNE.
Let’s try to classify customers depending on their spending score in a company. First, we need a dataset:
import pandas as pd
import random
ages = [random.randint(18, 65) for i in range(100)]
genders = [random.choice(['m', 'f']) for i in range(100)]
annual_incomes = [random.randint(10000, 50000) if age < 30 else random.randint(30000, 100000) for age in ages]
spending_scores = [random.randint(1, 30) if annual_income < 30000 else random.randint(30, 100) for annual_income in annual_incomes]
data = {
'age': ages,
'gender': genders,
'annual_income': annual_incomes,
'spending_score': spending_scores
}
df = pd.DataFrame(data)
Now we can create clusters:
from sklearn.cluster import KMeans
# Creating the model
kmeans = KMeans(n_clusters=4)
# Fitting the model to the data
kmeans.fit(df[['age', 'annual_income', 'spending_score']])
# Adding a column to the dataframe to show the cluster each customer belongs to
df['cluster'] = kmeans.predict(df[['age', 'annual_income', 'spending_score']])
print(df)
First, we create an instance of the KMeans class, passing the number of clusters we want to use (4) as a parameter. We then fit the model to the data using the .fit() method, selecting the features ‘age’, ‘annual_income’, ‘spending_score’ from the dataframe.
Once the model is trained, we can use it to predict the cluster for each customer, and the result is added to the dataframe as a new column ‘cluster’. Finally, we print the dataframe to see the cluster each customer belongs to.
We can now describe the clusters:
for cluster in df['cluster'].unique():
print(f"Cluster {cluster}")
print(f"Age: {df[df['cluster'] == cluster]['age'].mean()}")
print(f"Annual Income: {df[df['cluster'] == cluster]['annual_income'].mean()}")
print(f"Spending Score: {df[df['cluster'] == cluster]['spending_score'].mean()}")
print()
Cluster 1
Age: 27.96153846153846
Annual Income: 28124.53846153846
Spending Score: 48.19230769230769
Cluster 3
Age: 42.53846153846154
Annual Income: 48615.730769230766
Spending Score: 58.46153846153846
Cluster 0
Age: 45.65217391304348
Annual Income: 70754.95652173914
Spending Score: 66.04347826086956
Cluster 2
Age: 51.36
Annual Income: 91312.92
Spending Score: 73.4
Using Matplotlib and Seaborn, we can plot the clusters:
import matplotlib.pyplot as plt
import seaborn as sns
# Plotting the clusters
plt.figure(figsize=(10, 8))
sns.scatterplot(x='annual_income', y='spending_score', data=df, hue='cluster', palette='Set1')
plt.title('Clusters of Customers')
plt.show()

Final Note
Well, that’s a lot of new things but don’t worry, I’ll get back to each thing in detail later!
In the next article, I’ll talk about model evaluation and selection. We’ll see how we can improve the performance of our models. Be sure to follow me if you don’t want to miss this article!
To explore more of my Python stories, click here! You can also access all my content by checking this page.
If you want to be notified every time I publish a new story, subscribe to me via email by clicking here!
If you’re not subscribed to medium yet and wish to support me or get access to all my stories, you can use my link: