ML Tutorial 2 — Understanding Supervised and Unsupervised Learning

Learn the difference between supervised and unsupervised learning methods.

Table of Contents 1. Introduction 2. What is Supervised Learning? 3. What is Unsupervised Learning? 4. How to Choose Between Supervised and Unsupervised Learning? 5. Examples of Supervised and Unsupervised Learning Applications 6. Conclusion

Subscribe for FREE to get your 42 pages e-book: Data Science | The Comprehensive Handbook

Get step-by-step e-books on Python, ML, DL, and LLMs.

1. Introduction

In this blog, you will learn the difference between supervised and unsupervised learning methods, which are two of the most common types of machine learning techniques. You will also learn how to choose between them depending on your data and problem, and see some examples of their applications in real-world scenarios.

Machine learning is a branch of artificial intelligence that enables computers to learn from data and make predictions or decisions without being explicitly programmed. Machine learning can be used for various tasks, such as image recognition, natural language processing, recommendation systems, fraud detection, and more.

However, not all machine learning problems are the same, and different types of learning methods may be more suitable for different situations. In this blog, we will focus on two main types of learning methods: supervised and unsupervised learning.

Supervised learning is a type of machine learning where the computer learns from labeled data, that is, data that has a known outcome or target variable. For example, if you want to train a machine learning model to classify images of animals, you need to provide the model with a set of images that have labels, such as “cat”, “dog”, “bird”, etc. The model then learns to associate the features of the images with the labels, and can predict the label of a new image that it has not seen before.

Unsupervised learning is a type of machine learning where the computer learns from unlabeled data, that is, data that does not have a known outcome or target variable. For example, if you want to train a machine learning model to cluster customers based on their preferences, you do not need to provide the model with any labels, such as “loyal”, “casual”, “new”, etc. The model then learns to find patterns and similarities in the data, and can group the customers into different clusters based on their characteristics.

As you can see, supervised and unsupervised learning have different goals and approaches, and they can be used for different types of problems. In the next sections, we will explore these two types of learning methods in more detail, and see how to choose between them and apply them in practice.

2. What is Supervised Learning?

There are two main types of supervised learning problems: regression and classification. Regression problems involve predicting a continuous numerical value, such as the price of a house, the height of a person, or the temperature of a city. Classification problems involve predicting a discrete categorical value, such as the type of an animal, the sentiment of a text, or the genre of a movie.

To solve a supervised learning problem, you need to follow these steps:

Collect and preprocess the data. This involves gathering the data from various sources, cleaning and formatting the data, and splitting the data into two sets: a training set and a test set. The training set is used to train the model, and the test set is used to evaluate the model.
Choose a suitable algorithm. This involves selecting a machine learning algorithm that can learn from the data and make predictions. There are many algorithms to choose from, such as linear regression, logistic regression, decision trees, k-nearest neighbors, support vector machines, neural networks, and more. Each algorithm has its own advantages and disadvantages, and you need to consider factors such as the size and complexity of the data, the type and number of features, the speed and accuracy of the algorithm, and the interpretability and generalizability of the model.
Train the model. This involves feeding the training data to the algorithm and letting it learn the patterns and relationships in the data. The algorithm uses a mathematical function, called a hypothesis, to map the input features to the output target. The algorithm also uses a cost function, which measures the difference between the predicted and actual values, and an optimization technique, which minimizes the cost function and improves the model’s performance.
Test and evaluate the model. This involves using the test data to measure how well the model can make predictions on new data that it has not seen before. The evaluation metrics depend on the type of problem. For regression problems, common metrics are mean absolute error, mean squared error, and root mean squared error. For classification problems, common metrics are accuracy, precision, recall, and f1-score.
Improve the model. This involves tweaking the model’s parameters, called hyperparameters, to optimize its performance. Hyperparameters are not learned by the algorithm, but are set by the user before training the model. Examples of hyperparameters are the learning rate, the number of iterations, the number of hidden layers, and the regularization term. You can use techniques such as grid search, random search, or cross-validation to find the best combination of hyperparameters for your model.

To illustrate the supervised learning process, let’s look at an example of a regression problem. Suppose you want to train a machine learning model to predict the salary of a person based on their years of experience. You have a dataset of 30 employees, with their years of experience and salary as the input features and output target, respectively. You can use the Python programming language and the scikit-learn library to solve this problem.

First, you need to import the necessary modules and load the data. You can use the pandas library to read the data from a CSV file and store it in a dataframe. You can also use the matplotlib library to plot the data and visualize the relationship between the features and the target.

# Import the modules
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LinearRegression
    from sklearn.metrics import mean_squared_error

    # Load the data
    df = pd.read_csv("salary_data.csv")
    print(df.head())

    # Plot the data
    plt.scatter(df["YearsExperience"], df["Salary"])
    plt.xlabel("Years of Experience")
    plt.ylabel("Salary")
    plt.show()

The output of the code is as follows:

YearsExperience   Salary
    0             1.1  39343.0
    1             1.3  46205.0
    2             1.5  37731.0
    3             2.0  43525.0
    4             2.2  39891.0

The plot of the data shows a positive linear relationship between the years of experience and the salary, which means that as the years of experience increase, the salary also increases.

3. What is Unsupervised Learning?

There are two main types of unsupervised learning problems: clustering and dimensionality reduction. Clustering problems involve grouping the data points into different clusters based on their similarities or differences. Dimensionality reduction problems involve reducing the number of features or dimensions of the data, while preserving the most important information or variability in the data.

To solve an unsupervised learning problem, you need to follow these steps:

Collect and preprocess the data. This involves gathering the data from various sources, cleaning and formatting the data, and scaling or normalizing the data if necessary. Scaling or normalizing the data means transforming the data to have a common range or distribution, which can improve the performance of some algorithms.
Choose a suitable algorithm. This involves selecting a machine learning algorithm that can find the structure and patterns in the data. There are many algorithms to choose from, such as k-means, hierarchical clustering, Gaussian mixture models, principal component analysis, singular value decomposition, autoencoders, and more. Each algorithm has its own advantages and disadvantages, and you need to consider factors such as the shape and distribution of the data, the number and size of clusters or dimensions, the speed and complexity of the algorithm, and the interpretability and quality of the results.
Train the model. This involves feeding the data to the algorithm and letting it learn the structure and patterns in the data. The algorithm uses a mathematical function, called a criterion, to measure the quality of the clustering or dimensionality reduction. The algorithm also uses an optimization technique, which maximizes or minimizes the criterion and improves the model’s performance.
Test and evaluate the model. This involves using the trained model to assign cluster labels or reduce dimensions for new data that it has not seen before. The evaluation metrics depend on the type of problem. For clustering problems, common metrics are silhouette score, Davies-Bouldin index, and Calinski-Harabasz index. For dimensionality reduction problems, common metrics are explained variance, reconstruction error, and classification accuracy.
Improve the model. This involves tweaking the model’s parameters, called hyperparameters, to optimize its performance. Hyperparameters are not learned by the algorithm, but are set by the user before training the model. Examples of hyperparameters are the number of clusters, the distance metric, the initialization method, and the regularization term. You can use techniques such as grid search, random search, or cross-validation to find the best combination of hyperparameters for your model.

To illustrate the unsupervised learning process, let’s look at an example of a clustering problem. Suppose you want to train a machine learning model to segment customers based on their spending habits. You have a dataset of 200 customers, with their annual income and spending score as the input features. You can use the Python programming language and the scikit-learn library to solve this problem.

# Import the modules
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.cluster import KMeans
    from sklearn.metrics import silhouette_score

    # Load the data
    df = pd.read_csv("customer_data.csv")
    print(df.head())

    # Plot the data
    plt.scatter(df["Annual Income (k$)"], df["Spending Score (1-100)"])
    plt.xlabel("Annual Income (k$)")
    plt.ylabel("Spending Score (1-100)")
    plt.show()

The output of the code is as follows:

CustomerID  Gender  Age  Annual Income (k$)  Spending Score (1-100)
    0           1    Male   19                  15                      39
    1           2    Male   21                  15                      81
    2           3  Female   20                  16                       6
    3           4  Female   23                  16                      77
    4           5  Female   31                  17                      40

The plot of the data shows that there are some clusters of customers with different spending habits, such as low income and low spending, high income and high spending, low income and high spending, and high income and low spending.

4. How to Choose Between Supervised and Unsupervised Learning?

Choosing between supervised and unsupervised learning depends on the type and availability of the data, and the goal and complexity of the problem. Here are some factors to consider when deciding which type of learning method to use:

The data: Supervised learning requires labeled data, that is, data that has a known outcome or target variable. Labeled data can be obtained from human experts, existing databases, or other sources, but it can be costly, time-consuming, or impractical to obtain in some cases. Unsupervised learning does not require labeled data, and can work with any type of data, as long as it has some structure or patterns. Unsupervised learning can also be used to preprocess or transform the data before applying supervised learning, such as reducing the dimensionality or finding the optimal number of clusters.
The goal: Supervised learning has a clear and well-defined goal, which is to train a model that can make accurate predictions or decisions based on new data that it has not seen before. Supervised learning can be used for tasks such as classification, regression, or recommendation. Unsupervised learning has a more exploratory and open-ended goal, which is to discover the hidden structure and patterns in the data, and to find meaningful insights or representations of the data. Unsupervised learning can be used for tasks such as clustering, dimensionality reduction, or anomaly detection.
The complexity: Supervised learning can be more straightforward and easier to implement than unsupervised learning, as it has a clear objective function and evaluation metric. Supervised learning can also benefit from the feedback and guidance provided by the labels, which can help the model learn faster and better. Unsupervised learning can be more challenging and harder to implement than supervised learning, as it does not have a clear objective function and evaluation metric. Unsupervised learning can also suffer from the lack of feedback and guidance provided by the labels, which can make the model learn slower and worse.

As a general rule of thumb, you can use supervised learning when you have labeled data and a specific prediction or decision problem, and you can use unsupervised learning when you have unlabeled data and a general exploration or discovery problem. However, there is no definitive answer to which type of learning method is better or worse, as it depends on the context and the trade-offs of each method. You can also combine both types of learning methods, such as using unsupervised learning to preprocess or transform the data before applying supervised learning, or using supervised learning to fine-tune or evaluate the results of unsupervised learning.

5. Examples of Supervised and Unsupervised Learning Applications

In this section, we will look at some examples of supervised and unsupervised learning applications in real-world scenarios. These examples are not exhaustive, but they illustrate the potential and diversity of machine learning techniques.

Supervised Learning Applications

Spam Detection: Spam detection is a task of classifying email messages into spam or non-spam categories. Spam detection can be done using supervised learning, where the model is trained on a large dataset of labeled email messages, and then used to predict the label of a new email message. A common algorithm for spam detection is naive Bayes, which uses the probability of words and phrases to determine the likelihood of an email being spam or not.
Face Recognition: Face recognition is a task of identifying or verifying the identity of a person based on their face image. Face recognition can be done using supervised learning, where the model is trained on a large dataset of labeled face images, and then used to predict the identity of a new face image. A common algorithm for face recognition is convolutional neural network, which uses multiple layers of filters and neurons to extract features and patterns from the face image.
Stock Price Prediction: Stock price prediction is a task of forecasting the future price of a stock based on historical data and other factors. Stock price prediction can be done using supervised learning, where the model is trained on a large dataset of labeled stock prices, and then used to predict the price of a new stock. A common algorithm for stock price prediction is linear regression, which uses a linear function to model the relationship between the input features and the output target.

Unsupervised Learning Applications

Customer Segmentation: Customer segmentation is a task of dividing customers into different groups based on their characteristics and behaviors. Customer segmentation can be done using unsupervised learning, where the model is trained on a large dataset of unlabeled customer data, and then used to assign cluster labels to each customer. A common algorithm for customer segmentation is k-means, which uses the distance between the data points and the cluster centers to determine the optimal number and location of the clusters.
Topic Modeling: Topic modeling is a task of discovering the main topics or themes in a collection of text documents. Topic modeling can be done using unsupervised learning, where the model is trained on a large dataset of unlabeled text documents, and then used to assign topic labels to each document. A common algorithm for topic modeling is latent Dirichlet allocation, which uses a probabilistic model to infer the hidden topics and their distributions in the documents.
Anomaly Detection: Anomaly detection is a task of identifying unusual or abnormal data points or events in a dataset. Anomaly detection can be done using unsupervised learning, where the model is trained on a large dataset of unlabeled data points, and then used to detect outliers or anomalies in the data. A common algorithm for anomaly detection is isolation forest, which uses random trees to isolate the data points and measure their anomaly score based on their path length and proximity to other data points.

6. Conclusion

In this blog, you have learned the difference between supervised and unsupervised learning methods, which are two of the most common types of machine learning techniques. You have also learned how to choose between them depending on your data and problem, and see some examples of their applications in real-world scenarios.

Supervised learning is a type of machine learning where the computer learns from labeled data, that is, data that has a known outcome or target variable. The goal of supervised learning is to train a model that can make accurate predictions or decisions based on new data that it has not seen before. There are two main types of supervised learning problems: regression and classification. Regression problems involve predicting a continuous numerical value, such as the price of a house, the height of a person, or the temperature of a city. Classification problems involve predicting a discrete categorical value, such as the type of an animal, the sentiment of a text, or the genre of a movie.

Unsupervised learning is a type of machine learning where the computer learns from unlabeled data, that is, data that does not have a known outcome or target variable. The goal of unsupervised learning is to discover the hidden structure and patterns in the data, and to find meaningful insights or representations of the data. There are two main types of unsupervised learning problems: clustering and dimensionality reduction. Clustering problems involve grouping the data points into different clusters based on their similarities or differences. Dimensionality reduction problems involve reducing the number of features or dimensions of the data, while preserving the most important information or variability in the data.

Choosing between supervised and unsupervised learning depends on the type and availability of the data, and the goal and complexity of the problem. As a general rule of thumb, you can use supervised learning when you have labeled data and a specific prediction or decision problem, and you can use unsupervised learning when you have unlabeled data and a general exploration or discovery problem. However, there is no definitive answer to which type of learning method is better or worse, as it depends on the context and the trade-offs of each method. You can also combine both types of learning methods, such as using unsupervised learning to preprocess or transform the data before applying supervised learning, or using supervised learning to fine-tune or evaluate the results of unsupervised learning.

We hope that this blog has helped you understand the basics of supervised and unsupervised learning, and how to apply them in practice. If you want to learn more about machine learning, you can check out some of the resources below:

Machine Learning by Stanford University: A free online course that covers the theory and practice of machine learning, including supervised and unsupervised learning, linear regression, logistic regression, neural networks, support vector machines, k-means, principal component analysis, and more.
Python Machine Learning by Sebastian Raschka and Vahid Mirjalili: A book that teaches you how to use Python and scikit-learn to implement various machine learning techniques, such as data preprocessing, feature engineering, model evaluation, hyperparameter tuning, ensemble learning, deep learning, and more.
scikit-learn: A Python library that provides a range of tools and algorithms for machine learning, such as data loading, preprocessing, clustering, dimensionality reduction, regression, classification, model selection, and more.

Thank you for reading this blog, and happy learning!

Subscribe for FREE to get your 42 pages e-book: Data Science | The Comprehensive Handbook

Get step-by-step e-books on Python, ML, DL, and LLMs.

Subscribe to DDIntel Here.

Have a unique story to share? Submit to DDIntel here.

Join our creator ecosystem here.

DDIntel captures the more notable pieces from our main site and our popular DDI Medium publication. Check us out for more insightful work from our community.

DDI Official Telegram Channel: https://t.me/+tafUp6ecEys4YjQ1