avatarEugenia Anello

Summary

This context provides a guide on manipulating PyTorch datasets, covering the use of Dataset and Dataloader classes, and demonstrating four common operations: filtering a class from a PyTorch dataset, concatenating PyTorch datasets, converting a Pandas dataframe into a PyTorch dataset, and performing random sampling from a PyTorch dataset.

Abstract

The article is a comprehensive guide on manipulating PyTorch datasets, aimed at making PyTorch more intuitive and accessible. It begins by explaining the differences between the Dataset and Dataloader classes and their importance in managing and accessing data. The Dataset class stores samples and their corresponding targets, while the Dataloader class splits the data into batches for easier access and faster training. The article then demonstrates four common operations to manipulate datasets: filtering a class from a PyTorch dataset using the Subset class, concatenating multiple datasets using the ConcatDataset class, converting a Pandas dataframe into a PyTorch dataset using the TensorDataset class, and performing random sampling from a PyTorch dataset with and without replacement. The guide provides code examples and explanations for each operation, making it a valuable resource for anyone working with PyTorch datasets.

Bullet points

  • PyTorch Datasets and Dataloaders are essential for managing and accessing data in PyTorch.
  • The Dataset class stores samples and their corresponding targets, while the Dataloader class splits the data into batches for faster training.
  • The Subset class can be used to filter a class from a PyTorch dataset.
  • The ConcatDataset class can be used to concatenate multiple datasets.
  • The TensorDataset class can be used to convert a Pandas dataframe into a PyTorch dataset.
  • Random sampling from a PyTorch dataset can be performed with and without replacement.

Manipulating Pytorch Datasets

How to work with Dataloaders and Datasets for Deep Learning

Photo by Valdemaras D. on Unsplash

The post is the second in a series of guides to build deep learning models with Pytorch. Below, there is the full series:

Part 1: Pytorch Tutorial for Beginners

Part 2: Manipulating Pytorch Datasets (this post)

Part 3: Understand Tensor Dimensions in DL models

Part 4: CNN & Feature visualizations

Part 5: Hyperparameter tuning with Optuna

Part 6: K Fold Cross Validation

Part 7: Convolutional Autoencoder

Part 8: Denoising Autoencoder

Part 9: Variational Autoencoder

The goal of the series is to make Pytorch more intuitive and accessible as possible through examples of implementations. There are many tutorials on the Internet to use Pytorch to build many types of challenging models, but it can also be confusing at the same time because there are always slight differences when you pass from one tutorial to another. In this series, I want to start from the simplest topics to the more advanced ones.

The purpose of this post is to help you in getting familiar with the Pytorch Datasets. Before going deeper with models based on neural networks, you need to understand the general concepts and ideas of two main classes that provide data: Dataset and Dataloader. You are wondering what is difference between the Dataset and the Dataloader. The Dataset is used to store the samples and the corresponding targets. If you already worked with Pandas, it’s very similar to a standard Pandas Dataframe as structure.

Why should we also need the Dataloader? The Dataloader is important to get easy access to the Dataset and split the data into batches, where the batches are the number of samples passed in one iteration to the Machine Learning model. The more the dataset is huge, the more time is needed to train the model. Dividing the dataset into batches provides a way to speed up the training speed of the model.

In this post, I provide an overview to work with these two classes. Later, I show how to perform four common operations to manipulate your dataset:

  1. Filter class from Pytorch Dataset
  2. Concatenate Pytorch Datasets
  3. Convert Pandas Dataframe into Pytorch Dataset
  4. Random Sampling from Pytorch Dataset

Import libraries and dataset

There are two important libraries you should keep attention at:

  • torch.utils.data contains two main classes to store the data: Dataset and Dataloader
  • torchvision provides pre-loaded datasets, like MNIST and Fashion MNIST
import torch
from torch.utils.data import Dataset,DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt 
import pandas as pd
import numpy as np

Let’s now import the Fashion MNIST [1], a popular dataset that contains Zalando’s article images. Each sample is constituted by an image and the corresponding label that belongs to one of the ten classes.

To check out how many samples the training and test datasets contain, we need the well-known function len:

print(len(train_dataset)) #60000
print(len(test_dataset)) #10000

We can also get access separately to the images and the labels:

train_dataset.data
# tensor([[[0, 0, 0,  ..., 0, 0, 0],...[0, 0, 0,  ..., 0, 0, 0]]], # dtype=torch.uint8)
train_dataset.targets
# tensor([9, 0, 0,  ..., 3, 0, 5])

This output means that the first image corresponds to class 9, the successive two images to class 0 and so on [2]. We can easily verify it by displaying the first image of the training dataset:

Illustration by Author

Now, we can create a new data loader, based on the training dataset, with a batch size equal 256:

train_loader = DataLoader(dataset=train_dataset, batch_size=256, shuffle=True)

We can iterate through the DataLoader using the iter and next functions:

train_features, train_labels = next(iter(train_loader))
print(len(train_labels)) #256

We selected a batch containing 256 images, each one belonging to different classes.

1. Filter class from Pytorch Dataset

Let’s suppose that we only want the images of T-shirts from the training dataset. We’ll need the following lines of code:

We have employed:

  • np.where to yield a NumPy array containing the indexes where there are only images of t-shirts in the training dataset.
  • Subset to select a subset of the dataset, based on the indices passed.

Let’s display some images contained in train_subset:

Illustration by Author

There is a rich variety of t-shirts, don’t you think?

2. Concatenate Pytorch datasets

Let’s merge the training dataset and test dataset into a unique dataset:

The class ConcatDataset takes in a list of multiple datasets and returns a concatenation of these ones. In this way, we obtain a dataset of 7000 samples and a dataloader with 274 batches.

3. Convert Pandas dataframe into pytorch dataset

To train a neural network implemented with Pytorch, we need a Pytorch dataset. The issue is that Pytorch doesn’t provide all the datasets of the world and we need to make some efforts to convert the dataframes into the Pytorch tensor. For example, let’s import the Boston housing prices dataset from sklearn [3]:

Illustration by Author

Let’s iterate again through the dataloader to visualize what we have obtained:

train_features, train_labels = next(iter(train_loader))
print(train_features)
print(train_labels)
Illustration by Author

We can notice that a batch containing two samples is printed. Unluckily TensorDataset can have some limitations when the dataset’s structure is more complex.

An alternative and more efficient method is to create a class, which inherits the attributes and the methods of the class Dataset. Especially, we can override the methods of the class Dataset, depending on the dataset you are working with:

  • __getitem__ method is used for accessing ith sample of the dataset.
  • __len__ method returns the dataset’s number of sample.

Then, we create a new object belonging to the class Boston_Dataset , which will be passed to the data_loader function:

b_data = Boston_Dataset(df=df)
b_loader = DataLoader(dataset=b_data,batch_size=4,shuffle=True)
for x,y in b_loader:
    print(y)

4. Random Sampling from the Dataset

Sometimes it can happen that you want to select random samples from the dataset. Theoretically, there is already the class RandomSampler that does it automatically, but it works only with replacement. For this reason, I will need the following trick to select the samples WITHOUT replacement.

If you instead prefer to draw the sample WITH replacement, you can directly use the class RandomSampler

Final thoughts:

That’s it! Now you should be able to work with Pytorch datasets with less effort. It can be confusing at first, but after some practice, you’ll manage all the difficulties. Thanks for reading. Have a nice day!

References:

[1]https://pytorch.org/vision/main/generated/torchvision.datasets.FashionMNIST.html

[2] https://pytorch.org/tutorials/beginner/basics/data_tutorial.html

[3] https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html

Did you like my article? Become a member and get unlimited access to new data science posts every day! It’s an indirect way of supporting me without any extra cost to you. If you are already a member, subscribe to get emails whenever I publish new data science and python guides!

Programming
Pytorch
Machine Learning
Data
Ml So Good
Recommended from ReadMedium