This context provides a guide on manipulating PyTorch datasets, covering the use of Dataset and Dataloader classes, and demonstrating four common operations: filtering a class from a PyTorch dataset, concatenating PyTorch datasets, converting a Pandas dataframe into a PyTorch dataset, and performing random sampling from a PyTorch dataset.
Abstract
The article is a comprehensive guide on manipulating PyTorch datasets, aimed at making PyTorch more intuitive and accessible. It begins by explaining the differences between the Dataset and Dataloader classes and their importance in managing and accessing data. The Dataset class stores samples and their corresponding targets, while the Dataloader class splits the data into batches for easier access and faster training. The article then demonstrates four common operations to manipulate datasets: filtering a class from a PyTorch dataset using the Subset class, concatenating multiple datasets using the ConcatDataset class, converting a Pandas dataframe into a PyTorch dataset using the TensorDataset class, and performing random sampling from a PyTorch dataset with and without replacement. The guide provides code examples and explanations for each operation, making it a valuable resource for anyone working with PyTorch datasets.
Bullet points
PyTorch Datasets and Dataloaders are essential for managing and accessing data in PyTorch.
The Dataset class stores samples and their corresponding targets, while the Dataloader class splits the data into batches for faster training.
The Subset class can be used to filter a class from a PyTorch dataset.
The ConcatDataset class can be used to concatenate multiple datasets.
The TensorDataset class can be used to convert a Pandas dataframe into a PyTorch dataset.
Random sampling from a PyTorch dataset can be performed with and without replacement.
Manipulating Pytorch Datasets
How to work with Dataloaders and Datasets for Deep Learning
The goal of the series is to make Pytorch more intuitive and accessible as possible through examples of implementations. There are many tutorials on the Internet to use Pytorch to build many types of challenging models, but it can also be confusing at the same time because there are always slight differences when you pass from one tutorial to another. In this series, I want to start from the simplest topics to the more advanced ones.
The purpose of this post is to help you in getting familiar with the Pytorch Datasets. Before going deeper with models based on neural networks, you need to understand the general concepts and ideas of two main classes that provide data: Dataset and Dataloader. You are wondering what is difference between the Dataset and the Dataloader. The Dataset is used to store the samples and the corresponding targets. If you already worked with Pandas, it’s very similar to a standard Pandas Dataframe as structure.
Why should we also need the Dataloader? The Dataloader is important to get easy access to the Dataset and split the data into batches, where the batches are the number of samples passed in one iteration to the Machine Learning model. The more the dataset is huge, the more time is needed to train the model. Dividing the dataset into batches provides a way to speed up the training speed of the model.
In this post, I provide an overview to work with these two classes. Later, I show how to perform four common operations to manipulate your dataset:
There are two important libraries you should keep attention at:
torch.utils.data contains two main classes to store the data: Dataset and Dataloader
torchvision provides pre-loaded datasets, like MNIST and Fashion MNIST
import torch
from torch.utils.data import Dataset,DataLoader
from torchvision import datasets
from torchvision.transforms import ToTensor
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
Let’s now import the Fashion MNIST [1], a popular dataset that contains Zalando’s article images. Each sample is constituted by an image and the corresponding label that belongs to one of the ten classes.
To check out how many samples the training and test datasets contain, we need the well-known function len:
This output means that the first image corresponds to class 9, the successive two images to class 0 and so on [2]. We can easily verify it by displaying the first image of the training dataset:
Illustration by Author
Now, we can create a new data loader, based on the training dataset, with a batch size equal 256:
We selected a batch containing 256 images, each one belonging to different classes.
1. Filter class from Pytorch Dataset
Let’s suppose that we only want the images of T-shirts from the training dataset. We’ll need the following lines of code:
We have employed:
np.where to yield a NumPy array containing the indexes where there are only images of t-shirts in the training dataset.
Subset to select a subset of the dataset, based on the indices passed.
Let’s display some images contained in train_subset:
Illustration by Author
There is a rich variety of t-shirts, don’t you think?
2. Concatenate Pytorch datasets
Let’s merge the training dataset and test dataset into a unique dataset:
The class ConcatDataset takes in a list of multiple datasets and returns a concatenation of these ones. In this way, we obtain a dataset of 7000 samples and a dataloader with 274 batches.
3. Convert Pandas dataframe into pytorch dataset
To train a neural network implemented with Pytorch, we need a Pytorch dataset. The issue is that Pytorch doesn’t provide all the datasets of the world and we need to make some efforts to convert the dataframes into the Pytorch tensor. For example, let’s import the Boston housing prices dataset from sklearn [3]:
Illustration by Author
Let’s iterate again through the dataloader to visualize what we have obtained:
We can notice that a batch containing two samples is printed. Unluckily TensorDataset can have some limitations when the dataset’s structure is more complex.
An alternative and more efficient method is to create a class, which inherits the attributes and the methods of the class Dataset. Especially, we can override the methods of the class Dataset, depending on the dataset you are working with:
__getitem__ method is used for accessing ith sample of the dataset.
__len__ method returns the dataset’s number of sample.
Then, we create a new object belonging to the class Boston_Dataset , which will be passed to the data_loader function:
Sometimes it can happen that you want to select random samples from the dataset. Theoretically, there is already the class RandomSampler that does it automatically, but it works only with replacement. For this reason, I will need the following trick to select the samples WITHOUT replacement.
If you instead prefer to draw the sample WITH replacement, you can directly use the class RandomSampler
Final thoughts:
That’s it! Now you should be able to work with Pytorch datasets with less effort. It can be confusing at first, but after some practice, you’ll manage all the difficulties. Thanks for reading. Have a nice day!
Did you like my article? Become a member and get unlimited access to new data science posts every day! It’s an indirect way of supporting me without any extra cost to you. If you are already a member, subscribe to get emails whenever I publish new data science and python guides!