Divide and Conquer — How to Deal With Complex Datasets
Real-life data is messy, complex, and hard to understand! Let’s see how we can make it a little simpler!
One of my favourite approaches to dealing with large, complex problems is to break them down into smaller, more manageable sub-problems. This makes it easier to focus on the important bits without being overwhelmed by the details. At Skil.AI, we often deal with enormous, complicated datasets which can seem quite daunting to tackle! As with most real-life datasets, our data is often messy, noisy, and generally the opposite of clean. In such cases, it can be helpful to break down the dataset into smaller chunks so that it can be analyzed and understood easily.
Sometimes, there may be obvious logical ways to split the dataset, but that’s not always the case. Fortunately, clustering algorithms can be used to automatically figure out good ways to split a dataset. A clustering algorithm groups a set of objects such that the objects in the same group are more similar to each other than to the objects in other groups. Best of all, we don’t need the datasets to be labelled in any way as clustering is an unsupervised task.
While clustering can be performed on any kind of data, we’ll be focusing on image clustering in this article. The clustering pipeline that we use can be split into 3 main steps — feature extraction, clustering, and analysis.
Feature Extraction
A clustering algorithm generally expects an input of the shape [n_samples, n_features]. In other words, the input is a list of the same length as the number of samples in the dataset, where each item in the list is the features for that particular sample.
A naive way to build this input would be to simply use the pixel values for the image as the features for the image. However, this is less than ideal as the raw pixel values usually contain a lot of noise. A better way to generate the features is to use the output vector from models pre-trained on image data, e.g. ResNet. The features generated from pre-trained models like ResNet are typically more robust to noise and capable of representing the important features of a given image.
We can build a feature extractor based on ResNet with Pytorch as shown below.







