avatarSamer Sallam

Summary

This webpage provides a guide for describing a dataset for computer vision classification problems in machine learning.

Abstract

The author, a data scientist who has worked on various machine learning and deep learning projects in the computer vision field, shares a table for describing a dataset of images for classification projects. This table includes general information like dataset name, link, and size, as well as specific details such as images dimensions, number of images, number of classes, number of images per class, number of images per extension, images file size, and notes. The author also provides a demonstration using a COVID-19 dataset from Kaggle.

Opinions

  • The author emphasizes the importance of an accurate and well-organized dataset description to choose the best dataset for a machine learning project.
  • The author highlights that the number of images per class is crucial to know whether the dataset is balanced or imbalanced, which affects the entire process of training and validating the machine learning or deep learning model.
  • The author suggests that the number of images, depending on the problem complexity, may need to be sufficient to cover all possible cases.
  • The author mentions that the number of classes will help choose and set up a machine learning or deep learning algorithm.
  • The author notes that sometimes specific image extensions may be of interest, and the dataset description should include this information.
  • The author points out that the table provides an average value of image dimensions, which can give an intuition about the dimension value for most images.
  • The author suggests that the images file size distribution can provide an intuition about the images file size distribution.

How To Describe a Dataset For A Computer Vision Classification Problem

As a data scientist I worked on several machine learning and deep learning projects related to the computer vision field. In each project, I was asking myself how to choose the best dataset, and I realized that an accurate and well-organized description would give me the right answer. In this article, I would like to share with you the following table (table 1) which I developed to describe a dataset of images for classification projects in machine learning.

  • General information: Dataset name, link, and size.
  • Images dimensions: Dimension range for both width and height gives you a better idea about the images and about the transformation that you may apply, also an average value gives you an intuition about the dimension value for most images.
  • Number of images: · Depending on the problem you want to solve, there will be an acceptable number that you can deal with. But if the problem is very complex, then this number may need to be sufficient to cover all the possible cases.
  • Number of classes: The number of classes will help you choose and set up a ML/DL algorithm.
  • Number of images per class: It is very important to know whether the dataset is balanced or imbalanced as it will affect the whole process of training and validating of the ML/DL model.
  • Number of images per extension: Sometimes we are interested in a specific image extension. This info will help you to know the portion of images per extension
  • Images File size: Will give you an intuition about the images file size distribution.
  • Notes: This is useful if you want to add some additional information or notes about the dataset. (such as permissions, ethics…etc)
Artificial Intelligence Jobs

Trending AI Articles:

1. How to automatically deskew (straighten) a text image using OpenCV

2. Explanation of YOLO V4 a one stage detector

3. 5 Best Artificial Intelligence Online Courses for Beginners in 2020

4. A Non Mathematical guide to the mathematics behind Machine Learning

In order to understand the idea better let me show you a quick demo. The following table (Table 2) shows a description of a Covid19 dataset from Kaggle website.

This is all for this article, I hope you find it useful, and would you please share with me your ideas about the discussed topic.

Don’t forget to give us your 👏 !

Deep Learning
Machine Learning
Dataset
AI
Recommended from ReadMedium