avatarThe PyCoach

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

4135

Abstract

visualizations such as piecharts, bar plots, boxplots, and histograms.</p><p id="7d18">Dataset: <a href="https://drive.google.com/file/d/1QpCcE4U8NIhznbqf0kdeO2ITKPEs9OSm/view">Link</a> Tutorial: <a href="https://towardsdatascience.com/the-easiest-way-to-make-beautiful-interactive-visualizations-with-pandas-cdf6d5e91757">The Easiest Way to Make Beautiful Interactive Visualizations With Pandas</a></p><h2 id="79f9">The Simpsons / Avatar The Last Airbender</h2><p id="0305">Why not have fun while learning how to make visualizations? There are free datasets of TV shows such as The Simpsons and Avatar The Last Airbender on Kaggle. There you’ll find all the episodes and scripts, so you can make visualizations to show who has the highest number of lines, who speaks to whom, make a wordcloud, and plot sentiment analysis.</p><p id="bc79">Dataset: <a href="https://github.com/areevesman/the-simpsons/tree/master/data">The Simpsons</a>, <a href="https://www.kaggle.com/datasets/ekrembayar/avatar-the-last-air-bender">Avatar</a> Tutorial: <a href="https://readmedium.com/the-simpsons-meets-data-visualization-ef8ef0819d13">The Simpsons meets Data Visualization</a>, <a href="https://towardsdatascience.com/avatar-meets-data-visualization-60631f86ba7d">Data Visualization in Python with Avatar The Last Airbender</a></p><h1 id="e279">Automation</h1><p id="e12b">Instead of repeating tasks like creating Excel reporting, you can automate them with Python.</p><h2 id="124b">Supermarket sales</h2><p id="20f2">Most of us once in our life had to create an Excel report using a sales dataset. Why not automate it? This dataset contains historical sales of a supermarket company for 3 months of data. You can use this data to create a pivot table and barplot in Excel using Python behind scenes.</p><p id="b2df">Dataset: <a href="https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales">Link</a> Tutorial: <a href="https://towardsdatascience.com/a-simple-guide-to-automate-your-excel-reporting-with-python-9d35f143ef7">A Simple Guide to Automate Your Excel Reporting with Python</a></p><h1 id="901f">Regression Analysis</h1><h2 id="e47d">Boston House Prices</h2><p id="d159">This is a popular dataset for making linear regression. The dataset contains information about houses in Boston such as the per capita crime rate by town, average number of rooms per dwelling, full-value property tax rate per $10,000, and more.</p><p id="2a36">Dataset: You can get this dataset with the sklearn library.</p><div id="a0be"><pre><span class="hljs-keyword">from</span> sklearn.datasets <span class="hljs-keyword">import</span> load_boston boston_dataset = load_boston()</pre></div><p id="d5b8">Tutorial:<a href="https://towardsdatascience.com/a-simple-guide-to-linear-regression-using-python-7050e8c751c1"> A Simple Guide to Linear Regression using Python</a></p><h1 id="55fc">Text Classification</h1><p id="692c">If you’re into NLP (Natural language processing), you’ll love these datasets. To work with them you have to use libraries such as sklearn, NLTK, gensim, spaCy, etc</p><h2 id="398b">IMDB Dataset — Sentiment Analysis</h2><p id="563b">This dataset contains 50k movie reviews with their sentiment (positive/negative). This data is great for building a model that classifies whether a text is positive or negative. This is known as binary text classification.</p><p id="caf5">Dataset: <a href="https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews">Link</a> Tutorial: <a href="https://towardsdatascience.com/a-beginners-guide-to-text-classification-with-scikit-learn-632357e16f3a">A Simple Guide to Scikit-Learn — Building a Machine Learning Model in Python</a></p><h2 id="f1ba">60k Stack Overflow Questions with Quality Rating</h2><p id="c2eb">This dataset contains 60k Stack Overflow questions from 2016 to 2020. There are 3 types of questions: HQ (high-quality posts without a single edit), LQ_EDIT (low-quality posts with a negative score, and multiple community edits), and LQ_CLOSE (Low-quality posts that were closed by the community without a single edit).</p><p id="66a4">You can use this dataset to predict tags

Options

for a question. This is more challenging than the previous project since there are not only 2 but multiple options for a tag. In this case, you have to use a multilabel classification approach.</p><p id="74ba">Dataset: <a href="https://www.kaggle.com/datasets/imoore/60k-stack-overflow-questions-with-quality-rate">Link</a> Tutorial: <a href="https://github.com/hse-aml/natural-language-processing/blob/master/week1/week1-MultilabelClassification.ipynb">Predict tags on StackOverflow with linear models</a> (this one isn’t complete but it has instructions on how to solve it)</p><h1 id="b0fd">Image Classification</h1><p id="d4e3">Unlike the other datasets listed in this article, these contain mostly images that you can use to build a classification model. To do so, you have to use Tensor Flow, Open CV, etc.</p><h2 id="f2f4">Rock Paper Scissors</h2><p id="5994">If you like the rock paper scissors game, you won’t get bored with this dataset. This dataset contains 2892 images of different hands in rock/paper/scissors poses.</p><p id="a105">This is commonly used for image classification (as shown in the tutorial below), but you can use this dataset for other purposes.</p><p id="82e6">Dataset: <a href="https://www.kaggle.com/datasets/sanikamal/rock-paper-scissors-dataset">Link</a> Tutorial: <a href="https://readmedium.com/rock-paper-scissors-image-classification-using-cnn-eefe4569b415">Rock-Paper-Scissors Image Classification Using CNN</a></p><h2 id="97b9">Face Mask Detection</h2><p id="8bfa">This dataset consists of 1,376 images. In 690 images people are wearing face masks, while in 686 images people are not wearing a mask.</p><p id="cb6a">You can use this dataset to build a model that detects whether a person is wearing a face mask. By the end of the project, you can pick up a face mask and use your computers’ camera to test it yourself.</p><p id="b933">Dataset: <a href="https://github.com/prajnasb/observations/tree/master/experiements">Link</a> Tutorial: <a href="https://towardsdatascience.com/covid-19-face-mask-detection-using-tensorflow-and-opencv-702dd833515b">Face Mask Detection using TensorFlow and OpenCV</a></p><h1 id="2135">Recommender System</h1><p id="7da4">Have you ever wondered how companies like Netflix and YouTube recommend movies and videos? You can use the dataset below to build your own recommender system and understand how this works.</p><h2 id="723e">MovieLens 20M Dataset — Movie Recommendation</h2><p id="860d">This dataset contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Perfect for those who want to build their movie recommendation system from scratch.</p><p id="e284">Dataset: <a href="https://grouplens.org/datasets/movielens/">Link</a> Tutorial: <a href="https://www.datacamp.com/community/tutorials/recommender-systems-python">Recommender System in Python</a></p><p id="ea32"><a href="https://frankandrade.ck.page/bd063ff2d3"><b>Join my email list with 10k+ people to get my Python for Data Science Cheat Sheet I use in all my tutorials (Free PDF)</b></a></p><p id="a5f0">If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5 a month, giving you unlimited access to thousands of Python guides and Data science articles. If you sign up using <a href="https://frank-andrade.medium.com/membership">my link</a>, I’ll earn a small commission with no extra cost to you.</p><div id="d769" class="link-block"> <a href="https://frank-andrade.medium.com/membership"> <div> <div> <h2>Join Medium with my referral link — Frank Andrade</h2> <div><h3>As a Medium member, a portion of your membership fee goes to writers you read, and you get full access to every story…</h3></div> <div><p>frank-andrade.medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*xJErm7xRo6Ru3zNo)"></div> </div> </div> </a> </div></article></body>

14 Datasets for Your Next Data Science Project

With tutorials on text/image classification, recommender system, data visualization, and more

Photo by Vitaly Vlasov on Pexels

What cannot be omitted in a data science project? Data!

This is what this article is all about. I’ll share with you 14 datasets that you can use to do data analysis, data visualization, text/image classification, building a recommender system, and more.

There are links to tutorials for every dataset listed. Check them out in case you need some guidance or inspiration. The tutorials are listed by difficulty, so the beginner tutorials are in the beginning, while the advanced tutorials are at the end of the article.

Data Analysis

You can play with the datasets listed in this section using Pandas and Numpy.

Exam Scores

This dataset has the marks by students in various subjects (math, reading, writing) and also other data about the students such as their gender, ethnicity, and type of lunch. You can perform some analysis and get the average score per gender, find out whether a student passed/failed an exam, and more.

Dataset: Link Tutorial: Link

Pokemon Dataset

This dataset has stats of 721 pokemon. You’ll find their type, HP, attack, special attack, special defense, and speed. You can play with this data and do some analysis to, say, find the pokemon with the highest attack and defense.

If you’re new to Pandas, I highly recommend you learn the basics with this dataset by watching the tutorial below.

Dataset: Link Tutorial: Pandas Tutorial

Netflix movies and TV shows

This dataset has all the movies and TV shows available on Netflix as of mid-2021. There you can find data such as the title, director, rating, release year, and duration. There is missing data and some columns need some cleaning before working with them in a project.

Dataset: Link Tutorial: A Straightforward Guide to Cleaning and Preparing Data in Python

Data Visualization

You can use these datasets to create visualizations. To do so, you can use matplotlib, seaborn, and even pandas.

FIFA 22 player dataset

This dataset contains football player data for the video game FIFA. Data such as date of birth, height, weight, and overall rating can be found here.

The coolest thing is that on the website, there isn’t only the players’ data for 2022, but from 2016 to 2022, so you can compare the evolution of ratings in each player using line plots and other visualizations.

Dataset: Link Tutorial: A Simple Guide to Beautiful Visualizations in Python

Population dataset (1955–2020)

This dataset contains the population every 5 years from 1955 to 2020 for most countries around the world. The dataset has 3 columns: country, year, and population. The data is good to make simple visualizations such as piecharts, bar plots, boxplots, and histograms.

Dataset: Link Tutorial: The Easiest Way to Make Beautiful Interactive Visualizations With Pandas

The Simpsons / Avatar The Last Airbender

Why not have fun while learning how to make visualizations? There are free datasets of TV shows such as The Simpsons and Avatar The Last Airbender on Kaggle. There you’ll find all the episodes and scripts, so you can make visualizations to show who has the highest number of lines, who speaks to whom, make a wordcloud, and plot sentiment analysis.

Dataset: The Simpsons, Avatar Tutorial: The Simpsons meets Data Visualization, Data Visualization in Python with Avatar The Last Airbender

Automation

Instead of repeating tasks like creating Excel reporting, you can automate them with Python.

Supermarket sales

Most of us once in our life had to create an Excel report using a sales dataset. Why not automate it? This dataset contains historical sales of a supermarket company for 3 months of data. You can use this data to create a pivot table and barplot in Excel using Python behind scenes.

Dataset: Link Tutorial: A Simple Guide to Automate Your Excel Reporting with Python

Regression Analysis

Boston House Prices

This is a popular dataset for making linear regression. The dataset contains information about houses in Boston such as the per capita crime rate by town, average number of rooms per dwelling, full-value property tax rate per $10,000, and more.

Dataset: You can get this dataset with the sklearn library.

from sklearn.datasets import load_boston
boston_dataset = load_boston()

Tutorial: A Simple Guide to Linear Regression using Python

Text Classification

If you’re into NLP (Natural language processing), you’ll love these datasets. To work with them you have to use libraries such as sklearn, NLTK, gensim, spaCy, etc

IMDB Dataset — Sentiment Analysis

This dataset contains 50k movie reviews with their sentiment (positive/negative). This data is great for building a model that classifies whether a text is positive or negative. This is known as binary text classification.

Dataset: Link Tutorial: A Simple Guide to Scikit-Learn — Building a Machine Learning Model in Python

60k Stack Overflow Questions with Quality Rating

This dataset contains 60k Stack Overflow questions from 2016 to 2020. There are 3 types of questions: HQ (high-quality posts without a single edit), LQ_EDIT (low-quality posts with a negative score, and multiple community edits), and LQ_CLOSE (Low-quality posts that were closed by the community without a single edit).

You can use this dataset to predict tags for a question. This is more challenging than the previous project since there are not only 2 but multiple options for a tag. In this case, you have to use a multilabel classification approach.

Dataset: Link Tutorial: Predict tags on StackOverflow with linear models (this one isn’t complete but it has instructions on how to solve it)

Image Classification

Unlike the other datasets listed in this article, these contain mostly images that you can use to build a classification model. To do so, you have to use Tensor Flow, Open CV, etc.

Rock Paper Scissors

If you like the rock paper scissors game, you won’t get bored with this dataset. This dataset contains 2892 images of different hands in rock/paper/scissors poses.

This is commonly used for image classification (as shown in the tutorial below), but you can use this dataset for other purposes.

Dataset: Link Tutorial: Rock-Paper-Scissors Image Classification Using CNN

Face Mask Detection

This dataset consists of 1,376 images. In 690 images people are wearing face masks, while in 686 images people are not wearing a mask.

You can use this dataset to build a model that detects whether a person is wearing a face mask. By the end of the project, you can pick up a face mask and use your computers’ camera to test it yourself.

Dataset: Link Tutorial: Face Mask Detection using TensorFlow and OpenCV

Recommender System

Have you ever wondered how companies like Netflix and YouTube recommend movies and videos? You can use the dataset below to build your own recommender system and understand how this works.

MovieLens 20M Dataset — Movie Recommendation

This dataset contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users. Perfect for those who want to build their movie recommendation system from scratch.

Dataset: Link Tutorial: Recommender System in Python

Join my email list with 10k+ people to get my Python for Data Science Cheat Sheet I use in all my tutorials (Free PDF)

If you enjoy reading stories like these and want to support me as a writer, consider signing up to become a Medium member. It’s $5 a month, giving you unlimited access to thousands of Python guides and Data science articles. If you sign up using my link, I’ll earn a small commission with no extra cost to you.

Data Science
Python
Machine Learning
Education
Artificial Intelligence
Recommended from ReadMedium