All the Datasets You Need to Practice Data Science Skills and Make a Great Portfolio
A Great Collection of Different kinds of Datasets
Every time I attempt to do a project for learning a new topic or for a project I spend a significant amount of time finding a suitable dataset for that. That way I have quite a lot of datasets that helped me learn and do some cool projects for my portfolio. I am going to share those datasets in this article so that you have a dataset to practice and make your portfolio.
Olympic Dataset
This dataset has information on the Olympic results. Each row contains the data of a country. This dataset will give you a taste of data cleaning to start with.
I learned Python’s libraries like Numpy and Pandas using this dataset.
Download this dataset from here
Housing Price dataset
This dataset is commonly used to teach and learn Regression Models. Surely, It can be used for other staff as well.
This dataset contains these columns: id, date, price, bedrooms, bathrooms, sqft_living, sqft_lot, floors, waterfront, view, condition, grade, sqft_above, sqft_basement, yr_built, yr_renovated, zip code, lat, long, sqft_living15, sqft_lot15.
Here is the link.
Heart Disease Dataset
This dataset is from Kaggle. I used it in several articles for demonstration purposes.
These are two examples:
There is some exploratory data analysis done and also the details about the features in Kaggle.
Download this dataset from this link.
Mushrooms Dataset
I found this dataset in the course Applied Data Science With Python Specialization in Coursera.
I used it for Classification problems. It can be used for other purposes as well.
It contains these columns: class, cap-shape, cap-surface, cap-color, bruises, odor, gill-attachment, gill-spacing, gill-size, gill-color, stalk-shape, stalk-root, stalk-surface-above-ring, stalk-surface-below-ring, stalk-color-above-ring, stalk-color-below-ring, veil-type, veil-color, ring-number, ring-type, spore-print-color, population, habitat.
Here is the link to this dataset
NHANES Dataset
This is a big dataset that includes a lot of continuous and categorical features. So, you can use the whole dataset or part of it for so many different purposes. The column named may not look very understandable in the beginning. But once you get used to it, it can be a very useful dataset to practice Data Analysis, Visualization, Statistical Modeling, and Machine Learning models(both classification and regression).
In this article, I cut a piece of the dataset and used it for multiple linear regression:
Here I used it for some visualization demonstration:
Titanic Dataset
Another very popular dataset. I myself used it a lot, I saw different experienced people using this dataset to present a concept.
This dataset contains these columns: PassengerId, Survived, P-class, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked.
This dataset is good for Exploratory Data Analysis, Machine Learning Models specially Classification Models, Statistical Analysis, and Data Visualization Practice.
This is a tutorial where I used this dataset:
Here is a demonstration of some pandas functions using this dataset:
Here is the link to this dataset
Census Dataset
If you want to get a taste of how to explore a big dataset, work with this one. This dataset is very big.
This one is great for Exploratory Data Analysis, Statistical Analysis & Modeling, and, Data Visualization practice.
Here is some practice of data analysis with this dataset:
https://github.com/rashida048/Advanced-Pandas-Application/blob/master/advancedPandasPractice.ipynb
Download this dataset from here.
Credit Card Fraud Dataset
This dataset is different than the other datasets mentioned here. Because there are no feature names. Sometimes Data Scientists have to deal with datasets like that.
This dataset is about credit card fraud detection. It is very likely that a bank will not share its client information with a data scientist. So, the feature names won’t be available. This dataset gives a flavor of that. It has a binary column that indicates if a transaction is fraudulent or not. This dataset can be used for classification models.
An example in this GitHub page:
This dataset can also be used for Exploratory Data Analysis and Visualization.
Download this dataset from this link.
Movie Dataset
This dataset contains features related to different movies. This is a good dataset for some Natural Language Processing projects.
These are the features:
index, budget, genres, homepage, id, keywords, original_language, original_title, overview, popularity, production_companies, production_countries, release_date, revenue, runtime, spoken_languages, status, tagline, title, vote_average, vote_count, cast, crew, director.
Here is a demonstration of a movie recommendation algorithm using this dataset:
Download this dataset from this link.
People Wiki Dataset
This dataset is Wikipedia profiles of different genres of people. It has three features: URI, name (name of the person), and text (it includes the Wikipedia profile). As you may have already thought, this is also a good dataset for Natural Language Processing
Here is an example of a project with this dataset:
Here is the link to this dataset
Amazon Product Review Dataset
This dataset contains millions of product reviews of the products of amazon.
It has three columns: Name of the product, review, and rating. This dataset is almost a real dataset, very good for Natural Language Processing.
I have a sentiment analysis project and an article where I used this dataset. Please check it out here:
Download this dataset from this link.
BBC Text Dataset
Another wonderful dataset for Natural Language Processing.
This dataset contains information on different types of news from BBC archives. It’s a big text dataset.
It is normally popular for Multiclass Classification problems.
The dataset is big but it has only two columns: text and category.
Here is the link for this dataset
Digits dataset
This dataset contains pixel values of 10 digits. It is commonly used for image recognition problems.
I used this dataset for a few different types of Multiclass Classification problems.
This is a logistic regression algorithm:
Here is a demonstration of a neural network in python:
Download this dataset from this link.
Cifer Dataset
Also, a dataset that contains the pixel values of different images. But the difference from the digits dataset is, the pixel values are three-dimensional matrices.
Here is a project where I tried different neural network structures using Tensorflow and Keras with this dataset:
Cats vs Dogs
Very commonly used to practice Image Classification.
This dataset contains images of cats and dogs.
It is good for computer vision problems.
Malignant vs Benign
Another useful dataset for Computer Vision Problems
This dataset also contains images of two types of skin cancer.
Good for Image Classification problems
Download this dataset from here
Cars Dataset
This is a reasonable size dataset that can be used to practice some Regression Models and Exploratory Data Analysis.
This dataset contains these columns: YEAR, Make, Model, Size, (kW), Unnamed: 5, TYPE, CITY (kWh/100 km), HWY (kWh/100 km), COMB (kWh/100 km), CITY (Le/100 km), HWY (Le/100 km), COMB (Le/100 km), (g/km), RATING, (km), TIME (h).
Here is the link for this dataset
Canada Immigration Dataset
This dataset provides information about how many immigrants came from which country by year.
A great dataset to practice Exploratory Data Analysis and Data Visualization
Facebook Stock Data
It provides Facebook stock performance per day.
The columns in this dataset are Date, Open, High, Low, Close, Adj Close, Volume.
This one can be very useful in Time Series Analysis and Visualization or Time Series Related problems.
Here are some time series analysis and visualization tutorials using this dataset:
Airbnb Dataset
I received this dataset as a part of an interview a while ago.
I was asked to do an Exploratory Data Analysis and develop a Machine Learning Model using this dataset.
This dataset has a lot of text data and numerical data. You can use this dataset to practice a lot of different types of projects.
You will see several datasets in this link. But I was asked to download the listings.csv file for my interview.
Florida Subsidence Incidents Report
I wanted to add one dataset that includes latitude and longitude data if you are interested to work on some geospatial analysis. I used this dataset for some visualization practice:
Conclusion
These are all the datasets I wanted to share today. You should find good enough sets of datasets and some projects idea as well from this page to practice the necessary skills and make a portfolio. Hope this helps.
Please feel free to follow me on Twitter.






