Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

riables.</p><p id="e178">This was a competition that also serves as a job-finding opportunity. The top performers were given interview chances with selected companies. A great way to find where you lie among other candidates.</p><h2 id="c0bd">5. Categorical Feature Encoding Challenge 🐈</h2><p id="1176">A competition that contains only categorical features. 300,000 rows and 23 features. This Playground competition will give you the opportunity to try different encoding schemes for different algorithms to compare how they perform.</p><p id="a118">It is a good tactic to join machine learning competitions that have to teach you something new. In your learning journey don’t just compete on the ones you are more familiar with. Step out of your comfort zone and try new things. Here, you will learn how to tackle categorical features.</p><h1 id="4feb">Regression Problems:</h1><h2 id="6999">6. House Prices 🏠</h2><p id="a25a">The de facto regression machine learning competition. With 79 explanatory variables describing every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.</p><p id="8399">This is a really fun competition. There exist so many features and I recommend it in order to challenge your feature engineering skills!</p><h2 id="9190">7. TMDB Box Office Prediction 🍿</h2><p id="1fec">In this competition, you’re presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue.</p><p id="7fb1">Who doesn’t like watching movies? Now it’s your chance to play with a dataset with 7,000 films and explore their many available features. Learn pre-processing, feature engineering, data transformations, and many more while having fun!</p><h2 id="1b2f">8. Bike Sharing Demand 🚴‍♀️</h2><p id="ae1d">In this competition, you are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C, using 10,886 training rows.</p><p id="dc92">Using environmental data to predict everyday use cases is so cool. Use temperature, humidity, wind speed, and many more to predict bike demand? I am in.</p><h2 id="86ee">9. Predict Future Sales 📋</h2><p id="c9a6">This challenge serves as the final project for the “How to win a data science competition” Coursera course. In this competition, you will work with a challenging time-series dataset consisting of daily sales data.</p><p id="df18">Again, our focus is to step out of our comfort zone and learn something new. Time series is one of the most common types of data you will encounter. So learning how to tackle these kinds of data will be of high value to you.</p><h1 id="1149">Computer Vision Problems:</h1><h2 id="d5fd">10. Digit Recognizer 🔢</h2><p id="bde8">If you are new to computer vision this is the perfect introduction to techniques like neural networks using a classic dataset including pre-extracted features.</p><p id="0c9e">This is the de facto “hello world” dataset of computer vision. The task is to correctly identify digits from a dataset of tens of thousands of handwritten images.</p><h2 id="2268">11. Dog Breed Identification 🐶</h2><p id="5733">Predict the correct breed from 120 possible choices and a limited number of training images per class.</p><p id="49f1">A step further on the computer vision domain. Predicting handwritten numbers was really easy. It is more challenging but also rewarding to predict between 120 dog breeds.

Options

</p><h1 id="eb98">Natural Language Processing Problems:</h1><h2 id="aafe">12. Real or Not? NLP with Disaster Tweets ⛈️</h2><p id="52c1">This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. You are challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.</p><p id="d687">This dataset is also used in academia to extract valuable insights into how people tweet about natural disasters. The possibilities and learning opportunities with text data are infinite, so I recommend you start this unique journey with this dataset and competition.</p><h2 id="10a5">13. Jigsaw Toxic Comment Classification Challenge 🗯️</h2><p id="2f63">In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits.</p><p id="45d6">One important thing to focus on in this competition can be the various pre-processing techniques that you can apply to raw data in order to successfully fit your classification algorithm.</p><h2 id="7013">14. Sentiment Analysis on Movie Reviews 🎬</h2><p id="d5f8">This competition presents a chance to benchmark your sentiment analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, and positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.</p><p id="0590">This competition actually seeded my university thesis in 2016, where I did emotion detection on movie reviews!</p><p id="ca10">As a final note, I recommend you focus on one thing at a time in each competition. Pre-processing, encoding, transformations, ml algorithms, feature engineering, selection, tuning, analysis, and many more. Create a GitHub profile, and upload your work there. This serves a dual purpose: a) the future you can always find your previous works, b) the whole world can see proof of what you are capable of.</p><p id="bf40">You can start from the above competitions and as you learn and feel more confident, transition to more challenging and new ones. I wish you a great Data Science journey!</p><h1 id="4beb">Not sure what to read next? Here is one pick:</h1><div id="c7a0" class="link-block"> <a href="https://towardsdatascience.com/5-data-science-podcasts-you-should-be-listening-right-now-178f0af8ebce"> <div> <div> <h2>5 Data Science Podcasts you Should be Listening Right Now</h2> <div><h3>Keep up with the latest trends and stay at the top of your field.</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*eK2l238lN-_MfFFbWyHGog.png)"></div> </div> </div> </a> </div><h1 id="07d1">Keep in touch</h1><p id="e2c0">➥Follow me on <a href="https://medium.com/@dimitris.effrosynidis/subscribe">Medium</a> for more content like this. ➥Let’s connect on <a href="https://www.linkedin.com/in/dimitrios-effrosynidis/">LinkedIn</a>. ➥Check my <a href="https://github.com/Deffro">GitHub</a>.</p></article></body>

14 Kaggle Competitions to Start your Data Science Journey

Infinite free resources await you

Kaggle is a website that hosts online machine learning competitions, with or without prizes. It allows users to share their code mainly as Jupyter notebooks. It has a plethora of datasets and even courses for machine learning and data science.

I am basically a self-taught Data Scientist and everything I learned in my first six months, I owe to Kaggle. What I did back then was enter a competition, use other people’s code, understand it, google anything I didn’t know, and have fun. After 4 years, I am not that active, but I always come back to it and check new competitions, notebooks, and cool datasets. There is always something new to learn.

My advice for aspiring data scientists is to take advantage of Kaggle as much as possible, as it has so many to offer you. Its most important assets are the ready-to-use notebooks from other users. Just by reading them and reproducing them, you can learn how other people are facing and solving problems. You are not needed to start from zero. You can copy a notebook and improve it.

There are some competitions that are beginner-friendly and are the best way to start your data science journey. I will discriminate them into:

Classification
Regression
Computer Vision
Natural Language Processing

Let’s dive into the list:

Classification Problems:

1. Titanic 🛳️

The go-to intro competition on Kaggle. Predict which passengers survived the Titanic shipwreck. This is a classification problem with 891 training samples and 10 features.

Using this dataset you will get familiar with the Kaggle platform and how things work in a competition. How to join a competition, create notebooks, use other people’s notebooks, make submissions, and see how you score on the leaderboard.

2. Forest Cover Type Prediction 🌳

Another classification problem, where you need to predict the predominant kind of tree cover in 30 x 30-meter forest cells. 15,120 training samples and 54 features.

This is a 7-class multiclass classification problem. One step forward from the Titanic dataset. Now that you are already familiar with Kaggle, you can take on this challenge and score against 1,600+ other teams.

3. Don’t Overfit! ΙΙ ⚖️

You are given a classification problem with only 250 training rows and 300 features while trying to predict 19,750 rows. The challenge is to develop a successful model that doesn’t overfit.

I recommend this competition to sharpen your (not)overfitting skills. A very important concept in any machine/deep learning project. Be sure to view the available code from other users. You will learn so many tips and tricks to reduce overfitting.

4. CareerCon 2019 — Help Navigate Robots 🤖

Help robots recognize the floor surface they are standing on using a training dataset of 487,680 rows and 10 variables.

This was a competition that also serves as a job-finding opportunity. The top performers were given interview chances with selected companies. A great way to find where you lie among other candidates.

5. Categorical Feature Encoding Challenge 🐈

A competition that contains only categorical features. 300,000 rows and 23 features. This Playground competition will give you the opportunity to try different encoding schemes for different algorithms to compare how they perform.

It is a good tactic to join machine learning competitions that have to teach you something new. In your learning journey don’t just compete on the ones you are more familiar with. Step out of your comfort zone and try new things. Here, you will learn how to tackle categorical features.

Regression Problems:

6. House Prices 🏠

The de facto regression machine learning competition. With 79 explanatory variables describing every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

This is a really fun competition. There exist so many features and I recommend it in order to challenge your feature engineering skills!

7. TMDB Box Office Prediction 🍿

In this competition, you’re presented with metadata on over 7,000 past films from The Movie Database to try and predict their overall worldwide box office revenue.

Who doesn’t like watching movies? Now it’s your chance to play with a dataset with 7,000 films and explore their many available features. Learn pre-processing, feature engineering, data transformations, and many more while having fun!

8. Bike Sharing Demand 🚴‍♀️

In this competition, you are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C, using 10,886 training rows.

Using environmental data to predict everyday use cases is so cool. Use temperature, humidity, wind speed, and many more to predict bike demand? I am in.

9. Predict Future Sales 📋

This challenge serves as the final project for the “How to win a data science competition” Coursera course. In this competition, you will work with a challenging time-series dataset consisting of daily sales data.

Again, our focus is to step out of our comfort zone and learn something new. Time series is one of the most common types of data you will encounter. So learning how to tackle these kinds of data will be of high value to you.

Computer Vision Problems:

10. Digit Recognizer 🔢

If you are new to computer vision this is the perfect introduction to techniques like neural networks using a classic dataset including pre-extracted features.

This is the de facto “hello world” dataset of computer vision. The task is to correctly identify digits from a dataset of tens of thousands of handwritten images.

11. Dog Breed Identification 🐶

Predict the correct breed from 120 possible choices and a limited number of training images per class.

A step further on the computer vision domain. Predicting handwritten numbers was really easy. It is more challenging but also rewarding to predict between 120 dog breeds.

Natural Language Processing Problems:

12. Real or Not? NLP with Disaster Tweets ⛈️

This particular challenge is perfect for data scientists looking to get started with Natural Language Processing. You are challenged to build a machine learning model that predicts which Tweets are about real disasters and which ones aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified.

This dataset is also used in academia to extract valuable insights into how people tweet about natural disasters. The possibilities and learning opportunities with text data are infinite, so I recommend you start this unique journey with this dataset and competition.

13. Jigsaw Toxic Comment Classification Challenge 🗯️

In this competition, you’re challenged to build a multi-headed model that’s capable of detecting different types of toxicity like threats, obscenity, insults, and identity-based hate. You’ll be using a dataset of comments from Wikipedia’s talk page edits.

One important thing to focus on in this competition can be the various pre-processing techniques that you can apply to raw data in order to successfully fit your classification algorithm.

14. Sentiment Analysis on Movie Reviews 🎬

This competition presents a chance to benchmark your sentiment analysis ideas on the Rotten Tomatoes dataset. You are asked to label phrases on a scale of five values: negative, somewhat negative, neutral, somewhat positive, and positive. Obstacles like sentence negation, sarcasm, terseness, language ambiguity, and many others make this task very challenging.

This competition actually seeded my university thesis in 2016, where I did emotion detection on movie reviews!

As a final note, I recommend you focus on one thing at a time in each competition. Pre-processing, encoding, transformations, ml algorithms, feature engineering, selection, tuning, analysis, and many more. Create a GitHub profile, and upload your work there. This serves a dual purpose: a) the future you can always find your previous works, b) the whole world can see proof of what you are capable of.

You can start from the above competitions and as you learn and feel more confident, transition to more challenging and new ones. I wish you a great Data Science journey!

Not sure what to read next? Here is one pick:

5 Data Science Podcasts you Should be Listening Right Now

Keep up with the latest trends and stay at the top of your field.

towardsdatascience.com

Keep in touch

➥Follow me on Medium for more content like this. ➥Let’s connect on LinkedIn. ➥Check my GitHub.