This article provides a guide to learning data science using Kaggle, a platform for data science competitions, datasets, notebooks, and courses.
Abstract
The article begins by discussing the growth of Kaggle and its features, including competitions, datasets, notebooks, and courses. It then addresses the differences between Kaggle problems and real-world problems, emphasizing the importance of learning new things rather than focusing on the differences. The article provides a list of short courses on various data science-related topics, including Python, Pandas, data cleaning, data visualization, and an introduction to machine learning. It also highlights the importance of data exploration and provides links to some best notebooks for learning data exploration concepts. The article concludes by discussing the importance of learning about solving different data science problems and provides links to Kaggle notebooks for each category, including regression, classification, clustering, NLP, and computer vision.
Opinions
Kaggle is a valuable platform for learning data science, offering a variety of features and resources.
It is important to focus on learning new things rather than comparing Kaggle problems to real-world problems.
Data exploration is a crucial aspect of data science, and Kaggle offers many resources for learning data exploration concepts.
Learning about solving different data science problems is essential, and Kaggle provides many resources for this purpose.
When I started with Kaggle, it was very small. There used to be one or two active competitions at a time. Each competition will have a discussion forum. People can talk about their solutions and doubts. It was very simple and easy. Kaggle has now grown huge over the last few years. They have over 5 million registered users. There are more than 10 active competitions at any point in time.
It is not just a platform for data science competitions. Kaggle has now become a platform for data scientists. They have a huge inventory of datasets. There are many notebooks related to various problems. There are short courses on many data science topics. They have discussion forums, research competitions, and much more.
I have seen people pointing out the difference between the Kaggle problems and real-world problems. I agree with them. There are so many differences between a real-world problem and a kaggle problem. But that should not be a reason to not use Kaggle. There are so many other things to learn about data science on Kaggle. The focus should be on learning new things and not the other way.
Kaggle is now so huge that it can be overwhelming to understand. It is common to not know about using it to learn data science. In this article, I am going to explain about getting started with Kaggle to learn data science.
Learn the basics
Kaggle has many short courses on various data science-related topics. These courses are for free and you can earn certifications as well. If you do not have any experience in data science then there are some courses where you need to start with. These courses will help you to get familiar with the foundational concepts. All those courses are explained in detail below.
Python
The first step for anyone who wants to become a data scientist is to learn to code. The programming skill is the basic skill needed to solve any data science problem. Python is the most popular programming language for data science. It has many libraries that are very helpful to work on various phases of a data science project.
As per a recent Kaggle survey, more than 80% of people use Python at their job. The ease to use and learn makes Python one of the most sort after programming languages. People coming from a non-technical background find Python a bit difficult at the start. But once they get comfortable then everything will fall in place.
There is a 5-hour course on Python in Kaggle. It consists of 8 modules covering all basic concepts to work on data science problems. Below is the link to the course,
After completing the above course for more confidence learn from Codeacademy. Also, refer to the Coursera course here. At any time if you feel that coding is not for you then watch the below video. It will exactly show you the way to learn to code for data science.
Pandas
Pandas is one of the most important libraries in Python. It is one of the most commonly used libraries for data science. It makes it easy to read and write data. It is used for data pre-processing and transformation. Also, to structure the data to make it easy for exploration and analysis.
While working on a data science problem it is important for the data to be in a shape of an array. Be it for the data analysis or building a predictive model. We need the data to be in data frame format which is achieved using pandas. Knowledge of this library helps in performing various data operations easily and efficiently.
The below course on Pandas covers the key functionalities of this library. It is a 4-hour course and it requires prior knowledge of Python.
Kaggle datasets are generally clean and ready to use. In most real-life uses cases the data are not in a format that is usable. The data would need cleaning to make it consumable for the data science project.
A real-world dataset would have a lot of issues like,
Having many missing data points
Attributes in different scales like age (in years) and income (in ‘000s). Many algorithms would need the attributes to be on the same scale.
Attributes having outliers which if not treated might produce inaccurate results
Data inconsistencies due to typographical errors or other issues like different date formats
The below course explores some of the common data issues. It also covers the solution to handle those issues. This is also a short course that will teach important data cleaning strategies. This course doesn’t cover all the data cleaning techniques but enough to get started.
It is easy for the human brain to extract patterns from visual data. Visualization is the key to communicating the insights to the business stakeholders. A data scientist needs to have good knowledge about visualization techniques. Like, which charts are best for what type of data analysis? What are the techniques that help in highlighting the trends or patterns in the data?
Finally, the below course will help to understand the pipeline in a data science project. It will cover the important steps and the reason for having those steps. On completing this course you would be able to implement a predictive model.
These courses might not make you an expert in those topics. It will not cover every single concept within the topic. But, it will help to equip you with enough skills to solve any data science problem. It will definitely make you self-sufficient to learn data science by yourself.
Do more data exploration
After completing these courses now it is time to make use of the platform to gain practical skills. It is true Kaggle’s problems don’t represent a real-life data science problem. Does it mean you can’t learn anything useful? Definitely Not! There are so many things that one can learn about data science on Kaggle. One such important topic is data exploration.
About 70–80% of the time in most projects data scientists would work on performing the data analysis. There is no strict syllabus to learn about exploratory data analysis. The approach for performing the data analysis depends on the dataset, the problem being solved, the inputs provided by the business, the insights found in the data, and much more. It is important to understand there is no correct or wrong way. The objective should be to cover as many aspects as possible to uncover the key information.
There are some amazing notebooks in Kaggle for learning data exploration. They can help you to understand the methods and techniques commonly used. No one can become an expert in data exploration by completing a course. It is important to practice as much as possible to become an expert. I will provide links to some best notebooks that will help you learn the concepts better.
The below notebook is a comprehensive data analysis notebook on housing price data. The approach used in this notebook is in line with the approach used in any real-world data analysis. It starts off with understanding the data better. Univariate analysis help in better understanding the dependent variable. Multi-variate analysis help in understanding the relationship between the dependent and independent variables. Finally, it includes examples of testing for the assumptions like normality, homoscedasticity, and linearity.
The below notebook is about analyzing a dataset with different types of data. Like having tabular data, text, and images. This will help to learn different techniques useful for each of those data types.
The success of a data science problem depends on how well the problem has been understood. To best understand the problem, the data needs to be properly analyzed. Performing data analysis is not a difficult task. It is just often time-consuming which tempts people to move to the next step in haste. Moving on without understanding is a recipe for a disaster.
One best way to perform a complete data analysis is by incorporating the first principles of thinking in solving data science problems. The below article will show you about incorporating the first principles in solving a data science problem.
After learning the basics and data exploration then the focus can move to model building. I have seen many people trying to start learning data science with algorithms. It is definitely not a good idea to start with the algorithm without building a strong foundation.
If you are new to data science then it is highly recommended you complete the below short course on machine learning. After that move on to the real-world problems or kaggle competitions. It will show you the steps involved in solving a data science problem. You can also create a template out of it that can be used to solve any similar kind of problem.
Another aspect of model building is feature engineering. It helps in,
Identifying the right features
Transforming the features to make them compatible with the algorithm
Creating new features to help improve prediction
Reducing the features when there are a large number of features
Feature engineering plays an important role in improving the performance of the model. Below is a good course to learn more about feature engineering concepts. Again these are all not comprehensive courses but they are good enough to understand the core concepts.
Learn about solving different data science problems
Now, let me show you some of the kaggle notebooks to learn about different kinds of data science problems. The common problems in data science are Regression, Classification, Clustering, NLP, and Computer Vision. Most of the data science projects can be classified into one of these categories.
The best way to learn about algorithms and techniques is by working on projects. Below are some kaggle notebooks that could help you in better understanding the concepts. The notebooks I have chosen here are all well documented. It will clearly show the standard approach to solving different data science problems. Here is one Kaggle notebook for each of the categories,
By going through the above kaggle notebooks and executing each one of them line by line will teach you a lot of practical skills. It will show you the approach to solving those problems. These notebooks are mostly written by experts hence they will also teach you about the coding standards.
If your goal is to get a job in data science. Then by this stage, you would have gained enough knowledge to apply for jobs.
What are the different competition categories?
There are many competitions on Kaggle. The competitions are classified into different categories. Here are some of the competition categories,
Featured competitions
The featured competitions are the most sort after. These competitions generally attract huge prize money. They also attract top talents from all over the world. It is the best place to learn from the experts. The discussion forums of these competitions are goldmines. They hold so much information that can be very helpful. A prerequisite to entering a featured competition is having a strong knowledge of the basics.
Getting started competitions
These are the most accessible competition. These are the best competition for people just getting started with kaggle. The competitions in this category do not have any prize money. The advantage of these competitions is that the solutions help to learn about interesting techniques and approaches.
Research competitions
These competitions are generally as difficult as a featured competition. There is generally no prize money attached to these competitions. The lack of clarity in these competitions often makes me compare them to real-world problems.
These are some of the standard competition categories in Kaggle. Most of these competitions are single stage. It means that the datasets and the required information are all provided and the team with the best accuracy wins the competition. There are some competitions which are two-stages. In these 2-staged competitions, the participants are first evaluated based on the initial dataset. Those who successfully complete the first stage moves to Stage 2. The team with the best score in Stage 2 is the winner.
Making best use of Kaggle
Kaggle competitions are often being compared with real-world data science problems. But, my question is why do we need to compare kaggle with real-world data science problems? Why can’t we focus on the benefits and get the best out of kaggle? Instead of waiting for a perfect real-world problem to work and learn.
I have personally gained so much from this platform. Based on my experience I am going to share some of the tips and techniques that can help in making the best use of this platform. To learn data science and gain tangible benefits.
Focus on learning
The first and most important thing on kaggle is to make sure the focus is on learning. At least when you are getting started with Kaggle. Kaggle is a highly gamified platform. It is very easy to get stuck in the loop of getting a better rank.
The focus should be on learning new things from this platform. There are some things to learn on Kaggle,
The algorithms that are best suited to a certain problem or dataset
Data transformation required for algorithms
Techniques to increase or decrease the number of features
Best visualization to capture the insights and trends clearly
Most commonly used libraries and packages and why?
Learn from others
This platform hosts many leading data scientists. It has over 5 million registered users. The platform makes it possible to learn from the experts. I am not sure how many other industries would provide such a level of democracy. These are notebooks and discussion forums that can be used for learning.
Learn from previous competitions and create templates
All the historic competitions and the solutions submitted are all available. In many cases, the winning solutions are also shared. These are very helpful to learn about the techniques and approaches that help to stand out. The solution which produced good results previously can be packaged into a template so that it can be used on other similar problems.
Follow discussion forums
The discussion forums are the best place to learn new things. It is a great place to learn about the technical aspects of data science. Some of the things that generally get discussed are,
Errors and the possible solutions
Data quality issues and about resolving them
Approaches and techniques to improve performance
What you don’t learn from Kaggle?
As much as it is important to know about the things that one learns on Kaggle. It is equally important to understand those we can’t learn using Kaggle. So that other means can be sort to learn and excel those skills. For example,
In Kaggle problems, the datasets are in a structured format. In most real-world scenarios the data science team collects the data. The attributes required are generally identified by talking to several business stakeholders.
Kaggle datasets are generally clean. Hence there isn’t enough opportunity to learn about data cleaning techniques.
Accuracy is the only parameter to select the best or winning solution. In real-world problems, there are several other parameters and metrics to identify the best solution. The $1 million Netflix competition is a very good example. The best solution from Kaggle was much better than the actual Netflix algorithm. But still, Netflix chose not to use it. Because the engineering cost involved in implementing the winning solution was much higher than the actual incremental benefit produced.
Real-world projects are best to gain experience in these areas. Freelancing is one way to get exposure to these areas. Also working for non-profit organizations like DataKind.
If you would want to take your data science skills to the next league check the below video from my YouTube channel.
To stay connected
If you like this article and are interested in similar ones, follow me on Medium. Become a Medium member for access to thousands of articles related to career, money, and much more.