avatarAkshay Ravindran

Summary

The article outlines a structured approach to learning Data Science in 2022, emphasizing the importance of practical application and continuous learning.

Abstract

The article "How would I learn Data Science (If I had to Start Over in 2022)" provides a comprehensive guide for beginners in the field. It suggests starting with programming in Python or R, understanding basic statistics, and mastering data retrieval and visualization. The author stresses the significance of data preprocessing to improve model efficiency and delves into various machine learning models, both supervised and unsupervised. Special attention is given to niches within machine learning, such as NLP and image classification, and the importance of cloud architecture for deploying models. The article also lists essential Python packages for data science and concludes by encouraging readers to embrace the challenging yet rewarding journey of becoming a data scientist.

Opinions

  • The author believes that a structured learning approach can significantly reduce the time required to master data science.
  • Python is preferred over R for those already familiar with Java or C++, while R is recommended for those with a background in MATLAB.
  • Practical project work is considered the fastest and most effective way to learn data science.
  • Understanding the quality of data is more crucial than the quantity for successful machine learning models.
  • The article suggests that the choice of algorithms and preprocessing steps should be tailored to specific use cases.
  • Cloud platforms like AWS and Google Cloud are recognized for simplifying the process of deploying machine learning models into production.
  • The author emphasizes the importance of continuous learning and staying updated with the latest algorithms and models in the ever-evolving field of data science.

How would I learn Data Science (If I had to Start Over in 2022)

Photo by Myriam Jessier on Unsplash

Introduction

First of All, Data Science is an ever-evolving field, where you have to keep updating yourself so that you can stay relevant. New algorithms would emerge, State of the art vectorization models will come up. All of a sudden, older algorithms that were neglected once will rise up. This is also the fun part of being in this challenging field. The following is the structure that I would recommend to myself If I was starting as a beginner.

I learned all of these topics in a jumbled architecture, which just made my learning curve steeper. With this Structure hopefully, you can scratch off a few weeks of your learning time.

Topics that are going to be covered

  1. Programming
  2. Basic Statistics
  3. Data Retrieval
  4. Data Visualization
  5. Data Preprocessing
  6. Machine Learning
  7. Niches in ML
  8. Cloud Architecture
  9. Python Packages

1) Programming

Photo by Hitesh Choudhary on Unsplash

When it comes to what programming language to use for Data Science. Python and R Programming are the prime candidates. I prefer the Python programming language over R because I was already familiar with Java and C++. If you are probably a Software engineer who uses Java / C++ it is easier to pick up Python. But if you are like a Mechatronics Engineer / Biologists who are familiar with MATLAB. It is easier for you to pickup R.

The paramount thing is to choose any one of the languages (Python / R) and just start working on projects which will give you practical knowledge which is the easiest and fastest way to learn.

2) Basic Statistics

Photo by Naser Tamimi on Unsplash

Math and Data Science goes hand in hand. Basic Understanding of Statistics and Probability is required in order for you to better understand the data that you have in hand. Also to make decisions on which kind of algorithm would be suitable based on your findings.

Don’t panic now, In the beginning, you don’t have to know all the complex mathematical equations that works behind the scenes. You can get started with few foundational statistics topics like Probability, Conditional Probability, mean, median, mode(central tendencies), range and standard deviations. What does data skew mean and different distribution functions (Gaussian, Binomial, Bernoulli’s).

3) Data Retrieval

Now, after you have got the solid foundations. The essential part of Data Science is to retrieve the data that you want to analyze. Kaggle provides you with multiple datasets which you can directly download and start analyzing the data.

In real world, we have to get the data on our own. We either have to screen scrape the data from different websites or there are websites (twitter) which provide their own API where you can download the data that can be customized as well.

With python there are open source libraries which you can take advantage of to extract the data as well. (Google play scrapper which can be used to get the reviews for a particular/ list of apps that you want).

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — -

A short segue into Jupyter Notebooks. It is a web based interactive computing platform. I would consider this as the best tool for a data scientist to have. Since, you can interact with the data cell by cell. You can visualize the data, derive conclusions all with a click of a button / (Shift + Enter).

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

4) Data Visualization

A picture is worth a thousand words, following this analogy representing the data that you have in visual forms like graphs, charts and maps is the finest way to understand the data that you have. This process can aid you in data cleaning process where you can remove unwanted data (Noise). Depending on where you use it. This can help you in Feature Selection, Data Cleaning, Data Extraction.

Photo by Luke Chesser on Unsplash

Some visualization techniques you can learn are:

  • Distribution Plot
  • Line Plot
  • Bar Plot
  • Scatter Plot

5) Data Preprocessing

Usually, when you are new to the field you will hear:

“One can Never have enough amount of data”

In some sense it is true. But the quality of the dataset is significantly more important than the quantity of data that you have.

A good data scientist will be able to improve the efficiency of a ML model by improving the quality of the data before suggesting an increase in the data required.

In order to improve the quality of the dataset. You must learn about different data cleaning / data preprocessing techniques. You have to understand:

  • Why to impute null data.
  • What would happen if you are dealing with skewed data.
  • Why do you need to remove stop words in NLP.

Common Techniques in Data Preprocessing:

Numerical Dataset : Clustering Numerical Value into their respective Mean, Median, Standard Deviations of data, Imputing Null values, Removing Null values. Text Dataset: Removing URL’s, Stop Words(the, of, a, are), Punctuations, Emoticons.

6) Machine Learning

Photo by Markus Winkler on Unsplash

Now after you have spotted the anomalies and correlation between variables in the previous stages. You have completed the historical analysis part, you will then move on to predicting the future based on the past trends. This is where Machine Learning comes into play.

Machine Learning on it’s own will have deeper sub topics. There is Feature Extraction, Feature Selection, The Algorithm itself (Supervised vs Unsupervised), Optimization of the hyper parameters of the model, Ensemble Techniques involved. I will deep dive about the ML lifecycle in the next post.

As a start, you will have to go through Supervised ML models and Unsupervised ML models that can be easily rendered with few lines of code.

A side exercise would be is to understand the logic behind the models. Then try to come up with your own code using basic python elements to recreate the model you have learnt.

Supervised ML models : Linear Regression, Random Forest Classifier, Naïve Bayes Algorithm, Support Vector Machines etc.

Unsupervised ML models : K-means Clustering algorithm, Hierarchical Clustering, Principal Component Analysis etc.

7) Niches in ML

Having understood the ML models. Now you can delve deeper into different sub genres of ML. This is where it gets interesting. When you are having different use cases, the algorithms or the preprocessing steps you have learnt so far might be useful or might become totally irrelevant. It totally depends on the use case.

For example Clustering of numerical data into their mean, sum, min, max can yield a better understanding of data.

If you use the same approach to a text based dataset. It might not work. Since with one word the context could change.

Generally, NLP(Natural Language Processing) and Image Classification are the predominant Niches other than numerical data representations.

For NLP you can learn Vectorization of Text values. Glove, Fast Text and Elmo vectorizations. Vectorizing the text based values into numerical data will aid in faster calculations and reduction of storage space.

Image Classification introduces CNN, RNN, Open CV packages etc.

8) Cloud Architecture

With the rise of ML, AWS and Google Cloud have come up with their own architectures which makes it easier to train and deploy a model into production without much hassle. Google Cloud's Auto ML and AWS Sage maker have state-of-the-art approaches to deploy a model into production easily.

In the earlier days, the data scientist have to keep track of the meta data of the models which were pushed into production. This was troublesome because after you have pushed multiple versions of the same model into production. To roll back, Suddenly you have to keep track of the:

  • version number
  • the hyper parameters
  • what data the model trained on
  • what was their metrics(accuracy, precision, recall) etc.

This can be overcome if you use ML Flow which keeps track of everything I mentioned earlier as a form of artifacts. You can just go and click on the artifact that you previously deployed and get the model back.

Extras

Python Packages that is a must know if you are going to be a data scientist:

Conclusion

Data Science is one of the most interesting fields in Computer Science. It is challenging and worthwhile to stay in this field. Overall, you now know the sub-topics that make a data scientist. After reading this article, hope you got an understanding of how to structure your learning process towards your journey into the field of Data Science in 2022. Feel free to share your thoughts and feedback. Thanks.

Support us in bringing more articles to you by becoming a member.

Data Science
Python
Machine Learning
Software Development
Programming
Recommended from ReadMedium