
Data Science, Education
Learning Data Science Has Never Been Easier
In the modern age of information technology, there is an enormous amount of free resources for data science self-study
I. Introduction
In this article, I will discuss several resources that can help you master the foundations of data science. In the modern age of information technology, there is an enormous amount of free resources for data science self-study. As a matter of fact, you can design your own data science curriculum from the innumerable amount of available resources.
II. Resources for Data Science Self-Study
1. Massive Open Online Courses (MOOCs)
The rising demand for data science practitioners has given rise to a proliferation of massive open online courses (MOOC). The most popular providers of MOOC include the following:
a) edx: https://www.edx.org/
b) Coursera: https://www.coursera.org/
c) DataCamp: https://www.datacamp.com/
d) Udemy: https://www.udemy.com/
e) Udacity: https://www.udacity.com/
f) Lynda: https://www.lynda.com/
If you are going to be taking one of these courses, keep in mind that some MOOCs are 100% free, while some do require you to pay a subscription fee (it could range anywhere from $50 to $200 per course or more, varies from platforms to platforms). Keep in mind that gaining expertise in any discipline requires an enormous amount of time and energy. So do not be in a rush. Make sure that if you decide to enroll in a course, you should be ready to complete the entire course, including all assignments and homework. Some of the quizzes and homework assignments will be quite challenging. However, keep in mind that if you don’t challenge yourself, you wouldn’t be able to grow in your knowledge and skills.
Having completed so many data science MOOCs myself, find below are 3 of my favorite data science specializations.
(i) Professional Certificate in Data Science (HarvardX, through edX)
Includes the following courses, all taught using R (you can audit courses for free or purchase a verified certificate):
- Data Science: R Basics;
- Data Science: Visualization;
- Data Science: Probability;
- Data Science: Inference and Modeling;
- Data Science: Productivity Tools;
- Data Science: Wrangling;
- Data Science: Linear Regression;
- Data Science: Machine Learning;
- Data Science: Capstone
(ii) Analytics: Essential Tools and Methods (Georgia TechX, through edX)
Includes the following courses, all taught using R, Python, and SQL (you can audit for free or purchase a verified certificate):
- Introduction to Analytics Modeling;
- Introduction to Computing for Data Analysis;
- Data Analytics for Business.
(iii) Applied Data Science with Python Specialization (the University of Michigan, through Coursera)
Includes the following courses, all taught using python (you can audit most courses for free, some require the purchase of a verified certificate):
- Introduction to Data Science in Python;
- Applied Plotting, Charting & Data Representation in Python;
- Applied Machine Learning in Python;
- Applied Text Mining in Python;
- Applied Social Network Analysis in Python.
2. Learning from a Textbook
Learning from a textbook provides a more refined and in-depth knowledge beyond what you get from online courses. This book provides a great introduction to data science and machine learning, with code included: “Python Machine Learning”, by Sebastian Raschka. https://github.com/rasbt/python-machine-learning-book-3rd-edition

The author explains fundamental concepts in machine learning in a way that is very easy to follow. Also, the code is included, so you can actually use the code provided to practice and build your own models. I have personally found this book to be very useful in my journey as a data scientist. I would recommend this book to any data science aspirant. All that you need is basic linear algebra and programming skills to be able to understand the book.
There are lots of other excellent data science textbooks out there such as “Python for Data Analysis” by Wes McKinney, “Applied Predictive Modeling” by Kuhn & Johnson, “Data Mining: Practical Machine Learning Tools and Techniques” by Ian H. Witten, Eibe Frank & Mark A. Hall, and so on.
3. Medium
Medium is now considered one of the fastest-growing platforms for learning about data science. If you are interested in using this platform for data science self-study, the first step would be to create a medium account. You can create a free account or a member account. With a free account, there are limitations on the number of member articles that you can access per month. A member account requires a monthly subscription fee of $5 or $50/year. Find out more about becoming a medium member from here: https://medium.com/membership. With a member account, you will have unlimited access to medium articles and publications.
The 2 top data science publications on the medium are Towards Data Science and Towards AI. Every day, new articles are published on medium covering topics such as data science, machine learning, data visualization, programming, artificial intelligence, etc. Using the search tool on the medium website, you can have access to so many articles and tutorials covering a wide variety of topics in data science from basic to advanced concepts.
4. KDnuggets Website
KDnuggets is a leading site on AI, Analytics, Big Data, Data Mining, Data Science, and Machine Learning. On the website, you can find important educational tools and resources in data science as well as tools for professional development:
- Blog/News
- Opinions
- Tutorials
- Top stories
- Companies
- Courses
- Datasets
- Education
- Events (online)
- Jobs
- Software
- Webinar
5. GitHub
GitHub contains several tutorials and projects on data science and machine learning. Besides being an excellent resource for data science education, GitHub is also an excellent platform for portfolio building. For more information on creating a data science portfolio on GitHub, please see the following article: A Data Science Portfolio is More Valuable than a Resume.
6. LinkedIn
As data science is a field that is ever-evolving due to technological innovations and the development of new algorithms, one way to stay current is to join a network of data science professionals. LinkedIn is an excellent platform for networking. There are several data science groups and organizations on LinkedIn that one can join such as Towards AI, DataScienceHub, Towards data science, KDnuggets, etc. You can also follow top leaders in the field on this platform.
7. YouTube
YouTube contains several educational videos and tutorials that can teach you the essential math and programming skills required in data science, as well as several data science tutorials for beginners. A simple search would generate several video tutorials and lectures.
8. Khan Academy
Khan academy is also a great website for learning basic math, statistics, calculus, and linear algebra skills required in data science.
III. Sample Curriculum for Introductory Data Science Self-Study
Now that we have discussed several resources for data science education, it is only natural that you ask the following questions if you are considering data science:
Where to begin your journey?
What courses to take and in what order?
The answer to these questions varies from different individuals. Generally, individuals with a quantitative background such as physics, mathematics, engineering, computer science, or accounting have an advantage because they have the necessary math skills required in data science.
If you are new to data science, a recommended curriculum for self-study is provided below. These are the essential topics that you need to complete to become competent in data science.
1. Math Basics
(I) Multivariable Calculus
Most machine learning models are built with a dataset having several features or predictors. Hence familiarity with multivariable calculus is extremely important for building a machine learning model. Here are the topics you need to be familiar with:
- Functions of several variables
- Derivatives and gradients
- Step function, Sigmoid function, Logit function, ReLU (Rectified Linear Unit) function
- Cost function
- Plotting of functions
- Minimum and Maximum values of a function
(II) Linear Algebra
Linear algebra is the most important math skill in machine learning. A dataset is represented as a matrix. Linear algebra is used in data preprocessing, data transformation, and model evaluation. Here are the topics you need to be familiar with:
- Vectors
- Matrices
- Transpose of a matrix
- The inverse of a matrix
- The determinant of a matrix
- Dot product
- Eigenvalues
- Eigenvectors
(III) Optimization Methods
Most machine learning algorithms perform predictive modeling by minimizing an objective function, thereby learning the weights that must be applied to the testing data in order to obtain the predicted labels. Here are the topics you need to be familiar with:
- Cost function/Objective function
- Likelihood function
- Error function
- Gradient Descent Algorithm and its variants (e.g., Stochastic Gradient Descent Algorithm)
Resources: YouTube; Khan Academy
2. Programming Basics
Python and R are considered the top programming languages for data science. You may decide to focus on just one language. Python is widely adopted by industries and academic training programs. As a beginner, it is recommended that you focus on one language only.
Here are some Python and R basics topics to master:
- Basic R syntax
- Foundational R programming concepts such as data types, vectors arithmetic, indexing, and data frames
- How to perform operations in R including sorting, data wrangling using dplyr, and data visualization with ggplot2
- R studio
- Object-oriented programming aspects of Python
- Jupyter notebooks
- Be able to work with Python libraries such as NumPy, pylab, seaborn, matplotlib, pandas, scikit-learn, TensorFlow, PyTorch
Resources: Stack Overflow, Code Academy, Medium, YouTube
3. Data Basics
Learn how to manipulate data in various formats, for example, CSV file, pdf file, text file, etc. Learn how to clean data, impute data, scale data, import and export data, and scrap data from the internet. Some packages of interest are pandas, NumPy, pdf tools, stringr, etc. Additionally, R and Python contain several inbuilt datasets that can be used for practice. Learn data transformation and dimensionality reduction techniques such as covariance matrix plot, principal component analysis (PCA), and linear discriminant analysis (LDA).
Resources: DataCamp; edX, Coursera
4. Probability and Statistics Basics
Statistics and Probability is used for visualization of features, data preprocessing, feature transformation, data imputation, dimensionality reduction, feature engineering, model evaluation, etc. Here are the topics you need to be familiar with:
- Mean
- Median
- Mode
- Standard deviation/variance
- Correlation coefficient and the covariance matrix
- Probability distributions (Binomial, Poisson, Normal)
- p-value
- Baye’s Theorem (Precision, Recall, Positive Predictive Value, Negative Predictive Value, Confusion Matrix, ROC Curve)
- A/B Testing
- Monte Carlo Simulation
Resources: YouTube, Khan Academy, edX, Coursera, DataCamp
5. Data Visualization Basics
Learn essential components of a good data visualization. A good data visualization is made up of several components that have to be pieced up together to produce an end product:
a) Data Component: An important first step in deciding how to visualize data is to know what type of data it is, e.g., categorical data, discrete data, continuous data, time-series data, etc.
b) Geometric Component: Here is where you decide what kind of visualization is suitable for your data, e.g., scatter plot, line graphs, bar plots, histograms, Q-Q plots, smooth densities, boxplots, pair plots, heatmaps, etc.
c) Mapping Component: Here, you need to decide what variable to use as your x-variable and what to use as your y-variable. This is important, especially when your dataset is multi-dimensional with several features.
d) Scale Component: Here, you decide what kind of scales to use, e.g., linear scale, log scale, etc.
e) Labels Component: This includes things like axes labels, titles, legends, font size to use, etc.
f) Ethical Component: Here, you want to make sure your visualization tells the true story. You need to be aware of your actions when cleaning, summarizing, manipulating, and producing a data visualization and ensure you aren’t using your visualization to mislead or manipulate your audience.
Important data visualization tools include Python’s matplotlib and seaborn packages, and R’s ggplot2 package.
Resources: edX, Coursera, DataCamp, Medium
6. Linear Regression Basics
Learn the fundamentals of simple and multiple linear regression analysis. Linear regression is used for supervised learning with continuous outcomes. Some tools for performing linear regression are given below:
Python: NumPy, pylab, sci-kit-learn
R: caret package
Resources: edX, Coursera, DataCamp, Medium
7. Machine Learning Basics
a) Supervised Learning (Continuous Variable Prediction)
- Basic regression
- Multi regression analysis
- Regularized regression
b) Supervised Learning (Discrete Variable Prediction)
- Logistic Regression Classifier
- Support Vector Machine (SVM) Classifier
- K-nearest neighbor (KNN) Classifier
- Decision Tree Classifier
- Random Forest Classifier
- Naive Bayes
c) Unsupervised Learning
- Kmeans clustering algorithm
Python tools for machine learning: Scikit-learn, Pytorch, TensorFlow.
Resources: DataCamp, edX, Coursera, Medium
8. Time Series Analysis Basics
Use for a predictive model in cases where the outcome is time-dependent, e.g., predicting stock prices. There are 3 basic methods for analyzing time-series data:
- Exponential Smoothing
- ARIMA (Auto-Regressive Integrated Moving Average), which is a generalization of exponential smoothing
- GARCH (Generalized Auto Regressive Conditional Heteroskedasticity), which is an ARIMA-like model for analyzing variance.
These 3 techniques can be implemented in Python and R.
Resources: edX, Coursera, Medium
9. Productivity Tools Basics
Knowledge on how to use basic productivity tools such as R studio, Jupyter notebook, and GitHub, is essential. For Python, Anaconda Python is the best productivity tool to install. Advanced productivity tools such as AWS and Azure are also important tools to learn.
10. Data Science Project Planning Basics
Learn basics on how to plan a project. Before building any machine learning model, it is important to sit down carefully and plan what you want your model to accomplish. Before delving into writing code, it is important that you understand the problem to be solved, the nature of the dataset, the type of model to build, how the model will be trained, tested, and evaluated. Project planning and project organization are essential for increasing productivity when working on a data science project. Some resources for project planning and organization are provided below.
IV. Summary and Conclusion
In summary, we have discussed several resources for data science self-study. We have also provided a recommended curriculum that can serve as a guide when deciding on what resources to use in your educational journey. In the modern age of information technology, there is an enormous amount of free resources for data science self-study. With a little bit of effort and dedication, anyone can master the fundamentals of data science.