Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

om the internet using file’s URL and convert it directly into a data frame for analysis.</li><li>Importation of unstructured data</li><li>Cleaning and organizing unstructured data using string processing techniques</li><li>Converting unstructured data into structured data</li><li>Performing analysis of structured data</li><li>Extracting data from a pdf using tools in R and Python</li></ol>Module Links:<a href="https://readmedium.com/using-sapply-function-in-r-to-generate-a-table-310c8fb931cf">Using sapply() function in R to generate a table</a><a href="https://readmedium.com/download-a-file-from-the-internet-using-the-r-functions-download-file-and-read-csv-e7bd415648e4">Download a file from the internet using the R functions download.file() and read.csv()</a><a href="https://readmedium.com/tutorial-on-data-wrangling-college-towns-dataset-a0e8f8dfb6ae">Tutorial on Data Wrangling: College Towns Dataset</a><a href="https://readmedium.com/extracting-data-from-pdf-file-using-python-and-r-4ed8826bc5a1">Extracting Data from PDF File Using Python and R</a><h1 id="b9af">MODULE 2: Data Visualization Basics</h1>This module will teach basic data visualization principles and how to apply them using R’s ggplot2 and Python’s matplotlib packages. You will learn the following:<ol><li>Scatter plot</li><li>Barplot</li><li>Histrogram</li><li>Probability density plot</li><li>Line plot</li><li>Pairplot</li><li>Heatmap</li></ol>Module Links:<a href="https://readmedium.com/tutorial-on-barplots-using-rs-ggplot-package-b7f86104a974">Tutorial on Barplots using R’s ggplot Package</a><a href="https://readmedium.com/tutorial-on-data-visualization-weather-data-52efa1bef183">Tutorial on Data Visualization: Weather Data</a><a href="https://readmedium.com/bad-and-good-regression-analysis-700ca9b506ff">Bad and Good Regression Analysis</a><a href="https://readmedium.com/machine-learning-model-for-recommending-the-crew-size-for-cruise-ship-buyers-6dd478ad9900">Building a Machine Learning Recommendation Model from Scratch</a><h1 id="1561">MODULE 3: Techniques of Dimensionality Reduction</h1>A machine learning algorithm (such as classification, clustering or regression) uses a training dataset to determine weight factors that can be applied to unseen data for predictive purposes. Before implementing a machine learning algorithm, it is necessary to select only relevant features in the training dataset. The process of transforming a dataset in order to select only relevant features necessary for training is called dimensionality reduction. Dimensionality reduction is important because of three main reasons:<ol><li>Prevents Overfitting: A high-dimensional dataset having too many features can sometimes lead to overfitting (model captures both real and random effects).</li><li>Simplicity: An over-complex model having too many features can be hard to interpret especially when features are correlated with each other.</li><li>Computational Efficiency: A model trained on a lower-dimensional dataset is computationally efficient (execution of algorithm requires less computational time).</li></ol>Dimensionality reduction, therefore, plays a crucial role in data preprocessing. In this module, you’ll learn two important techniques for dimensionality reduction:<ol><li>Principal Component Analysis (PCA)</li><li>Linear Discriminant Analysis (LDA)</li></ol>Module Links:<a href="https://readmedium.com/machine-learning-dimensionality-reduction-via-linear-discriminant-analysis-cc96b49d2757">Machine Learning: Dimensionality Reduction via Linear Discriminant Analysis</a><a href="https://readmedium.com/machine-learning-dimens

Options

ionality-reduction-via-principal-component-analysis-1bdc77462831">Machine Learning: Dimensionality Reduction via Principal Component Analysis</a><h1 id="3795">MODULE 4: Linear Regression</h1>Learn how to use Python’s Pylab and Sklearn tools to implement linear regression, one of the most common statistical modeling approaches in data science. You’ll learn about the following:<ol><li>Building a simple linear regressor using Python</li><li>Gradient-descent algorithm for minimizing the cost function</li><li>Hyperparameter tuning</li><li>Bias-variance tradeoff</li><li>Multiple regression analysis</li><li>Model Evaluation</li><li>R-Square value</li><li>Residual and Mean Square Error (MSE)</li></ol>Module Links:<a href="https://readmedium.com/machine-leaning-python-linear-regression-estimator-using-gradient-descent-b0b2c496e463">Machine Learning: Python Linear Regression Estimator Using Gradient Descent</a><a href="https://readmedium.com/bad-and-good-regression-analysis-700ca9b506ff">Bad and Good Regression Analysis</a><a href="https://readmedium.com/linear-regression-analysis-in-materials-sciences-a45caac70d70">Linear Regression Analysis in Materials Sciences</a><a href="https://readmedium.com/bias-variance-tradeoff-illustration-using-pylab-202943bf4c78">Bias-Variance Tradeoff Illustration Using Pylab</a><a href="https://readmedium.com/machine-learning-model-for-recommending-the-crew-size-for-cruise-ship-buyers-6dd478ad9900">Building a Machine Learning Recommendation Model from Scratch</a><h1 id="f220">MODULE 5: Machine Learning</h1>Learn how machine learning can be used for building a recommendation system and for forecasting loan status using Monte Carlo simulation.You’ll learn the following:<ol><li>Covariance matrix</li><li>Variable selection</li><li>Feature standardization</li><li>Data partitioning into train, test, and validation sets</li><li>Model building</li><li>Model evaluation</li><li>Hyperparamter tuning</li><li>Cross-validation</li><li>PCA, LDA, and Lasso Regression</li><li>Sklearn’s pipeline tool</li><li>Monte Carlo simulation</li></ol>Module Links:<a href="https://readmedium.com/the-machine-learning-process-3ac14c9a557c">The Machine Learning Process</a><a href="https://readmedium.com/machine-learning-model-for-recommending-the-crew-size-for-cruise-ship-buyers-6dd478ad9900">Building a Machine Learning Recommendation Model from Scratch</a><a href="https://readmedium.com/machine-learning-model-for-stochastic-processes-c65a96f0b8c5">Machine Learning Model for Stochastic Processes</a><h1 id="561a">References and Additional Resources</h1><ol><li>Essential Math Skills for Machine Learning: <a href="https://readmedium.com/4-math-skills-for-machine-learning-12bfbc959c92">https://readmedium.com/4-math-skills-for-machine-learning-12bfbc959c92</a>.</li><li>Best Data Science MOOC Specializations: <a href="https://readmedium.com/3-best-data-science-mooc-specializations-d58da382f628"></a><a href="https://readmedium.com/3-best-data-science-mooc-specializations-d58da382f628">https://readmedium.com/3-best-data-science-mooc-specializations-d58da382f628</a>.</li><li>5 Steps to Become a Data Scientist: <a href="https://readmedium.com/five-steps-to-becoming-a-data-scientist-239bbc60a6e3">https://readmedium.com/five-steps-to-becoming-a-data-scientist-239bbc60a6e3</a>.</li><li>Data Scientist Interview Process — A Personal Experience: <a href="https://readmedium.com/data-scientist-interview-process-a-personal-experience-33295495b4a0">https://readmedium.com/data-scientist-interview-process-a-personal-experience-33295495b4a0</a>.</li></ol></article></body>

Data Science 101 | Towards AI

Data Science 101 — A Short Course on Medium Platform with R and Python Code Included

Data Science 101 is intended for individuals that have some prior exposure or knowledge in data science concepts and are interested in practical applications beyond what is offered in most introductory-level data science courses on platforms such as DataCamp, Coursera, Udemy, or edX.

This course will provide you with the fundamental knowledge that you need in data science using real-world examples. The course contains several examples with code included using both R and Python, considered the top 2 programming languages used by most data science organizations and industries.

Why You Should Take This Course

All course materials are included as links to my medium data science articles, so you don’t need to leave this platform to access course materials. Supplementary course materials such as datasets, jupyter notebooks, R scripts, and sample output files are included as links to my GitHub repositories.
Code is provided for all examples using R or Python. You can download the code and datasets for each example, and then modify it accordingly for learning purposes or modify the code to solve an entirely different problem.
The course can also serve as a quick refresher for those preparing for a data scientist job interview as some of the course materials are designed using materials from typical data scientist interviews take-home challenge projects.
The author has 2 years of experience in data science education, he is a top contributor to the online data science publication Towards AI, and he keeps learning new data science concepts each and every day. So please feel free to leave feedback comments or questions for further clarification or discussion.

What You Will Learn:

Fundamental programming skills in R and Python
Learn how to process raw data into formats necessary for analysis
Learn techniques for transforming data such as principal component analysis (PCA) and linear discriminant analysis (LDA)
Learn basic data visualization principles and how to apply them using R’s ggplot2, and Python’s matplotlib and seaborn packages
Introduction to linear regression including simple and multiple regression problems
Learn the machine process
Implement machine learning algorithms
In-depth knowledge of fundamental data science concepts through motivating real-world case studies
Hands-on Experiential learning

Prerequisites

This course assumes basic understanding of programming concepts in R and Python. The course also assumes familiarity with essential math skills. Please see the article: Essential Math Skills for Machine Learning for more information about essential math skills required for practicing data scientists.

MODULE 1: Data Wrangling

The process of data wrangling is a critical step for any data scientist. Very rarely is data easily accessible in a data science project for analysis. It’s more likely for the data to be in a file, a database, or extracted from documents such as web pages, tweets, or PDFs. Knowing how to wrangle and clean data will enable you to derive critical insights from your data that would otherwise be hidden.

This module will demonstrate the data wrangling process. You’ll learn the following:

Read csv file from the internet using file’s URL and convert it directly into a data frame for analysis.
Importation of unstructured data
Cleaning and organizing unstructured data using string processing techniques
Converting unstructured data into structured data
Performing analysis of structured data
Extracting data from a pdf using tools in R and Python

Module Links:

Using sapply() function in R to generate a table

Download a file from the internet using the R functions download.file() and read.csv()

Tutorial on Data Wrangling: College Towns Dataset

Extracting Data from PDF File Using Python and R

MODULE 2: Data Visualization Basics

This module will teach basic data visualization principles and how to apply them using R’s ggplot2 and Python’s matplotlib packages. You will learn the following:

Scatter plot
Barplot
Histrogram
Probability density plot
Line plot
Pairplot
Heatmap

Module Links:

Tutorial on Barplots using R’s ggplot Package

Tutorial on Data Visualization: Weather Data

Bad and Good Regression Analysis

Building a Machine Learning Recommendation Model from Scratch

MODULE 3: Techniques of Dimensionality Reduction

A machine learning algorithm (such as classification, clustering or regression) uses a training dataset to determine weight factors that can be applied to unseen data for predictive purposes. Before implementing a machine learning algorithm, it is necessary to select only relevant features in the training dataset. The process of transforming a dataset in order to select only relevant features necessary for training is called dimensionality reduction. Dimensionality reduction is important because of three main reasons:

Prevents Overfitting: A high-dimensional dataset having too many features can sometimes lead to overfitting (model captures both real and random effects).
Simplicity: An over-complex model having too many features can be hard to interpret especially when features are correlated with each other.
Computational Efficiency: A model trained on a lower-dimensional dataset is computationally efficient (execution of algorithm requires less computational time).

Dimensionality reduction, therefore, plays a crucial role in data preprocessing. In this module, you’ll learn two important techniques for dimensionality reduction:

Principal Component Analysis (PCA)
Linear Discriminant Analysis (LDA)

Module Links:

Machine Learning: Dimensionality Reduction via Linear Discriminant Analysis

Machine Learning: Dimensionality Reduction via Principal Component Analysis

MODULE 4: Linear Regression

Learn how to use Python’s Pylab and Sklearn tools to implement linear regression, one of the most common statistical modeling approaches in data science. You’ll learn about the following:

Building a simple linear regressor using Python
Gradient-descent algorithm for minimizing the cost function
Hyperparameter tuning
Bias-variance tradeoff
Multiple regression analysis
Model Evaluation
R-Square value
Residual and Mean Square Error (MSE)

Module Links:

Machine Learning: Python Linear Regression Estimator Using Gradient Descent

Bad and Good Regression Analysis

Linear Regression Analysis in Materials Sciences

Bias-Variance Tradeoff Illustration Using Pylab

Building a Machine Learning Recommendation Model from Scratch

MODULE 5: Machine Learning

Learn how machine learning can be used for building a recommendation system and for forecasting loan status using Monte Carlo simulation.

You’ll learn the following:

Covariance matrix
Variable selection
Feature standardization
Data partitioning into train, test, and validation sets
Model building
Model evaluation
Hyperparamter tuning
Cross-validation
PCA, LDA, and Lasso Regression
Sklearn’s pipeline tool
Monte Carlo simulation

Module Links:

The Machine Learning Process

Building a Machine Learning Recommendation Model from Scratch

Machine Learning Model for Stochastic Processes

References and Additional Resources

Essential Math Skills for Machine Learning: https://readmedium.com/4-math-skills-for-machine-learning-12bfbc959c92.
Best Data Science MOOC Specializations: https://readmedium.com/3-best-data-science-mooc-specializations-d58da382f628.
5 Steps to Become a Data Scientist: https://readmedium.com/five-steps-to-becoming-a-data-scientist-239bbc60a6e3.
Data Scientist Interview Process — A Personal Experience: https://readmedium.com/data-scientist-interview-process-a-personal-experience-33295495b4a0.