Top 9 Algorithms for a Machine Learning Beginner
Machines with brains are the future.
The use of Machine Learning and its prowess had grown exponentially over the last few years. The things our grandparents thought could only be done by an intellectual human are now being done by machines without human interference. Such is the power of Machine Learning.
Arthur Samuel of IBM when first came up with the phrase “Machine Learning” in 1952 for his game of checkers, he would have never thought that Machine Learning would open up a whole new spectrum from helping people with disabilities to felicitating businesses with decision-making and dynamic pricing.
Machine learning is a method of data analysis that automates analytical model building. It is a branch of technology that allows systems to learn from data, identify patterns and make decisions with minimal human intervention.
Machine Learning made it possible to build software that can understand images, sounds, and language and for us to learn more about technology each day.
When I first came to know about Machine Learning from an article in Verge in 2013, I could never understand how can machines be trained!?? It was not until I began my undergrad in 2015 and I began to learn Machine Learning. Training data, test data, supervised-unsupervised data, decision trees, random forests to deep learning and neural networks, it was a heavy cloud of knowledge.
One after the other, I learned the different Machine Learning algorithms and did projects about it. The more I read and worked with them, I understand that there are some algorithms you start off with and that makes you more familiar with the whole ML and AI ecosystem.
Before we get into the story, it is extremely important to set the base correct on supervised and unsupervised learning.
Supervised Learning
Supervised learning is when you have some input variables, say x and an output variable y, and you use an algorithm to learn the mapping function from the input to the output as y = f(x)
The goal of supervised learning is to approximate the mapping function that when you have new input data (x), you can predict the output variables (Y) for that data with the same accuracy.
It is called supervised learning because the process of learning for the algorithm is by learning from the training dataset can be thought of as an analogy to a teacher supervising the learning process of some students.
Unsupervised Learning
Unsupervised learning is when you only have input data x and no corresponding output variables.
The goal of unsupervised learning is to construct the underlying structure or distribution in the data in order to learn more about the data.
These are called unsupervised learning because unlike supervised learning, there are no correct methods of solving a problem or no teachers to supervise. Algorithms learn about the data themselves, devise to discover and present the interesting structure in data in the best possible way.
So, with this story from me, let’s get into the Top 9 Machine Learning Algorithms that we have heard about a hundred times, but read with clarity this time about its applications and powers, in no particular order of importance.
1. Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. Linear Regression models the relationship between a dependent variable and one or more independent variables. Linear regression is mainly used when working with scalar and exploratory variables.
Linear Regression finds its application to determine the extent to which there exists a linear relationship between a dependent variable (scalar) and one or more independent variables (exploratory). A single independent variable is used to predict the value of a dependent variable.
Real Life Applications of Linear Regression
- Risk Management in financial services or insurance domain
- Predictive Analytics
- Econometric
- Epidemiology
- Weather data analysis
- Customer survey results analysis
2. Logistic Regression
Logistic Regression is used when the dependent variable is binary. It is a go-to method for binary classification problems in statistics. First, it is quintessential to understand when to use linear regression and when to use logistic regression.
What is the difference between Linear and Logistic Regression?
Linear regression is used when the dependent variable is continuous and the nature of the regression line is linear.
Logistic regression is used when the dependent variable is binary in nature.
When to use Logistic regression?
It is a special case of linear regression where the target variable is categorical in nature. It uses a log of odds as the dependent variable.
The sigmoid function, also called the logistic function, gives an ‘S’ shaped curve that can take any real-valued number and map it into a value between 0 and 1.
- If the curve goes to positive infinity, y predicted will become 1
- If the curve goes to negative infinity, y predicted will become 0
- If the output of the sigmoid function is more than 0.5, we can classify the outcome as 1 or YES, and if it is less than 0.5, we can classify it like 0 or NO
- If the output is 0.75, we can say in terms of probability as: There is a 75 percent chance that patient will suffer from cancer.
Thus, Logistic Regression predicts the probability of occurrence of a binary event utilizing a sigmoid function.
Real life applications of Logistic Regression
- Cancer Detection
- Trauma and Injury Severity Score
- Image Segmentation and Categorization
- Geographic Image Processing
- Handwriting recognition
- Prediction whether a person is depressed based on bag of words from the corpus
3. Support Vector Machine
Machine learning largely involves predicting and classifying data. To do so, have a set of machine learning algorithms ti implement depending on the dataset. One of these ML algorithms is SVM. The idea being simple: create a line or a hyperplane which separates the data into multiple classes.
Support Vector Machine (SVM) is a supervised machine learning algorithm which can be used for both classification or regression challenges. However, it is mostly used in classification problems. SVM transforms your data base on it, finds an optimal boundary between the possible outputs.
Support Vector Machine performs classification by finding the hyperplane that maximizes the margin between the two classes.
The vectors that define the hyperplane are called the support vectors.
The SVM Algorithm
- Define an optimal hyperplane with a maximized margin
- Map data to a high dimensional space where it is easier to classify with linear decision surfaces
- Reformulate problem so that data is mapped implicitly into this space
Real Life Applications of SVM
- Face detection — classify between face and non-face areas on images
- Text and hypertext categorization
- Classification of images
- Bioinformatics — protein, genes, biological or cancer classification.
- Handwriting recognition
- Drug Discovery for Therapy
In recent times, SVM has played a very important role in cancer detection and its therapy with its application in classification.
4. Decision Trees
A decision tree is a decision support tool that uses a tree-like model of decision-making process and the possible consequences. It covers event outcomes, resource costs, and utility of decisions. Decision Trees resemble an algorithm or a flowchart that contains only conditional control statements.
A decision tree is drawn upside down with the root node at top. Each decision tree has 3 key parts: a root node, leaf nodes, branches.
In a decision tree, each internal node represents a test or an event. Say, a heads or a tail in a coin flip. Each branch represents the outcome of the test and each leaf node represents a class label — a decision taken after computing all attributes. The paths from root to leaf nodes represent the classification rules.
Decision trees can be a powerful machine learning algorithm for classification and regression. Classification tree works on the target to classify if it was a heads or a tail. Regression trees are represented in a similar manner, but they predict continuous values like house prices in a neighborhood.
The best part about decision trees:
- Handle both numerical and categorical data
- Handle multi-output problems
- Decision trees require relatively less effort in data preparation
- Nonlinear relationships between parameters do not affect tree performance
Real life applications of Decision Trees
- Selecting a flight to travel
- Predicting high occupancy dates for hotels
- Number of drug stores nearby was particularly effective for a client X
- Cancer vs non-cancerous cell classification where cancerous cells are rare say 1%
- Suggest a customer what car to buy
5. Random Forests
Random Forests in machine learning is an ensemble learning technique about classification, regression and other operations that depend on a multitude of decision trees at the training time. They are fast, flexible, represent a robust approach to mining high-dimensional data and are an extension of classification and regression decision trees we talked about above.
Ensemble learning, in general, can be defined as a model that makes predictions by combining individual models. The ensemble model tends to be more flexible with less bias and less variance. Ensemble Learning has two popular methods as:
- Bagging: Each individual tree to randomly sample from the dataset and trained by s random subset of data, resulting in different trees
- Boosting: Each individual tree /model learns from mistakes made by the previous model and improves
Random forest run times are quite fast. They are pretty efficient in dealing with missing and incorrect data. On the negatives, they cannot predict beyond the defined range in the training data, and that they may over-fit data sets that are particularly noisy.
A random forest should have a number of trees between 64–128 trees.
Difference between Random Forest and a Decision Tree
Random Forest is essentially a collection of Decision Trees. A decision tree is built on an entire dataset, using all the features/variables of interest, whereas a random forest randomly selects observations/rows and specific features/variables to build multiple decision trees from and then averages the results.
Real life applications of Random Forests
- Fraud detection for bank accounts, credit card
- Detect and predict the drug sensitivity of a medicine
- Identify a patient’s disease by analyzing their medical records
- Predict estimated loss or profit while purchasing a particular stock
6. K-nearest neighbors
K- nearest neighbor (kNN) is a simple supervised machine learning algorithm that can be used to solve both classification and regression problems.
kNN stores available inputs and classifies new inputs based on a similar measure i.e. the distance function. KNN has found its major application in statistical estimation and pattern recognition.
What does kNN work?
KNN works by finding the distances between a query and all inputs in the data. Next, it selects a specified number of inputs, say K, closest to the query. And then it votes for the most frequent label (in the case of classification) or averages the labels (in the case of regression).
The kNN Algorithm:
- Load the data
- Initialize k to a chosen number of neighbors in the data
- For each example in the data, calculate the distance between the query example and the current input from the data
- Add that distance to the index of input to make an ordered collection
- Sort the ordered collection of distances and indices in ascending order grouped by distances
- Pick the first K entries from the sorted collection
- Get the labels of the selected K entries
- If regression, return the mean of the K labels; If classification, return the mode of the K labels
Real world applications of kNN
- Fingerprint detection
- Forecasting stock market
- Currency exchange rate
- Bank bankruptcies
- Credit rating
- Loan management
- Money laundering analyses
- Estimate the amount of glucose in the blood of a diabetic person from the IR absorption spectrum of that person’s blood.
- Identify the risk factors for a cancer based on clinical & demographic variables.
6. K-means clustering
K-means clustering is one of the simplest and a very popular unsupervised machine learning algorithms.
Did we not talk about something so similar above?
Difference between k-nearest neighbors and k-means clustering
K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
K-means algorithm starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids. It halts creating and optimizing clusters when either the centroids have stabilized or a defined number of iterations have been achieved.
The K-means clustering algorithm:
- Specify the number of clusters K.
- Initialize centroids by first shuffling the dataset and then randomly selecting K data points for the centroids without replacement
- Keep iterating until the centroids are stabilized
- Compute the sum of the squared distance between data points and all centroids
- Assign each data point to the closest cluster (centroid)
- Compute the centroids for the clusters by taking the average of the data points that belong to each cluster.
Real World applications of K-means Clustering
- Identifying fake news
- Spam detection and filtering
- Classify books or movies by genre
- Popular transport routes while town planning
8. Naive Bayes
Naive Bayes is my favorite, super effective, commonly-used machine learning classifier. Naive Bayes is in its own a family of algorithms including algortihms for both supervised and unsupervised learning.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
In order to understand Naive Bayes, let us recall Bayes rule:
What is so “naive: in Naive Bayes?
Naive Bayes (NB) is naive because it makes the assumption that attributes of a measurement are independent of each other. We can simply take one attribute as independent quantity and determine proportion of previous measurements that belong to that class having the same value for this attribute only.
Naive Bayes is used primarily to predict the probability of different classes based on multiple attributes. It is mostly used in text classification while mining the data. If you look at the applications of Naive Bayes, the projects you always wanted to do can be best done by this family of algorithms.
Real world applications of Naive Bayes
- Classify a news article about technology, politics, or sports
- Sentiment analysis on social media
- Facial recognition softwares
- Recommendation Systems as in Netflix, Amazon
- Spam filtering
9. Principal Component Analysis (PCA)
Now, this one, principal component analysis might not be the best candidate in the algorithm category, but it definitely is super-useful as a machine learning technique.
Principal Component Analysis (PCA) is an unsupervised, statistical technique primarily used for dimensionality reduction by feature extraction in machine learning.
When we talk about high-dimensionality, it means that the dataset has a large number of features. and that requires a large amount of memory and computational power.
PCA uses orthogonal transformation which converts a set of correlated variables to a set of uncorrelated variables.It is used to explain the variance-co variance structure of a set of variables through linear combinations. It is a also the most widely used tool in exploratory data analysis and predictive modeling.
The idea behind PCA is simply to find a low-dimension set of axes that summarize data. Say for example, we have a dataset composed by a set of car properties; size, color, number of seats, number of doors, size of trunk, circularity, compactness, radius… However, many of these features will indicate the same result and therefore can be redundant. We as smart technologists should try to remove these redundancies and describe each car with fewer properties, making the computation simple. This is exactly what PCA aims to do.
PCA does not take information of attributes into account. It concerns itself with the variance of each attribute because the presence of high variance would indicate a good split between classes, and that’s how we reduce the dimensionality. PCA never just considers some while discards others. It takes the attributes into account statistically.
Real world applications of PCA
- Optimize power allocation in multiple communication channels
- Image Processing
- Movie recommendation system
Thank you for reading! I hope you enjoyed the article. Do let me know what skill are you looking forward to learning or exploring in your Machine Learning journey?
Happy Data Tenting!
Disclaimer: The views expressed in this article are my own and do not represent a strict outlook.
Know your author
Rashi is a graduate student at the University of Illinois, Chicago. She loves to visualize data and create insightful stories. She is a User Experience Analyst and Consultant, a Tech Speaker, and a Blogger.