How to detect outliers | Data Science Interview Questions and Answers
Detect outliers using supervised, unsupervised, time series, and deep learning models
In this tutorial, we will talk about how to answer the data science interview question about outlier detection. The tutorial covers the general strategies of answering the question, and provides example questions and answers.
Resources for this post:
- Video tutorial for this post on YouTube
- More video tutorials on anomaly detection and data science interview
- More blog posts on anomaly detection and data science interview
Strategies
The strategy for the outlier detection question is divide and conquer. We divide the answer into four parts:
- Outlier detection using the statistical definition of outliers
- Outlier detection using a supervised model
- Outlier detection using an unsupervised model
- Time series outlier detection
There is a bonus at the end to bring your answer to the next level.
The answer for each part includes three components. They are:
- When to use the method?
- How to implement the method?
- The threshold for outliers.
When answering the question, make it a conversation instead of a presentation. First, give a 1-minute summary of the answer, then wait for the interviewer to ask follow-up questions. Limit the time to answer the follow-up question to 1 to 2 minutes if possible.
Question: How to detect outliers?
There are four ways of detecting outliers depending on what kind of data we are working with.
- For a single variable, we can use the definition of outliers and label the values beyond 1.5 times Interquartile Range (IQR) as the outliers.
- For a dataset with features and labeled outliers, we can build a supervised binary classification model to predict outliers.
- For a dataset without labeled outliers, we can build an unsupervised anomaly detection model to predict outliers.
- For a time-series dataset, we can build a time series model to predict outliers.
Follow-up Question 1: How to use IQR to detect outliers?
To use IQR for outlier detection, we can follow three steps:
- The first step is to calculate IQR (Interquartile Range) using quantile 3 (Q3) minus quantile 1 (Q1) values.
- The second step is to multiply IQR by 1.5.
- The third step is to calculate the threshold. The lower threshold is Q1 minus 1.5 times IQR, and the upper threshold is Q3 plus 1.5 times IQR. Any data points beyond the lower and upper threshold are outliers.
Follow-up Question 2: How to use a supervised model to detect outliers?
To use a supervised model for outlier detection, we can follow three steps:
- The first step labels outliers to be 1 and non-outliers to be 0.
- The second step is to handle label imbalance. The outlier detection model usually has a highly imbalanced target, and this made building a high-performance model challenging, so we need to use techniques such as oversampling, under-sampling, and balanced weights to handle the imbalanced classification dataset.
- In the third step, we can use any binary classification model such as XGBoost, random forest, or a neural network model to make predictions and identify the data with a predicted probability greater than 0.5 as outliers.
To learn how to handle an imbalanced classification model using Python, please refer to my tutorials
- Four Oversampling and Under-Sampling Methods for Imbalanced Classification Using Python
- Neural Network Model Balanced Weight For Imbalanced Classification In Keras
- Balanced Weights For Imbalanced Classification
- Ensemble Oversampling And Under-Sampling For Imbalanced Classification Using Python
Follow-up Question 3: How to use an unsupervised model to detect outliers?
There are different machine learning algorithms for unsupervised anomaly detection such as one class Support Vector Machine (SVM), Local Outlier Factor (LOF), and Isolation Forest. I will explain the process using the one-class Support Vector Machine (SVM).
- In the first step, we need to specify the percentage of anomalies. This is usually based on business knowledge or historical data.
- In the second step, train the one-class SVM model using the specified percentage.
- In the third step, make predictions on the new data to identify outliers. The default threshold is the pre-specified percentage, but we can use a customized percentage as well.
To learn how to make anomaly detection using Python, please refer to my tutorials
- One-Class SVM For Anomaly Detection
- Local Outlier Factor (LOF) For Anomaly Detection
- Isolation Forest For Anomaly Detection
Follow-up Question 4: How to use a time series model to detect outliers?
Time series model outlier detection needs to consider time trends and seasonalities. We can use the Prophet time series model to identify outliers following the three steps:
- In the first step, build a time series forecasting model using Prophet, and specify the desired uncertainty interval.
- In the second step, make predictions on historical data using the time series forecasting model.
- In the third step, compare the actual values with the prediction intervals. Outliers are defined as the data points with actual values outside of the prediction intervals.
To learn how to make time-series outlier detection using Python, please refer to my tutorial Time Series Anomaly Detection Using Prophet in Python
Bonus Answer: Use Autoencoder for Anomaly Detection
To show the knowledge in deep learning, you can explain how to use autoencoder for outlier detection. This method only applies when labels are available for the outliers.
- In the first step, build an autoencoder model using the data without outliers.
- In the second step, make predictions on a new dataset that includes outliers.
- In the third step, set up a threshold for outliers by comparing the differences between the autoencoder model reconstruction value and the actual value.
- In the fourth step, identify the data points with a difference higher than the threshold as outliers.
To learn how to implement it in Python, please refer to my tutorial Autoencoder For Anomaly Detection Using Tensorflow Keras.