What does “Garbage in, garbage out” mean in solving real business problems?
and how to avoid it with a practical workflow
In today's business landscape, relying on accurate data is more important than ever. The phrase "garbage in, garbage out" perfectly captures the importance of data quality in achieving successful data-driven solutions. While using the right model for forecasting or classification is crucial, it's impossible to achieve good results without reliable data input. By using amplified features generated from trustworthy data sources, even simple linear regressions can yield highly accurate results. In this blog post, I will discuss the importance of data in solving real-world business problems and outline steps to create a strong data evaluation pipeline that ensures input data quality for precise modeling and smart decision-making.
The Reality in Applied Data Science
After working as a data scientist for over two years, one of my most surprising observations is how much time my colleagues and I spend on data cleaning. While in school, our attention is usually directed toward comprehending the fundamental algorithms, the mathematical principles underlying the models, the overall process of constructing a forecasting pipeline, etc. We often work with perfect datasets that are deliberately drafted in a certain way for us to only focus on the EDA process, the model evaluation, and fine-tuning parts, which leads us to underestimate the significance of data cleaning until we encounter real-world business data in the industry. Real business data is messy. The messiness comes from but is not limited to the following:
- Data Sources Diversity: Businesses accumulate data from a variety of sources. For example, an E-commerce company can collect data from customers' purchases, sales planning, manufacturing processes, marketing campaigns, etc. Each data source comes with its own unique data formats, structures, and quality levels. The inconsistencies here result in a great challenge later when merging all data sources together for later analysis.
- Human Error: Collecting data requires human involvement, which increases the likelihood of mistakes during the process. Mistakes made by humans, such as typos, incorrect formatting, duplicates, and other errors, can result in inaccuracies, which affect the future analysis.
- Missing Data: Not all data points are collected consistently or comprehensively in business settings. Missing data can be due to various reasons. Some common problems that can arise include customers failing to provide all the necessary information, technical difficulties during data collection, or inconsistent requirements for data collection over time.
- Inconsistency across Time: Business processes change due to market trends, technology advancements, internal decisions, etc. These changes can affect what and how data is collected. It is possible to see the same variable has different names across time, or variables with the same name do not mean the same thing.
- Ambiguous Categorical Names: Subjective or inconsistently labeled categories can be found in business data frequently. The classification of products, services, or customer attributes can be changing all the time. For example, E-commerce can classify one product into different higher category hierarchies due to business reorganization, product lifecycle, etc. This induces challenges when aggregating or analyzing data by category across time.
- Outliers: Outliers are found in business data all the time. They may be true observations arising from rare events, extraordinary customer behaviors, or simply data input mistakes. If not effectively managed, these outliers can potentially distort analysis and modeling outcomes.
- Scaling Issues: Data collected from different sources might have different scales, units, or magnitudes. We need to rescale them to the same magnitude before analysis. Otherwise, this can impact the effectiveness of certain algorithms and analysis methods.
Addressing the challenges before starting the modeling process is crucial to avoid “garbage in, garbage out.”. The aim is to ensure that the input data is precise, reliable, and preprocessed accordingly for analysis, which will help businesses to gain beneficial insights and make well-informed decisions. In the next section, we will discuss how to establish a data evaluation pipeline to achieve this goal.
Ensure Trustworthy Data Sources
The foundation of any data-driven solution lies in the data sources. Evaluating the credibility of data sources is crucial. Before starting a project, list all the required data and where to get them. Diving into the data’s origins helps avoid pitfalls and biases. Specifically, check:
- Data Collection Process: Understanding how data was collected is vital. Try to determine whether it is gathered consistently across time and sources. Inconsistencies in data collection processes can introduce errors that propagate through the analysis. If multiple datasets have overlapping information, always use the more formal ones. For example, data obtained from official sales records can be trusted more than data from informal employee logs.
- Variables’ Definitions: Do not assume a variable’s definition just based on its name. For example, a variable called US GDP can vary across sources. Is it nominal or real GDP? Is it seasonally adjusted? If so, how is it seasonally adjusted? Etc. We need to refer to the data descriptions to understand the details of the variables we are interested in.
- Consistency in Categorical Names: Messy category names can confuse and hinder meaningful analysis. For example, in sales data, one product’s name can differ across years, or the same product can be categorized into different higher hierarchies due to different administrative considerations. In this case, we must create a map to align the different names across sources.
It is important to understand the data origins when using this dataset directly and when joining them cross-sectionally or expanding the dataset’s time horizon. Where should I get the data origins information? Gather this information through the data description or the Readme file if you use a public or paid dataset. If it’s private corporate data, consulting with people maintaining the dataset before using it is always useful. Prior to diving into data analysis, always schedule a sync-up meeting with data owners to clarify the data sources, data collection process, variable definitions, and so on.
Data Cleaning and Preprocessing Workflow
Once we have a good understanding of the data, we can start the cleaning and preprocessing process. These are the steps to follow:
Check Duplicates:
Duplicates can distort analyses by giving extra weight to certain data points. Identifying and resolving duplicates ensures each data point contributes fairly to the analysis. However, where the duplicates come from is a more interesting and important question. Are the duplicates coming from the data collection process or errors from merging multiple datasets? Is it a one-time human input error or something more fundamental embedded in the collection process that will produce the same error consistently?
Answering these questions will help us find the best way to deal with the duplicates. For example, we can delete duplicated transactions if an e-commerce platform mistakenly records the same transaction twice, leading to skewed revenue calculations and inventory management. However, if the platform recorded every transaction twice since a specific date, we might have some technical issues in the platform that we need to troubleshoot. Or, if we only have duplicates for specific products’ transactions, it might come from duplicated definitions in the product hierarchy. For example, if this product belongs to the “Food” and “Entertaining” categories simultaneously, the transactions will be duplicated when aggregating to the higher hierarchy. In this case, depending on the business context, we must fix or leave the product hierarchy as it is. That’s where you gather business acumen in your position.
Handling Missing Values
Missing data is inevitable, and how you deal with it matters. Again, knowing where the missing value comes from gives guidance on how to impute them. In addition, we need to evaluate the percentage of the missing values since the easiest way is to exclude them. Generally, these are the steps to follow:
- Identify the Missing Pattern: Understand the nature of missing data in your dataset. Is it missing completely at random, or is there a systematic pattern, such as a flaw in the data collection process? If missing data is systematic due to data collection issues, consider improving data collection methods to minimize missing values in the future.
- Evaluate the Impact: What’s the missing data percentage? What’s the estimated impact if I remove them? We can evaluate the impact by running the feature transformation or the model prediction, including and excluding the mission values, and then compare the performance metrics. If the missing data percentage is low and does not significantly affect the overall dataset, consider removing rows with missing values. However, this approach should be used cautiously, as it may lead to the loss of valuable information. Only when the impact is significant enough we think about what’s the best imputation logic.
- Imputation: Imputation involves replacing missing values with estimated values. There are various imputation methods available:
- Mean, Median, or Mode Imputation: Replace missing values with the non-missing values’ mean, median, or mode in the same column. This method is simple but only suitable for numerical data. Additionally, it reduces a feature’s variance and ignores the correlation between this feature and other features.
- Interpolation: For time-series data, we can interpolate missing values by forward or backward fill. We can build simple linear or time series models to interpolate the missing timestamps if necessary. If we have cross-sectional data, we can use clustering algorithms. Specifically, cluster similar individuals together to impute missing values in a group. For example, find the most similar products of product A to impute the sales of product A, either using the most similar product’s sale or a cluster mean/median/mode for a group of products. Similarly, find the most similar products in the past to extend product A’s sales time horizons. This is very useful for understanding product lifecycle and new product launching.
- Predictive Imputation: This method takes an extra step in the imputation by including more features. We can choose different algorithms to predict missing values based on reasonable and available features in the dataset. This approach is more sophisticated and can yield accurate results if the predictive model is well-constructed.
4. Use missing value as an informative feature: If you believe the missing value is generated due to a systematic bias, for example, customers who have less extreme sentiments are less likely to leave a comment for a product, then we can build a feature indicating this useful insight. Create a binary indicator variable (for example, 1 if the value is missing, 0 if not) to capture the information about the missing value. This can help models account for the potential impact of missing data.
5. Domain Knowledge: Leveraging domain expertise is very important to solve data issues in business settings. Work with business owners to consider dropping the variable or fill the missing value.
Normalizing Data Scale
Data often comes in various units and scales. Normalization transforms data to a consistent scale, aiding algorithms that are sensitive to input magnitudes. The standard ways are subtracting the feature mean and dividing by its standard error or range. Data normalization is useful in at least three circumstances:
1, Different scales distort the distance calculation for algorithms that use Euclidian distance, such as KMeans, and KNN.
2, For algorithms that optimize with gradient descent. If features are in different scales, it makes gradient descent harder to converge.
3, For dimensionality reduction (PCA) algorithms, which find combinations of features with the most variance.
Not all models require normalization. Additionally, if you normalized the target variable, don’t forget to scale them back for meaningful interpretation.
Address Outliers
Outliers are data points that deviate significantly from the majority of the data in a dataset. They will distort model performance and analysis without appropriate handling. Detecting outliers and deciding whether to mitigate or retain them is critical. Outliers could represent anomalies or errors in data collection. Identifying the source and reason for outliers helps determine the appropriate action. First, to detect outliers:
1. Understand the Context: Understanding the business context before addressing outliers is critical. How much of a deviation from normal is defined as outliers? The answer to this question can be very different across businesses. Thus, we need to set a standard to distinguish outliers.
2. Explore Visually: Creating visualizations, such as box plots, scatter plots, and histograms, is the most straightforward way to understand data distribution and identify potential outliers. This can also give insights into understanding the outliers’ impact on the overall data distribution.
3. Statistical Methods: Use statistical methods to identify outliers, such as the Z-score (measuring the number of standard deviations a data point is away from the mean) or the interquartile range (IQR). Data points outside a certain threshold can be considered outliers.
4. Domain Knowledge: Leverage your domain expertise to determine whether certain outliers are valid and meaningful data points. Sometimes, outliers may hold critical information or insights that should not be discarded.
5. Build Anomaly Detection Algorithms: When a dataset is large, which is usually true in a business setting, we need something scalable to detect outliers effectively. There are a lot of use cases in anomaly detection, especially in time series data. One common anomaly detection algorithm is Isolation Forest, which isolates anomalies using a decision tree structure.
6. Monitoring model outputs: Besides detecting outliers from input data, we can find potential outliers through abnormal model outputs. The outputs can help us trace back to data, thus identifying the outliers. For exploratory analysis, consider conducting analyses with and without outliers to compare results and understand their impact on insights. Similarly, we can also evaluate model metrics with and without outliers to estimate their impact. Outliers impact different models differently, so understanding their influence helps in model selection.
After detecting the outliers, we must decide what to do with the outliers. Should we exclude them or impute them? In some cases, it might be appropriate to remove outliers if they are due to data entry errors or if they substantially impact model performance. However, this should be done carefully and with justification. Consider the following ways:
- Transformations: Apply data transformations such as logarithmic or square root transformations to reduce the impact of extreme values. This can help normalize the data distribution and make it more amenable to analysis.
- Truncation: Set a threshold beyond which data points are considered outliers and cap their values at that threshold. This approach retains the data but limits its influence on the analysis.
- Winsorization: Similar to truncation, winsorization involves replacing extreme values with values closer to the data’s distribution, thus reducing the impact of outliers.
- Robust Algorithms: Use algorithms that are less sensitive to outliers, such as robust regression or clustering methods that assign lower weights to extreme values.
- Impute outliers: If you believe the outlier comes from human errors and is not systematic, consider imputing them as if they are missing values before feeding into any model. You can follow the missing value imputation methods in this case.
When it comes to solving actual business issues through data science, it's important to remember the saying, "Garbage in, garbage out." No matter how advanced your algorithms are, they can only be as good as the data they’re fed. Establishing a robust data evaluation pipeline ensures that the data you’re working with is of high quality, accurate, and reliable. Although improving data quality cannot guarantee insightful data-driven decisions or solutions for a business, it can ensure a robust data-cleaning process that lays the foundation for any further analysis.
Outside school and artificial projects, the journey from raw data to valuable insights is a collaborative effort between data scientists, data engineers, domain experts, and decision-makers, all committed to turning quality data into impactful results. This is what I discussed in my previous article when I highlighted the importance of communication, collaboration, and business acumen as a data scientist:
I hope that this article has provided useful guidance for establishing a strong groundwork and practical workflow for data cleaning in solving real business problems. If you have any additional advice or comments, please feel free to share them with me.
Thanks for reading! Lastly, don’t forget to:
- Check these other articles of mine if interested;
- Subscribe to my email list;
- Sign up for medium membership;
- Or follow me on YouTube or watch my most recent YouTube video about: