Pre-process like a Pro — A Must-do list for Data-Engineers

If you are somebody that spends most part of your day bent over SQL scripts, Excel Spread-sheets, Python and Dashboards — Hello fellow Data Engineer! I come in peace. I am not here to throw the tip of my nose high up in the air, exhale with force and explain the difference between Data-Science and Data Engineering (I hear that is a trending topic in Academic Circles now). There is work to be done before any science can be done on any data, and Data Engineers are usually the first respondents at the scene.

While a movie (or job description) version of a Data Engineer would be smiling away, firmly shaking hands and closing deals with expensive-looking business partners (with line-graphs pointing upwards in the BG), real-life Data Engineering is mostly toiling away in the coal mine of legacy data and monster-logs to extract that one tiny ounce of what seems like a diamond, but may or may not be one (What’s the canary in YOUR coal mine?).

Attention! This is NOT the real life of a Data-Engineer. Image from Pexels

Whether you deliver clean data for further exploration or whether you offer insights from raw data to make real-life decisions, we have by now realized that a vast majority of businesses don’t have a “Data-Culture” in-place.

A real Data-Engineer is probably buried somewhere under there. Image from Pexels.

I can vouch for the fact that you are not alone and neither is your business. Imagine what happened when they first invented excel - Most people went “Nah, we don’t need them, we don’t have the capacity to teach our employees to type, we’re good on paper”. We all know by now that this won’t last.

To navigate the daily obstacles of data pre-processing, understanding the scene and building initial algorithms, here are some things I have internalized.

1. This will take time. Respect this and Invest in it.

It is okay to take time. Breathe, and pre-process.

Understand that when it comes to pre-processing, you are not running a sprint before you “Get to the real part”. Pre-processing IS most of the real part. Invest time in this. Make a 100% sure that every data point means exactly what you think it means. If you feed questionable data into your algorithm, you get a questionable basis to make questionable decisions. True story. In Bavarian German this is called :

“Kuddlmuddl nei, Kuddlmuddl naus”

Translation : Garbage in, Garbage out

2. Ask. Ask. Ask

Keep asking. Keep. Asking. Image from pexels.

To make doubly sure your numbers mean what they say, ask them all possible questions. Watch out for anomalies. Here is an example list of questions I ask myself:

Are the values in a certain range?
Do they always occur in this range?
Are there garbage values, what do you want to do with them to not lose information?
Should I zoom-in here, should I zoom out of the curve?
Are the numbers always round/always not round (I’d be suspicious to see a nice round series of 100.00 where most others are 45.24, 33.34, 231.73 etc.)
Are there unexpected patterns (Example, everyday at 8.00 AM your meter is reading the same value. Does this mean your sensor has not started yet?)
Why are there recorded entries when everyone was on a holiday? Are these garbage values?
Are some of these numbers references/serial numbers of some kind and don't mean anything?
Which part of the data set is a Code ? Which of these is a number which carries “value” ?(Example : street number 14 is not greater than street 13, but 14 meters is greater than 13 meters. One is a dimension and other is a measure. You don't want to use same kind of logic on both numbers for further processing)
Was everyone recording the data (in case of manual documentation) with the same understanding ?
Are there spaces, other garbage symbols that don't(or do?) make a difference when taken out?
Is the data time-related? Can I use the time parameter, can I take it out?

2. Expand your tool-set

Tools are the key to civilization. Expand your box.

If you have been sticking to excel, expand your tool-set to either R or Python. Automate small jobs like VLOOK-ing it UP or creating joins. When needed, pivot it. Write small but reusable scripts to automate tasks which you know will come up sometime later again. If you are using python, invest time to get a hang of using pandas, seaborn/matplotlib correctly to avoid having to “Stack-overflow” it repeatedly. (If you are not already doing it, “stack-overflow” some things time to time ;) ).

3. Use charts once in a while and go “Hey! What’s going on here?”

Look at your charts and say “This does not feel correct”. Image from Pexels.

If your data has measures and values, plot them at various steps of pre-processing. Make diverse plots of it to get a visual picture of the situation and proceed to pre-process in the direction of what you want to see. Your eyes are the first and very-best algorithm to recognize anomalies and trends. So, do this “Optical-check” before you proceed any further. Plot the easy values right there in excel and larger sets with whichever tools you please (I use Excel, Matplotlib or Tableau).

4. If using Machine Learning, tread carefully

What, did you just say?! Image from Pexels

Machine Learning has the allure of being a black-box that spits out magic correct answers. It is not. It is absolutely not. Any algorithm is only as good as the information you feed it.

Where possible, use non-black box solutions
If one class of data predominates the other in your input (Example in classification : you got a single 1 for every 100 0s?), find acceptable methods to stratify or “weight" your labels.
Too many parameters and too little data ? Try Random Forests and SVMs as opposed to Neural Networks. BUT, pre-process the data set to make it easier for the algorithm. The more abstract your input, the less reliable your output.
Lesser data means more pre-processing on your side. Think of it like mashing a banana to feed a baby. The smaller your baby, the more you mash.
Be as specific as possible in the task you assign to your model. Wide range of necessary predictions means more data. In most cases of application oriented, daily-life machine learning, we have neither the kind of data nor the capacity to process it.
DON'T forget to normalize your data to meaningfully fit a range wherever and whenever possible.
You see categories, you one-hot them.
Train your models at different points of time with updated information. Test on multiple test-sets to see if it REALLY generalizes. (Don’t be disappointed if it doesn’t)

5. Respect the “ritual” and Lapp up the meaning

Respect the Ritual, it is not trivial ;)

Data-Engineering is.. fun. Don’t be afraid of enjoying the process because it feels like you are far from the goal. Every new insight is a step closer and every theory disproved is a tid-bit of information garnered. Stand your ground and pre-process to perfection. Realize that no finding is trivial and no chart was “ But we knew it already!”. No, not like this you didn’t. Spend more than a moment to look at insights, not just charts. That, is the major role of a Data Engineer.

Pre-process with care. Most other problems that arise later have generic and widely accepted solutions as long as your pre-processing game is on point!