Data Science

Why You’ll Regret Training ML Models

Before thinking about these 6 things

Background Image from Unsplash, much thanks to the creator @awmleer

I know, you are excited about your new problem and can’t wait to throw Machine Learning at it. But since you are a great data scientist you know that there are some things you just have to think about before starting. From your first data analysis until production there are many things that can go terribly wrong. So let’s take some minutes to think collectively about what you can do to make sure your next project is a success.

First, as with every project, you need to know what your goals are, is it a fully autonomous or human-assisted ML system? Your understanding of the problem landscape can save you and your organization from wasting a lot of time.

Additionally, you need to understand what has already been achieved in the field and where the limitations are. If you don’t you will end up with a system that can not be used or even worse solves a problem that is not valuable to solve. The following sections will deal with all the stages of the preparation process and will surely give you a few new ideas on how to approach your next project.

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” ― Abraham Lincoln

Look At The Data

As the old saying goes garbage in garbage out. If the data you are trying to use for your ML model is bad your model will not stand the test of reality. Makeing sure to check the data before ever using it, is the most important duty of any data worker out there. No matter if Machine Learning Engineer, Data Analyst, Data Scientist or Data Engineer.

Worst case you will be working on an unsolvable problem for months to come. You should at least look at the data distribution, and both the quality of the collection process and the stored data quality.

Has a lot of the data been gathered in the same month?
What is the data source and from what time period?
Was it collected through a survey, where only certain demographics responded?
Is it gathered from some website, that may not represent your audience?

Are some of the basic questions that you should be able to answer, or at least communicate that there is a risk since you don’t know. Other crucial things to look out for are missing records or incomplete records, which can be an early indicator that you don’t know something. Do some exploratory data analysis and plot the data from all angles you can find. All of this often uncovers useful insights into how to proceed.

If you get bored remind yourself, that it is much easier to fix a doomed project at this stage than it will in a few months when your entire model is trained and all hyperparameters are tuned to perfection.

Worst case you lose a few days best case you save a few months.

Don’t Look At The Data

Contradiction you may scream, fair point but hear me out. You will make assumptions about your data and that is great. But you don’t want to make untestable assumptions and potentially overfit. This may create a solution that only works on the data that you have seen so far. Remember generalization, is your highest goal. It is your model’s ability to adapt properly to new, previously unseen data. This is exactly why there is a test set.

And just as your model can overfit on the test data when it looks at the test data. You too will overfit subconsciously as a human system, by choosing hyperparameters assumptions and architectures that perform well on your test set evaluation. I have several approaches to remove my own imperfections from the training process. But remember, that you will never be able to remove them fully.

My favorite approach here is to split the data by time and train on the older data to then test on the newer one. Even better I use all the data and test on the data that will be gathered while I develop my model. In this way, I can assure that my monkey brain doesn’t overfit unconsciously on the test data. Key takeaway don’t look at your test data after every epoch of training.

You Need More Data

Are my articles about data? I guess that's why I call brand my work as ‘Data with Sandro’. But now seriously if you don’t have enough data you can not start training anything that generalizes well.

It’s sadly not easy to tell if you have enough data right from the beginning. An indicator is the noise you see in the data if you have a lot of noise you need more data, if it is clear you can get away with less.

To solve this you can’t either gather or label more data. Augmentation data techniques or creating synthetic data can also work well. Not ideal but it can work if there are transformations that don’t destroy the meaning of the sample point. A good example is the background of an image if you are working on an image classification task.

One more detail if you are working with a limited amount of data be sure to use fewer parameters for your model since a big neural network will almost certainly simply memories all data points and not generalize at all. Key take away if you are uncomfortable with the amount of data at hand, make sure to run or somehow get more.

Ask The Guys That Knows The Field

Also called domain experts these creatures know their field and have been working on the subject for years. They are aware of easy indicators and were to get data for your problem. Most importantly they know what problems are valuable to solve, and would save them the most time.

Sometimes solving just a single problem class, already solves 80% of the problem. The famous 80–20 rule, also known as the Pareto Principle applies also in Data Science. Identifying the solutions that are potentially the most productive may be hard for you as an outsider but domain experts have been exposed to the same problems for years.

Another topic here is interpretability. They will let you know about regulatory constraints or stakeholders that need interpretability-based methods to understand the decisions your models are producing.

As we know understanding the problems and data is crucial. So be sure to also ask them about the data collection process to steer your data analysis in the right direction. Finally, they are the best advertisers you can find and may potentially help you promote your work inside your company or their industry to people that would benefit most from your work. All in all a clear win-win, go talk to them and you can still reinvent the wheel after that.

Others Throw Machine Learning At Problems Too

Both academia and industry have tons of creative programmers and data scientists, and you are probably not the first person to throw machine learning at this problem.

“Only a fool learns from his own mistakes. The wise man learns from the mistakes of others.” ― Otto von Bismarck

This is exactly why you should at the beginning of a project search for academic papers or blog posts of people that did something similar. Literature research will help you implement state-of-the-art models and preprocessing pipelines. More importantly, it will help you avoid the most common pitfalls and avoid their mistakes. Is the data clearly biased? Does a certain type of model occur in all papers? Is pretraining on another dataset done very often? Or is there even a standard software that solves this problem pretty well? Are questions that should be answered in the first few pages you read.

Another information source is your company, there may have been similar projects that tried to automate it somehow without machine learning, this can show you the limits of conventional approaches and indicate what parts of the problem are solvable without machine learning. So you can then focus first on the parts they considered unsolvable. Additionally, it may also give you an early look into company politics and fears that may need to be addressed. Key takeaway so your homework before starting.

What about production?

Start with the end in mind as they say. And as mentioned in the beginning your goal should always be a model that generalizes well in production. After all, the worth you provide is not the accuracy on the test set but the value you create when it is up and running.

Make sure to think early about what data is available when you make your predictions. Or at least make sure that you don't work with unrealistic post corrected data when training. Another important aspect is how fast your answer needs to be ready. Is a response tomorrow sufficient or does it need to happen in the next few milliseconds? Knowing how much it is allowed to cost, is also important in case you are planning to use APIs which are a little more costly. How can you ensure the service is always running? and how bad would it be if it was offline for a few hours? Are there hardware constraints? If you are planning to run the model on some raspberry pie without an internet connection you probably won’t be able to build a huge neural network. Do you want your model to continuously learn on new data? Or is manually retraining it once per year under controlled conditions safer.

In the end, you are building software, and having a proper software architecture is a must. Be sure to plan for production or you will be busy working on bug reports instead of creating new models

Conclusion

Taking your time to really understand the problem, may sometimes feel like a waste of time, but remember that it is easier to fix a project at the beginning than in the end.

A machine learning model after all is the result of a long journey, and planning this journey involves answering a lot of questions. But once you have all of them figured out I am sure your next project will be a huge success.

If you enjoyed this article, I would be excited to connect on Twitter or LinkedIn.

Make sure to check out my YouTube channel, where I will be publishing new videos every week.