The downfall of data science and the rise of data engineering?
Is data scientist still the “sexiest job of the 21st century” and what that means for you
Over the course of the pandemic (and post-pandemic) we have seen a tremendous growth in the number of data engineering job postings. This has to a large extent outpaced the growth in data science job postings. Likewise, we have seen a plethora of articles pop-up on Towards Data Science and other sites regarding why data engineering is the new “king” of the field [1, 2].
As someone who has worked as both a data engineer and a data scientist I thought I’d chime in on this shift and dispel some misconceptions about the two job titles.
Context
Fundamentally, data science (in this article I’m defining a data scientist as someone who primarily builds ML or statistical models) cannot exist without data engineering while data engineering can exist without data science. Now you might say well “I’m a data scientist and we don’t have a data engineering team” what that really means is that you are doing your own data engineering work. As a matter of fact a good percentage of data scientists probably actually spend most of their time doing DE work. This is because data science needs lots of clean data to power its models. In contrast, data engineering doesn’t NEED data science. As even a simple business analyst can pull sanitized data into Excel and Tableau to analyze it. With this in mind lets look at what has happened over the past five or six years.
Starting back in 2016 according to Glassdoor and several other sites data scientist emerged as an extremely popular and well compensated job title. Many companies were eager to hire data scientists without really knowing what exactly a data scientist did in order to jump on the AI band wagon. Once hired many of these data scientists quickly realized that the company’s data was in nowhere near a usable condition to actually create any sort of useful machine learning models. Data was distributed over many different places (in-house servers, AWS, and even people’s personal laptops). Over a period of several years companies began to catch-on and hire who they actually needed in the first place, data engineers. This was particularly noticeable during the pandemic when many companies stopped hiring for data scientists while growing their data engineer hiring. At most companies data scientists are viewed as more of a luxury than a necessity whereas data engineers are crucial (e.g. most companies can survive without fancy models but need basic data).
This is reflected in the fact that in the post-pandemic job market data scientist hiring has crept back up but data engineering hiring clearly remains higher overall [3]. If a recession does occur in the near future my guess is that we would again see a much greater decline in the overall hiring for data scientists versus data engineers.
Is data engineering harder than data science and why are data engineers so sought after by recruiters?
A question I frequently see online is given this trend and the (often) higher salary is DE more difficult than DS? My basic answer is both yes and no. To sum up data engineering succinctly it is full of tasks that should be trivial but are not. While a data scientist might spend his intellectual energy researching new deep learning models or designing relevant features a data engineer is often dealing with poorly formed data and constantly failing pipelines. At many companies data infrastructure and data quality has been neglected for years and people expect you to magically solve it.
For instance, an annoying problem I frequently encountered as a data engineer was standardizing date-time formats so rows would properly join. This included switching all date-times to the same timezone and creating code to handle the edge cases of improperly formatted date times. This seemingly trivial problem took several days to solve as it required closely examining all the formats and tracking down the appropriate time zone codes. Humans are imperfect and given a large enough dataset there are bound to be tons of edge cases.
Another problem data engineers face is constantly changing data-formats and APIs. Even internally other engineering teams often aren’t the best at notifying data engineering of changes in APIs/data output format. This causes pipelines to fail and you are faced with hours of looking through Airflow logs (or whatever tool your using) to find the culprit. Public APIs are often even worse and can go entirely offline or change their data output format without any notice. Even if notified in a timely manner it is still extremely tedious and difficult to change the code and redeploy the pipeline to run without any downtime (or back-fill the missing data from when the pipeline was down entirely).
At points data engineering can also become political when determining who gets access to what data or lobbying for access to certain protected data. While CTO or CDO and other leaders might determine organization data policies it is often DE that implements the actual access. Similarly, data engineers often have to at least be aware of whatever regulatory framework governs their data. For instance, I remember having to design mechanisms to delete records from our data-lake of customers to comply with GDPR. In other cases I’ve had to design pipelines to strip data of personally identifiable information for compliance with HIPPA.
Data engineers usually support multiple teams including data science, data visualization, and basic analysts. This may sound like a good thing but it also means more responsibility. When things run smoothly people barely notice data engineering but when people don’t get their data they tend to be very upset and raise a stink. Because of this data engineers more often than data scientists have to rotate on-call assignments in case pipelines fail. Different teams also have different requirements and you often need to create slightly different variations of the same data-set to appease all the teams.
Another reason good data engineers are harder to find is the field just isn’t popular in the way data science is. Very few people in college or grad school aspire to be data engineers. Even a quick look at the number of subscribers to r/DataScience versus DataEngineering shows 797k versus 69k (as of 8/5/2022). Similarly we now see bootcamps galore for data science but respectively much fewer for data engineering.
Therefore, it is much more difficult for organizations to find talented data engineers. This makes data engineers in most instances more highly sought after than data scientists.
Skills and career path trajectory
Data science hiring in general places a greater emphasis on educational attainment than data engineering. Often you need a master degree or higher to be hired into a lot of companies. Data scientists need good mathematical and statistical knowledge in order to be able to explain what their models are doing. Certain companies even require a track record of academic publications (though this is more rare). While, it is possible to get hired without an advanced degree (myself being an example), it is often a arduous process.
Data engineering hiring is much more experience based. Hiring managers will want to see that you are capable with tools like Airflow, Docker, Spark/Flink, and Kubernetes along with a variety of flavors of SQL and NoSQL. For data engineering there are few theoretical skills that can substitute for working with real world data. A good way for aspiring data engineers to gain valuable skills is to build an updating dataset based on open source APIs. Courses and certifications with the major cloud providers can also help candidates.
Switching back and forth between the different career paths is difficult yet doable. As someone who was a data engineer trying to shift to data scientist proved challenging despite the proximity. Almost all companies and recruiters automatically pegged me as data engineer and would send me exclusively data engineering positions. They would even “suggest” DE positions that I might be more suited for when I applied to DS positions. I had to extensively rework my resume to feature the more data science aspects of previous jobs.
Similarly, the higher you go in one area the more boxed-in you become. Data engineers usually become senior data engineers who then become lead data engineers and then managers. Data scientists similarly continue up a (senior, lead, manager) progression. Eventually the two roles might re-converge at a director of analytics or a chief data officer type role though in some organizations the two groups are under completely different VP (e.g. in one of my companies data science reported to the chief marketing officer and data engineering it reported to the chief technology officer).
Machine Learning Engineers (MLOps)
A more recent trend to emerge are companies creating roles called Machine Learning Engineers (MLEs). In theory these roles are supposed to focus on the deployment aspect of machine learning models. In practice, however most companies put the cart before the horse and hire for these roles when they have no actual models to deploy. Therefore, MLE engineers often will often end up doing work to support data science. In this respect there is a fair amount of overlap with data engineering, however MLEs generally do more work on the data science code base (e.g. taking models out of notebooks) and spinning up DS infrastructure such as creating clusters for experiments or experiment tracking tools. My guess is in the future MLEs will do more actual deployment but most companies are just not at that stage right now.
Reward and impact
Particularly, as someone with an interest in research I almost instinctively found data science more fascinating. When I worked as a data engineer I was frequently frustrated with the tedious tasks and annoyed that I wasn’t developing cutting edge deep learning models. I wanted to take the field of AI/ML forward as a whole not speed up some pesky data-stream. I knew that data engineering was important I just didn’t want to be the one doing it.
Overtime though I’ve come to more appreciate the rewards associated with data engineering. Creating a high-fidelity data-set in some ways can prove more useful and longer lasting than creating a slightly different NN architecture. Think about it all these years later ImageNet, Cifar and the COCO datasets are still used in almost every computer vision paper. Or in NLP SQuad still remains used for Q&A and BERT would not have been possible without that large scale scraping effort. There are also rare moments in data engineering where you see your pipelines processing gigabytes of data per second and your tables populating in real-time that are extremely rewarding.
That said, I’m happy that designing datasets is not the primary function of my job at this point. Tweaking deep learning architectures and implementing/combining mechanisms from different papers seems much more intellectually stimulating.
Future Trends
Over time as companies get more mature, data infrastructure I think we will see a greater surge in data scientist hiring. However, the need for data engineers will likely remain strong. Even as data infrastructure becomes more mature data engineers will still be needed to handle failing pipelines and changing data formats. So in a certain respect data engineering may be the more stable career in the near future. That said, I recommend going into it only if you find the idea of constructing datasets truly appealing and are ready to accept all the tediousness that comes with it.
