avatarTomer Gabay

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5149

Abstract

://cloud.google.com/">GCP</a> and <a href="https://aws.amazon.com/">AWS</a>. At my university, I hardly used any of the resources available on these cloud platforms. However, in working life, knowledge of cloud platforms by data scientists are almost as essential as their programming skills. An example of (cloud) engineering tasks I had to do as a data scientist so far:</p><ul><li>Deploy Python applications in a Docker container</li><li>Build CI/CD pipelines</li><li>Host Python packages as artifacts in the Cloud</li><li>Build data pipelines</li></ul><p id="baf5">To be able to have some cloud knowledge of one of the platforms above is already a huge plus when applying for a data science position. Each platform has its own tracks and certificates to learn and prove your knowledge on the cloud. Each platform has different tracks for different roles. Some relevant tracks for a data scientist for each platform are:</p><ul><li><a href="https://learn.microsoft.com/en-us/certifications/azure-data-scientist/">Azure DP-100</a></li><li><a href="https://cloud.google.com/learn/certification/machine-learning-engineer">GCP Machine Learning Engineer</a></li><li><a href="https://aws.amazon.com/certification/certified-data-analytics-specialty/">AWS Data Analytics</a></li></ul><p id="42ba">Getting such a certificate will definitely make you more valuable as a (potential) data science employee.</p><h1 id="d112">Working with other data scientists</h1><p id="c045">One of the most important differences between data science at a university and data science in real life is that in real life, you have to work together, continuously, while in university there are often only a few team-based projects over the course of an entire bachelor's or master's.</p><p id="673a">So far, every company and organization I worked for worked either <a href="https://www.scrum.org/resources/what-scrum-module"><i>scrum</i></a><i> </i>or <a href="https://en.wikipedia.org/wiki/DevOps"><i>DevOps</i></a><i>. </i>I’ve not had a single job interview where knowledge and/or experience in one of these methods wasn’t a plus. Besides working <i>scrum </i>or <i>DevOps, </i>code reviewing is a critical task while working as a data scientist. This means you both have to be able to write clear and readable code for others to review, as well as be able to quickly and critically judge someone’s code. You can find a very interesting article about how to give constructive feedback for a code review in the link below:</p><div id="990c" class="link-block"> <a href="https://mtlynch.io/human-code-reviews-1/"> <div> <div> <h2>How to Do Code Reviews Like a Human (Part One)</h2> <div><h3>Lately, I've been reading articles about best practices for code reviews. I notice that these articles focus on finding…</h3></div> <div><p>mtlynch.io</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*kxte0KS-2TzoI0wH)"></div> </div> </div> </a> </div><p id="bcd1">Also, you can’t get away (any longer) with unclear functions or variable names, or objects which don’t adhere to Python’s <a href="https://peps.python.org/pep-0008/">PEP-8 naming conventions</a>. If you want to learn more about how to write code that is of high quality see this article below:</p><div id="5c6a" class="link-block"> <a href="https://towardsdatascience.com/6-python-best-practices-that-distinguish-seniors-from-juniors-84199d4cac3c"> <div> <div> <h2>6 Python Best Practices that Distinguish Senior Developers from Juniors</h2> <div><h3>How to write Python code that is perceived as coming from an experienced developer.</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/0*jc_Z9MfwjF_ksdNV)"></div> </div> </div> </a> </div><h1 id="1967">Working with real-life data</h1><p id="2b61">At uni, most of the datasets students encounter are already heavily preprocessed. Often those datasets’ columns have meaningful names, wrong entries have been deleted and the datatypes are already rightly configured. The data you’ll encounter while working for a company or organization is regularly not even close to the standards of the datasets you encounter at uni, unless you work at a proper data-driven (tech) company.</p><p id="7fcf">A few examples of messy real-life data I’ve come across while working:</p><ul><li>Crime data of which some of the crimes’ date of perpetration was in the future.</li><li>Column names without any naming convention [ <code>name</code>, <code>Address</code>, <code>JobDescription</code>, <code>place_of_birth</code>]</li><li>Duplicate columns with different values [ <code>job_title</code> <code>JobTitle</code> ]</li><li>DateTime values of different timezones without timezone implication.<

Options

/li></ul><p id="fc0b">Due to that real-life data is much messier than data prepared for students, expect that as a data scientist, you’ll spend most of your time talking to the business to ask for explanations of the columns and values, data cleaning, and combining data from different sources rather than performing data analysis or building machine learning models.</p><p id="e17b">For more on data cleaning see e.g.:</p><div id="547d" class="link-block"> <a href="https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4"> <div> <div> <h2>The Ultimate Guide to Data Cleaning</h2> <div><h3>When the data is spewing garbage</h3></div> <div><p>towardsdatascience.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*LUK1pAU235VHhRJmNe6Ymw.png)"></div> </div> </div> </a> </div><p id="9b06">If you want to practice on some real-life (messy) datasets, <a href="https://analyticsindiamag.com/10-datasets-for-data-cleaning-practice-for-beginners/">here are ten datasets</a> to practice your data-cleaning skills on!</p><h1 id="1386">To Conclude</h1><p id="0aeb">The skills that universities teach for data science may not be sufficient for excelling as a data science employee. As a data scientist who has worked in multiple organizations, I have noticed four major gaps in skills that require addressing:</p><ul><li>building solid data science projects.</li><li>working in a cloud-based environment.</li><li>collaborating with other data scientists.</li><li>working with real-life data.</li></ul><p id="c73d">Using the resources mentioned in this article you can help mitigate the possible lack of knowledge or skills in these highly-valued areas!</p><blockquote id="12df"><p>Of course, each university and each study has its own data science curriculum, and some universities might have been better or worse in preparing you for working as a data science employee. Also, it has been several years since I graduated from university, so perhaps universities’ data science curriculums have improved since then on e.g. the cloud area by adding (more) cloud-related courses.</p></blockquote><h1 id="59e4">Resources</h1><p id="df7f"><b>Solid Python projects <a href="https://packaging.python.org/en/latest/tutorials/packaging-projects/"></a></b><a href="https://packaging.python.org/en/latest/tutorials/packaging-projects/">https://packaging.python.org/en/latest/tutorials/packaging-projects/</a> <a href="https://towardsdatascience.com/how-to-convert-your-python-project-into-a-package-installable-through-pip-a2b36e8ace10">https://medium.com/r/url=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-convert-your-python-project-into-a-package-installable-through-pip-a2b36e8ace10</a> <a href="https://readmedium.com/how-to-start-any-professional-python-package-project-9f66538ebc2">https://readmedium.com/how-to-start-any-professional-python-package-project-9f66538ebc2</a> <a href="https://towardsdatascience.com/deploy-a-machine-learning-model-as-an-api-on-aws-43e92d08d05b">https://towardsdatascience.com/deploy-a-machine-learning-model-as-an-api-on-aws-43e92d08d05b</a></p><p id="6f4d"><b>Cloud-based environment <a href="https://learn.microsoft.com/en-us/certifications/azure-data-scientist/"></a></b><a href="https://learn.microsoft.com/en-us/certifications/azure-data-scientist/">https://learn.microsoft.com/en-us/certifications/azure-data-scientist/</a> <a href="https://cloud.google.com/learn/certification/machine-learning-engineer">https://cloud.google.com/learn/certification/machine-learning-engineer</a> <a href="https://aws.amazon.com/certification/certified-data-analytics-specialty/">https://aws.amazon.com/certification/certified-data-analytics-specialty/</a></p><p id="d9da"><b>Working with other data scientists <a href="https://www.scrum.org/resources/what-scrum-module"></a></b><a href="https://www.scrum.org/resources/what-scrum-module">https://www.scrum.org/resources/what-scrum-module</a> <a href="https://en.wikipedia.org/wiki/DevOps">https://en.wikipedia.org/wiki/DevOp</a>s <a href="https://mtlynch.io/human-code-reviews-1/">https://mtlynch.io/human-code-reviews-1/</a> <a href="https://peps.python.org/pep-0008/">https://peps.python.org/pep-0008/</a> <a href="https://readmedium.com/6-python-best-practices-that-distinguish-seniors-from-juniors-84199d4cac3c">https://readmedium.com/6-python-best-practices-that-distinguish-seniors-from-juniors-84199d4cac3c</a></p><p id="d5b0"><b>Working with real-life data <a href="https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4"></a></b><a href="https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4">https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4</a> <a href="https://analyticsindiamag.com/10-datasets-for-data-cleaning-practice-for-beginners/">https://analyticsindiamag.com/10-datasets-for-data-cleaning-practice-for-beginners/</a></p></article></body>

Succeeding in Data Science: 4 Essential Skills You Didn’t Learn in University

How to bridge the gap between academia and employment in the data science field

Photo by krakenimages on Unsplash

I’ve had multiple positions as a data scientist in different organizations and companies since graduating from a Dutch university in the data science field. What has surprised me and some other graduates is the gap between the skills you are taught at university and the skills required to be a valuable data science employee. In this article, I’d like to highlight the four biggest gaps in skills taught at uni and skills I’ve found necessary to perform well as a data science employee.

I’ve included useful resources in each section to help you enhance your skills in that subject area, should you wish to improve your proficiency.

Building solid data science projects

At university it is often common to work ‘Notebook-style’, using Jupyter Notebooks. The most solid project structure I set up at uni was calling different modules from a main.py file and using a requirements.txt. But using a setup.py? Or a pyproject.toml? Nope.

The requirements companies want from a data science project are much different. Your data science project should often be built as an installable package, e.g. through pip install <package_name>. A requirements.txt is sometimes necessary, but at least the package’s dependencies should be specified in the pyproject.toml (nowadays preferred over setup.py, see PEP-518).

Next to being installable, your data science project should often be able to run as a Docker container-based API in the cloud. By calling your API using REST requests, data can be processed through pipelines or even predicted with a Machine Learning model.

If you recognize that you’re also mostly doing data science projects ‘Notebook-style’, or that you are not familiar with how to build an installable Python project, I’d recommend reading e.g.

If you’re able to build an installable package, but you’re not sure how to build and run your data science project as a Docker-based API, read:

Working in a cloud-based environment

Nowadays, virtually every company and organization uses at least some cloud-based software. Currently, the most popular ones are Microsoft Azure, GCP and AWS. At my university, I hardly used any of the resources available on these cloud platforms. However, in working life, knowledge of cloud platforms by data scientists are almost as essential as their programming skills. An example of (cloud) engineering tasks I had to do as a data scientist so far:

  • Deploy Python applications in a Docker container
  • Build CI/CD pipelines
  • Host Python packages as artifacts in the Cloud
  • Build data pipelines

To be able to have some cloud knowledge of one of the platforms above is already a huge plus when applying for a data science position. Each platform has its own tracks and certificates to learn and prove your knowledge on the cloud. Each platform has different tracks for different roles. Some relevant tracks for a data scientist for each platform are:

Getting such a certificate will definitely make you more valuable as a (potential) data science employee.

Working with other data scientists

One of the most important differences between data science at a university and data science in real life is that in real life, you have to work together, continuously, while in university there are often only a few team-based projects over the course of an entire bachelor's or master's.

So far, every company and organization I worked for worked either scrum or DevOps. I’ve not had a single job interview where knowledge and/or experience in one of these methods wasn’t a plus. Besides working scrum or DevOps, code reviewing is a critical task while working as a data scientist. This means you both have to be able to write clear and readable code for others to review, as well as be able to quickly and critically judge someone’s code. You can find a very interesting article about how to give constructive feedback for a code review in the link below:

Also, you can’t get away (any longer) with unclear functions or variable names, or objects which don’t adhere to Python’s PEP-8 naming conventions. If you want to learn more about how to write code that is of high quality see this article below:

Working with real-life data

At uni, most of the datasets students encounter are already heavily preprocessed. Often those datasets’ columns have meaningful names, wrong entries have been deleted and the datatypes are already rightly configured. The data you’ll encounter while working for a company or organization is regularly not even close to the standards of the datasets you encounter at uni, unless you work at a proper data-driven (tech) company.

A few examples of messy real-life data I’ve come across while working:

  • Crime data of which some of the crimes’ date of perpetration was in the future.
  • Column names without any naming convention [ name, Address, JobDescription, place_of_birth]
  • Duplicate columns with different values [ job_title JobTitle ]
  • DateTime values of different timezones without timezone implication.

Due to that real-life data is much messier than data prepared for students, expect that as a data scientist, you’ll spend most of your time talking to the business to ask for explanations of the columns and values, data cleaning, and combining data from different sources rather than performing data analysis or building machine learning models.

For more on data cleaning see e.g.:

If you want to practice on some real-life (messy) datasets, here are ten datasets to practice your data-cleaning skills on!

To Conclude

The skills that universities teach for data science may not be sufficient for excelling as a data science employee. As a data scientist who has worked in multiple organizations, I have noticed four major gaps in skills that require addressing:

  • building solid data science projects.
  • working in a cloud-based environment.
  • collaborating with other data scientists.
  • working with real-life data.

Using the resources mentioned in this article you can help mitigate the possible lack of knowledge or skills in these highly-valued areas!

Of course, each university and each study has its own data science curriculum, and some universities might have been better or worse in preparing you for working as a data science employee. Also, it has been several years since I graduated from university, so perhaps universities’ data science curriculums have improved since then on e.g. the cloud area by adding (more) cloud-related courses.

Resources

Solid Python projects https://packaging.python.org/en/latest/tutorials/packaging-projects/ https://medium.com/r/url=https%3A%2F%2Ftowardsdatascience.com%2Fhow-to-convert-your-python-project-into-a-package-installable-through-pip-a2b36e8ace10 https://readmedium.com/how-to-start-any-professional-python-package-project-9f66538ebc2 https://towardsdatascience.com/deploy-a-machine-learning-model-as-an-api-on-aws-43e92d08d05b

Cloud-based environment https://learn.microsoft.com/en-us/certifications/azure-data-scientist/ https://cloud.google.com/learn/certification/machine-learning-engineer https://aws.amazon.com/certification/certified-data-analytics-specialty/

Working with other data scientists https://www.scrum.org/resources/what-scrum-module https://en.wikipedia.org/wiki/DevOps https://mtlynch.io/human-code-reviews-1/ https://peps.python.org/pep-0008/ https://readmedium.com/6-python-best-practices-that-distinguish-seniors-from-juniors-84199d4cac3c

Working with real-life data https://towardsdatascience.com/the-ultimate-guide-to-data-cleaning-3969843991d4 https://analyticsindiamag.com/10-datasets-for-data-cleaning-practice-for-beginners/

Data Science
Technology
Programming
Machine Learning
Career Advice
Recommended from ReadMedium