As A Student I Failed To Hone This Development Skill Employers Secretly Crave But It’s Not Too Late For You

As a data science student don’t get so distracted by technical requirements that you overlook opportunities to develop an underrated critical skill.

I don’t think this is an unpopular opinion: I consider discussion boards in a data science course (or any STEM subject) to be as useful as a FIX ME comment in two-year-old archived code.

As a former teaching assistant and almost-teacher, I find most exercises that force collaboration to be a distraction from “real” learning. At best, these assignments can provide a needed boost to struggling students who virtually “show up” to the course every week. At worst, they are examples of professors and curricula designers “going through the motions” of course engagement without the substance.

Which brings me to another exercise I never appreciated as a student but now, as a professional developer, wish I took more seriously:

Peer code reviews.

After spending a solid year of professional work getting the GitHub equivalent of red ink on my code, courtesy of my very thorough (yet very patient) senior team members, best coding and development practices finally clicked.

There was only one thing that took even longer to learn than submitting effective code.

And that was submitting effective code feedback, a skill that, like other aspects of data science, was glossed over in my formal education.

An artist’s rendition of me snoozing through student code reviews. Photo by Zhang Kenny on Unsplash.

Make no mistake. Code feedback is not a technical skill. Sure, you are commenting on the outcome of a technology-oriented product but you’re accomplishing this through critical analysis and clear communication.

Rarely will you make a major technical contribution like an entire AirFlow DAG, Python function or SQL query.

So it’s better to understand how to identify and communicate the need for a revision rather than concerning yourself with specific implementation. After all, that’s your teammate’s job.

There are plenty of articles on “how to give a code review” so I’m not going to delve too deeply into “10 things you need to include in your code review.”

While I wish there was more emphasis on the code review process in my data science program, one useful aspect of that exercise was the fact that my professors provided a rubric for each code review we conducted.

Notice I said rubric and not checklist. Getting too granular can be detrimental to a code review process and, if your team is working on deadline, requesting changes on the spelling in a log message (something I’ve shamelessly done) doesn’t necessarily take priority.

Like writing, there are general coding principles and best practices, but everyone has their own style and if something is functional and efficient, there is no reason to impose your own preferences on someone just because “this is how I would do it.”

So, in the spirit of rubric-based feedback, below are the areas I use to mentally “grade” each pull request (PR) I receive. If you’re required to provide code feedback in school but not given guidance, feel free to steal these focus areas to use as a starting point.

Pardon the interruption: To receive more data science-oriented content, consider following Pipeline.

To receive my latest writing, you can follow me as well.

The Essentials

When I was a tutor we distinguished between burning priorities and lesser concerns by using the terms “global” and “local” (like variable definitions). I start at the global level, looking at the code overall.

Before you pick apart each line, look at the code as a whole. Beyond the obvious, will this run? My next question is: Does this build contain everything it needs (and nothing it doesn’t)?

This is the most basic level of my review. Thinking of the particular type of build I’m reviewing, typically some Python or SQL code deployed on GCP infrastructure, I check to make sure “everything” is there.

For context, when I deploy a cloud function on GCP I usually submit four files:

Main
Config
Requirements
GitHub Actions yml

If a teammate submits code for a cloud function I’m first making sure these were included and, because we sometimes templatize our code, I’m making sure that the commit doesn’t include aspects of previous work, especially when we specify naming conventions of the cloud function and associated pub/sub trigger.

I’ll do a quick check between main and config to make sure the variables referenced are defined and, most importantly, used. I check the dependencies defined in the requirements file. If I’m skeptical or have enough time, I might do a quick search on PyPi to check if a package version is deprecated or a newer version is available.

Then I’ll check the main file to make sure these packages are properly imported.

If package versions are outdated, this can cause issues that trigger vague and really frustrating errors that undermine a script’s durability.

Durability

After asking “Will this run?” My next question is: How likely is it that this will break? Has the developer considered all possible scenarios and implemented proper error handling?

For an ETL pipeline, I ask:

Is the API request written correctly?
Is the API response logged and, ideally, is the status_code logged?
Has the developer handled the possibility of no data returned both in the API request function and in the main script where it is referenced?
If downloading or streaming a large volume of data: Has the developer properly logged any download status returned from the API?

If the script contains a SQL query, I typically run it or, in BigQuery, at least copy/paste it into the query environment to see if it will run.

For anything beyond a minor change, I (and my senior peers) ask the submitting developer how they’ve tested what they want to put into production.

This is something you should ask, even at the student level.

A simple: How is this tested? Or, to be more precise: How did your test mimic the environment you want to ultimately run this script in?

If your peer responds: “It ran in my environment”, you might be in for a rough time.

Syntax

At the lower level I check syntax.

Despite writing frequently and coming from a liberal arts background, I’m terrible at naming variables. More than once my teammate has asked me to rename: “a_df” or “another_df.” It’s a nit-picky review item, but syntax is important, especially for someone unfamiliar with your code. Your variable names should be somewhat descriptive and tell your reviewer and future developers what each contains.

For instance, instead of “data”, it might be better to call a variable “json_response” so you know the kind of data returned. If you’re downloading a CSV from an AWS S3 bucket, instead of calling the response variable “df” you might say s3_csv. These are small changes that help anyone looking at the code in the future.

This also extends to function names.

Not only should you make sure that functions are accurately named, but you should also make sure that they do what they’re intended to do. I.e. a function called “make_request” should really only make an API request, not make a request, start a VM and prepare a training set.

Legibility

The last thing I (and you should) look for is legibility.

Having been working for over two years, I’ve been in the position more than once that I need to open, understand and contribute to a code file that hasn’t been touched in months or years.

There are minimal tweaks you can make to ensure a script is readable the moment you commit it and for months and/or years into the future. Assuming the script you’re reviewing is perfectly functional, structurally sound and syntactically correct, then you want to look for opportunities to make it as legible as possible.

Here are small tweaks I often suggest:

Doc string summaries for each function
Type hints for each function parameter
Clear and frequent logging messages
In-line comments explaining any engineering choices/complex procedures

If you’re reviewing a PR that is all-but-perfect, chances are your peer could still improve the neatness and legibility of their script.

Even as a professional developer, sometimes a review seems like a step to do just to check a box. However, I hope you’ve seen that even without a working knowledge of the code, there is almost always an opportunity to elevate someone else’s work through actionable feedback.

Being able to look at and comment on code without running it or even having the entire context is a skill that takes time to refine. And you might find that even if you make a suggestion, that doesn’t necessarily mean your peer will implement it.

Unlike notes on an essay or a product review, code reviews are not one-sided communications. They open a dialogue between two parties. Experienced developer and new engineer. Domain knowledge-equipped product manager and contract coder.

Sometimes, only the code’s creator and owner will have the kind of context required to intimately understand its function and purpose.

And that’s ok. As long as they listen to the people that have to run it.

Create a job-worthy data portfolio. Learn how with my free project guide.