6 Tips for Writing Clean and Maintainable Python Functions for Data Engineering
What does clean and maintainable code mean?
Clean code is code where the intent of the author is easily comprehendible. The intent is expressed clearly and concisely and doesn't leave the reader searching for further documentation or surprised at returned results. The ultimate goal of clean coding is to make it easy for colleagues, users and future maintainers to work with your code.
Maintainable code refers to code that is easy to modify and extend over time. Maintainable code is designed and structured in a way that makes it easier for developers to make changes, fix issues, and add new features without introducing unintended side effects. Writing maintainable code reduces the time and effort required to support and enhance pipelines, improves collaboration among developers, and minimises the risk of introducing unintended errors or bugs during modifications or enhancements. There is often much overlap in the definitions of ‘clean’ and ‘maintainable’ and it's common to see both used in the same phrase.
Why is clean and maintainable code important in Data Engineering?
Businesses never stop asking questions of their data. New products, clients and stakeholders create an ever-rotating data landscape, and with these changes come new demands on the data pipelines that support decision-making. In order to readily incorporate new business requirements, a Data Engineering codebase needs to be readily adaptable, moving with the winds of change rather than against. It should provide the environment to readily respond to the demands of the business, without creating friction for the engineers who maintain it.
The first step to achieving this is by clearly communicating the intent of a codebase or pipeline via clean and maintainable code. Clean and maintainable code provides the robust foundation upon which enhancements to pipelines are easily made. In this way, Data Engineers can respond to stakeholder requests with confidence and efficiency.
Thankfully, there are a clear set of principles you can follow when writing Python functions to help ensure you achieve these ambitions. The following tips are easy to implement, and with a bit of practice, problematic code will become easy to spot, and easy to fix.
So let's dive in.
1. Conform to the Single Responsibility Principle
The Single Responsibility Principle (SRP) is a programming philosophy that states that a class or function should do one thing and do it well. As the name suggests, the SRP states that a function should have a single responsibility. By adhering to the SRP in Python programming, you ensure that your function has a clear and distinct responsibility. This promotes modularity, flexibility, and reduces the likelihood of introducing bugs when making changes to the codebase.
Consider the below example. Here we have a function that builds API request parameters, makes an API Post request using those parameters, extracts the relevant data from the response and then saves that data to a JSON file stored in an S3 bucket:
import requests
import json
import boto3
def get_data(
access_token,
url,
target_app,
interval_granularity,
interval_type,
queue_IDs,
aws_access_key,
aws_secret_access_key,
bucket_name,
key_name,
):
# Request headers
headers = {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"bearer {access_token}",
"target_app": f"auth_for{target_app}",
}
payload = {
"queue_ids": queue_IDs,
"interval_type": f"{interval_type}",
"granularity": f"PT{interval_granularity}",
}
# Call API
queues_data = requests.post(url, headers=headers, json=payload)
data = queues_data.json()["results"][0]["data"]
# Write json to s3
session = boto3.Session(
aws_access_key_id=aws_access_key,
aws_secret_access_key=aws_secret_access_key
)
s3 = session.resource('s3')
s3.Object(bucket_name, key_name).put(
Body=json.dumps(data),
ContentType='application/json'
)
return dataWhile this function performs the job we want without error, there are several issues with it.
- Code Complexity: When a function takes on multiple responsibilities, it becomes harder to understand. This can make the code difficult to maintain, debug, and extend. While it may be obvious to you, as the author, what this code should be doing, when you include multiple responsibilities within a single function, you run an increased risk of that intent being misinterpreted by future users and maintainers of your code.
- Dependency issues: Because this function has multiple concerns, it also has multiple dependencies due to the fact that each concern requires code from different libraries. This can lead to tangled dependencies and create tight coupling between different parts of the codebase. Changes in one responsibility might inadvertently affect other unrelated parts, making it harder to isolate and test individual components.
- Lack of reusability: Large functions like the above are hard to reuse. For example, if we wanted to save the file to a local drive instead of S3, we can no longer use this function.
- Testing and Debugging: Breaking the SRP in this manner makes testing and debugging more challenging. With multiple responsibilities within a single component, it becomes harder to isolate and test specific behaviours. Debugging issues or finding the root cause of problems can also be more time-consuming, as the complexity of intertwined responsibilities increases.
Let's take a look at how we can re-write this to adhere to the SRP:
import requests
import boto3
import json
@dataclass
class RequestParams:
url: str
headers: str
payload: str
def header(access_token, target_app):
return {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"bearer {access_token}",
"target_app": f"auth_for{target_app}",
}
def payload_dict(interval_type, interval_granularity):
return {
"interval_type": f"{interval_type}",
"granularity": f"PT_{interval_granularity}",
}
def request_params(starting_url, target_app, headers, payload):
return RequestParams(
url= starting_url + target_app,
headers=headers,
payload=payload,
)
def api_req(request_params):
return requests.post(
request_params.url, headers=request_params.headers, json=request_params.payload
)
def get_data(response):
return response["results"][0]["data"]
def get_s3_session(aws_access_key, aws_secret_access_key):
session = boto3.Session(
aws_access_key_id=aws_access_key,
aws_secret_access_key=aws_secret_access_key,
)
return session.resource("s3")
def save_to_s3(data, bucket_name, key_name, s3_session):
s3_session.Object(bucket_name, key_name).put(
Body=json.dumps(data), ContentType="application/json"
)
Writing our code in this way achieves several things. By adhering to the SRP, each function has a clear and focused responsibility, leading to simpler and more manageable code that is easier to read and understand. Writing functions in a modular manner makes them reusable. If we have another part of the codebase that is also looking to save a JSON file to S3, we can reuse our save_to_s3() function. Finally, these functions can now be tested independently, allowing for greater specificity and making it easier to write tests as we don't have to deal with handling side effects. Taken together, this all promotes code reuse, code readability and ultimately enables flexible and scalable development.
2. Keep number of parameters minimal
This one is heavily linked to the previous tip, and in fact, can be used as a means to spot a function that is breaking the SRP and requires separation of concern. While there is no hard rule as to how many parameters is too many, a general rule of thumb is that any more than three or four should raise an eyebrow. That's not to say that five parameters is instantly a problem, more that it should give you cause for suspicion, and an evaluation of your function’s abstractions. In the preceding example, our problematic get_data() had 10 parameters! This should be an easy giveaway that a function that is doing too many things.
3. Name functions and variables clearly
Ideally, a reader would not even need to read your implementation, but just your function signature, in order to get a good idea what the function is going to do. A good tip for naming a Python function is to choose a descriptive and meaningful name that accurately conveys the purpose or action performed by the function. Don’t be afraid to be a little verbose here, clear communication of intent is more important than brevity when it comes to naming functions. Here are a few guidelines to consider when naming your Python functions:
1. Be clear and explicit: Use a name that clearly describes what the function does or the problem it solves. Avoid vague or ambiguous names that may lead to confusion or misunderstanding.
2. Use the verb-noun convention: Follow the convention of using verb-noun pairs or verb phrases to describe the action performed by the function. This helps to make the function’s purpose explicit and readable. For example, `calculate_average`, `validate_input`, or `send_email`.
3. Be Consistent: Maintain consistency in your function naming style throughout your codebase. If you have existing naming conventions, follow them to ensure uniformity and ease of understanding.
4. Avoid Abbreviations and Acronyms: Unless the abbreviation or acronym is widely known and commonly used, it is best to avoid them in function names. Clarity and readability should take precedence over brevity.
Let's take our modular functions from the previous example and give them better names:
def create_request_header(access_token, target_app):
return {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"bearer {access_token}",
"target_app": f"auth_for{target_app}",
}
def create_request_payload(interval_type, interval_granularity):
return {
"interval_type": f"{interval_type}",
"granularity": f"PT_{interval_granularity}",
}
def create_request_params(starting_url, target_app, headers, payload):
return RequestParams(
url=starting_url + target_app,
headers=headers,
payload=payload,
)
def call_api(request_params):
return requests.post(
request_params.url, headers=request_params.headers, json=request_params.payload
)
def extract_data_from_response(response):
return response["results"][0]["data"]
def get_s3_session_object(aws_access_key, aws_secret_access_key):
session = boto3.Session(
aws_access_key_id=aws_access_key,
aws_secret_access_key=aws_secret_access_key,
)
return session.resource("s3")
def save_json_to_s3(data, bucket_name, key_name, s3_session):
s3_session.Object(bucket_name, key_name).put(
Body=json.dumps(data), ContentType="application/json"
)
Remember, the goal of a well-named function is to make the code more readable, self-explanatory, and maintainable. Choosing a descriptive and meaningful name goes a long way in enhancing the understandability and clarity of your code. Admittedly this can be hard sometimes, but if it is hard, this may be a good indication that your function does not have a clearly defined single responsibility, and is therefore violating the SRP.
4. Use Docstrings
Well-written docstrings not only help other developers understand and use your code but also serve as a valuable resource for yourself in the future. Taking the time to write clear and informative docstrings can greatly enhance the usability and maintainability of your Python code. A good docstring should describe the purpose of the function or class. Remember, a docstring is your opportunity to convey the intention of your code. Start the docstring with a brief one-line summary that describes the purpose of the function, method, or module. This summary should be concise and give a clear overview of what the code does. Depending on the function and its context, you may want to elaborate by providing a more detailed description that provides additional information about the functionality, any relevant parameters, return values, or exceptions raised. You may deem it appropriate to use headings for this such as “Parameters,” “Return Value,” “Raises,” etc.
Let's add some docstrings to our previous functions:
def create_request_header(access_token, target_app):
"""Returns a dictionary containing the items required in request headers
for the MyRandomAppName API.
Parameters
----------
access_token: OAuth token required for API access
target_app: name of specific endpoint to be queried
"""
return {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"bearer {access_token}",
"target_app": f"auth_for{target_app}",
}
def create_request_payload(interval_type, interval_granularity):
"""Builds a dictionary containing the items required to specify the
payload for the MyRandomAppName API Post request
Parameters
----------
interval_type: type of interval required in results, e.g - linear
interval_granularity: granularity of interval, e.g minutes, hours etc
"""
return {
"interval_type": f"{interval_type}",
"granularity": f"PT_{interval_granularity}",
}
def create_request_params(starting_url, target_app, headers, payload):
"""Returns a RequestParams object that contains all the parameters required
for the call_api() function
Parameters
----------
starting_url: Base URL that holds all available options for the target_app
e.g 'https://api.MyRandomAppName.com.au/api/v2/analytics/'
target_app: specific app from the MyRandomAppName to be queried.
headers: Dictionary containing the Post request headers
payload: Dictionary containing the Post request payload
"""
return RequestParams(
url=starting_url + target_app,
headers=headers,
payload=payload,
)
def call_api(request_params):
"""Calls the MyRandomAppName API using a post request and returns the response"""
return requests.post(
request_params.url, headers=request_params.headers, json=request_params.payload
)
def extract_data_from_response(response):
"""Accesses the response json returned from MyRandomAppName API Post
request to return the required data"""
return response["results"][0]["data"]
def get_s3_session_object(aws_access_key, aws_secret_access_key):
"""Returns an S3 session connection object"""
session = boto3.Session(
aws_access_key_id=aws_access_key,
aws_secret_access_key=aws_secret_access_key,
)
return session.resource("s3")
def save_json_to_s3(data, bucket_name, key_name, s3_session):
"""Uploads a json file to s3 and saves in the specified bucket"""
s3_session.Object(bucket_name, key_name).put(
Body=json.dumps(data), ContentType="application/json"
)
Remember to use clear and readable Language. Use proper grammar, punctuation, and formatting to enhance readability. Avoid acronyms as much as possible (industry standard acronyms such as ‘S3’ are probably ok).
5. Use Type annotations
Python types convey information. They provide a representation of behaviours that both humans and computers can reason about. Type annotations are an extra syntax that informs the user about the expected types of arguments passed to your function. They can also be used to inform on the types of variables returned by your function. These annotations act as hints, providing helpful information to the reader without being utilized by the Python language during runtime. When you use type annotations, future maintainers and users of your code have a clearer idea about how your function was intended to be used, and you reduce the cognitive burden on a reader trying to understand your code. This means updating your code is much easier, resulting in a stable and maintainable codebase. When combined with type checkers such as mypy, type annotations can detect errors in your code at a much earlier stage in development. A deeper dive into Python Type Annotations can be found here.
Let's add type annotations to our functions:
import requests
import boto3
import json
import os
from dataclasses import dataclass
from typing import Literal, Dict
@dataclass
class RequestParams:
url: str
headers: str
payload: str
def create_request_header(
access_token: str,
target_app: Literal["app_one", "app_two"],
) -> Dict[str, str]:
"""Returns a dictionary containing the items required in request headers
for the MyRandomAppName API.
Parameters
----------
access_token: OAuth token required for API access
target_app: name of specific endpoint to be queried
choose either "app_one" or "app_two"
"""
return {
"Content-Type": "application/json",
"Accept": "application/json",
"Authorization": f"bearer {access_token}",
"target_app": f"auth_for{target_app}",
}
def create_request_payload(
interval_type: Literal["linear", "non-linear"],
interval_granularity: Literal["hour", "day", "week"],
) -> Dict[str, str]:
"""Builds a dictionary containing the items required to specify the
payload for the MyRandomAppName API Post request
Parameters
----------
interval_type: type of interval required in results, e.g - linear
interval_granularity: granularity of interval, e.g minutes, hours etc
"""
return {
"interval_type": f"{interval_type}",
"granularity": f"PT_{interval_granularity}",
}
def create_request_params(
starting_url: str,
target_app: Literal["app_one", "app_two"],
headers: dict,
payload: dict,
) -> RequestParams:
"""Returns a RequestParams object that contains all the parameters required
for the call_api() function
Parameters
----------
starting_url: Base URL that holds all available options for the target_app
e.g 'https://api.MyRandomAppName.com.au/api/v2/analytics/'
target_app: specific app from the MyRandomAppName to be queried.
headers: Dictionary containing the Post request headers
payload: Dictionary containing the Post request payload
"""
return RequestParams(
url=starting_url + target_app,
headers=headers,
payload=payload,
)
def call_api(request_params: RequestParams) -> requests.Response:
"""Calls the api using a post request and returns the response"""
return requests.post(
request_params.url, headers=request_params.headers, json=request_params.payload
)
def extract_data_from_response(response: requests.Response) -> dict:
"""Accesses the response json returned from MyRandomAppName API Post
request to return the required data"""
return response["results"][0]["data"]
def get_s3_session_object(
aws_access_key: str, aws_secret_access_key: str
) -> boto3.Session:
"""Returns an S3 session connection object"""
session = boto3.Session(
aws_access_key_id=aws_access_key,
aws_secret_access_key=aws_secret_access_key,
)
return session.resource("s3")
def save_json_to_s3(
data: dict,
bucket_name: str,
key_name: str,
s3_session: boto3.Session,
) -> None:
"""Uploads a json file to s3 and saves in the specified bucket"""
s3_session.Object(bucket_name, key_name).put(
Body=json.dumps(data), ContentType="application/json"
)6. Only request information you actually need
This next tip is as much about performance as it is about writing clean and maintainable code. When writing your functions, think carefully about what it is your function actually needs. For example, when working with Dataframes, consider whether you need the entire Dataframe or just one or two Series to return the desired output. For example, in the following function, we are only operating on a single column named ‘week_num_col’:
def turn_week_num_str_into_list(df: pd.DataFrame) -> pd.DataFrame:
"""Turns strings of week numbers in the week_num_col column
into a list of strings.
i.e - '1,2,3,4' -> ['1','2','3','4']
"""
df['week_num_col'] = df['week_num_col'].astype(str).str.split(",")
return dfso rather than read in the entire Dataframe, this function can be re-written so as to just take a single Series:
def turn_week_num_str_into_list(target_column: pd.Series) -> pd.Series:
"""Turns strings of week numbers into a list of strings.
i.e - '1,2,3,4 -> ['1','2','3','4']"""
return target_column.astype(str).str.split(",")which can be called in the following manner:
df['week_num_col'] = turn_week_num_str_into_list(df['week_num_col'])This version can now be reused easily, as any column can be used as the argument to ‘target_column’. Furthermore, your script is more memory efficient as we are no longer holding an entire Dataframe in memory.
Summary
And that's it. Six easy tips to help ensure your Python code remains clean and maintainable. Following these principles will mean both you and future maintainers will have a much easier time enhancing or fixing your code in the future, creating a more efficient Data Engineering team that can rapidly respond to the ever-changing requirements of a business.
More content at PlainEnglish.io.
Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.
