Building Generator Pipelines in Python

This article proposes an elegant way to build generator pipelines

Generator pipelines: a straight road to the solution. Photo by Matthew Brodeur on Unsplash

In software, a pipeline means performing multiple operations (e.g., calling function after function) in a sequence, for each element of an iterable, in such a way that the output of each element is the input of the next.

In Python, you can build pipelines in various ways, some simpler than others. In this article, we will discuss an elegant way of doing this: generator pipelines.

Let me present a simplistic example. It will be easier for us to work with a pipeline consisting of simple operations, so that we do not have to focus on what each operation is doing. Let’s say, for example, that for each number from a range of integers, you want to apply the following calculations:

take the square root
double the number and add 12
square the number
add the square root of pi
add 75
return the result after rounding to two decimal digits

Of course, no one does things like that, but imagine that each step performs an operation on a particular element of the iterable. For instance, the iterable can contain a number of files, and you want to read them, preprocess them, apply an NLP model to each text, analyze the results, and return them. Or, you have a number of files with some quantitative data, and you want to read each file, check and preprocess the data, use an ML model, then return the results in some organized way. These two examples can be organized in a pipeline just like the above example, because each has an iterable with items to which a sequence of operations is to be applied one after another, with the output of one being the input of the next.

A common approach

Probably the most natural approach is to take each element of the iterable and apply these calculations step by step. So, when you have two functions to apply to x, say, f1() and then f2(), you can do it by f2(f1(x)). As simple as that — but when there are more functions to apply and they take more arguments, the resulting code representing f2(f1(x)) becomes a difficult-to-understand monster. You will see it below.

First, let’s create what we will need to create a pipeline that performs operations on numbers.

Note: The paragraph below is a digression that describes type hints in the code. You can skip it if you’re not interested.

As you can see in the code above, I decided to use type aliases. In many situations, I consider such an approach much more readable than using raw type hints (Kozak 2022). Thus, I’ve created type alias XType to represent the type of x. Another type alias, PipeItems, represents an iterable with items to be then processed in the pipeline we will create; it thus takes objects of XType. In our example, these are numbers, but generally, this can be any object. Functions power() and add() also take values of XType as an input, but this does not have to be the case; it depends on the operations that create the pipeline. A function representing a particular operation should take an object of the type that the previous operation returns.

Our first approach, in which we simply apply all the functions one after another (hence the function’s name, get_pipeline_all_calc), could be something like that:

What?!

How do you even read such code, let alone understand it? I tried a dozen or so versions, just to find out which of them was best, and… and none was. So, since I needed to make lines short, I chose this one, although definitely it is not readable. But again, none of them was.

Below, you will find what black has done with this code, after using flag -l 79, which keeps the maximum line length at 79:

I don’t think it helped much… Decide yourself if this code is comprehensible to you or not.

Such code is not only difficult to read and understand, but also to write and rewrite. Try it yourself. For instance, return to the description of the pipeline and implement it yourself. Or add a new step to the function above; say, before adding the square root of pi, raise the number to the power of 3. Do it just to see how difficult it is to use this approach.

The above version works correctly, however crazy and incomprehensible it looks. The Zen of Python says that “Simple is better than complex,” so there should be a better way.

And there is: generator pipelines.

Generator pipelines

The get_pipeline_all_calc() function returns a generator, so we can call it a generator pipeline. Since it returns a generator object, we will have to evaluate its items. This is one great thing about generator pipelines: After being created, a pipeline (formally, it is a generator object) can be evaluated lazily, that is, on demand. You can do it all at once, you can do it in steps, or you can evaluate the subsequent items when you want or need to.

Nevertheless, many would disagree that this is a generator pipeline; rather, it’s calculations pressed into a generator. True generator pipelines use a generator at each step of the pipeline.

Thus, to build a generator pipeline, you make each step a generator based on the generator from the previous step. The first step uses the iterable with the original items.

This is actually simpler than it sounds. The below function shows one way to achieve this:

You could certainly use different naming, and this could improve readability a little. For instance, instead of using the names of step_i, where i represents the step’s ID, you could actually use a meaningful name. For instance, step_1 could be root_squared, step_2 could become doubled, and step_3 could be added_12. While here such naming is not really that helpful, in many situations it will. You could also use map() in each step (we will return to map() a little later).

For me, the generator-pipeline version, get_pipeline_original(), is much more readable and easier to update (i.e., add a new operation). I see each step in the chain, and so it’s easy to add a new one, unlike in get_pipeline_all_calc().

The code, however, does not feel perfect. Can we improve it even more?

Can we make the generator pipeline even better?

The get_pipeline_original() function shows a typical way of creating a generator pipeline. I don’t think it’s used often, despite its readability. While being readable, it is not perfectly readable. For example, I do not like the visual clutter, which results mainly from the for loops that create the generator expressions in each step. What’s perhaps most important, in order to understand such a pipeline, you need to think about two things at once: the operations and the generator expressions. Between them, the former constitute the essence while the latter constitutes methodological specifics.

As I wrote above, we could use the map() function instead. But this would not help much, unfortunately: We would have to understand all the maps.

Indeed, it is the pipeline operations that constitutes its essence. We should focus on them, because if we want to understand a pipeline, we need to understand the following aspects of each of its steps:

input into the operation
what the operation does
output from the operation

Return to the above code. We see these aspects, but when trying to understand the code, much of our focus goes to understanding the generator expressions.

To make this code more readable, we could try to decrease the amount of code that does not describe its essence. Ask yourself the following question: Does each operation need a dedicated generator expression?

The answer is, of course not. And since it does not, we may try to simplify the code, for instance, by removing all the generator expressions but one. Consider this code:

Here, calculate() contains a pipeline of operations for a single element, while get_pipeline_proposed() creates a generator of the pipelines — so, a generator pipeline — for each element of the items iterable. Before, in get_pipeline_all_calc(), all the operations were called in one line, but here they are not; they are presented as a pipeline, and each line represents one operation. We can easily see that each operation, except for the first one, takes the output of the previous operation as input.

Let’s return to this sentence: True generator pipelines use a generator at each step of the pipeline. We cannot say this about this approach, as the calculate() function consists of a pipeline, and then this pipeline is looped over in the generator. Theoretical considerations on naming put aside, I think we still can call this approach, like the one with get_pipeline_all_calc(), a generator pipeline: It’s a pipeline evaluated using a generator.

In my opinion, the get_pipeline_proposed() and calculate() functions are more readable than any other version we have considered above, including get_pipeline_original(). The former needs fewer characters and shorter lines than the latter—and in result, it does not suffer from the visual clutter.

We could use different naming in calculate(), though. What do you think about the below version?

I think it’s a matter of preference. In this example, I’d go for the version with x, as it suggests that what we get from each step is still the same variable, but after being processed. But when each step does a different operation, e.g., reads a file, processes text, runs an NLP model, then such naming could increase readability.

In such situations, chaining of functions using a pipe operator can result in even more readable code. I will discuss this in another article, as it deserves its own focus.

Generator pipelines with map()

Can’t we use map() to create a generator pipeline? Isn’t not what map() was created for?

Indeed, we can. To this end, we will use the calculate() function defined above, and the pipeline-creating function becomes as follows:

That’s it! It’s another version of a generator pipeline, built with a map() function.

Benchmarking

Code readability is one thing; performance is another. A traditional generator pipeline creates as many generators as the number of steps in the pipeline. While creating a generator is cheap in terms of performance, it still means creating and using several generators instead of just one. Surely, the proposed version (either using the generator expression or map()), despite not using that many generators, uses the same number of objects, so is their evaluation as expensive as the evaluation of the generators?

For benchmarking, I will use the timeit module. If you want to learn more about this module, you can read my article from Towards Data Science, in which I explain some interesting intricacies of this package.

We’ll run the benchmarks in order to compare the performance of the following four approaches:

get_pipeline_all_calc(): the first version, with all the calculations being done one after another;
get_pipeline_original(): the traditional generator pipeline, in which each calculation is done using a dedicated generator;
get_pipeline_proposed(): the proposed generator pipeline, which is a modification of the first version; ; and
get_pipeline_proposed_map(): the proposed generator pipeline, created using map().

The benchmarking code is too long and repetitive to present it here, so you can find it in this GitHub gist. You will find the results here.

Here are the conclusions from the benchmarks:

The quickest among the four was get_pipeline_all_calc(), that is, creating one generator that calls all functions at once. This is not surprising, as this version has the lowest overhead of creating generators, calling functions, and creating objects.
The slowest was the classical generator pipeline, get_pipeline_original(), which uses as many generators as there are operations. This does not come as a surprise either, as this version suffers from the overhead of creating as many generators as there are operations in the pipeline.
The proposed solution using a generator expression, get_pipeline_proposed(), is around the middle.
The proposed solution using map(), get_pipeline_proposed_map(), is slower than get_pipeline_all_calc() but quicker than get_pipeline_proposed().
The above results occurred for all lengths of an iterable, that is, [100, 1_000, 10_000, 100_000, 1_000_000]. It could be expected, as in this example, the execution times for the elements of the iterable were practically the same.

Nevertheless, note that we analyzed a pipeline consisting of very quick operations. If one or more of them took a long time to perform, the benchmarks would show almost no difference between the methods, because the overhead of creating generators or creating several objects in calculate() would be negligible.

If you want to check this, change the above-defined double() function to the following:

and use number=1. You will see no meaningful difference between the four approaches. This is because this time, the overhead of creating generators and additional objects in calculate() is negligible compared to the execution time of the operations.

So, remember that it only makes sense to consider performance makes sense only when the pipeline itself is very quick. But the main aim of this experiment was to show that the proposed structure of generator pipelines is not significantly slower than the other approaches. This proved to be true. So, you do not have to worry about performance when using this approach. What’s more, if performance does matter, the proposed solution will be even quicker — only a little bit, but still — than traditional generator pipelines.

Conclusion

A traditional generator pipeline consists of as many generators as operations in the pipeline. Generator-pipeline code is easier to understand than a generator that calls all the functions at once. This is why, between the two alternatives, the generator pipeline is preferred.

In the case of quick operations, however, such an approach is slower than calling all the operations at once, one after another, in one generator. Interestingly, a map() version is quicker then. For longer operations, however, the difference becomes negligible, and you will get practically the same results for all the versions.

In this article, I have shown that generator pipelines that use one generator are indeed more readable than the other versions. In particular, this call:

is less readable than the following generator pipeline:

Both approaches lead to the very same generator, results. They are different though, and in this case, brevity does not come with readability.

The generator pipeline is indeed more readable, but this does not mean its readability cannot be increased even more. Each operation (step) is itself a generator, which helps us understand what’s happening in each step.

What I do not like in this code is its repetitiveness and unnecessary visual clutter. The main reason behind this are the for loops in the generator expressions. A first glance may suggest they are important, though the truth is that they are not.

One more thing: Those who have created several generator pipelines know that creating them can sometimes be tricky; not necessarily difficult to code, but tricky, from time to time. The first approach — calling function after function in one long chain — can be a much bigger challenge, especially when you need to add a new operation somewhere in the middle.

This is why I wrote this article. I propose an in-between approach, one that joins the brevity of the former approach with the readability of the latter. In doing so, this approach avoids unnecessary fragments of the code, those which are repeated in every step though they do not do anything essential. In such a pipeline, we have several essential items:

the generator: the truth is we have only one iterable, so we should have only one generator; in the proposed structure of a generator pipeline, step1 creates this generator;
the operations: in the example above, they are represented by functions, each function performing one operation; in our case, they are function1, function2, function3 and function4;
the resulting generator object: this is the object you evaluate when you need to get the results, but note that it’s started in the first step, and the last step still uses it.

Using the structure of a generator pipeline proposed in this article, we will get a generator pipeline of a new type, which will look like the following:

I mentioned above that writing get_pipeline_all_calc() and get_p[ipeline_original) could be tricky. This structure is not tricky. It’s straightforward. It is clear. And while still being longer than the first approach, it does not add unnecessary code as the original pipelines do. I definitely prefer this version than either of the other two, though I do understand and accept that this is also a matter of preference.

In terms of performance, the new generator pipelines lie in-between, with the map() version being more performant than the generator-expression version. But we have to remember that more often than not, this does not matter, as if a pipeline takes some computation time, the overhead of creating several more generators (as is done in the original generator pipeline) is negligible.

Certainly, this article does not offer anything novel. Many of us have used similar pipelines, just not calling them “generator pipelines”; or not calling them whatsoever, for that matter.

Since traditional generator pipelines are often proposed as a nice way to create memory-efficient pipelines (e.g., Langdon 2012, Uzan 2020 and Kalkman 2021), this article proposes creating generator pipelines in a new way, by wrapping all the operations in one function and creating a generator using this function.

This approach is simpler, more efficient, and more readable. In short, it is better.