Benchmarking Python code with timeit

The most popular tool for time benchmarking of Python code, the built-in timeit module offers more than most Pythonistas know

The timeit module is Python’s stopwatch. Photo by Tsvetoslav Hristov on Unsplash

Benchmarking is seldom done for the fun of it, even if it is a lot of fun indeed. In addition to this fun, it can help you in:

understanding Python behavior; you can learn what is quicker and what is slower, which in turn can help you understand the language;
optimizing your code.

If you think you spend too much time benchmarking some random code snippets, don’t worry. I’ve been there. To be honest, I still do it quite frequently. Don’t be ashamed of this: Benchmarking helps you understand the intricacies of the language. With time, you will notice you can guess how fast or slow a particular snippet should be. From time to time, however, even your “benchmarking nose” will mislead you, so benchmarking is often time well spent. Programming is a lot of fun after all, isn’t it?

Likely the most popular module for benchmarking code snippets in Python, in terms of execution time, is the timeit module. Python offers also other time-related benchmarking tools, but timeit should definitely constitute your first step, for the following reasons:

it’s the most popular one among time-related benchmarking tools in Python;
it’s part of the standard library, so you don’t have to install it; and
other tools are often wrappers around timeit.

Thus, I think that if you want to use these other tools, you should first learn how to use timeit and interpret its results. This article aims to help you with that. I will show you a little-known feature of this module: benchmarking functions instead of code snippets. I will also show you situations in which timeit results can be misleading.

Using timeit

Most users do not know that the timeit module offers two APIs, which is why you will find mainly one of them in use. The two APIs are as follows:

Snippet-based API. Most people mentioning timeit think of this very API. It has two advantages: it’s relatively easy to use, and you can benchmark almost anything, as it benchmarks code snippets.
Callable-based API. Most people actually do not know this API whatsoever. It aims to benchmark callables. Its syntax is, unfortunately, less natural than that of the snippet-based API, but it can be useful in some situations. I will show such a situation later in this article.

Both APIs use the same functions, but differently. Two functions you should know are timeit.timeit() and timeit.repeat(). The truth is, I almost always use the latter, as it simply runs the former several times and provides the individual results of timeit.timeit(), so you get more stable results. I recommend you doing the same. For this reason, I will discuss the timeit.repeat() function; timeit.timeit()’s API is very similar, with only one difference: a lack of therepeat argument.

The simplest use is as follow:

And that’s it! It will measure the execution time of the code snippet provided as a string. In our case, it’s [_ for _ in range(10)], which means that we will measure how much time Python uses to create a list using a list comprehension a million times. This million times is the default value of the number argument. Remember that each of the million calls of the command are conducted one after another, in the same session; as I will show later, this can be tricky in some situations.

Let’s analyze the function’s signature:

stmt is the code snippet you want to benchmark, provided as a string;
number is the number of times stmt will be called, in one session;, provided as an integer; defaults to 1_000_000;
setup is code you want to run before running stmt for number times, provided as a string; it’s run only once per session, at the beginning of the session;
timer is the timer used; the default one is perf_counter(), and since it’s currently considered the best built-in timer, in most situations it’s best not to touch it;
repeat is the number of sessions to be run, each session consisting of number calls of stmt; you will get the results of all these sessions; provided as an integer, it defaults to 5;
globals is a dictionary of globals to be provided; you can use it instead of setup or along with it.

For small and quick snippets, there is no need to change number and repeat, unless you want your benchmarks to provide very stable results. If you do, you should increase these numbers, depending on how stable you want the results to be and how much time you want the benchmark to run.

For longer-taking snippets, however, you may wish to decrease number and repeat, or both, as otherwise the benchmark could take far too much time. You will notice, however, that after doing so, that is, with too small values of number and repeat, the results can become unstable.

Above, we used the snippet-based API. It’s time to discuss the rarely known callable-based API:

You may think that the two above calls to timeit.repeat() should provide similar results because they benchmark the same thing, namely, creating a list using a list comprehension from a range object of length 10. But that’s not true: the first one indeed benchmarks creating a list that way, but the latter does not, or rather not only. This is because the latter includes also an overhead of running the make_list() function, and sometimes this overhead can be quite significant. We can, actually, analyze this:

The first from the above calls is equivalent to timeit.repeat(make_list), but it uses the same API as the second. So, if we see a difference between t1 and t2, it will be due to the overhead of calling a function a million times.

On my machine (32 GB, four physical and eight logical cores, run in WSL 1), I got the following results:

The difference is rather small, isn’t it? Can we trust this benchmark, given how short it took? Let’s rerun the benchmarks, just in case, with number=10_000_000 and repeat=10. Again, best(t1) was bigger, with 4.1292 against best(t2) of 3.7315. You have to remember, however, that these are not that long benchmarks… They took 78 seconds altogether, and next time I run them, I got the results of 3.9846 against 3.8373. So if you want to be sure of your benchmarks, use much bigger values of both number and repeat. When the difference in execution time of two (or more) snippets is large, however, you do not have to use big values.

As for the comparison of the two APIs, remember:

The two APIs — the snippet-based and the callable-based — can produce different results even when they benchmark code that does the same thing. This is because the results of the callable-based API includes also the overhead time of calling the callable.

Which one to choose? It depends on the situation. If you want to compare the execution time of two functions, this is exactly what the callable-based API was created for. Or, when in your code you do something (e.g., allocate a list, like above) in a function, then the callable-based API will better reflect the actual situation. Nonetheless, if you simply want to compare two snippets, there’s no need to use a function for that. Remember, however, about scopes. When you encapsulate your code inside a function, all that will be done in a local scope and namespace of this function. As I will show later, this can make a difference.

Example: Creating an empty dictionary

Now, imagine we want to benchmark two ways of creating an empty dictionary: {} and dict(). We can do it in the following way:

What’s your guess?

If you voted for the literal to be quicker, you got it right. On my machine, best(literal) gave 0.0183 while best(func) gave 0.0600, so quite a difference. We can analyze the bytecode of both approaches to see where this difference in performance comes from:

As you see, dict() uses one more bytecode operation, the one in which it calls a dict() function; the literal is likely quicker because it does not have to do that. If you see some similarities with the previous example, you’re right; we see the overhead of calling the dict() function.

Interestingly, if you create your own function as follows:

you will see it’s quicker than the dict() function itself. I leave you checking this as an exercise. (This is the fun of benchmarking I told you about above!)

Why the minimum value?

You may wonder why I used the minimum value, not the mean, to compare the two benchmark results (see the best() function above). Why not the mean, as a central tendency measure, and the variance (or the standard deviation, or still a different measure) as a measure of variation?

This is because benchmarking is not a normal situation, so we should not apply a typical statistical approach to analyze its results. In the best conditions, all runs of the same code (that includes no randomness) should take the same time. However, the operating system is performing a lot of different tasks during benchmarking, so the subsequent benchmarks take different amounts of time. Remember that this is not because of the code itself, but because of the operating system’s processes.

Therefore, we should take the minimum value, as it is the closest one to the real execution time, without any disruptions. The same way, we should not pay attention to variation in the results across the runs. They do not measure how variable is the actual execution time of this particular code snippet; in fact, there should be almost no variation. Instead, such variation rather measures how variable the other processes were, those that were run by the operating system; so, how variable was these processes’ effects on the benchmarking results. This is why analyzing benchmarks is so much different from analyzing other kinds of data, and that’s why the minimum value is a better measure to represent the results of benchmarks than any measure of central tendency.

Another example

Let’s consider a more complex scenario. This time, we will use setup to set up the environment.

Again, before running the code, analyze it and try to guess which function should be quicker. (This time it’s not that simple, as you need to know how the array module works.)

Note that you can simplify the above code a little but. Such simplification makes sense especially when you have more snippets to compare:

Here, foo_array() occurs to be almost five times slower than foo_list(), quite a difference.

Before, I promised you to show an example when the callable-based API makes more sense than the snippet-based one. This is such a situation. Notice that using functions looks more natural this time than using code snippets, because the snippets above simply call functions.

The functions take an argument, which makes it a little trickier to use the callable-based API. We need to use lambda, which is a minor drawback of this API:

We do not need to use setup, as we imported the array module. When we use the callable-based API, we run benchmarks in the current environment, unless you affect it by changing the globals argument of timeit.repeat(). So, if the environment is heavy with big objects, the benchmark can actually show worse performance than one in an almost-empty environment.

You should know what the value that timeit.repeat() returns is. The list’s length is equal to number, each value representing the total time, in seconds, of running the number calls of the code snippet/callable. You can calculate the average time of running the snippet; for example, sum(results["array"]) / (number*repeat) will give us the mean number of calling the foo_array() function, in seconds.

Watch out!

You have to remember what I mentioned before: timeit functions run the same command one after another in the same session. This is particularly important when you work with mutable objects, but not only.

Let’s consider an example. Imagine you want to compare a generator expression with the corresponding list. Say, you will consume the two items in a for loop, but doing nothing — that way, you want to compare the overhead of using the two types of objects. For instance,

And we get 0.016 for the generator expression vs 0.669 for the list. That’s an amazing result! Creating a list of 100 elements is over 42 times slower than a generator expression?!

Or… is it?

Whenever you see such crazy results, double-check the code. I am not saying that such results never happen; but rather that you should assure yourself that the code is correct.

Do you see the problem in this code? If not, analyze it once more. There is something wrong with it.

The problem results from using a generator. You can use a generator only once, and then it’s empty. So, in this code, it’s iterated over only once, and then it’s not iterated over anymore for the simple reason that it’s empty. The list, however, is iterated over each time.

That’s why the corresponding snippet takes so much time: The for loop loops over x_gen only once, during the first call, and then does nothing as x_gen is empty. With the list, however, it loops over it every time the snippet is called. Hence the difference.

We can solve it in an easy way, using timeit.repeat(): We can use number=1 and a great value for repeat. That way, each repetition (session) will actually iterate over the generator expression, as it will be recreated in each subsequent session.

I’ve multiplied the results by repeat because otherwise they would be very small, representing the time of iterating over one for loop.

Now we have 15.400 for the generator expression against 0.900 for the list. This time — when we used a correct approach — the list was about 17 times quicker than a generator expression.

A similar situation can happen when you operate on a mutable object: If it’s affected in every call, then the next call will work with this updated version of the object, not the original one. Hence each call works with a different object. An example could be benchmarking how append works for lists. Every time you append an item to a list, the list will grow longer, and so the subsequent appends are incomparable. Play with this to see how this works, and for the pure fun of benchmarking.

Alternatives

Python offers various time benchmarking alternatives. Three of which that I would like to point out are

cProfile, an built-in Python profiler; this is an extremely useful tool for anyone for whom performance matters;
perftester, a package for performance testing, in terms of execution time and memory usage, that offers also benchmarking and profiling tools; and
ycecream, a package for sweet debugging and benchmarking of Python code.

Conclusion

The timeit module is perhaps the easiest way to benchmark code. It’s a built-in solution, and its API is relatively easy to use. I definitely recommend it, but consider using the timeit.repeat() function instead of timeit.timeit().

With time, you will notice that quite often it does not really matter if the code is optimized in every single detail. For example, does it really matter if you find a way of saving 10 milliseconds when the code lasts 10 hours? Sometimes this can make sense, but we have to remember that improving performance often comes with some cost, like less-readable code.

To summarize:

Optimize performance when it really matters. Otherwise, strive for code readability.
Try to find the right balance between code complexity and performance.
Code optimization takes time. Is it worth spending 10 hours on optimizing code to save a second per month? The answer, as usually, is “it depends,” but always remember to ask yourself (and others on the team) this very question.
The timeit module offers built-in methods for benchmarking execution time. It’s pretty simple to use, but you should consider it a benchmarking tool, not a profiling tool.
When you know that one function is quicker than another, is as readable as the other, and using it in code does not take more time — use it. Why on earth would you use a slower function/method in such a situation?
If you want to learn various intricacies of Python, timeit can help a lot. Combined with other tools, it can help you understand how Python works.
If you want to learn more about code optimization, profiling and related topics in Python, Gorelick and Ozsvald’s book (2020) is your friend.

Thanks for reading. I hope you’ve enjoyed the article and the timeit module. If you did, be aware that it’s not the end of the story: In later articles, I will discuss other benchmarking tools.

Resources

https://docs.python.org/3/library/timeit.html
https://docs.python.org/3/library/profile.html
https://github.com/nyggus/perftester
https://github.com/salabim/ycecream
Gorelick, M., Ozsvald, I. (2020). High Performance Python: Practical Performant Programming for Humans. 2nd edition. O’Reilly Media.