Free AI web copilot to create summaries, insights and extended knowledge, download it at here

Abstract

ithub.com/MartinThoma/mpu">mpu</a> library as an example. At the moment, it has a 99% branch and 99% line coverage.<div id="ed78"><pre>$ mutmut run</pre></div><div id="3136"><pre>- Mutation testing starting -</pre></div><div id="f47f"><pre>These are the steps: 1. A full test suite run will be made to make sure we can run the tests successfully and we know how long it takes (to detect infinite loops for example) 2. Mutants will be generated and checked</pre></div><div id="42fa"><pre>Results are stored in .mutmut-cache. Print found mutants with mutmut results.</pre></div><div id="4c02"><pre>Legend for output: 🎉 Killed mutants. The goal is for everything to end up in this bucket. ⏰ Timeout. Test suite took 10 times as long as the baseline so were killed. 🤔 Suspicious. Tests took a long time, but not long enough to be fatal. 🙁 Survived. This means your tests needs to be expanded. 🔇 Skipped. Skipped.</pre></div><div id="44d7"><pre>1. Running tests without mutations ⠧ Running...Done</pre></div><div id="80b4"><pre>2. Checking mutants ⠸ 1818/1818 🎉 1303 ⏰ 1 🤔 6 🙁 508 🔇 0</pre></div>This takes over 1.5 hours for mpu. mpu is a small project, with only about 2000 lines of code:<div id="bc6c"><pre>Language files blank comment code --------------------------------------------------------------- Python 22 681 1399 2046</pre></div>One pytest run of the mpu example project takes roughly 9 seconds and the slowest 3 tests are:<div id="a37f"><pre>1.03s call tests/test_main.py::test_parallel_for 0.80s call tests/test_string.py::test_is_email 0.41s call tests/test_io.py::test_download_without_path</pre></div>In the end, you will see how many mutants were successfully killed (🎉), how many received a timeout (⏰) and which ones survived (😕). Especially the timeout ones are annoying as they make the mutmut runs slower, but the code and the tests might still be fine.<h1 id="784f">Which mutations are applied?</h1>mutmut 2.0 creates the following mutants (<a href="https://github.com/boxed/mutmut/blob/9fc568648ba81d193f986c25ab60cbee0660dd33/mutmut/__init__.py#L433-L446">source</a>):<ul><li>Operator mutations: About 30 different patterns like replacing <code>+</code> by <code>-</code> , <code>*</code> by <code>**</code> and similar, but also <code>></code> by <code>>=</code> .</li><li>Keyword mutations: Replacing <code>True</code> by <code>False</code> , <code>in</code> by <code>not in</code> and similar.</li><li>Number mutations: You can write things like <code>0b100</code> which is the same as <code>4</code>, <code>0o100</code>, which is 64, <code>0x100</code> which is 256, <code>.12</code> which is <code>0.12</code> and similar. The number mutations try to capture mistakes in this area. mutmut simply adds 1 to the number.</li><li>Name mutations: The name mutations capture <code>copy</code> vs <code>deepcopy</code> and <code>""</code> vs <code>None</code> .</li><li>Argument mutations: Replaces keyword arguments one by one from <code>dict(a=b)</code> to <code>dict(aXXX=b)</code>.</li><li>or_test and and_test: <code>and</code> ↔ <code>or</code></li><li>String mutation: Adding <code>XX</code> to the string.</li></ul>Those can be grouped into three very different kinds of mutations: value mutations (string mutation, number mutation), decision mutations (switch if-else blocks, e.g. the or_test / and_test and the keyword mutations) and statement mutations (removing or changing a line of code).The value mutations are most often false-positive for me. I’m not certain if I could write my code or my tests in another way to fix this. I’ve briefly discussed it with t

Options

he library author, but apparently he does not have the same issue. If you’re interested in that discussion, see <a href="https://github.com/boxed/mutmut/issues/175">issue #175</a>.<h1 id="3590">How can I get a HTML report with mutmut?</h1><div id="44ba"><pre> $mutmut html</pre></div>gives you<figure id="af89"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*m-xJeLi6u9LdR4mWZnOpcQ.png"><figcaption>Index page of the mutmut HTML report. Image by Martin Thoma.</figcaption></figure><figure id="101c"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*oG0ekG4qoEyB-hF99Jr1yQ.png"><figcaption>The complete pd.py report. Image by Martin Thoma.</figcaption></figure>As you can see, the index claims that 108 mutants survived and the HTML report only shows one. That one is also a false-positive as a change in the logging message does not cause any issue.Alternatively, you can use the junit XML to generate a report:<div id="2776"><pre>$ pip install junit2html $mutmut junitxml > mutmut-results.xml $ junit2html mutmut-results.xml mutmut-report.html</pre></div>The report shows this index page:<figure id="714b"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*KE0ZC11mKCGANkymj7iEhw.png"><figcaption>Test report generated from JUnit XML. Image by Martin Thoma</figcaption></figure>Clicking on one mutant, you gets this:<figure id="95c7"><img src="https://cdn-images-1.readmedium.com/v2/resize:fit:800/1*rAxchK1bgoivAKJErlqDxg.png"><figcaption>Mutant #3 was killed, but mutant #4 survived. I did not use the global variable “countries” anywhere in the tests. Image by Martin Thoma.</figcaption></figure>The issue with this generated HTML report is that it shows many results for a single line of code and no grouping. If the failures were grouped by file and if one could see the code in which lines with surviving mutants would be highlighted, it would be way more useful.<h1 id="b9d7">Mutation Testing for Machine Learning Systems</h1>I’ve searched for cool applications of machine learning to generate mutants in code, but I’ve only found “Machine Learning Approach in Mutation Testing” from 2012 (12 citations).I was hoping to find data-based code mutant generation techniques. For example, one could search for git commits which are bug fixes by examining the commit message. If the fix is rather short, this is a kind of mutation one could test for. Instead of generating all possible mutants, one could sample from the mutants in a way to first take the most promising ones; the ones that are most likely not perceived as a false-positive.Other work was more focused on making machine learning systems more robust (<a href="https://arxiv.org/pdf/1805.05206.pdf">DeepMutation</a>, <a href="https://arxiv.org/pdf/1803.07519.pdf">DeepGauge</a>, an <a href="https://www.pre-crime.eu/techreps/TR-Precrime-2019-03.pdf">Evaluation</a>). I don’t know this stream of work well enough to write about it. But it sounds similar to techniques I know:<ul><li>To overcome scarcity in training data, various data augmentation techniques such as rotations, flips, or color adjustments are applied. You can actually see those as mutations.</li><li>Also, in the GAN setting where you have a generator and a discriminator, you could argue that the generator produces mutants and the discriminator should tell them apart.</li><li>In order to force the network to learn more robust features, a technique called dropout (<a href="https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout">Tensorflow</a>, <a href="https://lasagne.readthedocs.io/en/latest/modules/layers/noise.html">Lasagne</a>)is commonly used. You could say that a part of the input or the internal representation is randomly mutated by setting it to zero</li></ul><h1 id="a83b">What’s next?</h1>In this series, we already had:<ul><li>Part 1: <a href="https://readmedium.com/unit-testing-in-python-basics-21a9a57418a0">The basics of Unit Testing in Python</a></li><li>Part 2: <a href="https://levelup.gitconnected.com/unit-testing-in-python-mocking-patching-and-dependency-injection-301280db2fed">Patching, Mocks and Dependency Injection</a></li><li>Part 3: <a href="https://readmedium.com/how-to-test-flask-applications-aef12ae5181c">How to test Flask applications</a> with Databases, Templates and Protected Pages</li><li>Part 4: <a href="https://readmedium.com/unit-testing-in-python-tox-and-nox-833e4bbce729">tox and nox</a></li><li>Part 5: <a href="https://readmedium.com/unit-testing-in-python-structure-57acd51da923">Structuring Unit Tests</a></li><li>Part 6: <a href="https://levelup.gitconnected.com/ci-pipelines-for-python-projects-9ac2830d2e38">CI-Pipelines</a></li><li>Part 7: <a href="https://levelup.gitconnected.com/unit-testing-in-python-property-based-testing-892a741fc119">Property-based Testing</a></li><li>Part 8: <a href="https://readmedium.com/unit-testing-in-python-mutation-testing-7a70143180d8">Mutation Testing</a></li><li>Part 9: <a href="https://towardsdatascience.com/static-code-analysis-for-python-bdce10b8d287">Static Code Analysis</a> — Linters, Type Checking, and Code Complexity</li><li>Part 10: <a href="https://towardsdatascience.com/pytest-plugins-to-love-%EF%B8%8F-9c71635fbe22">Pytest Plugins to Love</a></li></ul>Let me know if you’re interested in other topics around testing with Python.</article></body>

Mutation Testing with Python

Test the tests — automatically, by applying common mistakes

Based on the Monster Character set by macrovector

We need to kill the mutants — no, I’m not a villain from the X-Men comics. I’m a software engineer who wants to improve unit tests.

In this article you will learn what mutation testing is and how it can help you to write better tests. The examples are for Python, but the concepts hold in general and in the end I have a list of tools in other languages.

Why do we need mutation testing?

Unit tests have the issue that it’s unclear when your tests are good enough. Do you cover the important edge cases? How do you test the quality of your unit tests?

Typical mistakes are slight confusions. Accessing list[i] instead of list[i-1] , letting the loop run for i < n instead of i <= n , initializing a variable with None instead of the empty string. There are a lot of those slight changes which are usually just called “typos” or “off-by-one” mistakes. When I make them, I often didn’t think about the part thoroughly enough.

Mutation testing tests your unit tests. The key idea is to apply those minor changes and run the unit tests that could fail. If a unit test fails, the mutant was killed. Which is what we want. It shows that this kind of off-by-one mistake cannot happen with our test suite. Of course, we assume that the unit tests themselves are correct or at worst incomplete. Hence you can see a mutation test as an alternative to test coverage. In contrast to test coverage, the mutation testing toolkit can directly show you places and types of mistakes you would not cover right now.

Which mutation testing tools are there?

There are a couple of tools like cosmic-ray, but Anders Hovmöller did a pretty amazing job by creating mutmut. As of August 2020, mutmut is the best library for Python to do mutation testing.

To run the examples in this article, you have to install mutmut:

pip install mutmut

In other languages, you might want to try these:

C / C++: mull
Java: PIT (GitHub)
JavaScript: Stryker
PHP: Infection (formerly called humbug)
Ruby: mutant
Rust: mutagen
Swift: muter

Why isn’t branch and line coverage enough?

It is pretty easy to get to a high line coverage by creating bad tests. For example, take this code:

def fibonacci(n: int) -> int:
    """Get the n-th Fibonacci number, starting with 0 and 1."""
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return b  # BUG! should be a!

def test_fibonacci():
    fibonacci(10)

This smoke test already adds some value as it makes sure that things are not crashing for a single input. However, it would not find any logic bug. There is an assert statement missing. This pattern can quickly drive up the line coverage up to 100%, but you are then still lacking good tests.

A mutation test cannot be fooled as easily. It would mutate the code and, for example, initialize b with 0 instead of 1:

- a, b = 0, 1
+ a, b = 0, 0

The test would still succeed and thus the mutant would survive. Which means the mutation testing framework would complain that this line was not properly tested. In other words:

Mutation testing provides another way to get a more rigid line coverage. It can still not guarantee that a tested line is correct, but it can show you potential bugs that your current test suite would not detect.

Create the mutants!

As always, I use my small mpu library as an example. At the moment, it has a 99% branch and 99% line coverage.

$ mutmut run

- Mutation testing starting -

These are the steps:
1. A full test suite run will be made to make sure we
   can run the tests successfully and we know how long
   it takes (to detect infinite loops for example)
2. Mutants will be generated and checked

Results are stored in .mutmut-cache.
Print found mutants with `mutmut results`.

Legend for output:
🎉 Killed mutants.   The goal is for everything to end up in this bucket.
⏰ Timeout.          Test suite took 10 times as long as the baseline so were killed.
🤔 Suspicious.       Tests took a long time, but not long enough to be fatal.
🙁 Survived.         This means your tests needs to be expanded.
🔇 Skipped.          Skipped.

1. Running tests without mutations
⠧ Running...Done

2. Checking mutants
⠸ 1818/1818  🎉 1303  ⏰ 1  🤔 6  🙁 508  🔇 0

This takes over 1.5 hours for mpu. mpu is a small project, with only about 2000 lines of code:

Language     files          blank        comment        code
---------------------------------------------------------------
Python       22            681           1399           2046

One pytest run of the mpu example project takes roughly 9 seconds and the slowest 3 tests are:

1.03s call     tests/test_main.py::test_parallel_for
0.80s call     tests/test_string.py::test_is_email
0.41s call     tests/test_io.py::test_download_without_path

In the end, you will see how many mutants were successfully killed (🎉), how many received a timeout (⏰) and which ones survived (😕). Especially the timeout ones are annoying as they make the mutmut runs slower, but the code and the tests might still be fine.

Which mutations are applied?

mutmut 2.0 creates the following mutants (source):

Operator mutations: About 30 different patterns like replacing + by - , * by ** and similar, but also > by >= .
Keyword mutations: Replacing True by False , in by not in and similar.
Number mutations: You can write things like 0b100 which is the same as 4, 0o100, which is 64, 0x100 which is 256, .12 which is 0.12 and similar. The number mutations try to capture mistakes in this area. mutmut simply adds 1 to the number.
Name mutations: The name mutations capture copy vs deepcopy and "" vs None .
Argument mutations: Replaces keyword arguments one by one from dict(a=b) to dict(aXXX=b).
or_test and and_test: and ↔ or
String mutation: Adding XX to the string.

Those can be grouped into three very different kinds of mutations: value mutations (string mutation, number mutation), decision mutations (switch if-else blocks, e.g. the or_test / and_test and the keyword mutations) and statement mutations (removing or changing a line of code).

The value mutations are most often false-positive for me. I’m not certain if I could write my code or my tests in another way to fix this. I’ve briefly discussed it with the library author, but apparently he does not have the same issue. If you’re interested in that discussion, see issue #175.

How can I get a HTML report with mutmut?

$ mutmut html

gives you

Index page of the mutmut HTML report. Image by Martin Thoma.

The complete pd.py report. Image by Martin Thoma.

As you can see, the index claims that 108 mutants survived and the HTML report only shows one. That one is also a false-positive as a change in the logging message does not cause any issue.

Alternatively, you can use the junit XML to generate a report:

$ pip install junit2html
$ mutmut junitxml > mutmut-results.xml
$ junit2html mutmut-results.xml mutmut-report.html

The report shows this index page:

Test report generated from JUnit XML. Image by Martin Thoma

Clicking on one mutant, you gets this:

Mutant #3 was killed, but mutant #4 survived. I did not use the global variable “countries” anywhere in the tests. Image by Martin Thoma.

The issue with this generated HTML report is that it shows many results for a single line of code and no grouping. If the failures were grouped by file and if one could see the code in which lines with surviving mutants would be highlighted, it would be way more useful.

Mutation Testing for Machine Learning Systems

I’ve searched for cool applications of machine learning to generate mutants in code, but I’ve only found “Machine Learning Approach in Mutation Testing” from 2012 (12 citations).

I was hoping to find data-based code mutant generation techniques. For example, one could search for git commits which are bug fixes by examining the commit message. If the fix is rather short, this is a kind of mutation one could test for. Instead of generating all possible mutants, one could sample from the mutants in a way to first take the most promising ones; the ones that are most likely not perceived as a false-positive.

Other work was more focused on making machine learning systems more robust (DeepMutation, DeepGauge, an Evaluation). I don’t know this stream of work well enough to write about it. But it sounds similar to techniques I know:

To overcome scarcity in training data, various data augmentation techniques such as rotations, flips, or color adjustments are applied. You can actually see those as mutations.
Also, in the GAN setting where you have a generator and a discriminator, you could argue that the generator produces mutants and the discriminator should tell them apart.
In order to force the network to learn more robust features, a technique called dropout (Tensorflow, Lasagne)is commonly used. You could say that a part of the input or the internal representation is randomly mutated by setting it to zero

What’s next?

In this series, we already had:

Part 1: The basics of Unit Testing in Python
Part 2: Patching, Mocks and Dependency Injection
Part 3: How to test Flask applications with Databases, Templates and Protected Pages
Part 4: tox and nox
Part 5: Structuring Unit Tests
Part 6: CI-Pipelines
Part 7: Property-based Testing
Part 8: Mutation Testing
Part 9: Static Code Analysis — Linters, Type Checking, and Code Complexity
Part 10: Pytest Plugins to Love

Let me know if you’re interested in other topics around testing with Python.