An Algorithm That Saved Biology and Shaped Modern Medicine

Finding patterns is an essential part of research in Biology.

Be it finding repeating patterns in DNA, RNA, or in a string of amino acids in a protein, the applications of pattern matching can be found in multiple places.

Let’s dive in deep!

The Human DNA

The genetic information of a cell is found in its nuclei (also in mitochondria, but that is for some other day).

This information is in form of DNA (deoxyribonucleic acid) that is structured as chromosomes.

Human DNA is a polymer that is composed of two chains formed by monomers called nucleotides.

This polymer is very long and contains millions of nucleotides.

These chains twist around each other and form a double helical structure.

There are four types of Nucleotides in each chain and these are:

Adenine (A)
Thymine (T)
Guanine (G)
Cytosine (C )

Image from Wikipedia (Ignore the RNA side of the image)

A Ground Breaking Research

Let’s say that our hypothetical research group found out that a sequence of nucleotides (gene)ATTCCA , if present in one’s DNA causes early heart attacks.

It’s all over on the television and people are gathering to get their DNA tested.

If we sequence one’s DNA, we will get a long string of millions of nucleotides.

ATCGATGCAAGGTCGATCGATGCGCGCTAGCGATCGGCATCGATCGAGTCGATCGATCGATCGA...

Searching for our sequence of interest is a time-taking and tedious task for a human being and hence we need a computer to help with this.

Programmer Stepping In

Let’s say that we have a programmer in our research group.

We ask them to write an algorithm that can find this pattern in our long string of nucleotides.

Problem Statement

A problem statement is formulated that explicitly mentions what we need our computer to do.

Given two strings:

- the string of all nucleotides in the DNA (dna )

- the string of the gene of interest (gene )

Find all occurrences of gene in dna

A Naïve Solution

The programmer starts with a brute force approach to solve this problem.

Steps

We save the length of dna in dna_length variable and the length of gene in gene_length variable.
We set pointer_1 to the beginning of dna string and increment it using a for loop till index dna_length — gene_length + 1
We set pointer_2 to the beginning of gene string and increment it using a while loop
The if statement checks if nucleotides at both pointers’ indexes are equal.
If this is the case, pointer_2is incremented, and if not pointer_1 is incremented.
If pointer_2 transverses the whole length of gene and is able to match with the last nucleotide, we print that a match has been found.

Great job!

But is this algorithm efficient?

This algorithm runs in:

Time complexity of O (gene_length * (dna_length — gene_length + 1) ) because we are looping through both of the strings.
Space complexity of O(1) because we are not creating any new intermediate data structure to be used by our algorithm.

An Efficient Approach: Knuth–Pratt-Morris algorithm

Steps in the Knuth-Pratt-Morris Algorithm!

As defined by Wikipedia, this algorithm searches for occurrences of a string within another string by employing the observation that when a mismatch occurs, the first string itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters.

The time complexity of this search algorithm is O(dna_length + gene_length ).

The space complexity of this search algorithm is O(gene_length ).

But how does the algorithm do this in linear time?

The Magic Ingredient

To perform the pattern search in linear time, the algorithm performs a step where the gene string is pre-processed to construct an intermediate array that contains the longest proper prefix that is also a proper suffix of the string.

Confused? Let’s clear things up.

Example 1

Let’s take a gene string i.e. ATGA

The prefixes of this string are A , AT , ATG , ATGA .

The suffixes of this string are A,GA , TGA , ATGA .

When we consider the proper prefixes and suffixes, we exclude the whole string prefix and suffix, respectively.

Hence, the proper prefixes of this string are A , AT , ATG (excluded ATGA ).

The proper suffixes of this string are A,GA , TGA (excluded ATGA ).

Example 2

Let’s take another gene string i.e. ATATAGTA .

Let’s make an array that contains longest proper prefix that is also a proper suffix of the string.

|A|T|A|T|A|G|T|A|

[0,0,1,2,3,0,0,1]

Look at the substring till index 2 i.e. ATA

The proper prefixes of it are A , AT

The proper suffix of it are A , TA

The longest proper prefix that is also a proper suffix of this substring is A . The length of it is 1 . Hence, we put 1 at this index in the array.

Example 3

Look at the substring till index 4i.e. ATATA

The proper prefixes of it are A , AT , ATA , ATAT

The proper suffix of it are A , TA , ATA , TATA

The proper prefixes that are also proper suffixes of this substring areA and ATA.

The longest of them i.e. ATA has a length of 3 .

Hence, we put 3 at this index in the array.

Creating The Intermediate Array

Let’s write up a function that generates this intermediate array.

This function uses two pointers i and j at indices 1 and 0 respectively.

This function will return the intermediate array i.e. [0,1,2,0,1,2,3,3] .

Searching Our Gene of Interest in the DNA

We will use two pointers for this.

i is a pointer on the dna string.

j is a pointer on the gene string.

While searching we try to match each nucleotide in the gene with the nucleotide in the DNA.

But this will make us loop through both strings like in the naïve approach.

To avoid this we use our intermediate array to decide the next characters to be matched (avoiding characters that will match anyways) i.e. by changing j according to our generated intermediate array arr .

Let’s try it out!

This outputs to:

Gene found at index 0
Gene found at index 2
Gene found at index 4
Gene found at index 6
Gene found at index 8
Gene found at index 10
Gene found at index 12
Gene found at index 14

And this is how Knuth-Pratt-Morris algorithm makes things easy and saves time for biologists!

Resources

Knuth-Morris-Pratt algorithm - Wikipedia

In computer science, the Knuth-Morris-Pratt string-searching algorithm (or KMP algorithm) searches for occurrences of a…

en.wikipedia.org

The Knuth-Morris-Pratt Algorithm in my own words

For the past few days, I've been reading various explanations of the Knuth-Morris-Pratt string searching algorithms…

jakeboxer.com

KMP Algorithm for Pattern Searching - GeeksforGeeks

Given a text txt[0..n-1] and a pattern pat[0..m-1], write a function search(char pat[], char txt[]) that prints all…

www.geeksforgeeks.org

Thanks for reading this article!

If you are someone new to Python or programming overall, check out my new book called ‘The No Bulls**t Guide To Learning Python’ below:

The No Bulls**t Guide To Learning Python

Are you someone who is thinking about learning to program but does not know where to start?I have the solution for you…

bamaniaashish.gumroad.com

Join Medium with my referral link — Dr. Ashish Bamania

Read every story from Dr. Ashish Bamania (and thousands of other writers on Medium). Your membership fee directly…

bamania-ashish.medium.com