Understand and Build FP-Growth Algorithm in Python
Frequency Pattern Mining using FP-tree and conditional FP-tree in Python
What is FP-Growth
FP-growth is an improved version of the Apriori Algorithm which is widely used for frequent pattern mining(AKA Association Rule Mining). It is used as an analytical process that finds frequent patterns or associations from data sets. For example, grocery store transaction data might have a frequent pattern that people usually buy chips and beer together. The Apriori Algorithm produces frequent patterns by generating itemsets and discovering the most frequent itemset over a threshold “minimal support count”. It greatly reduces the size of the itemset in the database by one simple principle:
If an itemset is frequent, then all of its subsets must also be frequent.
However, the Apriori Algorithm has a major shortfall. Using Apriori required multiple scans of the database to check the support count of each item and itemsets. When the database is huge, this will cost a significant amount of disk I/O and computing power. Therefore the FP-Growth algorithm is created to overcome this shortfall. It only scans the database twice and used a tree structure(FP-tree) to store all the information. The root represents null, each node represents an item, while the association of the nodes is the itemsets with the order maintained while forming the tree. The FP-tree is concise and is used to directly generating large itemsets. Once an FP-tree has been constructed, it uses a recursive divide-and-conquer approach to mine the frequent itemsets.
FP-tree Pseudocode and Explanation
- Step 1: Deduce the ordered frequent items. For items with the same frequency, the order is given by the alphabetical order.
- Step 2: Construct the FP-tree from the above data
- Step 3: From the FP-tree above, construct the FP-conditional tree for each item (or itemset).
- Step 4: Determine the frequent patterns.
Let’s take a look at an example of how to generate an FP-tree. Find all frequent itemsets with support ≥2. First, find all items with support count ≥ 2.

Before constructing the FP-tree, reorder the transaction based on their item frequency. When start constructing the FP-tree, we need to create a header table that records the location of all item node with a link list. Each time a new node is added to the tree, it needs to be linked to the last node with the same item. The header table is essential to build conditional FP-tree in the later steps.

Now we have scanned the database twice and build the FP-tree. All the transaction information is contained in the tree. The only thing left to be done is to recursively construct the conditional FP-tree to find all frequent itemsets with support count larger than 2. Use the header table we created in the last step and start from the item with the least frequent count to the most(E, D, C, B, A).

Finally, we can easily generate all frequent itemsets from those conditional FP-trees {E}, {A, D, E}, {A, E}, {D, E}, {D}, {A, D}, {C, D}, {C}, {A, C}, {B}, {A, B}, {A}.

Python Implementation
Here is some sample code to build FP-tree from scratch and find all frequency itemsets in Python 3.
In conclusion, FP-tree is still the most efficient and scalable method for mining the complete set of frequent patterns in a dataset. Most programming languages including Python, R and even Pyspark have well-supported libraries to generate FP-tree.






