Hands-on Tutorial

Python Runtime Profiling using SnakeViz — How to Inspect the Codes Performances

Determine which part of your Python codes takes more time to run

After reading and doing step-by-step in this tutorial, you will get some new knowledge and experiences in Python script profiling, how to create profiling on your own script or function, and determine which part of the function takes more time to run.

Introduction to Python script profiling

When working in production, other than bugs occurring in Python script, the run time execution will be one consideration. It is a performance issue when our data volume becomes bigger and bigger. The production script must be restructured and optimized to improve performance.

What if the scripts are too complex to check line by line? Are there any tools to summarize the script performance rather than wasting our time checking line by line and file by file? Yups, we need Python script profiling and Python has several options of modules to do it.

In this tutorial, we will use cProfile to create profiling.

Introduction to cProfile

The cProfile is a built-in module provided by Python to profile our scripts. It is commonly used as a Python script profiler by Python programmers. Different from profile that is written above Python,cProfile is written above C. Why do developers love using cProfile instead of other Python script profiling tools? Probably, it is caused by functionality bundled by cProfile.

Standard built-in module with a simple command to do profiling
A lot of statistics were generated to check the Python script performance. Besides, performing profiling on a function, cProfile also check the performance of the function when calling its dependent function
Flexible in order to profile a whole script or specific function in the script

How to use the cProfile

To follow the tutorial, we need to install all the modules needed for Python script profiling. There are two main modules, namely cProfile and snakeviz. The cProfile is a built-in module that is automatically installed when we install Anaconda in our OS. However, the snakeviz must be installed independently.

cProfile — used for deterministic profiling of Python script
pstats — used for analyzing the Python script profiling data
time — used for calculating the runtime of Python script
io — used for dealing with various types of I/O
snakeviz — used for making Python script profiling visualization

When all modules are already installed using Anaconda Prompt, import them into Jupyter Notebook. Luckily, we can run Jupyter Notebook online without any installation.

# Module for deterministic profiling of Python scripts/programs
import cProfile
import pstats
# Module for timing
import time
# Module for dealing with various types of I/O
import io
# Module Python codes profiling
import snakeviz

Create Python script profiling using run() Our first example of Python script profiling is simple. It is just one simple function foo() that tries to sleep for 1 millisecond and prints a string foo. To profile our function cProfile has run() method in which we can pass the Python code or function name that we will profile as the string.

# One function
def foo():
    time.sleep(1)
    print('foo')
# Run the profiling
cProfile.run('foo()')

The output of Python script profiling part 1 (Image by Author)

Rows in the stats table represent the unique functions called and columns are the information related to these unique functions. Read the detailed information at https://jiffyclub.github.io/snakeviz/.

Moving from the simple function created before, we will create a more complex in which consists of looping and print operation. Two functions — loopingSomenting() and printSomething() are created and embedded into the main function main().

# More than one function
# 1 Function for looping
def loopingSomething():
    index = []
    for i in range(100):
        index.append(i)
# 2 Function for printing
def printSomething():
    print('Print something here!')
# 3 Bundle above functions
def main():
    # First function
    loopingSomething()
    # Second function
    printSomething()
# Run the profiling
cProfile.run('main()')

The output of Python script profiling part 2 (Image by Author)

Modify the profiling output using Profile class Sometimes, for the profiling output, we want to order the stats table based on the number of calls or the total amount of time taken per one call for a given function. That unluckily can not be handled by run() method because it only returns the ordering system based on filename:lineno(function).

The Profile class is useful to modify the output generated by profiling.

The method enable() allow the profiler to start collecting the profiling information and followed by the main function. The method disable() allow the profiler to stop collecting the profiling information

The data that is collected by cProfile then will be sorted by the pstats method that is sort_stats(). We can determine which one of the columns is the reference.

Further, the output will be printed out by print_stats().

# Create a object from cProfile class
profiler = cProfile.Profile()
profiler.enable()
main()
profiler.disable()
# Sort the profiling based on 'ncalls'
stats = pstats.Stats(profiler).sort_stats('ncalls')
# Print the profiling
stats.print_stats()

The output of Python script profiling part 3 (Image by Author)

The output is followed by the directory name and it can be easily removed if doesn’t fit our report for advanced analysis. The operation uses strip_dirs() method.

# Remove directory names
stats = pstats.Stats(profiler).strip_dirs().sort_stats('ncalls')
stats.print_stats()

The output of Python script profiling part 4 (Image by Author)

Save the cProfile output into different formats After successfully printing out and ordering the output of Python script profiling, we will talk about how to save the output into different file formats like raw, txt, or CSV.

To save in raw format, we use dump_stats() method that passes the argument of a directory where the file will be saved and filename. In this tutorial, we save the cProfile output in folder data and filename as cProfileExport.

# Export the profiler output into file
stats = pstats.Stats(profiler)
stats.dump_stats('../data/cProfileExport')

The output also can be saved in txt format. It’s performed using StringIO method and pass it as the parameter in the argument stream.

result = io.StringIO()
stats = pstats.Stats(profiler, stream = result).sort_stats('ncalls')
stats.print_stats()
# Save it into disk
with open('../data/cProfileExport.txt', 'w+') as f:
    f.write(result.getvalue())

To save the output into CSV format, it has more effort to do like data transformation so the data will be successfully parsed into comma separator format.

result = io.StringIO()
stats = pstats.Stats(profiler, stream = result).sort_stats('ncalls')
stats.print_stats()
result = result.getvalue()
# Chop the string into a csv-like buffer
result = 'ncalls' + result.split('ncalls')[-1]
result = '\n'.join([','.join(line.rstrip().split(None, 6)) for line in result.split('\n')])
# Save it into disk
with open('../data/cProfileExport.csv', 'w+') as f:
    f.write(result)

The cProfile in raw, txt, and CSV format (Image by Author)

While we saved the cProfile output in CSV format, to read it in proper condition, we must perform data manipulation.

# Import module for data manipulation
import pandas as pd
# Load the data
df_profiling = pd.read_csv('../data/cProfileExport.csv', sep = ',')
# Print the dimension
print('Dimension data: {} rows and {} columns'.format(len(df_profiling), len(df_profiling.columns)))
df_profiling.head()

Raw data of cProfile from CSV format (Image by Author)

The manipulation is simple, it only involves the concatenation between two columns and column reordering. The steps are systematically listed.

# Fill missing value
df_profiling.fillna(value = '', inplace = True)
# Column manipulation to make the data clean
df_profiling['filename:lineno(function)'] = df_profiling['percall.1'] + ' ' + df_profiling['filename:lineno(function)']
# Rename columns
del df_profiling['percall.1']
cols = ['tottime', 'percall', 'cumtime', 'percall_2', 'filename:lineno(function)']
df_profiling.columns = cols
# Reset index
df_profiling['ncalls'] = df_profiling.index
df_profiling.reset_index(drop = True, inplace = True)
# Reorder columns
cols = ['ncalls', 'tottime', 'percall', 'cumtime', 'percall_2', 'filename:lineno(function)']
df_profiling = df_profiling[cols]

Cleaned data of cProfile from CSV format (Image by Author)

How to visualize the cProfile object using SnakeViz

In reality, our Python script and function are too complex and it directly makes the cProfile output is harder to read than we think before. To analyze the function performances in our script, it is easier to display it using visualization (diagram or graph), instead of reading row by row in cProfile output.

Python provides a snakeviz module that can automatically make Python script profiling viz from the log file generated by cProfile. It has two visualization styles, namely icicle, and sunburst charts.

Icicle chart — the time taken by functions to run will be visualized by the width of a rectangle. The root function is on the top of viz which has the largest width. The root function runs by calling the sub-functions below it and so on. The smaller a rectangle width, the faster a function is executed
Sunburst chart — the time taken by functions to run will be visualized by an angular extent of the arc. The root function is a circle in the middle of viz. The root function runs by calling the sub-functions below it and so on

Create profiling viz for the machine learning function

Let’s create a machine-learning pipeline for the iris data set. Further, it has a goal to produce a classification model that will help us to predict the species class based on a given characteristic, such as petal length, petal width, sepal length, sepal width, etc.

The machine learning model uses logistic regression as the benchmark and the metric used is accuracy. The function called irisDataClassification() will print out the final accuracy.

It’s a common real problem that usually needs runtime optimization when the data becomes bigger or the algorithm is more complex. The Python script profiling will help us assess which one of the codes or functions must be optimized.

def irisDataClassification():    
    # Import modules
    from sklearn import datasets
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import accuracy_score
    # Import some data to play with
    iris = datasets.load_iris()
    X, y = iris.data, iris.target
    
    # Data splitting
    X_train, X_test, y_train, y_test = train_test_split(
        X,
        y,
        test_size = 0.2,
        random_state = 1,
        stratify = y)
    
    # Create logistic regression object
    model = LogisticRegression()
    
    # Data modelling with logistic regression
    model.fit(X_train, y_train)
    
    # Create prediction using testing data
    y_pred = model.predict(X_test)
    
    # Print out the accuracy
    accuracy = accuracy_score(y_test, y_pred)
    print(accuracy)

The snakeviz module provides us with two options to generate the visualization or dashboard, via Jupyter Notebook or Terminal. When the scripts become more complex, I suggest using the terminal because it will open new windows in the browser.

Create a visualization using Jupyter Notebook It’s easy to display the visualization by calling the irisDataClassiication() in the %snakeviz command.

# Python script profiling viz with SnakeViz
%load_ext snakeviz
%snakeviz irisDataClassification()

Notification of Snakeviz in Jupyter Notebook (Image by Author)

Python script profiling using SnakeViz in Jupyter Notebook (Image by Author)

Create a visualization using the Terminal Firstly, the function irisDataVisualization must be saved in .py file. Next, open the terminal and change the directory into a folder where the .py file located. Then, run the snakeviz command followed by the filename of irisDataClassification viz.

A new tab will automatically open for the viz.

The filename is adjusted to your needs

snakeviz iris_classification.prof

Python script profiling using SnakeViz in Icicle chart (Image by Author)

The default visualization style uses an icicle chart but we can change it to the sunburst chart by the style dropdown option.

Python script profiling using SnakeViz in Sunburst chart (Image by Author)

The stats table is displayed at the bottom of viz.

Detailed output of Python script profiling (Image by Author)

Explanation of elements in snakeviz

The profiling visualization created by SnakeViz has a lot of parameters as information to support our Python script profiling. Learn detailed information about the elements in snakeviz at https://jiffyclub.github.io/snakeviz/.

Conclusion

Python profiling is really important when we are running our script in production. Using the profiling, we can optimize the script, which one of the functions takes more time to run — that will be optimized later. Runtime optimization is crucial when the production environment charges costs as to how many resources are being used.

References

[1] M. Davis. SNAKEVIZ (2013). https://jiffyclub.github.io/snakeviz/.

[2] Shrivarsheni. cProfile — How to profile your python code (2020). https://www.machinelearningplus.com/python/cprofile-how-to-profile-your-python-code/.