Pandas 2.0 — Numpy vs Arrow back-end

A Dataframe speed test. Who wins?

I’ve written a couple of articles recently comparing the speed of Pandas vs Polars and Pandas vs Psycopg, all based on the reading and writing of data into and out of a Postgres database. Check out those articles here:

Pandas v Polars: A database speed test. Who wins?

We have a process that runs against our AWS RDS Postgres database. It’s based on a Python Jupyter Notebook and reads a…

levelup.gitconnected.com

Pandas v Psycopg:

A Postgres database speed test. Who wins?

medium.com

With the advent of Pandas version 2.0, it made sense to test out what performance enhancements if any, of one of the biggest features the new version of Pandas offers, namely, different back-end data handling.

Historically, the back-end data processing of Pandas relied heavily on the Numpy library. This included the Numpy data types and memory management. With the introduction of Pandas 2.0, it’s now possible to use Arrow — via the Pyarrow library — as the favoured back-end data engine.

Arrow is a relatively new addition to the Apache software foundation. It’s a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. Pyarrow promises:

Efficient Data Structures: Pyarrow includes efficient, in-memory data structures that can be shared between Python and other Arrow-compatible languages without copying or changing the underlying memory representation.

Data Interchange: Pyarrow provides a common data format for sharing information between various systems for processing, storage, and transmission. This allows you to use multiple systems with your data, all without the overhead of data serialization or deserialization.

Computational Libraries: Pyarrow includes computational libraries that can work directly with Arrow’s in-memory data structures. These libraries are optimized for modern CPUs and leverage features like SIMD (Single Instruction, Multiple Data) for added performance.

Integration with pandas: Pyarrow has built-in functionality for converting data between its format and Pandas data structures. This allows for more efficient operations when using pandas for computation and Arrow for storage and transport.

Parquet and ORC file formats support: Apache Arrow includes libraries for reading and writing data in the Parquet and ORC file formats. These file formats are often used in Cloud and Hadoop-based big data systems and support efficient, columnar storage of complex, nested data.

All in all, Pyarrow promises improved performance and memory management and this is what we’ll be testing out. To do that I’ll be using the three input database tables and files similar to those that I used in my Pandas v Psycopg article. Each database table will have 10 columns with a mixture of int, floats, and text data columns. The file data will be the same as the table data.

There will be six tests for each of the Numpy and Arrow back-ends. Three populating a data frame from each of the database tables and three doing the same but from local CSV files. After each load into the data frame, we’ll perform some simple computations on the data and present the timings (in seconds), and memory usage for each test.

Here is the code we’ll use:

from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd

query = f"SELECT * FROM your_table"

# Create the SQLAlchemy engine with your postgres database connection string
engine = create_engine('postgresql://your_user:your_pass@your_hostname:your_port/your_dbname')

# Query the database and save the result in a DataFrame,
# only the backend directibve will change
with engine.connect() as conn:
  df1=pd.read_sql_query(text(query), con = conn,dtype_backend='numpy_nullable'))
  # changed to this for pyarrow backend test case
  # df1=pd.read_sql_query(text(query), con = conn,dtype_backend='pyarrow'))
  
  # dataframe memory usage
  df1.info(memory_usage="deep")

  # do some calcs

  df_grouped = df1.groupby(by=["text_field2","text_field1","text_field3"])["float_field1"].sum()


  df1['new_number1'] = df1['int_field1'] * df1['float_field2']

  df1['new_number2'] = df1.text_field1.str.len() * df1.text_field2.str.len() * df1.text_field3.str.len() * df1.text_field4.str.len()

The only difference between the Arrow and Numpy code is the back-end directive stipulating either Numpy or Arrow. The time taken column shows how much time was taken to read the data from the database into a data frame, and then perform a series of computations on the data frame data.

Database Table-> Dataframe (Numpy)

 Name     |   Rows        |  Columns    |Dataframe Memory Use|Time taken     |
+---------+---------------+-------------+--------------------+---------------+
| Table_1 |  196608       |    10       |     56 MB          |2.5            |
| Table_2 |  1769472      |    10       |    541 MB          |18             |
| Table_3 |  7077888      |    10       |   2100 MB          |76             |
+---------+---------------+-------------+--------------------+---------------+

And now the timings using Arrow

Database Table-> Dataframe (Arrow)

Name      |   Rows        |  Columns    |Dataframe Memory Use|Time taken    |
+---------+---------------+-------------+--------------------+---------------+
| Table_1 |  196608       |    10       |     18 MB          |2              |
| Table_2 |  1769472      |    10       |    164 MB          |18             |
| Table_3 |  7077888      |    10       |    656 MB          |70             |
+---------+---------------+-------------+--------------------+---------------+

So, in terms of the time taken, there isn’t much difference between the two, however, the Pyarrow back-end uses far less memory than the default Numpy — roughly saving one-third of the memory in each test case. Depending on the RAM available to you this could be an extremely useful saving.

Let’s repeat the above tests, this time reading into the data frames from local files rather than the database. The data in the files was the same as in the database tables. First, here’s the code we used.

import pandas as pd


# Read the data from the file into a DataFrame
df1 = pd.read_csv('file1.csv',dtype_backend='numpy_nullable')
# changed to this for the arrow backend test case
# df1 = pd.read_csv('file1.csv',engine='pyarrow',dtype_backend='pyarrow')

    
df_grouped = df1.groupby(by=["text_field2","text_field1","text_field3"])["float_field1"].sum()
df1['new_number'] = df1['int_field1'] * df['float_field2']
df1['new_number2'] = df1.text_field1.str.len() * df1.text_field2.str.len() * df1.text_field3.str.len() * df1.text_field4.str.len()

File -> Dataframe (Numpy)

Name     |   Rows        |  File Size  |Dataframe Memory Use|Time taken     |
+---------+---------------+------------+--------------------+---------------+
| File_1 |  196608       |    8.5MB    |     60 MB          |0.5            |
| File_2 |  1769472      |    76 MB    |    541 MB          |4.23           |
| File_3 |  7077888      |   304 MB    |   2100 MB          |16             |
+---------+---------------+-------------+--------------------+--------------+

Now for the arrow back-end timings

File -> Dataframe (Pyarrow)

Name     |   Rows        |  File Size  |Dataframe Memory Use|Time taken     |
+---------+---------------+------------+--------------------+---------------+
| File_1 |  196608       |    8.5MB    |     18 MB          |0.14           |
| File_2 |  1769472      |    76 MB    |    164 MB          |1.03           |
| File_3 |  7077888      |   304 MB    |    656 MB          |4.28           |
+---------+---------------+-------------+--------------------+--------------+

Once again, the Pyarrow back-end trumps Numpy, this time in both memory management and run-time

That’s all for me for now. Hopefully, you find this helpful. If you do, please clap and follow me for more content like this.

If you are not already a medium member and appreciate content like this please consider joining using this link. If you do, I earn a (very) small commission which helps me as a writer.