avatarThomas Reid

Summary

The article compares the performance of Pandas 2.0 using Numpy versus Arrow (Pyarrow) as the back-end for data handling, focusing on memory usage and speed in various data frame operations.

Abstract

The author of the article has conducted a series of tests to evaluate the performance enhancements introduced by Pandas 2.0, particularly the option to use Arrow as a back-end data engine through the Pyarrow library. This is a departure from the historical reliance on Numpy for Pandas' data processing. The tests involved populating data frames from database tables and local CSV files, followed by computations, and then measuring the time taken and memory usage. The results indicate that while there is no significant difference in speed between Numpy and Arrow, the Pyarrow back-end consistently uses less memory, approximately one-third less in each test case. This memory efficiency can be particularly beneficial for systems with limited RAM. Additionally, when reading data from files, Pyarrow outperforms Numpy in both memory management and runtime, suggesting that Pyarrow is a superior choice for these operations.

Opinions

  • The author suggests that the memory savings from using Pyarrow can be substantial, which is advantageous for systems with constrained memory resources.
  • The performance of Pyarrow in terms of speed is comparable to that of Numpy when reading data from a database.
  • Pyarrow demonstrates superior performance in both memory usage and speed when reading data from local files.
  • The author implies that the integration of Pyarrow with Pandas is seamless, allowing for efficient operations and data interchange without significant changes to existing Pandas workflows.
  • The author encourages readers to consider Pyarrow as a back-end for Pandas, especially for those who value memory efficiency and are working with large datasets.
  • By providing detailed code examples and performance metrics, the author conveys a strong endorsement for the adoption of Pyarrow with Pandas 2.0 for data handling tasks.
Image by Author

Pandas 2.0 — Numpy vs Arrow back-end

A Dataframe speed test. Who wins?

I’ve written a couple of articles recently comparing the speed of Pandas vs Polars and Pandas vs Psycopg, all based on the reading and writing of data into and out of a Postgres database. Check out those articles here:

With the advent of Pandas version 2.0, it made sense to test out what performance enhancements if any, of one of the biggest features the new version of Pandas offers, namely, different back-end data handling.

Historically, the back-end data processing of Pandas relied heavily on the Numpy library. This included the Numpy data types and memory management. With the introduction of Pandas 2.0, it’s now possible to use Arrow — via the Pyarrow library — as the favoured back-end data engine.

Arrow is a relatively new addition to the Apache software foundation. It’s a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. Pyarrow promises:

Efficient Data Structures: Pyarrow includes efficient, in-memory data structures that can be shared between Python and other Arrow-compatible languages without copying or changing the underlying memory representation.

Data Interchange: Pyarrow provides a common data format for sharing information between various systems for processing, storage, and transmission. This allows you to use multiple systems with your data, all without the overhead of data serialization or deserialization.

Computational Libraries: Pyarrow includes computational libraries that can work directly with Arrow’s in-memory data structures. These libraries are optimized for modern CPUs and leverage features like SIMD (Single Instruction, Multiple Data) for added performance.

Integration with pandas: Pyarrow has built-in functionality for converting data between its format and Pandas data structures. This allows for more efficient operations when using pandas for computation and Arrow for storage and transport.

Parquet and ORC file formats support: Apache Arrow includes libraries for reading and writing data in the Parquet and ORC file formats. These file formats are often used in Cloud and Hadoop-based big data systems and support efficient, columnar storage of complex, nested data.

All in all, Pyarrow promises improved performance and memory management and this is what we’ll be testing out. To do that I’ll be using the three input database tables and files similar to those that I used in my Pandas v Psycopg article. Each database table will have 10 columns with a mixture of int, floats, and text data columns. The file data will be the same as the table data.

There will be six tests for each of the Numpy and Arrow back-ends. Three populating a data frame from each of the database tables and three doing the same but from local CSV files. After each load into the data frame, we’ll perform some simple computations on the data and present the timings (in seconds), and memory usage for each test.

Here is the code we’ll use:

from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd

query = f"SELECT * FROM your_table"

# Create the SQLAlchemy engine with your postgres database connection string
engine = create_engine('postgresql://your_user:your_pass@your_hostname:your_port/your_dbname')

# Query the database and save the result in a DataFrame,
# only the backend directibve will change
with engine.connect() as conn:
  df1=pd.read_sql_query(text(query), con = conn,dtype_backend='numpy_nullable'))
  # changed to this for pyarrow backend test case
  # df1=pd.read_sql_query(text(query), con = conn,dtype_backend='pyarrow'))
  
  # dataframe memory usage
  df1.info(memory_usage="deep")

  # do some calcs

  df_grouped = df1.groupby(by=["text_field2","text_field1","text_field3"])["float_field1"].sum()


  df1['new_number1'] = df1['int_field1'] * df1['float_field2']

  df1['new_number2'] = df1.text_field1.str.len() * df1.text_field2.str.len() * df1.text_field3.str.len() * df1.text_field4.str.len()

The only difference between the Arrow and Numpy code is the back-end directive stipulating either Numpy or Arrow. The time taken column shows how much time was taken to read the data from the database into a data frame, and then perform a series of computations on the data frame data.

Database Table-> Dataframe (Numpy)

 Name     |   Rows        |  Columns    |Dataframe Memory Use|Time taken     |
+---------+---------------+-------------+--------------------+---------------+
| Table_1 |  196608       |    10       |     56 MB          |2.5            |
| Table_2 |  1769472      |    10       |    541 MB          |18             |
| Table_3 |  7077888      |    10       |   2100 MB          |76             |
+---------+---------------+-------------+--------------------+---------------+

And now the timings using Arrow

Database Table-> Dataframe (Arrow)

Name      |   Rows        |  Columns    |Dataframe Memory Use|Time taken    |
+---------+---------------+-------------+--------------------+---------------+
| Table_1 |  196608       |    10       |     18 MB          |2              |
| Table_2 |  1769472      |    10       |    164 MB          |18             |
| Table_3 |  7077888      |    10       |    656 MB          |70             |
+---------+---------------+-------------+--------------------+---------------+

So, in terms of the time taken, there isn’t much difference between the two, however, the Pyarrow back-end uses far less memory than the default Numpy — roughly saving one-third of the memory in each test case. Depending on the RAM available to you this could be an extremely useful saving.

Let’s repeat the above tests, this time reading into the data frames from local files rather than the database. The data in the files was the same as in the database tables. First, here’s the code we used.

import pandas as pd


# Read the data from the file into a DataFrame
df1 = pd.read_csv('file1.csv',dtype_backend='numpy_nullable')
# changed to this for the arrow backend test case
# df1 = pd.read_csv('file1.csv',engine='pyarrow',dtype_backend='pyarrow')

    
df_grouped = df1.groupby(by=["text_field2","text_field1","text_field3"])["float_field1"].sum()
df1['new_number'] = df1['int_field1'] * df['float_field2']
df1['new_number2'] = df1.text_field1.str.len() * df1.text_field2.str.len() * df1.text_field3.str.len() * df1.text_field4.str.len()

File -> Dataframe (Numpy)

Name     |   Rows        |  File Size  |Dataframe Memory Use|Time taken     |
+---------+---------------+------------+--------------------+---------------+
| File_1 |  196608       |    8.5MB    |     60 MB          |0.5            |
| File_2 |  1769472      |    76 MB    |    541 MB          |4.23           |
| File_3 |  7077888      |   304 MB    |   2100 MB          |16             |
+---------+---------------+-------------+--------------------+--------------+

Now for the arrow back-end timings

File -> Dataframe (Pyarrow)

Name     |   Rows        |  File Size  |Dataframe Memory Use|Time taken     |
+---------+---------------+------------+--------------------+---------------+
| File_1 |  196608       |    8.5MB    |     18 MB          |0.14           |
| File_2 |  1769472      |    76 MB    |    164 MB          |1.03           |
| File_3 |  7077888      |   304 MB    |    656 MB          |4.28           |
+---------+---------------+-------------+--------------------+--------------+

Once again, the Pyarrow back-end trumps Numpy, this time in both memory management and run-time

That’s all for me for now. Hopefully, you find this helpful. If you do, please clap and follow me for more content like this.

If you are not already a medium member and appreciate content like this please consider joining using this link. If you do, I earn a (very) small commission which helps me as a writer.

Pyarrow
Numpy
Pandas
Python
Speed Test
Recommended from ReadMedium