Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5408

Abstract

v><p>medium.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*36K8H7FZvSK85Oq0tv7VwQ.png)"></div> </div> </div> </a> </div><p id="f3fc"><b>Postgres -> CSV file using Pandas</b></p><p id="b1e6">The Pandas code is very straightforward. Make sure you have the following libraries pip installed first:</p><div id="c90e"><pre>pip install pandas psycopg2 sqlalchemy </pre></div><p id="e8b3">Substitute your own values for the database credentials etc …</p><div id="436e"><pre><span class="hljs-keyword">from</span> sqlalchemy <span class="hljs-keyword">import</span> create_engine <span class="hljs-keyword">from</span> sqlalchemy.sql <span class="hljs-keyword">import</span> text <span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd

query = <span class="hljs-string">f""" SELECT * FROM your_table """</span> <span class="hljs-comment"># Create the SQLAlchemy engine with your postgres database connection string</span> engine = create_engine(<span class="hljs-string">'postgresql://your_user:your_pass@your_hostname:your_port/your_dbname'</span>) <span class="hljs-comment"># Query the database and save the result in a DataFrame</span> <span class="hljs-keyword">with</span> engine.connect() <span class="hljs-keyword">as</span> conn: df=pd.read_sql_query(text(query), con = conn) <span class="hljs-comment"># Write the DataFrame to a CSV file</span> df.to_csv(<span class="hljs-string">"output.csv"</span>, index=<span class="hljs-literal">False</span>)</pre></div><p id="790a">I ran the above code against three tables containing increasing data volumes. Each table was comprised of 10 integer columns. Here are the results (timings in seconds):</p><div id="5c7f"><pre>Postgres -> CSV file using Pandas

Name | Rows | Columns | Disk Space taken |Time to read | File output size | +---------+---------------+-------------+--------------------+---------------+------------------+ | Table_1 | 22,500 | 10 | 1.5 MB |.48 | 485 KB | | Table_2 | 2,880,000 | 10 | 187 MB | 31 | 62 MB | | Table_3 | 11,500,000 | 10 | 1300 MB |253 | 495 MB | +---------+---------------+-------------+--------------------+---------------+------------------+ </pre></div><p id="775d">Note, that if your input table is really huge you may run into memory issues. If that is the case you can try using a chunking strategy to read and write the data.</p><p id="24dc"><b>Postgres -> CSV file using Psycopg2</b></p><p id="77bf">One of the fastest ways to get data into and out of a Postgres database to CSV files using Psycopg is to use the <b>copy_export</b> command. Under the hood, it uses PostgreSQL’s COPY command which is a fast way to load large datasets into and out of PostgreSQL because it avoids the overhead of executing individual SQL INSERT commands.</p><p id="9df7">Here is the code I used.</p><div id="1db0"><pre><span class="hljs-keyword">import</span> psycopg2

<span class="hljs-comment"># Set up a connection to the PostgreSQL database</span> connection = psycopg2.connect( host=<span class="hljs-string">"localhost"</span>, database=<span class="hljs-string">"your_database"</span>, user=<span class="hljs-string">"your_username"</span>, password=<span class="hljs-string">"your_password"</span> ) <span class="hljs-comment"># Create a cursor object</span> cur = connection.cursor() <span class="hljs-comment"># Define the COPY command</span> copy_cmd = <span class="hljs-string">"COPY your_table TO STDOUT WITH (FORMAT CSV, HEADER)"</span> <span class="hljs-comment"># Open the file to write to and execute the COPY command</span> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">'output.csv'</span>, <span class="hljs-string">'w'</span>) <span class="hljs-keyword">as</span> f: cur.copy_expert(copy_cmd, f) <span class="hljs-comment"># Close the connection</span> connection.close()</pre></div><p id="7e20">I ran this against the same three test Postgres tables and the timings (in seconds) are shown below:</p><div id="30cb"><pre>Postgres -> CSV file using Psycopg2

Options

e dataframe to a database table. Again, the code is straightforward.</p><div id="3602"><pre><span class="hljs-keyword">import</span> pandas <span class="hljs-keyword">as</span> pd <span class="hljs-keyword">from</span> sqlalchemy <span class="hljs-keyword">import</span> create_engine

<span class="hljs-comment"># Define the connection string and create engine</span> engine = create_engine(<span class="hljs-string">'postgresql://your_username:your_password@localhost/your_database'</span>) <span class="hljs-comment"># Read the data from the file into a DataFrame</span> df = pd.read_csv(<span class="hljs-string">'output.csv'</span>) <span class="hljs-comment"># Write the data to the database</span> df.to_sql(<span class="hljs-string">'your_table'</span>, engine, if_exists=<span class="hljs-string">'replace'</span>, index=<span class="hljs-literal">False</span>)</pre></div><p id="a843">And here are the timings (in seconds). For the input files, I used the output files of the <b>Postgres ->CSV file using Pandas </b>operation.</p><div id="b3ba"><pre>CSV-> Postgres file using Pandas

Name | Rows | Columns | File size |Time to write | +---------+---------------+-------------+--------------------+---------------+ | Table_1 | 22,500 | 10 | 485 KB |2 | | Table_2 | 2,880,000 | 10 | 62 MB |215 | | Table_3 | 11,500,000 | 10 | 495 MB |2031 | +---------+---------------+-------------+--------------------+---------------+</pre></div><p id="671b"><b>CSV -> Postgres using Psycopg2</b></p><div id="c78c"><pre><span class="hljs-keyword">import</span> psycopg2

<span class="hljs-comment"># Set up a connection to the PostgreSQL database</span> connection = psycopg2.connect( host=<span class="hljs-string">"localhost"</span>, database=<span class="hljs-string">"your_database"</span>, user=<span class="hljs-string">"your_username"</span>, password=<span class="hljs-string">"your_password"</span> ) <span class="hljs-comment"># Create a cursor object</span> cur = connection.cursor() <span class="hljs-comment"># Define the COPY command</span> copy_cmd = <span class="hljs-string">"COPY your_table FROM STDIN WITH (FORMAT CSV, HEADER)"</span> <span class="hljs-comment"># Open the file to read from and execute the COPY command</span> <span class="hljs-keyword">with</span> <span class="hljs-built_in">open</span>(<span class="hljs-string">'input.csv'</span>, <span class="hljs-string">'r'</span>) <span class="hljs-keyword">as</span> f: cur.copy_expert(copy_cmd, f) <span class="hljs-comment"># Commit the transaction</span> connection.commit() <span class="hljs-comment"># Close the connection</span> connection.close()</pre></div><p id="b05b">And here are the timings (in seconds). For the input files, I used the output files of the <b>Postgres -> CSV file using Psycopg </b>operation.</p><div id="f44a"><pre>CSV-> Postgres file using Pandas

Name | Rows | Columns | File size |Time to write | +---------+---------------+-------------+--------------------+---------------+ | Table_1 | 22,500 | 10 | 485 KB |0.14 | | Table_2 | 2,880,000 | 10 | 62 MB |5 | | Table_3 | 11,500,000 | 10 | 495 MB |23 | +---------+---------------+-------------+--------------------+---------------+</pre></div><p id="b746">Once again pyscopg2 blows Pandas out of the water, being anywhere between 14 and 100 times faster, noting that the relative speed improvement increased substantially with increasing data volumes.</p><blockquote id="e264"><p><i>That’s all for me for now. Hopefully, you found this article useful. If you did, please check out my profile page at <a href="https://medium.com/@thomas_reid">this link</a>. From there, you can see my other published stories and subscribe to get notified when I post new content.</i></p></blockquote><p id="6992">If you liked this content, you might find these articles interesting too.</p><div id="ccbb" class="link-block"> <a href="https://betterprogramming.pub/duckdb-ed960853ce65"> <div> <div> <h2>DuckDB</h2> <div><h3>A Deep Dive into a Fast and Efficient Database Using Python</h3></div> <div><p>betterprogramming.pub</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*ZnAlLXNTb0mEl6ZsGanMqA.png)"></div> </div> </div> </a> </div><div id="4cf7" class="link-block"> <a href="https://levelup.gitconnected.com/pandas-2-0-numpy-vs-arrow-back-end-b16f874d0883"> <div> <div> <h2>Pandas 2.0 — Numpy vs Arrow back-end</h2> <div><h3>A Dataframe speed test. Who wins?</h3></div> <div><p>levelup.gitconnected.com</p></div> </div> <div> <div style="background-image: url(https://miro.readmedium.com/v2/resize:fit:320/1*_X6t8oY5Gi-SZCEtdrRiGw.png)"></div> </div> </div> </a> </div></article></body>

Pandas v Psycopg

A Postgres database speed test. Who wins?

Following on from a story I wrote comparing the speed of Pandas and Polars libraries in terms of reading and writing data — from and to — a Postgres database I thought it might be interesting to do a similar comparison between Pandas and Psycopg2.

If you need to get data from or to a Postgres database table from or to a local file, read on for the winner.

You can find the Pandas v Polars article at the link below:

Pandas v Polars: A database speed test. Who wins?

We have a process that runs against our AWS RDS Postgres database. It’s based on a Python Jupyter Notebook and reads a…

levelup.gitconnected.com

Pandas

I don’t think I need to explain much about what Pandas is. Its use in Python code is ubiquitous and is one of the main tools that people use to load, explore, visualise and process large amounts of data in Python.

Psycopg

Psycopg is one of the most popular PostgreSQL database libraries for the Python programming language. It implements the Python Database API Specification v2.0, allowing Python applications to communicate with PostgreSQL databases.

Psycopg is designed for efficiency and thread safety. It provides a high-level, Pythonic interface for connecting to a PostgreSQL database, executing SQL statements, managing transactions, and fetching results, while also offering low-level access to PostgreSQL-specific features for advanced use cases.

Using Psycopg, Python applications can perform a variety of database operations. These include executing SQL queries and commands, manipulating large object storage in PostgreSQL, managing transactions, and handling notifications from the PostgreSQL database.

The library also supports a variety of PostgreSQL features, such as prepared statements, multiple cursors, asynchronous notifications, and COPY commands for bulk data transfers. Additionally, it supports advanced data types and methods provided by PostgreSQL, including geometric types, arrays, hstore, JSON, and others.

Psycopg’s connection pooling functionalities facilitate efficient use of resources in multi-threaded environments. Furthermore, it’s capable of handling many concurrent connections, making it a good fit for web application backends and data-intensive applications.

In summary, Psycopg is an essential tool for Python developers working with PostgreSQL databases. It combines the power of PostgreSQL with the ease and expressiveness of Python, making it an ideal choice for database-driven Python applications.

Pandas v Psycopg

For the purposes of this speed test our objective is to get data from a Postgres database table into a CSV file on your local PC or laptop. We will also go in the reverse direction, i.e. assume we have a local data file in CVS format that we want to populate a Postgres database table with.

All the code will be Python v3.11, running on a Jupyter Notebook. The version of Pandas we are using is 2.0.3 , using the default numpy backend for data typing and we will be using version 2.9.3 of the Psycopg library.

If you want to try out the below code snippets you will obviously need access to a Postgres database.

For Pandas, getting data from a database table to a local file system involves two steps, the first is reading the database table into a dataframe and the second is writing the dataframe to a file. Pandas doesn’t interact with databases directly. Instead, it farms out that aspect to third-party database “engines”, one of the most popular of which is SQLAlchemy.

Here is some code that does that. I used the %%time magic in Jupyter to time the operation. If you want to find out more about Jupyter magics and other useful tips on using Jupyter see my article below:

Using Jupyter Notebooks? — Be more productive with these shortcuts

I think it’s fair to say that Jupyter notebooks are fast becoming the de-facto tool of choice for many data engineers…

medium.com

Postgres -> CSV file using Pandas

The Pandas code is very straightforward. Make sure you have the following libraries pip installed first:

pip install pandas psycopg2 sqlalchemy

Substitute your own values for the database credentials etc …

from sqlalchemy import create_engine
from sqlalchemy.sql import text
import pandas as pd

query = f"""
        SELECT *
        FROM your_table
    """
# Create the SQLAlchemy engine with your postgres database connection string
engine = create_engine('postgresql://your_user:your_pass@your_hostname:your_port/your_dbname')
# Query the database and save the result in a DataFrame
with engine.connect() as conn:
    df=pd.read_sql_query(text(query), con = conn)
# Write the DataFrame to a CSV file
df.to_csv("output.csv", index=False)

I ran the above code against three tables containing increasing data volumes. Each table was comprised of 10 integer columns. Here are the results (timings in seconds):

Postgres -> CSV file using Pandas
===================================

Name     |   Rows        |  Columns    |   Disk Space taken |Time to read   | File output size |
+---------+---------------+-------------+--------------------+---------------+------------------+
| Table_1 |  22,500       |    10       |   1.5 MB           |.48            |   485 KB         |
| Table_2 |  2,880,000    |    10       |   187 MB           | 31            |    62 MB         |
| Table_3 |  11,500,000   |    10       |   1300 MB          |253            |   495 MB         |
+---------+---------------+-------------+--------------------+---------------+------------------+

Note, that if your input table is really huge you may run into memory issues. If that is the case you can try using a chunking strategy to read and write the data.

Postgres -> CSV file using Psycopg2

One of the fastest ways to get data into and out of a Postgres database to CSV files using Psycopg is to use the copy_export command. Under the hood, it uses PostgreSQL’s COPY command which is a fast way to load large datasets into and out of PostgreSQL because it avoids the overhead of executing individual SQL INSERT commands.

Here is the code I used.

import psycopg2

# Set up a connection to the PostgreSQL database
connection = psycopg2.connect(
    host="localhost",
    database="your_database",
    user="your_username",
    password="your_password"
)
# Create a cursor object
cur = connection.cursor()
# Define the COPY command
copy_cmd = "COPY your_table TO STDOUT WITH (FORMAT CSV, HEADER)"
# Open the file to write to and execute the COPY command
with open('output.csv', 'w') as f:
    cur.copy_expert(copy_cmd, f)
# Close the connection
connection.close()

I ran this against the same three test Postgres tables and the timings (in seconds) are shown below:

Postgres -> CSV file using Psycopg2
===================================

Name     |   Rows        |  Columns    |   Disk Space taken |Time to read    | File output size |
+---------+---------------+-------------+--------------------+---------------+------------------+
| Table_1 |  22,500       |    10       |   1.5 MB           |.20            |   485 KB         |
| Table_2 |  2,880,000    |    10       |   187 MB           |5              |    62 MB         |
| Table_3 |  11,500,000   |    10       |   1300 MB          |43             |   495 MB         |
+---------+---------------+-------------+--------------------+---------------+------------------+

So, to summarize the timings on getting Postgres data to a local file, the Psycopg2 method is approx 4–6 times faster on the tested data volumes.

CSV -> Postgres using Pandas

For Pandas, once again we have a two-stage process. Read the file into a dataframe, then load the dataframe to a database table. Again, the code is straightforward.

import pandas as pd
from sqlalchemy import create_engine

# Define the connection string and create engine
engine = create_engine('postgresql://your_username:your_password@localhost/your_database')
# Read the data from the file into a DataFrame
df = pd.read_csv('output.csv')
# Write the data to the database
df.to_sql('your_table', engine, if_exists='replace', index=False)

And here are the timings (in seconds). For the input files, I used the output files of the Postgres ->CSV file using Pandas operation.

CSV-> Postgres file using Pandas
=================================

Name     |   Rows        |  Columns     |   File size        |Time to write  |
+---------+---------------+-------------+--------------------+---------------+
| Table_1 |  22,500       |    10       |   485 KB           |2              |
| Table_2 |  2,880,000    |    10       |    62 MB           |215            |
| Table_3 |  11,500,000   |    10       |   495 MB           |2031           |
+---------+---------------+-------------+--------------------+---------------+

CSV -> Postgres using Psycopg2

import psycopg2

# Set up a connection to the PostgreSQL database
connection = psycopg2.connect(
    host="localhost",
    database="your_database",
    user="your_username",
    password="your_password"
)
# Create a cursor object
cur = connection.cursor()
# Define the COPY command
copy_cmd = "COPY your_table FROM STDIN WITH (FORMAT CSV, HEADER)"
# Open the file to read from and execute the COPY command
with open('input.csv', 'r') as f:
    cur.copy_expert(copy_cmd, f)
# Commit the transaction
connection.commit()
# Close the connection
connection.close()

And here are the timings (in seconds). For the input files, I used the output files of the Postgres -> CSV file using Psycopg operation.

CSV-> Postgres file using Pandas
=================================

Name     |   Rows        |  Columns     |   File size        |Time to write  |
+---------+---------------+-------------+--------------------+---------------+
| Table_1 |  22,500       |    10       |   485 KB           |0.14           |
| Table_2 |  2,880,000    |    10       |    62 MB           |5              |
| Table_3 |  11,500,000   |    10       |   495 MB           |23             |
+---------+---------------+-------------+--------------------+---------------+

Once again pyscopg2 blows Pandas out of the water, being anywhere between 14 and 100 times faster, noting that the relative speed improvement increased substantially with increasing data volumes.

That’s all for me for now. Hopefully, you found this article useful. If you did, please check out my profile page at this link. From there, you can see my other published stories and subscribe to get notified when I post new content.

If you liked this content, you might find these articles interesting too.

DuckDB

A Deep Dive into a Fast and Efficient Database Using Python

betterprogramming.pub

Pandas 2.0 — Numpy vs Arrow back-end

A Dataframe speed test. Who wins?

levelup.gitconnected.com