avatarLynn Kwong

Free AI web copilot to create summaries, insights and extended knowledge, download it at here

5678

Abstract

p><p id="4bd3">Before we can write this data frame to the database, we need to create an SQLAlchemy connection to interact with our database:</p><div id="1fd0"><pre>from sqlalchemy import create_engine

db_url = <span class="hljs-string">"mysql+pymysql://root:root@localhost:13306/data"</span>

engine = create_engine( db_url, pool_size=5, pool_recycle=3600, isolation_level=<span class="hljs-string">"AUTOCOMMIT"</span> ) con = engine.connect()</pre></div><p id="90a3">Then we can write the data frame created above to our MySQL database using the connection created. However, there are several pitfalls that should be avoided.</p><h2 id="ea2b">Pitfalls when writing to SQL using pandas</h2><p id="1b1f">Let’s explore different arguments for the <code>pandas.to_sql()</code> function and check out the pitfalls that should be avoided in practice.</p><p id="99ee">Let’s first run it with no additional arguments:</p><div id="f14d"><pre>df_to_sql.to_sql(<span class="hljs-string">"student_scores"</span>, con=con)</pre></div><p id="77d4">When this command is run, we will see the error:</p><div id="2af5"><pre><span class="hljs-symbol">ValueError:</span> Table <span class="hljs-comment">'student_scores' already exists.</span></pre></div><p id="766a">This happens because <code>pandas.to_sql()</code> by default will raise a <code>ValueError</code> if the table already exists. To append data to an existing table, we can specify the <code>if_exists</code> parameter to be <code>append:</code></p><div id="0665"><pre>df_to_sql.to_sql(<span class="hljs-string">"student_scores"</span>, con=con, if_exists=<span class="hljs-string">"append"</span>)</pre></div><p id="f8bd">However, when this command is run, we will see a new error:</p><div id="7914"><pre><span class="hljs-symbol">OperationalError:</span> (pymysql.err.OperationalError) (<span class="hljs-number">1054</span>, <span class="hljs-string">"Unknown column 'index' in 'field list'"</span>)</pre></div><p id="5615">This is because the index of the data frame will be written to the database by default. Since we don’t have an <code>index</code> column in our table, an error will be raised.</p><p id="1f3f">We set the <code>index</code> parameter to be <code>False</code> to skip the data frame index when writing it to the database:</p><div id="a389"><pre>df_to_sql.to_sql(<span class="hljs-string">"student_scores"</span>, con=con, if_exists=<span class="hljs-string">"append"</span>, index=<span class="hljs-literal">False</span>)</pre></div><p id="5fce">This time everything works properly. Now if you check the table in the database, you will see six rows inserted.</p><p id="d1c1">Note that if the <code>isolation_level</code> is not set to be <code>AUTOCOMMIT</code> when creating the SQLAlchemy engine, you would need to run <code>con.commit()</code> explicitly to persist the changes to the database.</p><p id="d3f1">Now let’s see what will happen when we run the above command again:</p><div id="ee9e"><pre>df_to_sql.to_sql(<span class="hljs-string">"student_scores"</span>, con=con, if_exists=<span class="hljs-string">"append"</span>, index=<span class="hljs-literal">False</span>)</pre></div><p id="9d93">This time a new error is raised about duplicate keys:</p><div id="1acc"><pre><span class="hljs-symbol">IntegrityError:</span> (pymysql.err.IntegrityError) (<span class="hljs-number">1062</span>, <span class="hljs-string">"Duplicate entry '1-Literature' for key 'student_scores.PRIMARY'"</span>)</pre></div><p id="5534">The reason is that the above command will append the records to the table, and since they are already in the database, they cannot be inserted again due to the limitation of the primary key.</p><p id="1995">Unfortunately, there is no such <code>ON DUPLICATE KEY UPDATE</code> or <code>upsert</code> functionality of <code>pandas.to_sql()</code>. We can, however, specify <code>if_exists</code> to <code>replace</code> to <i>drop the table</i> before inserting new values. Note that it’s “drop the table” but not “truncate the table” which is pretty buggy in my opinion.</p><p id="14ff">When the table is dropped and recreated, the indexes will be dropped and the data types will be assigned automatically. Let’s see how it works:</p><div id="6d44"><pre>df_to_sql.to_sql(<span class="hljs-string">"student_scores"</span>, con=con, if_exists=<span class="hljs-string">"replace"</span>, index=<span class="hljs-literal">False</span>)</pre></div><p id="0c40">This time the command works. Let’s check the data definition language (DDL) of the <code>student_scores</code> table as demonstrated in <a href="https://towardsdatascience.com/learn-common-database-managing-commands-as-a-data-engineer-4d199cfb15ae">this post</a>:</p><div id="bce6"><pre><span class="hljs-keyword">SHOW</span> <span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> data.student_scores;

<span class="hljs-keyword">CREATE</span> <span class="hljs-keyword">TABLE</span> student_scores ( student_id <span class="hljs-type">bigint</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">NULL</span>, subject text, score <span class="hljs-type">bigint</span> <span class="hljs-keyword">DEFAULT</span> <span class="hljs-keyword">NULL</span> ) ENGINE<span class="hljs-operator">=</span>InnoDB <span class="hljs-keyword">DEFAULT</span> CHARSET<span class="hljs-operator">=</span>utf8mb4 <span class="hljs-keyword">COLLATE</span><span class="hljs-operator">=</span>utf8mb4_0900_ai_ci</pre></div><p id="b932">As we can see, the indexes are dropped, including the primary key, and the data types are assigned automatically, which is not desired in most

Options

practical cases.</p><p id="6a86">Therefore it’s better to truncate the existing table before inserting new records or use <code>ON DUPLICATE KEY UPDATE</code> or <code>upsert</code> to update existing ones, which is a bit out of the scope of this post and will be covered in future ones.</p><h2 id="2b72">Read data from SQL with pandas</h2><p id="1dad">While it’s not so common to write data to SQL with <i>pandas</i>, it’s way more common to read data from SQL using <i>pandas</i>, and it’s also much simpler.</p><p id="193c">We can use <code>pandas.read_sql()</code> to read data from a SQL database. <code>pandas.read_sql()</code> is a wrapper function for <code>pandas.read_sql_table()</code> and <code>pandas.read_sql_query()</code>.</p><p id="e443">In most cases, you can just use <code>pandas.read_sql()</code>. However, when you need to specify the schema when reading by the table name, you would need to use <code>pandas.read_sql_table()</code>.</p><p id="02d7">Let’s first try to read by table name:</p><div id="6c7f"><pre>df_from_sql = pd.read_sql(<span class="hljs-string">"data.student_scores"</span>, con=con)</pre></div><p id="082e">This will raise an error:</p><div id="385b"><pre><span class="hljs-symbol">ObjectNotExecutableError:</span> <span class="hljs-built_in">Not</span> an executable <span class="hljs-type">object</span>: <span class="hljs-comment">'data.student_scores'</span></pre></div><p id="ac4d">Because we cannot specify a schema with <code>pandas.read_sql()</code>. We need to specify a schema with <code>pandas.read_sql_table()</code>:</p><div id="27f3"><pre>df_from_sql = pd.read_sql_table(<span class="hljs-string">"student_scores"</span>, con, schema=<span class="hljs-string">"data"</span>)</pre></div><p id="7075">And this command will work.</p><p id="53db">We can specify which columns to read when reading by table name:</p><div id="a3f9"><pre>df_from_sql = pd.read_sql_table( <span class="hljs-string">"student_scores"</span>, con, schema=<span class="hljs-string">"data"</span>, columns=[<span class="hljs-string">"student_id"</span>, <span class="hljs-string">"subject"</span>, <span class="hljs-string">"score"</span>] )</pre></div><p id="bcc1">Besides reading by the table name, we can also read by plain SQL query. This time you may be surprised that it may not work as you expected:</p><div id="526e"><pre>df_from_sql = pd.read_sql(<span class="hljs-string">"SELECT * FROM data.student_scores"</span>, con=con) df_from_sql = pd.read_sql_query(<span class="hljs-string">"SELECT * FROM data.student_scores"</span>, con=con)</pre></div><p id="c38e">Both these two commands will fail with this error:</p><div id="3998"><pre><span class="hljs-symbol">ObjectNotExecutableError:</span> <span class="hljs-built_in">Not</span> an executable <span class="hljs-type">object</span>: <span class="hljs-comment">'SELECT * FROM data.student_scores'</span></pre></div><p id="3ea3">This is because, with newer versions of SQLAlchemy, we cannot pass a plain SQL query as a string anymore, but instead need to construct a TextClause:</p><div id="210e"><pre><span class="hljs-keyword">from</span> sqlalchemy import <span class="hljs-keyword">text</span>

df_from_sql = pd.read_sql(<span class="hljs-keyword">text</span>(<span class="hljs-string">"SELECT * FROM data.student_scores"</span>), con=con) df_from_sql = pd.read_sql_query(<span class="hljs-keyword">text</span>(<span class="hljs-string">"SELECT * FROM data.student_scores"</span>), con=con)</pre></div><p id="7557">Now both commands will work and we can use <code>pandas.read_sql()</code> for simplicity.</p><p id="98d0">In this post, we have covered how to write data to and read data from a SQL database using <i>pandas</i>. A MySQL database is used because it’s a commonly used one in practice. However, the Python code for <i>pandas</i> will work for all SQL databases with minor adjustments because SQLAlchemy, a universal and versatile Python SQL toolkit, is used for database connection.</p><p id="a4e6">We have covered the fundamentals in this post which should be enough if you work with plain SQL queries. A more advanced post will come soon and will cover how to work with SQLAlchemy <a href="https://docs.sqlalchemy.org/en/20/core/metadata.html#sqlalchemy.schema.Table"><code>Ta</code>ble</a> and <a href="https://docs.sqlalchemy.org/en/20/orm/mapping_styles.html"><code></code>ORM</a> models in <i>pandas</i>.</p><h2 id="bab7">Related posts:</h2><ul><li><a href="https://betterprogramming.pub/how-to-execute-plain-sql-queries-with-sqlalchemy-627a3741fdb1">How to Execute Plain SQL Queries With SQLAlchemy in Python</a></li><li><a href="https://towardsdatascience.com/learn-common-database-managing-commands-as-a-data-engineer-4d199cfb15ae">Important MySQL Data Definition Language (DDL) commands we should know for managing our tables</a></li></ul><h1 id="c8e7">Level Up Coding</h1><p id="dfce">Thanks for being a part of our community! Before you go:</p><ul><li>👏 Clap for the story and follow the author 👉</li><li>📰 View more content in the <a href="https://levelup.gitconnected.com/?utm_source=pub&amp;utm_medium=post">Level Up Coding publication</a></li><li>💰 Free coding interview course ⇒ <a href="https://skilled.dev/?utm_source=luc&amp;utm_medium=article">View Course</a></li><li>🔔 Follow us: <a href="https://twitter.com/gitconnected">Twitter</a> | <a href="https://www.linkedin.com/company/gitconnected">LinkedIn</a> | <a href="https://newsletter.levelup.dev">Newsletter</a></li></ul><p id="0eb2">🚀👉 <a href="https://jobs.levelup.dev/talent/welcome?referral=true"><b>Join the Level Up talent collective and find an amazing job</b></a></p></article></body>

How to Work with SQL Databases Efficiently in Pandas

Learn an efficient way to work with SQL data using pandas

Image by GraphicMama-team (Panda Character) in Pixabay

A major benefit of working with SQL data in pandas is that we can manipulate a large amount of data much more efficiently. If you have been writing aggregation queries using GROUP BY in SQL, or worse, looping through database records one by one in Python, you should try to use pandas to perform these tasks and it can be magnitudes more efficient.

In this post, we will introduce how to write data to and read data from a SQL database using pandas. Importantly, some pitfalls are demonstrated with simple code and potential errors, which can help us avoid these pitfalls and work more efficiently in practice.

Preparation

In this post, we will spin up a local MySQL server with Docker. All SQL databases will work similarly.

# Create a volume to persist the data.
$ docker volume create mysql8-data

# Create the container for MySQL.
$ docker run --name mysql8 -d -e MYSQL_ROOT_PASSWORD=root -p 13306:3306 -v mysql8-data:/var/lib/mysql mysql:8

# Connect to the local MySQL server in Docker.
$ docker exec -it mysql8 mysql -u root -proot

mysql> SELECT VERSION();
+-----------+
| VERSION() |
+-----------+
| 8.0.31    |
+-----------+
1 row in set (0.00 sec)

Then we will create a virtual environment and install all the libraries needed, which include pandas, SQLAlchemy, and pymysql:

conda create --name pandas-sql python=3.11
conda activate pandas-sql

pip install -U "pandas>=1.5.0,<1.6.0"
pip install -U "SQLAlchemy>=2.0.0,<2.1.0"
pip install -U "pymysql>=1.0.0,<1.1.0"
pip install -U "cryptography>=40.0.0,<40.1.0"

Note that version ranges are specified for the libraries so maintenance upgrades can be picked up automatically when you run these commands in the future.

Especially, SQLAlchemy will be used to interact with our MySQL database (with cryptography as the cryptographic package for authentication) and pymysql is a driver specifically for MySQL.

Create the database and table

When working with a SQL database, it’s better to create the database and table in advance so we can better define the data types and indexes for the efficiency of storage and reading.

CREATE DATABASE `data`;

CREATE TABLE `data`.`student_scores` (
  `student_id` smallint NOT NULL,
  `subject` varchar(50) NOT NULL,
  `score` tinyint DEFAULT '0',
  PRIMARY KEY (`student_id`,`subject`),
  KEY `ix_subject` (`subject`),
  KEY `ix_score` (`score`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci
;

We can create tables with pandas on the fly, but the indexes will not be created as we will see soon.

Write to SQL with pandas

Let’s first check out how to write data to a SQL table using pandas.

We need to create a data frame that can be mapped as the rows to be written:

import pandas as pd

df_to_sql = pd.DataFrame(
    columns=["student_id", "subject", "score"],
    data=[
        [1, "Literature", 90],
        [1, "Math", 60],
        [2, "Literature", 80],
        [2, "Math", 80],
        [3, "Literature", 70],
        [3, "Math", 95],
    ],
)

With columns the columns or fields of the data frame are specified, and the data is specified as a list of lists, each of which corresponds to a row to be written.

Before we can write this data frame to the database, we need to create an SQLAlchemy connection to interact with our database:

from sqlalchemy import create_engine

db_url = "mysql+pymysql://root:root@localhost:13306/data"

engine = create_engine(
    db_url, pool_size=5, pool_recycle=3600, isolation_level="AUTOCOMMIT"
)
con = engine.connect()

Then we can write the data frame created above to our MySQL database using the connection created. However, there are several pitfalls that should be avoided.

Pitfalls when writing to SQL using pandas

Let’s explore different arguments for the pandas.to_sql() function and check out the pitfalls that should be avoided in practice.

Let’s first run it with no additional arguments:

df_to_sql.to_sql("student_scores", con=con)

When this command is run, we will see the error:

ValueError: Table 'student_scores' already exists.

This happens because pandas.to_sql() by default will raise a ValueError if the table already exists. To append data to an existing table, we can specify the if_exists parameter to be append:

df_to_sql.to_sql("student_scores", con=con, if_exists="append")

However, when this command is run, we will see a new error:

OperationalError: (pymysql.err.OperationalError) (1054, "Unknown column 'index' in 'field list'")

This is because the index of the data frame will be written to the database by default. Since we don’t have an index column in our table, an error will be raised.

We set the index parameter to be False to skip the data frame index when writing it to the database:

df_to_sql.to_sql("student_scores", con=con, if_exists="append", index=False)

This time everything works properly. Now if you check the table in the database, you will see six rows inserted.

Note that if the isolation_level is not set to be AUTOCOMMIT when creating the SQLAlchemy engine, you would need to run con.commit() explicitly to persist the changes to the database.

Now let’s see what will happen when we run the above command again:

df_to_sql.to_sql("student_scores", con=con, if_exists="append", index=False)

This time a new error is raised about duplicate keys:

IntegrityError: (pymysql.err.IntegrityError) (1062, "Duplicate entry '1-Literature' for key 'student_scores.PRIMARY'")

The reason is that the above command will append the records to the table, and since they are already in the database, they cannot be inserted again due to the limitation of the primary key.

Unfortunately, there is no such ON DUPLICATE KEY UPDATE or upsert functionality of pandas.to_sql(). We can, however, specify if_exists to replace to drop the table before inserting new values. Note that it’s “drop the table” but not “truncate the table” which is pretty buggy in my opinion.

When the table is dropped and recreated, the indexes will be dropped and the data types will be assigned automatically. Let’s see how it works:

df_to_sql.to_sql("student_scores", con=con, if_exists="replace", index=False)

This time the command works. Let’s check the data definition language (DDL) of the student_scores table as demonstrated in this post:

SHOW CREATE TABLE `data`.`student_scores`;

CREATE TABLE `student_scores` (
  `student_id` bigint DEFAULT NULL,
  `subject` text,
  `score` bigint DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci

As we can see, the indexes are dropped, including the primary key, and the data types are assigned automatically, which is not desired in most practical cases.

Therefore it’s better to truncate the existing table before inserting new records or use ON DUPLICATE KEY UPDATE or upsert to update existing ones, which is a bit out of the scope of this post and will be covered in future ones.

Read data from SQL with pandas

While it’s not so common to write data to SQL with pandas, it’s way more common to read data from SQL using pandas, and it’s also much simpler.

We can use pandas.read_sql() to read data from a SQL database. pandas.read_sql() is a wrapper function for pandas.read_sql_table() and pandas.read_sql_query().

In most cases, you can just use pandas.read_sql(). However, when you need to specify the schema when reading by the table name, you would need to use pandas.read_sql_table().

Let’s first try to read by table name:

df_from_sql = pd.read_sql("data.student_scores", con=con)

This will raise an error:

ObjectNotExecutableError: Not an executable object: 'data.student_scores'

Because we cannot specify a schema with pandas.read_sql(). We need to specify a schema with pandas.read_sql_table():

df_from_sql = pd.read_sql_table("student_scores", con, schema="data")

And this command will work.

We can specify which columns to read when reading by table name:

df_from_sql = pd.read_sql_table(
    "student_scores", con, schema="data", columns=["student_id", "subject", "score"]
)

Besides reading by the table name, we can also read by plain SQL query. This time you may be surprised that it may not work as you expected:

df_from_sql = pd.read_sql("SELECT * FROM data.student_scores", con=con)
df_from_sql = pd.read_sql_query("SELECT * FROM data.student_scores", con=con)

Both these two commands will fail with this error:

ObjectNotExecutableError: Not an executable object: 'SELECT * FROM data.student_scores'

This is because, with newer versions of SQLAlchemy, we cannot pass a plain SQL query as a string anymore, but instead need to construct a TextClause:

from sqlalchemy import text

df_from_sql = pd.read_sql(text("SELECT * FROM data.student_scores"), con=con)
df_from_sql = pd.read_sql_query(text("SELECT * FROM data.student_scores"), con=con)

Now both commands will work and we can use pandas.read_sql() for simplicity.

In this post, we have covered how to write data to and read data from a SQL database using pandas. A MySQL database is used because it’s a commonly used one in practice. However, the Python code for pandas will work for all SQL databases with minor adjustments because SQLAlchemy, a universal and versatile Python SQL toolkit, is used for database connection.

We have covered the fundamentals in this post which should be enough if you work with plain SQL queries. A more advanced post will come soon and will cover how to work with SQLAlchemy Table and ORM models in pandas.

Related posts:

Level Up Coding

Thanks for being a part of our community! Before you go:

🚀👉 Join the Level Up talent collective and find an amazing job

Pandas
Python
Sql
Database
Pitfalls
Recommended from ReadMedium