Say Goodbye to Slow Analytic Queries — How to Use DuckDB in Python for Lightning-Fast Results!

OLTP (Online Transaction Processing) databases are designed to handle high volumes of transactions with low latency. However, when it comes to performing complex analytical queries, OLTP databases can be slow and inefficient.
Now DuckDB is here for the rescue. It is an embedded, in-process, column-oriented SQL OLAP database management system that is designed to be used in data analysis.
In this post, we will focus on how to use DuckDB in Python.
Installation
First, you need to install the DuckDB Python package using pip. Open your terminal and run the following command:
pip install duckdb
A Simple Example
Once you have installed the package, you can use DuckDB in your Python code. Here’s an example of how to create a DuckDB database and insert some data into it:
import duckdb
# create a new DuckDB database
con = duckdb.connect(database=':memory:')
# create a table
con.execute('CREATE TABLE users (id INTEGER, name VARCHAR(255))')
# insert some data
con.execute("INSERT INTO users VALUES (1, 'John')")
con.execute("INSERT INTO users VALUES (2, 'Jane')")
# query the data
result = con.execute("SELECT * FROM users").fetchall()
for row in result:
print(row)In this example,
- we create a new DuckDB database using the
connectfunction. - We then create a table called
users - We insert some data into it.
- Finally, we query the data using a
SELECTstatement and print the results.
A Complex Example
Now we have a basic understanding of how to use it, it’s time to tackle a more complex example.
Suppose your company is using AWS CloudFront CDN service, and the billing shows there is a significant increase in usage in some locations.
You need to identify which location and URI are being visited the most during a certain period.
Here is a sample log file made to the CloudFront distribution, with fields including request date and time, edge location, IP address, HTTP method, and various technical details such as user agent, SSL protocol, and response size.
#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end
2023-04-30 00:51:11 DUB2-C1 21354 89.199.19.161 GET fafadfafaf.cloudfront.net /dc/58/1f/122a25fee0c11ssf25785c64fb077e7d9a8b2_scaled_v1_400.jpg 200 - AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Version/4.0 - - Hit aujfSSSuSbssoCBE04RqrZuGyJp-CTVFXaJhxxx6wi0R0lbxxxMLaWIwoUg== s3.xxx.fm https 211 0.001 - TLSv1.3 TLS_AES_128_GCM_SHA256 Hit HTTP/1.1 - - 42178 0.001 Hit image/jpeg 20889 - -
2023-04-30 00:51:08 DEN52-C1 9969 216.21.213.132 GET fafadfafaf.cloudfront.net /ce/3e/72/5913fsf0fada069f0etyl1308e097b6cba91a_scaled_v1_200.jpg 200 - AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Version/4.0 - - Hit LZF3BTTHSEYoU-5MsuQEw3J68rqVTw6VCYXLnCo9GgWNGyeKzxxxGTc3q2Sw== s3.xxx.fm https 211 0.016 - TLSv1.3 TLS_AES_128_GCM_SHA256 Hit HTTP/1.1 - - 54418 0.016 Hit image/jpeg 9466 - -
2023-04-30 00:51:24 LAX50-P2 41914 104.175.83.75 GET fafadfafaf.cloudfront.net /02/d7/e9/58d13fafe3a45396d108l1c04ee8899454cfe2_scaled_v1_400.jpg 200 - AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Version/4.0 - - Hit ikHTlI68yTTfI4nT81bofYNhkJP4W1ITfdNmxxx6nj121oBRvS4X2xxxREZbpvww== s3.xxx.fm https 211 0.000 - TLSv1.3 TLS_AES_128_GCM_SHA256 Hit HTTP/1.1 - - 37360 0.000 Hit image/jpeg 41447 - -How to use DuckDB to do data analysis?
Step 1: Configure DuckDB
import duckdb
conn = duckdb.connect(database="cloudfront-log.db", read_only=False)
conn.execute('''
INSTALL 'httpfs';
LOAD 'httpfs';
INSTALL json;
LOAD json;
SET s3_region='ap-northeast-1';
SET s3_access_key_id='YOUR_S3_ACCES_KEY';
SET s3_secret_access_key='YOUR_SECRET_ACCESS_KEY';
SET enable_progress_bar = true;
''')Step 2: Create a Table
conn.execute('''
CREATE TABLE logs (
date STRING,
time STRING,
x_edge_location STRING,
sc_bytes STRING,
c_ip STRING,
cs_method STRING,
cs_host STRING,
cs_uri_stem STRING,
sc_status STRING,
cs_Referer STRING,
cs_User_Agent STRING,
cs_uri_query STRING,
cs_Cookie STRING,
x_edge_result_type STRING,
x_edge_request_id STRING,
x_host_header STRING,
cs_protocol STRING,
cs_bytes STRING,
time_taken STRING,
x_forwarded_for STRING,
ssl_protocol STRING,
ssl_cipher STRING,
x_edge_response_result_type STRING,
cs_protocol_version STRING,
fle_status STRING,
fle_encrypted_fields STRING,
c_port STRING,
time_to_first_byte STRING,
x_edge_detailed_result_type STRING,
sc_content_type STRING,
sc_content_len STRING,
sc_range_start STRING,
sc_range_end STRING)
''')Step 2: Importing Data
If the logs are stored in many small files, we can use a regular expression to import the data into the table.
file_regs = [
'/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-24*.gz',
'/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-25*.gz',
'/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-26*.gz',
'/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-27*.gz',
'/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-28*.gz',
'/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-29*.gz',
'/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-30*.gz',
]
for file_reg in file_regs:
print("file_reg: ", file_reg)
sql = f"INSERT INTO logs SELECT * FROM read_csv('{file_reg}', skip=2, sep='\t', ignore_errors=true, auto_detect=TRUE)"
conn.execute(sql)Or if the log is a big file, we can use the following COPY command.
COPY logs FROM '{single_file}' (FORMAT csv, DELIMITER '\t', SKIP 2, AUTO_DETECT true, IGNORE_ERRORS true);Step 3: Check the Data
To check the data, we can fetch some data to have a quick look.
# fetch some data
df = conn.execute(f"SELECT * FROM logs limit 10").fetchdf()
print(df.head(10))Step 4: Find the solution
Now, we can solve our problem by executing the following query.
sql = f"SELECT \
x_edge_location, \
cs_uri_stem, \
count(*) as cnt \
FROM logs \
WHERE x_edge_location ='DUB56-P1' \
GROUP BY 1, 2 \
ORDER BY cnt DESC"
df = conn.execute(sql).fetchdf()DuckDB is a highly effective analytical SQL tool that you should consider using in your next data project. With its impressive capabilities, you can expect to achieve great results and gain critical insights. Don’t hesitate to give it a try and see the difference it can make!






