Summary

The article provides a comprehensive guide on using DuckDB with Python for efficient data analysis, particularly for analyzing AWS CloudFront logs.

Abstract

The article introduces DuckDB, an in-process, column-oriented SQL OLAP database system, and explains its advantages for executing complex analytical queries over traditional OLTP databases. It details the installation process of DuckDB's Python package and demonstrates how to create a DuckDB database, insert data, and perform queries using Python. The article also presents a complex example of analyzing AWS CloudFront CDN service logs to identify the most visited locations and URIs within a specific period. The steps include configuring DuckDB to work with large datasets, creating tables, importing data using either regular expressions for multiple files or a COPY command for a single large file, checking the imported data, and executing analytical queries to extract valuable insights. The author emphasizes DuckDB's effectiveness in data analytics and encourages its use in data projects for significant performance improvements.

Opinions

The author conveys that OLTP databases are not well-suited for complex analytical queries, implying a need for specialized tools like DuckDB for such tasks.
DuckDB is presented as a solution for data analysts seeking efficient and in-memory database management systems for analytical SQL operations.
The author suggests that using DuckDB can result in "lightning-fast" query performance.
The article suggests that DuckDB, with its ability to integrate with Python and handle data from various sources, can be a valuable asset in data analysis workflows.
By illustrating a real-world scenario involving AWS CloudFront logs, the author offers an opinion that DuckDB is versatile enough to handle complex data analysis tasks at scale.

Say Goodbye to Slow Analytic Queries — How to Use DuckDB in Python for Lightning-Fast Results!

OLTP (Online Transaction Processing) databases are designed to handle high volumes of transactions with low latency. However, when it comes to performing complex analytical queries, OLTP databases can be slow and inefficient.

Now DuckDB is here for the rescue. It is an embedded, in-process, column-oriented SQL OLAP database management system that is designed to be used in data analysis.

In this post, we will focus on how to use DuckDB in Python.

Installation

First, you need to install the DuckDB Python package using pip. Open your terminal and run the following command:

pip install duckdb

A Simple Example

Once you have installed the package, you can use DuckDB in your Python code. Here’s an example of how to create a DuckDB database and insert some data into it:

import duckdb

# create a new DuckDB database
con = duckdb.connect(database=':memory:')

# create a table
con.execute('CREATE TABLE users (id INTEGER, name VARCHAR(255))')

# insert some data
con.execute("INSERT INTO users VALUES (1, 'John')")
con.execute("INSERT INTO users VALUES (2, 'Jane')")

# query the data
result = con.execute("SELECT * FROM users").fetchall()
for row in result:
    print(row)

In this example,

we create a new DuckDB database using the connect function.
We then create a table called users
We insert some data into it.
Finally, we query the data using a SELECT statement and print the results.

A Complex Example

Now we have a basic understanding of how to use it, it’s time to tackle a more complex example.

Suppose your company is using AWS CloudFront CDN service, and the billing shows there is a significant increase in usage in some locations.

You need to identify which location and URI are being visited the most during a certain period.

Here is a sample log file made to the CloudFront distribution, with fields including request date and time, edge location, IP address, HTTP method， and various technical details such as user agent, SSL protocol, and response size.

#Fields: date time x-edge-location sc-bytes c-ip cs-method cs(Host) cs-uri-stem sc-status cs(Referer) cs(User-Agent) cs-uri-query cs(Cookie) x-edge-result-type x-edge-request-id x-host-header cs-protocol cs-bytes time-taken x-forwarded-for ssl-protocol ssl-cipher x-edge-response-result-type cs-protocol-version fle-status fle-encrypted-fields c-port time-to-first-byte x-edge-detailed-result-type sc-content-type sc-content-len sc-range-start sc-range-end
2023-04-30 00:51:11 DUB2-C1 21354 89.199.19.161 GET fafadfafaf.cloudfront.net /dc/58/1f/122a25fee0c11ssf25785c64fb077e7d9a8b2_scaled_v1_400.jpg 200 - AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Version/4.0 - - Hit aujfSSSuSbssoCBE04RqrZuGyJp-CTVFXaJhxxx6wi0R0lbxxxMLaWIwoUg== s3.xxx.fm https 211 0.001 - TLSv1.3 TLS_AES_128_GCM_SHA256 Hit HTTP/1.1 - - 42178 0.001 Hit image/jpeg 20889 - -
2023-04-30 00:51:08 DEN52-C1 9969 216.21.213.132 GET fafadfafaf.cloudfront.net /ce/3e/72/5913fsf0fada069f0etyl1308e097b6cba91a_scaled_v1_200.jpg 200 - AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Version/4.0 - - Hit LZF3BTTHSEYoU-5MsuQEw3J68rqVTw6VCYXLnCo9GgWNGyeKzxxxGTc3q2Sw== s3.xxx.fm https 211 0.016 - TLSv1.3 TLS_AES_128_GCM_SHA256 Hit HTTP/1.1 - - 54418 0.016 Hit image/jpeg 9466 - -
2023-04-30 00:51:24 LAX50-P2 41914 104.175.83.75 GET fafadfafaf.cloudfront.net /02/d7/e9/58d13fafe3a45396d108l1c04ee8899454cfe2_scaled_v1_400.jpg 200 - AppleWebKit/537.36%20(KHTML,%20like%20Gecko)%20Version/4.0 - - Hit ikHTlI68yTTfI4nT81bofYNhkJP4W1ITfdNmxxx6nj121oBRvS4X2xxxREZbpvww== s3.xxx.fm https 211 0.000 - TLSv1.3 TLS_AES_128_GCM_SHA256 Hit HTTP/1.1 - - 37360 0.000 Hit image/jpeg 41447 - -

How to use DuckDB to do data analysis?

Step 1: Configure DuckDB

import duckdb

conn = duckdb.connect(database="cloudfront-log.db", read_only=False)
conn.execute('''
                INSTALL 'httpfs'; 
                LOAD 'httpfs';
                
                INSTALL json;
                LOAD json;

                SET s3_region='ap-northeast-1';
                SET s3_access_key_id='YOUR_S3_ACCES_KEY';
                SET s3_secret_access_key='YOUR_SECRET_ACCESS_KEY';

                SET enable_progress_bar = true;
            ''')

Step 2: Create a Table

conn.execute('''
   CREATE TABLE logs (
             date STRING,
             time STRING,
             x_edge_location STRING,
             sc_bytes STRING,
             c_ip STRING,
             cs_method STRING,
             cs_host STRING,
             cs_uri_stem STRING,
             sc_status STRING,
             cs_Referer  STRING,
             cs_User_Agent STRING,
             cs_uri_query STRING,
             cs_Cookie STRING,
             x_edge_result_type STRING,
             x_edge_request_id STRING,
             x_host_header STRING,
             cs_protocol STRING,
             cs_bytes STRING,
             time_taken STRING,
             x_forwarded_for STRING,
             ssl_protocol STRING,
             ssl_cipher STRING,
             x_edge_response_result_type STRING,
             cs_protocol_version STRING,
             fle_status STRING,
             fle_encrypted_fields STRING,
             c_port STRING,
             time_to_first_byte STRING,
             x_edge_detailed_result_type STRING,
             sc_content_type STRING,
             sc_content_len STRING,
             sc_range_start STRING,
             sc_range_end STRING)
     ''')

Step 2: Importing Data

If the logs are stored in many small files, we can use a regular expression to import the data into the table.

file_regs = [
    '/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-24*.gz',
    '/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-25*.gz',
    '/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-26*.gz',
    '/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-27*.gz',
    '/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-28*.gz',
    '/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-29*.gz',
    '/Users/jerry.an/cloudfront/E1JPIDV7XCO3TK.2023-04-30*.gz',
]


for file_reg in file_regs:
    print("file_reg: ", file_reg)
    sql = f"INSERT INTO logs SELECT * FROM read_csv('{file_reg}', skip=2, sep='\t', ignore_errors=true, auto_detect=TRUE)"
    conn.execute(sql)

Or if the log is a big file, we can use the following COPY command.

COPY logs FROM '{single_file}' (FORMAT csv, DELIMITER '\t', SKIP 2, AUTO_DETECT true, IGNORE_ERRORS true);

Step 3: Check the Data

To check the data, we can fetch some data to have a quick look.

# fetch some data
df = conn.execute(f"SELECT * FROM logs limit 10").fetchdf()
print(df.head(10))

Step 4: Find the solution

Now, we can solve our problem by executing the following query.

sql = f"SELECT \
            x_edge_location, \
            cs_uri_stem, \
            count(*) as cnt \
        FROM logs \
        WHERE x_edge_location ='DUB56-P1' \
        GROUP BY 1, 2 \
        ORDER BY cnt DESC"

df = conn.execute(sql).fetchdf()

DuckDB is a highly effective analytical SQL tool that you should consider using in your next data project. With its impressive capabilities, you can expect to achieve great results and gain critical insights. Don’t hesitate to give it a try and see the difference it can make!