avatarWei Huang

Summary

The article demonstrates how to use DuckDB to efficiently analyze a 32GB dataset for Anti-Money Laundering (AML) without the need for extensive infrastructure or complex data pipelines.

Abstract

The article, intended for personal learning and development, discusses the use of DuckDB, an in-process SQL OLAP database management system, to handle a large 32GB AML dataset. The author contrasts the limitations of traditional tools like Pandas for large datasets with DuckDB's capabilities, emphasizing DuckDB's performance, ease of use, and ability to work with data that exceeds available memory. The article covers the background of the data analysis challenge, the setup and installation process for DuckDB, and the execution of SQL queries to analyze the data. The author concludes by highlighting DuckDB's features, such as its simplicity, speed, and compatibility with various programming languages, and suggests that it is a suitable tool for data professionals dealing with medium-sized datasets.

Opinions

  • The author believes that the focus in data projects should be on the problem or solution rather than just the tools used.
  • Data modeling is considered more crucial than tooling for the success of data projects.
  • Traditional data analysis tools like Pandas are insufficient for large datasets due to memory constraints and performance issues.
  • The author suggests that the process of setting up infrastructure and navigating internal requirements can consume a significant portion of time in data analysis projects.
  • DuckDB is praised for its ability to handle large datasets efficiently, its ease of installation and use, and its compatibility with various data formats and programming languages.
  • The author is impressed with DuckDB's performance, particularly its speed in querying a 32GB dataset on modest hardware.
  • There is an appreciation for DuckDB's SQLite-like interface, which minimizes the learning curve for new users.
  • The article encourages readers to consider DuckDB for their data analysis needs, especially for medium-sized datasets that are too large for tools like Pandas but do not necessarily require big data infrastructure.

A Simple Demo To Analyze 32GB AML Data With DuckDB

Weekend build and learn

Photo by Carlos Muza on Unsplash

Please note this article is solely for personal learning and development purposes during the weekend. It is not intended to promote or compare any Big Data vendors currently on the market. :)

Background

I recently reconnected with an old friend who currently holds a high-level role in data management.

During our conversation, we covered various topics, such as recent advancements in data technology and marketplace growth. Here are a few noteworthy points we discussed:

  1. It's common for individuals and organizations to have a relatively small amount of data. You will unlikely require petabytes (PB) of data for your latest project. Instead, you have gigabytes (GB) or even megabytes (MB) of data.
  2. It's important to remember that the core basis for utilizing data is determining the problem, solution, or story we want to extract for the benefit of the business. The focus should be on more than just the tools being used.
  3. One of the most crucial steps for the success of any data project is data modeling, NOT Tooling.

That's interesting! You may have noticed that I "bolded" the NOT Tooling. The reason for this is that I believe the tooling is only meant to assist with the job at hand, which is in line with my previous writing:

Context matters.

Why Did DuckDB Catch My Attention?

As a data engineer or analyst, it is common to encounter situations like the one we will use as an example.

Your boss wants you to analyze this IBM Transactions for Anti-Money Laundering (AML) publicly available data in Kaggle(the world's largest data science community with powerful tools and resources to help you achieve your data science goals.)

And get to a summary of those data points. Sounds easy? Let's open VSCode with Jupyter Notebook and start with Pandas!

Wait. Oh, boy! This is a 8 GB zip file with a 32GB CSV unzip size.

Photo by Aziz Acharki on Unsplash

As many know, Panda is an excellent resource for managing small datasets, typically less than 2–3 GB.

However, when the dataset grows beyond this limit, it is not advisable to utilize Pandas.

Pandas loads the complete dataset into memory before processing the dataframe.

Therefore, when the dataset surpasses the memory size, it can lead to memory errors.

Pandas loads the entire data into memory before processing the dataframe. So, if the dataset size is larger than the memory, you will run into memory errors.

In summary:

  1. Pandas is single-threaded.
  2. Dataset size is restricted to the memory of the machine.
  3. Inefficient with Memory.

What should we do then?

“Let's contact the Big Data Team to have the data ingested into the Datalake, or should we utilize the Cloud to begin the analysis?”

In addition to the infrastructure requirements, it is essential to consider your company's delivery model and team structure for this simple request.

  1. In Agile, generating a JIRA story for the sprint may be necessary.
  2. If you are on the waterfall, you must request another team to complete this task.

Based on the above description, around 70% of the time seems devoted to infrastructure and internal processes/requirements.

Additionally, it's essential to remember that the cost of time, human resources, and infrastructure will continue to be a factor as time progresses.

Photo by RC.xyz NFT gallery on Unsplash

How about low-level code for the pipeline?

When analyzing data, sometimes you need a bit of Python and SQL coding. Even if you work as a data engineer, you might find yourself writing code for this task. But is this approach too overkill?

Can we do it differently, faster, and more easier?

For example, DuckDB will be the best option for the job, which is why DuckDB caught most people's attention.

Photo by Ross Sokolovski on Unsplash

So What's DuckDB

Based on the DuckDB website:

DuckDB is an in-process SQL OLAP database management system.

  1. Fast: DuckDB is designed to support OLAP (Online analytical processing). It contains a columnar-vectorized query execution engine, leading to better performance in OLAP queries.
  2. Simple: DuckDB adopts these ideas of simplicity (SQLite) and embedded operation with no external dependencies, neither for compilation nor during run-time.
  3. Feature-Rich: DuckDB provides severe data management features, such as transactional guarantees (ACID properties) and bulk-optimized Multi-Version Concurrency Control (MVCC). Data can be stored in persistent, single-file databases, etc.
  4. Free and Open: DuckDB is Open Source. The entire source code is freely available on GitHub. DuckDB is deeply integrated into Python and R for efficient interactive data analysis. DuckDB provides APIs for Java, C, C++, and others.

Ok, it's a lot of lectures and sounds promising. But is it true? Let's find out by using it for our use case above.

Installation

Install the DuckDB is simple, as promised, just like SQLite installation.

## brew install
brew install duckdb

## or Direct download
wget https://github.com/duckdb/duckdb/releases/download/v0.8.1/duckdb_cli-osx-universal.zip

Dowload the data and prepare the folder structure

mkdir data
cd data
wget https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml/download?datasetVersionNumber=7
unzip archive

cd ..
mkdir database

Duckdb in action

Before we start, let's look at our current machine, the spec as follows:

Hardware Overview:
  Model Name: MacBook Pro
      Model Identifier: MacBookPro16,1
      Processor Name: 6-Core Intel Core i7
      Processor Speed: 2.6 GHz
      Number of Processors: 1
      Total Number of Cores: 6
      L2 Cache (per Core): 256 KB
      L3 Cache: 12 MB
      Memory: 16 GB

The data volume is two times larger than our memory, and we want to avoid it every time we open the DuckDB to reload the data to process; we will need to persist data in the disk.

❯ duckdb 'db/ibm_aml.db'
v0.8.1 6536a77232
Enter ".help" for usage hints.
D PRAGMA temp_directory='data/tmp.tmp';
D CREATE TABLE demo AS SELECT * FROM read_csv_auto("data/archive/*.csv");

The total load time is around seven mins for 430 million records with 32 GB of data.

Sanity tests the table and data.

D SHOW TABLES;
┌──────────────────┐
│       name       │
│     varchar      │
├──────────────────┤
│ demo             │
└──────────────────┘
D DESCRIBE demo;
┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│    column_name     │ column_type │  null   │   key   │ default │ extra │
│      varcharvarcharvarcharvarcharvarchar │ int32 │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ TimestampVARCHAR     │ YES     │         │         │       │
│ from_bank          │ VARCHAR     │ YES     │         │         │       │
│ Account            │ VARCHAR     │ YES     │         │         │       │
│ to_bank            │ VARCHAR     │ YES     │         │         │       │
│ Account_1          │ VARCHAR     │ YES     │         │         │       │
│ amount_received    │ DOUBLE      │ YES     │         │         │       │
│ receiving_currency │ VARCHAR     │ YES     │         │         │       │
│ amount_paid        │ DOUBLE      │ YES     │         │         │       │
│ payment_currency   │ VARCHAR     │ YES     │         │         │       │
│ payment_format     │ VARCHAR     │ YES     │         │         │       │
│ is_laundering      │ BIGINT      │ YES     │         │         │       │
├────────────────────┴─────────────┴─────────┴─────────┴─────────┴───────┤
│ 11 rows                                                      6 columns │
└────────────────────────────────────────────────────────────────────────┘
D SELECT * FROM demo limit 1;
┌──────────────────┬───────────┬───────────┬─────────┬───────────┬─────────────────┬────────────────────┬─────────────┬──────────────────┬────────────────┬───────────────┐
│    Timestamp     │ from_bank │  Account  │ to_bank │ Account_1 │ amount_received │ receiving_currency │ amount_paid │ payment_currency │ payment_format │ is_laundering │
│     varcharvarcharvarcharvarcharvarchardoublevarchardoublevarcharvarchar     │     int64     │
├──────────────────┼───────────┼───────────┼─────────┼───────────┼─────────────────┼────────────────────┼─────────────┼──────────────────┼────────────────┼───────────────┤
│ 2022/09/01 00:15020800104D70 │ 020800104D70 │         8095.07 │ US Dollar          │     8095.07 │ US Dollar        │ Reinvestment   │             0 │
└──────────────────┴───────────┴───────────┴─────────┴───────────┴─────────────────┴────────────────────┴─────────────┴──────────────────┴────────────────┴───────────────┘

Jupyter Notebook and Panda dataframe

Let's open our VS Code and start with some analysis work on the data.

For demo propers, we will stop here. You can get the code here:

Final Thoughts

What we have learned so far:

  1. Starting to use DuckDB is easy.
  2. It's a fantastic new tool for those involved in Data Analysis, Data Science, and Data Engineering.
  3. Users can load data to memory or disk, a great feature.
  4. The API for variable programming languages (Python, Java, Rust, R, C, C++, etc.) enables support for JSON, CSV, and Parquet files, including the capability to load S3 Parquet. This provides additional capabilities for building a Data Pipeline, particularly useful in challenging scenarios.
  5. The performance is impressive, with fast query execution for up to 32 GB of data. However, additional testing may be required, such as testing on smaller hardware machines. (next weekend, build and learn, maybe?)
  6. The SQLite-like interface is another excellent design feature, and there is almost NO learning curve to get started.

For your next data analysis requirements, give DuckDB a try.

I will also continue testing and monitoring these tools for suitability in the "weekend build and learn" use case.

Thank you for taking the time to read my content. Please follow me on Medium to stay updated on my future articles.

Duckdb
Olap
Sql
Dbms
Recommended from ReadMedium