A Simple Demo To Analyze 32GB AML Data With DuckDB
Weekend build and learn
Please note this article is solely for personal learning and development purposes during the weekend. It is not intended to promote or compare any Big Data vendors currently on the market. :)
Background
I recently reconnected with an old friend who currently holds a high-level role in data management.
During our conversation, we covered various topics, such as recent advancements in data technology and marketplace growth. Here are a few noteworthy points we discussed:
- It's common for individuals and organizations to have a relatively small amount of data. You will unlikely require petabytes (PB) of data for your latest project. Instead, you have gigabytes (GB) or even megabytes (MB) of data.
- It's important to remember that the core basis for utilizing data is determining the problem, solution, or story we want to extract for the benefit of the business. The focus should be on more than just the tools being used.
- One of the most crucial steps for the success of any data project is data modeling, NOT Tooling.
That's interesting! You may have noticed that I "bolded" the NOT Tooling. The reason for this is that I believe the tooling is only meant to assist with the job at hand, which is in line with my previous writing:
Context matters.
Why Did DuckDB Catch My Attention?
As a data engineer or analyst, it is common to encounter situations like the one we will use as an example.
Your boss wants you to analyze this IBM Transactions for Anti-Money Laundering (AML) publicly available data in Kaggle(the world's largest data science community with powerful tools and resources to help you achieve your data science goals.)
And get to a summary of those data points. Sounds easy? Let's open VSCode with Jupyter Notebook and start with Pandas!
Wait. Oh, boy! This is a 8 GB zip file with a 32GB CSV unzip size.
As many know, Panda is an excellent resource for managing small datasets, typically less than 2–3 GB.
However, when the dataset grows beyond this limit, it is not advisable to utilize Pandas.
Pandas loads the complete dataset into memory before processing the dataframe.
Therefore, when the dataset surpasses the memory size, it can lead to memory errors.
Pandas loads the entire data into memory before processing the dataframe. So, if the dataset size is larger than the memory, you will run into memory errors.
In summary:
- Pandas is single-threaded.
- Dataset size is restricted to the memory of the machine.
- Inefficient with Memory.
What should we do then?
“Let's contact the Big Data Team to have the data ingested into the Datalake, or should we utilize the Cloud to begin the analysis?”
In addition to the infrastructure requirements, it is essential to consider your company's delivery model and team structure for this simple request.
- In Agile, generating a JIRA story for the sprint may be necessary.
- If you are on the waterfall, you must request another team to complete this task.
Based on the above description, around 70% of the time seems devoted to infrastructure and internal processes/requirements.
Additionally, it's essential to remember that the cost of time, human resources, and infrastructure will continue to be a factor as time progresses.
How about low-level code for the pipeline?
When analyzing data, sometimes you need a bit of Python and SQL coding. Even if you work as a data engineer, you might find yourself writing code for this task. But is this approach too overkill?
Can we do it differently, faster, and more easier?
For example, DuckDB will be the best option for the job, which is why DuckDB caught most people's attention.
So What's DuckDB
Based on the DuckDB website:
DuckDB is an in-process SQL OLAP database management system.
- Fast: DuckDB is designed to support OLAP (Online analytical processing). It contains a columnar-vectorized query execution engine, leading to better performance in OLAP queries.
- Simple: DuckDB adopts these ideas of simplicity (SQLite) and embedded operation with no external dependencies, neither for compilation nor during run-time.
- Feature-Rich: DuckDB provides severe data management features, such as transactional guarantees (ACID properties) and bulk-optimized Multi-Version Concurrency Control (MVCC). Data can be stored in persistent, single-file databases, etc.
- Free and Open: DuckDB is Open Source. The entire source code is freely available on GitHub. DuckDB is deeply integrated into Python and R for efficient interactive data analysis. DuckDB provides APIs for Java, C, C++, and others.
Ok, it's a lot of lectures and sounds promising. But is it true? Let's find out by using it for our use case above.
Installation
Install the DuckDB is simple, as promised, just like SQLite installation.
## brew install
brew install duckdb
## or Direct download
wget https://github.com/duckdb/duckdb/releases/download/v0.8.1/duckdb_cli-osx-universal.zipDowload the data and prepare the folder structure
mkdir data
cd data
wget https://www.kaggle.com/datasets/ealtman2019/ibm-transactions-for-anti-money-laundering-aml/download?datasetVersionNumber=7
unzip archive
cd ..
mkdir databaseDuckdb in action
Before we start, let's look at our current machine, the spec as follows:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: MacBookPro16,1
Processor Name: 6-Core Intel Core i7
Processor Speed: 2.6 GHz
Number of Processors: 1
Total Number of Cores: 6
L2 Cache (per Core): 256 KB
L3 Cache: 12 MB
Memory: 16 GBThe data volume is two times larger than our memory, and we want to avoid it every time we open the DuckDB to reload the data to process; we will need to persist data in the disk.
❯ duckdb 'db/ibm_aml.db'
v0.8.1 6536a77232
Enter ".help" for usage hints.
D PRAGMA temp_directory='data/tmp.tmp';
D CREATE TABLE demo AS SELECT * FROM read_csv_auto("data/archive/*.csv");The total load time is around seven mins for 430 million records with 32 GB of data.
Sanity tests the table and data.
D SHOW TABLES;
┌──────────────────┐
│ name │
│ varchar │
├──────────────────┤
│ demo │
└──────────────────┘
D DESCRIBE demo;
┌────────────────────┬─────────────┬─────────┬─────────┬─────────┬───────┐
│ column_name │ column_type │ null │ key │ default │ extra │
│ varchar │ varchar │ varchar │ varchar │ varchar │ int32 │
├────────────────────┼─────────────┼─────────┼─────────┼─────────┼───────┤
│ Timestamp │ VARCHAR │ YES │ │ │ │
│ from_bank │ VARCHAR │ YES │ │ │ │
│ Account │ VARCHAR │ YES │ │ │ │
│ to_bank │ VARCHAR │ YES │ │ │ │
│ Account_1 │ VARCHAR │ YES │ │ │ │
│ amount_received │ DOUBLE │ YES │ │ │ │
│ receiving_currency │ VARCHAR │ YES │ │ │ │
│ amount_paid │ DOUBLE │ YES │ │ │ │
│ payment_currency │ VARCHAR │ YES │ │ │ │
│ payment_format │ VARCHAR │ YES │ │ │ │
│ is_laundering │ BIGINT │ YES │ │ │ │
├────────────────────┴─────────────┴─────────┴─────────┴─────────┴───────┤
│ 11 rows 6 columns │
└────────────────────────────────────────────────────────────────────────┘
D SELECT * FROM demo limit 1;
┌──────────────────┬───────────┬───────────┬─────────┬───────────┬─────────────────┬────────────────────┬─────────────┬──────────────────┬────────────────┬───────────────┐
│ Timestamp │ from_bank │ Account │ to_bank │ Account_1 │ amount_received │ receiving_currency │ amount_paid │ payment_currency │ payment_format │ is_laundering │
│ varchar │ varchar │ varchar │ varchar │ varchar │ double │ varchar │ double │ varchar │ varchar │ int64 │
├──────────────────┼───────────┼───────────┼─────────┼───────────┼─────────────────┼────────────────────┼─────────────┼──────────────────┼────────────────┼───────────────┤
│ 2022/09/01 00:15 │ 020 │ 800104D70 │ 020 │ 800104D70 │ 8095.07 │ US Dollar │ 8095.07 │ US Dollar │ Reinvestment │ 0 │
└──────────────────┴───────────┴───────────┴─────────┴───────────┴─────────────────┴────────────────────┴─────────────┴──────────────────┴────────────────┴───────────────┘Jupyter Notebook and Panda dataframe
Let's open our VS Code and start with some analysis work on the data.






