26 Dec, 2023 - About 3 minutes
DuckDB Mock Environment
Intro
In this article I will go through DuckDB an in-process SQL OLAP database management system and setup some mock data to do some tests
Setup
In order to setup duckdb client install the following pip packages.
pip install -U duckcli |
The duckcli allow autocompletion on the terminal
NOTE: There is other client options like ODBC or Node checkout the official page for those cases.
Generate Mock Data
DuckDB reads data directly from files, and support CSV, Parquet, JSON, Excel and more.
Now lets generate some data and import to our SQL Engine.
jafgen --years 1 |
This will generate 6 CSV files with mock data
jaffle-data/ |
lets start the client and create managed tables for each
duckcli mydatabase.db |
The CSVs have a header line so lets create the following tables using the commands
CREATE TABLE raw_items AS SELECT * FROM read_csv_auto('jaffle-data/raw_items.csv',header = true); |
There are other options like defining a official delimiter or specifying the columns. Check the official page for more details on CSV Import.
mydatabase.db> show tables; |
Disclaimer
DuckDB seems to be blazing fast it also has the option to run in-memmory. It is important however to identify in which usecases this backend could present benefits or not.
When or Not to use
DuckDB aims to automatically achieve high performance by using well-chosen default configurations and having a forgiving architecture. Of course, there are still opportunities for tuning the system for specific workloads. The Performance Guide’s page contain guidelines and tips for achieving good performance when loading and processing data with DuckDB.
When to use DuckDB
- Processing and storing tabular datasets, e.g., from CSV or Parquet files
- Interactive data analysis, e.g., join & aggregate multiple large tables
- Concurrent large changes, to multiple large tables, e.g., appending rows, adding/removing/updating columns
- Large result set transfer to client
When to not use DuckDB
- High-volume transactional use cases (e.g., tracking orders in a webshop)
- Large client/server installations for centralized enterprise data warehousing
- Writing to a single database from multiple concurrent processes
- Multiple concurrent processes reading from a single writable database
Conclusion
In this article I went through the process to setup duckdb in a local environment and load some data into it. This database has some interesting benchmark values, I would suggest you try this one out especially if your usecase doesn’t involve transactional data or multiple concurrent processes reading from a single writable database there for staging processes, development environments or single threaded CDC process seems very interesting.
I will certainly use this more in the future. Also very poisitive feedback regarding the documentation you can find on the Official website.
I haven´t found direct support for Delta yet, although it supports Parquet.
If you want to understand better why to choose DuckDB please check this article Why DuckDB