28 Jan, 2024 - About 8 minutes
ClickHouse
Intro
ClickHouse is a true column-oriented DBMS. Data is stored by columns, and during the execution of arrays (vectors or chunks of columns). Whenever possible, operations are dispatched on arrays, rather than on individual values. It is called “vectorized query execution” and it helps lower the cost of actual data processing.
Architecture
ClickHouse was initially built as a prototype to do just a single task well: to filter and aggregate data as fast as possible. That’s what needs to be done to build a typical analytical report, and that’s what a typical GROUP BY query does. The ClickHouse team has made several high-level decisions that, when combined, made achieving this task possible:
Column-oriented storage: Source data often contain hundreds or even thousands of columns, while a report can use just a few of them. The system needs to avoid reading unnecessary columns to avoid expensive disk read operations.
Indexes: Memory resident ClickHouse data structures allow the reading of only the necessary columns, and only the necessary row ranges of those columns.
Data compression: Storing different values of the same column together often leads to better compression ratios (compared to row-oriented systems) because in real data a column often has the same, or not so many different, values for neighboring rows. In addition to general-purpose compression, ClickHouse supports specialized codecs that can make data even more compact.
Vectorized query execution: ClickHouse not only stores data in columns but also processes data in columns. This leads to better CPU cache utilization and allows for SIMD CPU instructions usage.
Scalability: ClickHouse can leverage all available CPU cores and disks to execute even a single query. Not only on a single server but all CPU cores and disks of a cluster as well.
Attention to Low-Level Details
But many other database management systems use similar techniques. What really makes ClickHouse stand out is attention to low-level details. Most programming languages provide implementations for most common algorithms and data structures, but they tend to be too generic to be effective.
Setup
In order to install run the following script
curl https://clickhouse.com/ | sh |
One can start the server with the following command
./clickhouse server |
With a different terminal lets start a client with
./clickhouse client |
Tests
Lets start by creating a table from sample data
CREATE TABLE my_first_table |
The statement uses traditional SQL DDL, with one extend information regarding the execution engine. The MergeTree option provides improved performance for managed tables but there are also options to integrate with external systems such as BigQuery, S3, Kafka, PostgreSQL, …
Insert some data
INSERT INTO my_first_table (user_id, message, timestamp, metric) VALUES |
Query
SELECT * |
Now lets create a table from external data in S3 and a materialized table using the MergeTree engine from it.
Create data from S3 sources
First create our managed table
CREATE TABLE trips |
Now let’s create our raw table. Take into account the ENGINE being used.
CREATE TABLE trips_raw |
Let’s select from our raw table
SELECT DISTINCT(pickup_ntaname) |
The S3 table engine supports parallel reads. Writes are only supported if the table definition does not contain glob patterns
You will need to configure access credentials on the config.xml if you are using private data or you need to write data. You can also define individual configuration filed on the conf.d directory like the following example.
<clickhouse> |
But we are not going to do this and just load the public data locally.
Let’s just load data from our raw table for the materialized one.
INSERT INTO trips SELECT * FROM trips_raw |
This will take some time and you can check the progress bar on the console
0 rows in set. Elapsed: 109.103 sec. Processed 10.00 million rows, 815.59 MB (91.66 thousand rows/s., 7.48 MB/s.) |
Lets validate we have the data
buldozer :) SELECT COUNT(1) FROM trips |
Let’s execute a DISTINCT query to force a heavier execution plan.
SELECT DISTINCT(*) FROM trips |
This execution was done in a single server with a AMD processor
10000840 rows in set. Elapsed: 18.121 sec. Processed 10.00 million rows, 3.07 GB (551.89 thousand rows/s., 169.19 MB/s.) |
Still it scanned 10M registers in 18s, not bad :D
Clickhouse Local
Clickhouse local is very useful if you want to query data directly on top of files without having to install a server.
This option is also very interesting to incorporate in a deployment pipeline
Query CSV
./clickhouse local -q "SELECT * FROM 'review_?.csv'" |
Query Parquet
./clickhouse local -q " |
Integrations
Clickhouse as many integrations available:
- Core integrations: built or maintained by ClickHouse, they are supported by ClickHouse and live in the ClickHouse GitHub organization
- Partner integrations: built or maintained, and supported by, third-party software vendors
- Community integrations: built or maintained and supported by community members. No direct support is available besides the public GitHub repositories and community Slack channels
Also the available documentation seems well prepared.
Production
This article only touches the surface on the available options to setup Clickhouse one should read the Scale Out section to understand best to deploy in Production
Conclusion
In this article we followed the quick-start guide in order to setup a Clickhouse server loaded data from S3 a public dataset with 10M records and performed a select distinct query on that table.
Clickhouse presents as a very interesting OLTP solution. If you are considering a solution for Analytical Reporting this is something to have on your radar.
I also liked the fact you can start small having you service full deployment and maintained by you, but still having the capability to scaling horizontally or moving to Cloud Offering which would granting the support characteristics that most of the times are missing in open-source source software.
The fact there is a huge list of Adopters and a Company providing support and defining a roadmap for the product also brings reassurance to use this Product.
The documentation provide several integrations patterns that are worth checking.
This is definitely one that I will keep an eye on.
Well done Clickhouse ;)