2024-02-04

Strapi and React TODO Application

Intro

In this article I will build a Todo App with Strapi for the backend component and React as frontend. The guide was originally written by Chigozie Oduah check the references links as he as some very interesting articles about Strapi.

What is Strapi

Setup backend with Strapi

I will be using bun to setup packages due to improved performance checkout their page if you want to know more.

Let’s start by creating our backend with the command

bunx create-strapi-app todo-list --quickstart

This should have created a new folder todo-list you can run the following command on that folder to start your development

You should now access the browser to http://localhost:1337/admin and create you admin account so that we can start create a new collection.

If you need to restart the development environment you can enter the todo-list folder and run

bun run develop

Building the Backend

Now for our TODO application lets create a collection.

Navigate to Content-Type Builder
Select Create new collection type
Call it Todo

Strapi uses this name to reference this collection within our application. Strapi automatically uses the display name to fill the rest of the text boxes.

Create the following fields:

item : Type ( Text )

And hit Save, as our application will be a simple Todo list application that single field will do the job.

Add test entries

After the collection is created, we add some test entries.

Go to content Manager
select the Todo collection and choose Create New entry
After filling the item information you can Save and Publish

Repeat the previous step to have more entries.

Create API Endpoint for our collection

We create API endpoints for our frontend using the Todo collection. These endpoints allows a frontend to interact with our collection.

Navigate to Settings
Click on Roles under user permission & roles.
Click on public to open the permissions given to the public.
Toggle the Todo dropdown under Permissions and Select all to allow public access to our collection without auth.
Hit Save

After performing the following steps you should be able to access the API

http://localhost:1337/api/todos

You should have a working APIs

Find (/api/todos GET ): We use this endpoint to get all the items in our Todo collection
Create (/api/todos POST): We use this endpoint to create a new item in our to-do collection.
Find one (/api/todos/ GET): We use this endpoint to get an item in our Todo collection.
Update (/api/todos/ PUT): We use this endpoint to update an item in our Todo collection
Delete (/api/todos/ DELETE): We use this endpoint to delete an item in our Todo collection.

Great that was easy, now lets setup our frontend React application to interact with this API endpoints.

Setup frontend React App

Now lets start the frontend application on the parent folder run the following command

bunx create-react-app todo-frontend

Next create the following two files for the environment variables:

.env.development

REACT_APP_BACKEND=http://localhost:1337/

.env.production

REACT_APP_BACKEND=/

You can run the frontend application with the following command

bun run start

And access the browser at http://localhost:3000 which will hold an empty react application.

Lets replace the App.js file the following content

import { useState, useEffect } from 'react';
import TodoItem from './TodoItem';
import './App.css';

function App() {
  const [todos, setTodos] = useState([]);
  const [newTodo, setNewTodo] = useState("");

  useEffect(() => {
    // update update the list of todos
    // when the component is rendered for the first time
    update();
  }, []);

  // This function updates the component with the
  // current todo data stored in the server
  function update() {
    fetch(`${process.env.REACT_APP_BACKEND}api/todos`)
      .then(res => res.json())
      .then(todo => {
        setTodos(todo.data);
      })
  }

  // This function sends a new todo to the server
  // and then call the update method to update the
  // component
  function addTodo(e) {
    e.preventDefault();
    let item = newTodo;
    let body = {
      data: {
        item
      }
    };
 
    fetch(`${process.env.REACT_APP_BACKEND}api/todos`, {
      method: "POST",
      headers: {
        'Content-type': 'application/json'
      },
      body: JSON.stringify(body)
    })
      .then(() => {
        setNewTodo("");
        update();
      })
  }

  return (
    <div className="app">
      <main>
        {/* we centered the "main" tag in our style sheet*/}

        {/* This form collects the item we want to add to our todo, and sends it to the server */}
        <form className="form" onSubmit={addTodo}>
          <input type="text" className="todo_input" placeholder="Enter new todo" value={newTodo} onChange={e => setNewTodo(e.currentTarget.value) }/>
          <button type="submit" className="todo_button">Add todo</button>
        </form>

        {/* This is a list view of all the todos in the "todo" state variable */}
        <div>
          {
            todos.map((todo, i) => {
              return <TodoItem todo={todo} key={i} update={update} />
            })
          }
        </div>

      </main>
    </div>
  )
}
export default App;

Create the following file TodoItem.jsx with the following content:

import { useState } from "react";
import './App.css';

function TodoItem({ todo, update }) {
 
  // Our component uses the "edit" state
  // variable to switch between editing
  // and viewing the todo item
  const [edit, setEdit] = useState(false);
  const [newTodo, setNewTodo] = useState("");

  // This function changes the to-do that
  // is rendered in this component.
  // This function is called when the
  // form to change a todo is submitted
  function changeTodo(e) {
    e.preventDefault();
    let item = newTodo;
    let pos = todo.id;
    let body = {
      data: {
        item
      }
    };

    fetch(`${process.env.REACT_APP_BACKEND}api/todos/${pos}`, {
      method: "PUT",
      headers: {
        'Content-type': 'application/json'
      },
      body: JSON.stringify(body)
    })
      .then(() => {
        setEdit(false);
        update();
      })
  }

  // This function deletes the to-do that
  // is rendered in this component.
  // This function is called when the
  // form to delete a todo is submitted
  function deleteTodo(e) {
    e.preventDefault();
    let pos = todo.id;
 
    fetch(`${process.env.REACT_APP_BACKEND}api/todos/${pos}`, {
      method: "DELETE"
    })
      .then(() => {
        update();
      })
  }

  return <div className="todo">
    {/*
      The below toggles between two components
      depending on the current value of the "edit"
      state variable
    */}
    { !edit
        ? <div className="name">{todo.attributes.item}</div>
        : <form onSubmit={changeTodo}>
            <input className="todo_input" type="text" placeholder="Enter new todo" value={newTodo} onChange={e => setNewTodo(e.currentTarget.value)} />
            <button className="todo_button" type="submit">Change todo</button>
          </form>
    }
    <div>
      <button className="delete" onClick={deleteTodo}>delete</button>
      <button className="edit" onClick={() => {
        // this button toggles the "edit" state variable
        setEdit(!edit)

        // we add this snippet below to make sure that our "input"
        // for editing is the same as the one for the component when
        // it is toggled. This allows anyone using it to see the current
        // value in the element, so they don't have to write it again
        setNewTodo(todo.attributes.item)
      }}>edit</button>
    </div>
  </div>
}

export default TodoItem;

Also replace App.css file with the following content:

.app {
  display: flex;
  justify-content: center;
  text-align: center;
}

.todo_input {
  height: 16px;
  padding: 10px;
  border-top-left-radius: 8px;
  border-bottom-left-radius: 8px;
  border: 2px solid blueviolet;
}

.todo_button {
  border: 2px solid blueviolet;
  background-color: transparent;
  height: 40px;
  border-top-right-radius: 8px;
  border-bottom-right-radius: 8px;
}

.todo {
  display: flex;
  justify-content: space-between;
  margin-top: 5px;
  font-weight: 700;
  margin-bottom: 5px;
  min-width: 340px;
}

.edit {
  width: 66px;
  font-weight: 700;
  background: blueviolet;
  border: none;
  border-top-right-radius: 5px;
  height: 33px;
  border-bottom-right-radius: 5px;
  color: white;
  font-size: medium;
}

.delete {
  width: 66px;
  font-weight: 700;
  background: white;
  border: 2px solid blueviolet;
  border-top-left-radius: 5px;
  height: 33px;
  color: blueviolet;
  border-bottom-left-radius: 5px;
  font-size: medium;
}

.form {
  padding-top: 27px;
  padding-bottom: 27px;
}

.name {
  max-width: 190.34px;
  text-align: left;
}

After the last update you should have a workable todo app http://localhost:3000/

Deployment

I’ve seen several articles where developers bundle the frontend application on the public folder to keep a single server installation, but according to Strapi is not a good practice.

Conclusion

In this article we have setup Strapi to setup the backend for a Todo list application and a react frontend that would take advantage of the provided APIs using a headless architecture.

Strapi allows to quickly setup APIs for Collections that can be defined and managed through a provided UI. Very useful if one would like to decouple the development process, or if you don’ t won’ t to implement from scratch backend functionalities.

Regarding the level of customization would require extensive exploration. The backoffice allows to create auth tokens, webhooks, SSO, internationalization and also has a marketplace area to include more functionalities.

Also worth mention that if you can leverage Strapi Cloud to deploy your Production applications

References

2024-01-29

Dbt-Clickhouse

Integrating dbt and ClickHouse

In this we will be following the integration steps to use dbt and clickouse with sample IMDB data.

Configure ClickHouse sources

Setup clickhouse check my previous article if you would like more information on this product.

Then connect with a client and run the following DDL scripts

CREATE DATABASE imdb;

CREATE TABLE imdb.actors
(
    id         UInt32,
    first_name String,
    last_name  String,
    gender     FixedString(1)
) ENGINE = MergeTree ORDER BY (id, first_name, last_name, gender);

CREATE TABLE imdb.directors
(
    id         UInt32,
    first_name String,
    last_name  String
) ENGINE = MergeTree ORDER BY (id, first_name, last_name);

CREATE TABLE imdb.genres
(
    movie_id UInt32,
    genre    String
) ENGINE = MergeTree ORDER BY (movie_id, genre);

CREATE TABLE imdb.movie_directors
(
    director_id UInt32,
    movie_id    UInt64
) ENGINE = MergeTree ORDER BY (director_id, movie_id);

CREATE TABLE imdb.movies
(
    id   UInt32,
    name String,
    year UInt32,
    rank Float32 DEFAULT 0
) ENGINE = MergeTree ORDER BY (id, name, year);

CREATE TABLE imdb.roles
(
    actor_id   UInt32,
    movie_id   UInt32,
    role       String,
    created_at DateTime DEFAULT now()
) ENGINE = MergeTree ORDER BY (actor_id, movie_id);

After creating the source tables lets fill them with data from AWS, running the following code.

INSERT INTO imdb.actors
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_actors.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.directors
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_directors.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.genres
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_genres.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.movie_directors
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies_directors.tsv.gz',
        'TSVWithNames');

INSERT INTO imdb.movies
SELECT *
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_movies.tsv.gz',
'TSVWithNames');

INSERT INTO imdb.roles(actor_id, movie_id, role)
SELECT actor_id, movie_id, role
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/imdb/imdb_ijs_roles.tsv.gz',
'TSVWithNames');

Setup DBT

Starting by setting up DBT environment

pip install dbt-core
pip install dbt-clickhouse

Init the dbt project

dbt init imdb

Update the file dbt_project.yml and make sure to add the actors

models:
  imdb:
    # Config indicated by + and applies to all files under models/example/
    actors:
      +materialized: view

Create the following file models/actors/schema.yml with the following content

version: 2

sources:
- name: imdb
  tables:
  - name: directors
  - name: actors
  - name: roles
  - name: movies
  - name: genres
  - name: movie_directors

Create the following file models/actors/actor_summary.sql with the content

{{ config(order_by='(updated_at, id, name)', engine='MergeTree()', materialized='table') }}

with actor_summary as (
SELECT id,
    any(actor_name) as name,
    uniqExact(movie_id)    as num_movies,
    avg(rank)                as avg_rank,
    uniqExact(genre)         as genres,
    uniqExact(director_name) as directors,
    max(created_at) as updated_at
FROM (
        SELECT {{ source('imdb', 'actors') }}.id as id,
                concat({{ source('imdb', 'actors') }}.first_name, ' ', {{ source('imdb', 'actors') }}.last_name) as actor_name,
                {{ source('imdb', 'movies') }}.id as movie_id,
                {{ source('imdb', 'movies') }}.rank as rank,
                genre,
                concat({{ source('imdb', 'directors') }}.first_name, ' ', {{ source('imdb', 'directors') }}.last_name) as director_name,
                created_at
        FROM {{ source('imdb', 'actors') }}
                    JOIN {{ source('imdb', 'roles') }} ON {{ source('imdb', 'roles') }}.actor_id = {{ source('imdb', 'actors') }}.id
                    LEFT OUTER JOIN {{ source('imdb', 'movies') }} ON {{ source('imdb', 'movies') }}.id = {{ source('imdb', 'roles') }}.movie_id
                    LEFT OUTER JOIN {{ source('imdb', 'genres') }} ON {{ source('imdb', 'genres') }}.movie_id = {{ source('imdb', 'movies') }}.id
                    LEFT OUTER JOIN {{ source('imdb', 'movie_directors') }} ON {{ source('imdb', 'movie_directors') }}.movie_id = {{ source('imdb', 'movies') }}.id
                    LEFT OUTER JOIN {{ source('imdb', 'directors') }} ON {{ source('imdb', 'directors') }}.id = {{ source('imdb', 'movie_directors') }}.director_id
        )
GROUP BY id
)

select *
from actor_summary

Configure the clickstream connection on the following file ~/.dbt/profiles.yml

imdb:
  target: dev
  outputs:
    dev:
      type: clickhouse
      schema: imdb_dbt
      host: localhost
      port: 8123
      user: default
      password: ''
      secure: False

After this updates run the dbt debug command.
To make sure the connection is working properly

dbt debug
00:31:58  Running with dbt=1.7.6
00:31:58  dbt version: 1.7.6
00:31:58  python version: 3.11.6
00:31:58  python path: /home/rramos/Development/local/dbt/bin/python
00:31:58  os info: Linux-6.6.10-zen1-1-zen-x86_64-with-glibc2.38
00:31:58  Using profiles dir at /home/rramos/.dbt
00:31:58  Using profiles.yml file at /home/rramos/.dbt/profiles.yml
00:31:58  Using dbt_project.yml file at /home/rramos/Development/local/dbt/imdb/dbt_project.yml
00:31:58  adapter type: clickhouse
00:31:58  adapter version: 1.7.1
00:31:58  Configuration:
00:31:58    profiles.yml file [OK found and valid]
00:31:58    dbt_project.yml file [OK found and valid]
00:31:58  Required dependencies:
00:31:58   - git [OK found]
...
00:31:58  Registered adapter: clickhouse=1.7.1
00:31:58    Connection test: [OK connection ok]

If the connection test passed properly, one just need to create the model via dbt.

dbt run

And you should have a similar output

dbt run
00:38:13  Running with dbt=1.7.6
00:38:13  Registered adapter: clickhouse=1.7.1
00:38:13  Unable to do partial parsing because a project config has changed
00:38:15  Found 1 model, 6 sources, 0 exposures, 0 metrics, 421 macros, 0 groups, 0 semantic models
00:38:15  
00:38:15  Concurrency: 1 threads (target='dev')
00:38:15  
00:38:15  1 of 1 START sql view model `imdb`.`actor_summary` ............................. [RUN]
00:38:15  1 of 1 OK created sql view model `imdb`.`actor_summary` ........................ [OK in 0.17s]
00:38:15  
00:38:15  Finished running 1 view model in 0 hours 0 minutes and 0.27 seconds (0.27s).
00:38:15  
00:38:15  Completed successfully
00:38:15  
00:38:15  Done. PASS=1 WARN=0 ERROR=0 SKIP=0 TOTAL=1

Test query the model

SELECT *
FROM imdb_dbt.actor_summary
WHERE num_movies > 5
ORDER BY avg_rank DESC

Conclusion

In this article I’ve went trough the process of setup a Clickhouse database and setup dbt to setup the models with IMDB test data for actors, directors, movies, etc.

This two systems work like a charm together.
Clickstream shows great performance for analytical queries, and dbt compiles and runs your analytics code against your data platform, enabling you and your team to collaborate on a single source of truth for metrics, insights, and business definitions.

Would like to extend this exercise by incorporating github actions related with dbt test actions before promoting to production.

References

2024-01-28

ClickHouse

Intro

ClickHouse is a true column-oriented DBMS. Data is stored by columns, and during the execution of arrays (vectors or chunks of columns). Whenever possible, operations are dispatched on arrays, rather than on individual values. It is called “vectorized query execution” and it helps lower the cost of actual data processing.

Architecture

ClickHouse was initially built as a prototype to do just a single task well: to filter and aggregate data as fast as possible. That’s what needs to be done to build a typical analytical report, and that’s what a typical GROUP BY query does. The ClickHouse team has made several high-level decisions that, when combined, made achieving this task possible:

Column-oriented storage: Source data often contain hundreds or even thousands of columns, while a report can use just a few of them. The system needs to avoid reading unnecessary columns to avoid expensive disk read operations.

Indexes: Memory resident ClickHouse data structures allow the reading of only the necessary columns, and only the necessary row ranges of those columns.

Data compression: Storing different values of the same column together often leads to better compression ratios (compared to row-oriented systems) because in real data a column often has the same, or not so many different, values for neighboring rows. In addition to general-purpose compression, ClickHouse supports specialized codecs that can make data even more compact.

Vectorized query execution: ClickHouse not only stores data in columns but also processes data in columns. This leads to better CPU cache utilization and allows for SIMD CPU instructions usage.

Scalability: ClickHouse can leverage all available CPU cores and disks to execute even a single query. Not only on a single server but all CPU cores and disks of a cluster as well.

Attention to Low-Level Details

But many other database management systems use similar techniques. What really makes ClickHouse stand out is attention to low-level details. Most programming languages provide implementations for most common algorithms and data structures, but they tend to be too generic to be effective.

Setup

In order to install run the following script

curl https://clickhouse.com/ | sh

One can start the server with the following command

./clickhouse server

With a different terminal lets start a client with

./clickhouse client

Tests

Lets start by creating a table from sample data

CREATE TABLE my_first_table
(
    user_id UInt32,
    message String,
    timestamp DateTime,
    metric Float32
)
ENGINE = MergeTree
PRIMARY KEY (user_id, timestamp)

The statement uses traditional SQL DDL, with one extend information regarding the execution engine. The MergeTree option provides improved performance for managed tables but there are also options to integrate with external systems such as BigQuery, S3, Kafka, PostgreSQL, …

Insert some data

INSERT INTO my_first_table (user_id, message, timestamp, metric) VALUES
    (101, 'Hello, ClickHouse!',                                 now(),       -1.0    ),
    (102, 'Insert a lot of rows per batch',                     yesterday(), 1.41421 ),
    (102, 'Sort your data based on your commonly-used queries', today(),     2.718   ),
    (101, 'Granules are the smallest chunks of data read',      now() + 5,   3.14159 )

query

SELECT *
FROM my_first_table
ORDER BY timestamp

Now lets create a table from external data in S3 and a materialized table using the MergeTree engine from it.

Create data from S3 sources

First create our managed table

CREATE TABLE trips
(
    `trip_id` UInt32,
    `vendor_id` Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
    `pickup_date` Date,
    `pickup_datetime` DateTime,
    `dropoff_date` Date,
    `dropoff_datetime` DateTime,
    `store_and_fwd_flag` UInt8,
    `rate_code_id` UInt8,
    `pickup_longitude` Float64,
    `pickup_latitude` Float64,
    `dropoff_longitude` Float64,
    `dropoff_latitude` Float64,
    `passenger_count` UInt8,
    `trip_distance` Float64,
    `fare_amount` Float32,
    `extra` Float32,
    `mta_tax` Float32,
    `tip_amount` Float32,
    `tolls_amount` Float32,
    `ehail_fee` Float32,
    `improvement_surcharge` Float32,
    `total_amount` Float32,
    `payment_type` Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
    `trip_type` UInt8,
    `pickup` FixedString(25),
    `dropoff` FixedString(25),
    `cab_type` Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
    `pickup_nyct2010_gid` Int8,
    `pickup_ctlabel` Float32,
    `pickup_borocode` Int8,
    `pickup_ct2010` String,
    `pickup_boroct2010` String,
    `pickup_cdeligibil` String,
    `pickup_ntacode` FixedString(4),
    `pickup_ntaname` String,
    `pickup_puma` UInt16,
    `dropoff_nyct2010_gid` UInt8,
    `dropoff_ctlabel` Float32,
    `dropoff_borocode` UInt8,
    `dropoff_ct2010` String,
    `dropoff_boroct2010` String,
    `dropoff_cdeligibil` String,
    `dropoff_ntacode` FixedString(4),
    `dropoff_ntaname` String,
    `dropoff_puma` UInt16
)
ENGINE = MergeTree
PARTITION BY toYYYYMM(pickup_date)
ORDER BY pickup_datetime
SETTINGS index_granularity = 8192

Now let’s create our raw table. Take into account the ENGINE being used.

CREATE TABLE trips_raw
(
   `trip_id`               UInt32,
   `vendor_id`             Enum8('1' = 1, '2' = 2, '3' = 3, '4' = 4, 'CMT' = 5, 'VTS' = 6, 'DDS' = 7, 'B02512' = 10, 'B02598' = 11, 'B02617' = 12, 'B02682' = 13, 'B02764' = 14, '' = 15),
   `pickup_date`           Date,
   `pickup_datetime`       DateTime,
   `dropoff_date`          Date,
   `dropoff_datetime`      DateTime,
   `store_and_fwd_flag`    UInt8,
   `rate_code_id`          UInt8,
   `pickup_longitude`      Float64,
   `pickup_latitude`       Float64,
   `dropoff_longitude`     Float64,
   `dropoff_latitude`      Float64,
   `passenger_count`       UInt8,
   `trip_distance`         Float64,
   `fare_amount`           Float32,
   `extra`                 Float32,
   `mta_tax`               Float32,
   `tip_amount`            Float32,
   `tolls_amount`          Float32,
   `ehail_fee`             Float32,
   `improvement_surcharge` Float32,
   `total_amount`          Float32,
   `payment_type_`         Enum8('UNK' = 0, 'CSH' = 1, 'CRE' = 2, 'NOC' = 3, 'DIS' = 4),
   `trip_type`             UInt8,
   `pickup`                FixedString(25),
   `dropoff`               FixedString(25),
   `cab_type`              Enum8('yellow' = 1, 'green' = 2, 'uber' = 3),
   `pickup_nyct2010_gid`   Int8,
   `pickup_ctlabel`        Float32,
   `pickup_borocode`       Int8,
   `pickup_ct2010`         String,
   `pickup_boroct2010`     FixedString(7),
   `pickup_cdeligibil`     String,
   `pickup_ntacode`        FixedString(4),
   `pickup_ntaname`        String,
   `pickup_puma`           UInt16,
   `dropoff_nyct2010_gid`  UInt8,
   `dropoff_ctlabel`       Float32,
   `dropoff_borocode`      UInt8,
   `dropoff_ct2010`        String,
   `dropoff_boroct2010`    FixedString(7),
   `dropoff_cdeligibil`    String,
   `dropoff_ntacode`       FixedString(4),
   `dropoff_ntaname`       String,
   `dropoff_puma`          UInt16
) ENGINE = S3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/nyc-taxi/trips_{0..9}.gz', 'TabSeparatedWithNames', 'gzip');

Let’s select from our raw table

SELECT DISTINCT(pickup_ntaname)
FROM trips_raw
LIMIT 10;

The S3 table engine supports parallel reads. Writes are only supported if the table definition does not contain glob patterns

You will need to configure access credentials on the config.xml if you are using private data or you need to write data. You can also define individual configuration filed on the conf.d directory like the following example.

<clickhouse>
    <s3>
        <endpoint-name>
            <endpoint>https://dalem-files.s3.amazonaws.com/test/</endpoint>
            <access_key_id>key</access_key_id>
            <secret_access_key>secret</secret_access_key>
            <!-- <use_environment_credentials>false</use_environment_credentials> -->
            <!-- <header>Authorization: Bearer SOME-TOKEN</header> -->
        </endpoint-name>
    </s3>
</clickhouse>

But we are not going to do this and just load the public data locally.

Let’s just load data from our raw table for the materialized one.

INSERT INTO trips SELECT * FROM trips_raw

This will take some time and you cant check the progress bar on the console

0 rows in set. Elapsed: 109.103 sec. Processed 10.00 million rows, 815.59 MB (91.66 thousand rows/s., 7.48 MB/s.)
Peak memory usage: 1.21 GiB.

Lets validate we have the data

buldozer :) SELECT COUNT(1) FROM trips

SELECT COUNT(1)
FROM trips

Query id: 8956eff9-d76e-4a1e-a766-9bcc5dd1f6cd

┌──count()─┐
│ 10000840 │
└──────────┘

1 row in set. Elapsed: 0.002 sec.

Let’s execute a DISTINCT query to force a heavier execution plan.

SELECT DISTINCT(*) FROM trips

This execution was done in a single server with a AMD processor

10000840 rows in set. Elapsed: 18.121 sec. Processed 10.00 million rows, 3.07 GB (551.89 thousand rows/s., 169.19 MB/s.)
Peak memory usage: 631.59 MiB.

Still it scanned 10M registers in 18s, not bad :D

Clickhouse Local

Clickhouse local is very usefull if you want to query data directly ontop of files without having to install a server.

This option is also very interesting to incorporate in a deployment pipeline

Query CSV

./clickhouse local -q "SELECT * FROM 'review_?.csv'"

Query Parquet

./clickhouse local -q "
SELECT count()
FROM s3('https://datasets-documentation.s3.eu-west-3.amazonaws.com/house_parquet/house_0.parquet')"

Integrations

Clickhouse as many integrations available:

Core integrations: built or maintained by ClickHouse, they are supported by ClickHouse and live in the ClickHouse GitHub organization
Partner integrations: built or maintained, and supported by, third-party software vendors
Community integrations: built or maintained and supported by community members. No direct support is available besides the public GitHub repositories and community Slack channels

Also the available documentation seems well prepared.

Production

This article only touches the surface on the available options to setup Clickhouse one should read the Scale Out section to understand best to deploy in Production

Conclusion

In this article we followed the quick-start guide in order to setup a Clickhouse server loaded data from S3 a public dataset with 10M records and performed a select distinct query on that table.

Clickhouse presents as a very interesting OLTP solution. If you are considering a solution for Analytical Reporting this is something to have on your radar.

I also liked the fact you can start small having you service full deployment and maintained by you, but still having the capability to scaling horizontally or moving to Cloud Offering which would granting the support characteristics that most of the times are missing in open-source source software.

The fact there is a huge list of Adopters and a Company providing support and defining a roadmap for the product also brings reassurance to use this Product.

The documentation provide several integrations patterns that are worth checking.

This is definetly one that I will keep an eye on.

Well done Clickhouse ;)

References

2024-01-26

GitHub Actions

Intro

In this article I will go through the steps to setup github actions to deploy hexo pages upon push requests into Github pages

Github Actions

GitHub Actions makes it easy to automate all your software workflows, now with world-class CI/CD. Build, test, and deploy your code right from GitHub. Make code reviews, branch management, and issue triaging work the way you want.

Check the available documentation as there are several interesting examples

Hexo Setup

This tech notes site is maintained by Hexo a markdown blob framework.

The framework generates html static content based on Markdown articles an one needs carry the following steps to update content

Create/Update Markdown content
Execute hexo generate to generte content
Validate the pages with hexo serve
Deploy to your hosting service hexo deploy

Setup

Include the following file .github/workflows/hexo-deploy.yml on your hexo repo with the content:

name: Deploy

on: [push]

jobs:
  build:
    runs-on: ubuntu-latest
    name: Deploying udateds Hexo articles
    steps:
    - name: Checkout
      uses: actions/checkout@v2
      with:
        submodules: 'true'
        token: ${{ secrets.ACCESS_TOKEN }}

    # Caching dependencies to speed up workflows. (GitHub will remove any cache entries that have not been accessed in over 7 days.)
    - name: Cache node modules
      uses: actions/cache@v1
      id: cache
      with:
        path: node_modules
        key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}
        restore-keys: |
          ${{ runner.os }}-node-
          
    - name: Install Dependencies
      if: steps.cache.outputs.cache-hit != 'true'
      run: npm ci
    
    # Deploy hexo blog website.
    - name: Deploy
      id: deploy
      uses: sma11black/hexo-action@v1.0.4
      with:
        deploy_key: ${{ secrets.HEXO_DEPLOY_KEY }}
        user_name: <GITHUB_USER>  # (or delete this input setting to use bot account)
        user_email: <GITHUB_EMAIL>  # (or delete this input setting to use bot account)
        commit_msg: ${{ github.event.head_commit.message }} 

    # Use the output from the `deploy` step(use for test action)
    - name: Get the output
      run: |
        echo "${{ steps.deploy.outputs.notify }}"

Replace <GITHUB_USER> with your user account and <GITHUB_EMAIL> with your email address
Generate a new ssh-key with the comand ssh-keygen -t rsa -C "<GITHUB_EMAIL>" making sure to use you email account
- This step will generate 2 files a pub key which you need to configure on the destination repo as one allowed Deployment key
- And on the source repo you need to configure a secret where you will put the ssh-key
Configure a personal token and register also as a secret on the source repository as ACCESS_TOKEN

That’s it you just need to start pushing changes

NOTE: This assumes the hexo source repository was already configured for the destination github pages account.

Multi repos

It is important to notice that you cannot assign the same deployment key for several repositories.

That is why I used a personal token, but there should be better alternatives.

Conclusion

Github Actions is a really powerfull CI/CD tool and for this type of static generation content works rather well.

I had several issues regarding github submodules where the authetication was not passing. If you use the same approach for themes, you may endup on the same situation and using a token approach would be preferable

Also the ssh-keys being bounded by repo caused some initial confusion and there should be a better way to setup the autentication but I didn’t explore it in detail.

Also package-lock.json are required for this to work and is advisable to have your source repo as private.

This workflow can certainly be improve like including tests and making sure that grammar validation is done as one example.

References

2024-01-25

Microsoft Fabric

Intro

Fabric is a centralized product implemented by Microsoft in a SaaS way, that combines several services such as DataLake, Orchestration Processing Vizualization and AI.

Billing is defined by the amount of processing used and the ammount of storage used.

OneLake is the solution for storage where all the Data is automatically indexed for discovery, lineage and Governance are configured with support of Pureview.

Data can be virtualized with external storage locations from different cloud providers, no data duplication is needed similar to pointers.

Also data is stored in Delta guaranteeing ACID compliant characteristics.

Integration with PowerBI for reporting using the familiar look

Data Activator is the realtime processing component wich triggers actions based on rules, like automated reports or procedures.

The Product also brings CoPilot features for PowerBI, DS Notebooks and DataFactory cleaning processes.

Components

OneLake
Data Factory
Synapse
Data Activator
PowerBI

Integrations

Databricks integration through Delta Uniform, also Unity Catalog integrates with OneSecurity

Copilot

Several integrations of Copilot, but the one with PowerBI is really interesting to produce quick visualizations from natual languange prompts

Conclusion

Not much to conclued here as the Product needs to be deeply tested from my part. I have some concerns on vender lock-in, but the fact that it sits on top a DataLake and supports openformats brings some reassurance.

One thing that also troubles me, is that if one client just want to use a partial feature would need to aquire the full solution.

For instance D365 in the past allowed the client to export data from that system to a Datalake with internal feature of the Product and now it relies on Fabric only, if this strategy is going to be followed for the rest of MS portfolio, example DataFactory, although increased stability or support. I found the lack of flexibility concerning, by forcing clients to use this solution.

References

https://www.microsoft.com/en-us/microsoft-fabric