Databoost.dev

Intro

On the past week I’ve started a Forum for Data Professionals to discuss challenges and ideias in way to create a safe discussion space.

The motivation for creating this was related to a big layoff thread happening in tech on the begining of the year, in a way to support some of my past colegues, not losing track of them and some way also to share job opportunities or support to kickstarters.

The forum is based on https://nodebb.org and this article explain how I’ve setup the hosting of the aplication.

Fell free to join if you have an interest in Data, no fees and we sure need someone helping producing content.

Domain Registration

Not much to describe here that part needs to be paid normally on a yearly basis.
I’ve used http://www.godaddy.com but http://www.amen.pt also presents some good rates.

Hosting

For this one I picked Google. As it has a Free Tier you can achieve a 0-cost hosting if you don’t got abose the usage.

Important to pay attention on the machine you provision and the region.
If you choose a e2-micro with a 30GB disk in one of the following US regions:

  • Oregon: us-west1
  • Iowa: us-central1
  • South Carolina: us-east1

NOTE: Check the offical docs as this may change

NodeBB

Installation of NodeBB was straight forward from the offical documentation

The project is pretty well organized and the documentation is very good

I’ve setup some extra plugins like SSO with gihub and google, Google Analytics, Reactions and some themes.

I opted by the mongodb option and not the redis one.

After some customization on the Backend it was done.

GCS

Having the following options on the free tier:

  • 5 GB-months of regional storage (US regions only) per month
  • 5,000 Class A Operations per month

Decided to setup a cron job to backup the application

Cloudfare

I also change the DNS domain servers from godaddy to cloudfare as they also support a Free tier which allows me to take advantage of caching and other benefits like DoS protection

ntfy

This service is simple and very usefull. I’ve configure some services to use it like if there are any issues with the backups or access to the system, but look on the Examples Section for some of the things you can achieve.

SSL

I also setup https://certbot.eff.org in order to have SSL enable, but that is something that I still need to check with the functionality provided from cloudfare, If I should change

Mailjet

I also configured Mailjet and Sendmail to relay all mail to that service. The free tier allow me to use with some limitations that service for newsletter setup and campaign management

Google Analytics

Always good to have visibility on the engagement, activated the GA plugin and setup the tracking.

This still needs some work in this part.

Conclusion

In this article I’ve been trough the options I did to setup databoost.dev a forum for Data Professionals. The hosting and other SaaS options used to achieve a “zero” cost option.

Feel free to join and support us.

References

pico.sh

Intro

We think of pico.sh as a hacker lab where we can experiment with new ways to interact with the web.

Features

Pico supports the following services:

  • prose.sh - A blog platform for hackers
  • pastes.sh - A pastebin for hackers
  • feeds.sh - An rss email notification service

The payed version called pico+ brings more service

pico+

Payed version of pico which brings extra services:

  • pgs.sh - 10GB asset storage
  • tuns.sh - Full access
  • imgs.sh - 2GB image registry storage
  • prose.sh - 1GB image storage
  • Beta access

Setup

Prose

Prose.sh - This service allows you to upload Github flavor markdown and it will generate the HTML content to display you just need to sync the data.

Create a post eg. ~/blog/hello-world.md

# hello world!

This is my first blog post.

Check out some resources:

- [pico.sh](https://pico.sh)
- [antoniomika](https://antoniomika.me)
- [bower.sh](https://bower.sh)

Cya!

And just publish with rsync

rsync ~/blog/* prose.sh:/

There are some special files you can setup to customize the css or add a footer

  • _styles.css
  • _footer.md

But that’s pretty much it

Check the following doc for more information

Pastes

You can also use the pastebin service

echo "foobar" | ssh pastes.sh

You can define a expiration

echo "foobar" | ssh pastes.sh FILENAME expires=2023-12-12

It will generate a url for your paste
eg: https://ruimsramos.pastes.sh/1709216080780412798

Conclusion

This article was about pico a hacker labs service as they advertise it.

It is extermely fast if you want to use a pastbin option to share some data or quickly upload in prose some markdown notes, when you don’t need to worry on setting up something fancing just to publish them and focus on the writing.

The pro service like tuns.hs and imgs.sh seem to be also powerfull. The later one if you ant to integrate with github actions for instance, but I didn’t evaluate that version

References

Parquet Compression

Introduction

Was reading this article where Philippe Rivière and Éric Mauvière optimized a 200GB Parquet data and prepare it to 549kB.

Now this work touch some very relevant points regarding Data Engineering procedures and best practises, I would suggest going on the article as it explains in detail what they applied in each stage and how.

Use Case

This new fascinating dataset just dropped on Hugging Face. French public domain newspapers 🤗 references about 3 million newspapers and periodicals with their full text OCR’ed and some meta-data. The data is stored in 320 large parquet files. The data loader for this Observable framework project uses DuckDB to read these files (altogether about 200GB) and combines a minimal subset of their metadata — title and year of publication, most importantly without the text contents —, into a single highly optimized parquet file.

Undoubtedly, this dataset proves immensely valuable for training and processing Language Model (LLM) models

Best Practises

I firmly believe that these best practices should be applied not only to Parquet but also to other columnar formats.

These are the key factors you should have into consideration:

1. Select only the Columns That you will use

This is one of simplest optimizations that you can do. Remember that data is stored in a columnar way so picking the columns that matter not only will will filter out very quickly as it will reduce significantly the volume

2. Apply the most appropriate Compression algoritm

The majority of contemporary data formats support compression. When examining the most common ones for Parquet—such as LZO, Snappy, and Gzip—we observe several notable differences (ref: sheet)

For instance gzip cannot be splitted, which means if you are going to process the data with a distributed process like Spark for instance you must use the driver to deal with all the uncompression.

LZO strikes a better balance between speed and compression rate when compared to Snappy. In this specific case, I would also recommend exploring Brotli as the datasets seem to contain text. Choosing an effective algorithm is crucial.

3. Sort the data

While it may not seem immediately relevant, aligning the rows in this manner results in extended streaks of constant values across multiple columns, enhancing the compaction ratio applied by the compression algorithm

Thoughs

They took it a step further by implementing additional optimizations, such as increasing the row_group_size. What’s crucial to highlight here is the significant gains achievable through the application of good engineering practices, resulting in faster and more cost-effective processes.

Additionally, DuckDB is exceptionally fast for executing these types of processes. While I’m eager to test it out, unfortunately, I find myself short on both time and disk space!

References

Sendmail Relay Configuration

Intro

In this article I will go through the process of setting up Sendmail to relay email to MailJet service.

There are several options to setup relaying on your web hosting service, and also several providers that you can consider.

Incorporating the SMTP relay service with Mailjet allows to take advantage of other services provided such and Campaign management.

Requirements

  • For this setup you will need to have access to your server and permissions to install software.
  • Create one account on MailJet service
  • Have permissions to change your domain DNS records

MailJet

For this setup we are considering MailJet service but you can use a different one.
Depending on the tier level, you will have different limitations.

The Free tier allows:

  • 200 emails per day
  • 1500 contacts
  • 6000 emails p/month

It is a good point to start and later increase if it makes sense.

DNS

SPF & DKIM are authentication systems that tell Internet Service Providers (ISPs), like Gmail and Yahoo, that incoming mail has been sent from an authorized system, and that it is not spam or email spoofing. To set Mailjet as an authorized sender and improve your deliverability, you need to modify your DNS records to include DKIM signature and SPF.

This document provides more detailed information

But basically you will need to include 2 TXT records

  • type: TXT , host: @ , value: “v=spf1 include:spf.mailjet.com ~all”

If you run a DNS query on your domain for TXT you need to see that info

dig -t TXT yourdomain.com

You also need to include the DKIM record follow the instructions provided

There is one option to validate if the configuration is working properly

Add Domains

You will also need to configure the allowed domains that will be allowed and validate senders.

In the following URL you can make those:

API Keys

The last step would be to create an API key for your service.

Go to following URL and create a new key, note it down as it will be required later.

Ok, now let’s configure our MTA

Configure Sendmail

For this setup you will need access to your hosting service and capable of installing software.

The following instructions are for a Ubuntu base distribution.

Install packages

sudo apt-get install sendmail

Configuration

In this setup we will configure to relay via SMTP all email using auth provided by the service

Start by editing the following file /etc/mail/sendmail.mc and add the following content at the end

dnl # Default Mailer setup
MAILER_DEFINITIONS
define(`SMART_HOST', `in-v3.mailjet.com')dnl
define(`RELAY_MAILER_ARGS', `TCP $h 587')dnl
define(`ESMTP_MAILER_ARGS', `TCP $h 587')dnl
define(`confAUTH_OPTIONS', `A p')dnl
TRUST_AUTH_MECH(`EXTERNAL DIGEST-MD5 CRAM-MD5 LOGIN PLAIN')dnl
define(`confAUTH_MECHANISMS', `EXTERNAL GSSAPI DIGEST-MD5 CRAM-MD5 LOGIN PLAIN')dnl
FEATURE(`authinfo',`hash -o /etc/mail/authinfo/smtp-auth.db')dnl
MAILER(`local')dnl
MAILER(`smtp')dnl

We need to setup authentication, remember the previous API key that you created you will need to include the information associaded to API_KEY and API_SECRET on the following file /etc/mail/authinfo/smtp-auth

AuthInfo: "U:root" "I:API_KEY" "P:API_SECRET"
sudo mkdir /etc/mail/authinfo
sudo nano /etc/mail/authinfo/smtp-auth

Example:

AuthInfo: "U:root" "I:1233450786523741256e" "P:ety555qtfgdghsd88wrfer"

After this you need to run the following command to update the service configuration
files

make -C /etc/mail

And restart sendmail service

systemctl restart sendmail

Test

In order to test you can execute the following command

echo "Test Email" | mail -s "Subject Here" recipient@example.com 

You can now check in MailJet Stats session if your mail pass there.

Troubleshooting

You can check with the mailq command to understand if there is mail being block and the logs in /var/log/mail.log to understand if there is some issue.

Conclusion

In this article we went though the configuration of Sendmail service to relay emails through the Mailjet service. It covers the necessary configurations in both DNS and the Mailjet service to ensure seamless email delivery from your web hosting server.

References

Redpanda

Intro

In this article I’ll go through the Redpanda quickstart guide.
Spinning up a Redpanda cluster in Docker to evaluate in Linux

Requirements

Make sure you have docker and docker-compose

Setup

For lightweight testing, we are going to start a single Redpanda broker.

Create the following docker-compose.yml file with the content:

version: "3.7"
name: redpanda-quickstart
networks:
redpanda_network:
driver: bridge
volumes:
redpanda-0: null
services:
redpanda-0:
command:
- redpanda
- start
- --kafka-addr internal://0.0.0.0:9092,external://0.0.0.0:19092
# Address the broker advertises to clients that connect to the Kafka API.
# Use the internal addresses to connect to the Redpanda brokers'
# from inside the same Docker network.
# Use the external addresses to connect to the Redpanda brokers'
# from outside the Docker network.
- --advertise-kafka-addr internal://redpanda-0:9092,external://localhost:19092
- --pandaproxy-addr internal://0.0.0.0:8082,external://0.0.0.0:18082
# Address the broker advertises to clients that connect to the HTTP Proxy.
- --advertise-pandaproxy-addr internal://redpanda-0:8082,external://localhost:18082
- --schema-registry-addr internal://0.0.0.0:8081,external://0.0.0.0:18081
# Redpanda brokers use the RPC API to communicate with each other internally.
- --rpc-addr redpanda-0:33145
- --advertise-rpc-addr redpanda-0:33145
# Tells Seastar (the framework Redpanda uses under the hood) to use 1 core on the system.
- --smp 1
# The amount of memory to make available to Redpanda.
- --memory 1G
# Mode dev-container uses well-known configuration properties for development in containers.
- --mode dev-container
# enable logs for debugging.
- --default-log-level=debug
image: docker.redpanda.com/redpandadata/redpanda:v23.3.5
container_name: redpanda-0
volumes:
- redpanda-0:/var/lib/redpanda/data
networks:
- redpanda_network
ports:
- 18081:18081
- 18082:18082
- 19092:19092
- 19644:9644
console:
container_name: redpanda-console
image: docker.redpanda.com/redpandadata/console:v2.4.3
networks:
- redpanda_network
entrypoint: /bin/sh
command: -c 'echo "$$CONSOLE_CONFIG_FILE" > /tmp/config.yml; /app/console'
environment:
CONFIG_FILEPATH: /tmp/config.yml
CONSOLE_CONFIG_FILE: |
kafka:
brokers: ["redpanda-0:9092"]
schemaRegistry:
enabled: true
urls: ["http://redpanda-0:8081"]
redpanda:
adminApi:
enabled: true
urls: ["http://redpanda-0:9644"]
ports:
- 8080:8080
depends_on:
- redpanda-0

And start the execution with docker-compose up -d

Start Streaming

Let’s use the rpk command-line tool to create a topic, produce messages to it, and consume messages.

Get information about the cluster with the command

docker exec -it redpanda-0 rpk cluster info

Now lets create a topic called chat-room:

docker exec -it redpanda-0 rpk topic create chat-room

Producing messages for that topic

docker exec -it redpanda-0 rpk topic produce chat-room

Consuming one message from the topic

docker exec -it redpanda-0 rpk topic consume chat-room --num 1

You can install rpk on your system directly and connect with the broker

curl -LO https://github.com/redpanda-data/redpanda/releases/latest/download/rpk-linux-amd64.zip

Then unzip the file and put the rpk binary on bin path ex: unzip rpk-linux-amd64.zip -d ~/.local/bin/

You can test the connection to your broker with:

rpk cluster info -X brokers=127.0.0.1:19092

Generating Mock Data

Let’s use the following command from our references to product mock data.

Leave one terminal open with the following command

rpk topic consume Products -X brokers=127.0.0.1:19092

On a diferent terminal create the following file schema.avsc

{
"type": "record",
"name": "Products",
"namespace": "exp.products.v1",
"fields": [
{ "name": "id", "type": "string" },
{ "name": "productId", "type": ["null", "string"] },
{ "name": "title", "type": "string" },
{ "name": "price", "type": "int" },
{ "name": "isLimited", "type": "boolean" },
{ "name": "sizes", "type": ["null", "string"], "default": null },
{ "name": "ownerIds", "type": { "type": "array", "items": "string" } }
]
}

Make sure to install datagen

npm install -g @materializeinc/datagen

Create the following .env file

# Kafka Brokers
KAFKA_BROKERS=localhost:19092

# For Kafka SASL Authentication:
SASL_USERNAME=
SASL_PASSWORD=
SASL_MECHANISM=

# For Kafka SSL Authentication:
SSL_CA_LOCATION=
SSL_CERT_LOCATION=
SSL_KEY_LOCATION=

# Connect to Schema Registry if using '--format avro'
SCHEMA_REGISTRY_URL=
SCHEMA_REGISTRY_USERNAME=
SCHEMA_REGISTRY_PASSWORD=

Then execute the following command

datagen -s schema.avsc -n 10

And you just generated mock data based on the provided json file.
Take a look on the following repo for more details on datagen.

Conclusion

Redpanda provides a very quick alternative to have a quick kafka enviroment, which is especially good for developers. This article didn’t go deep on performance evalutions of Kafka vs Redpanda but their benchmarks worth assessing if that means reducing your kafka footprint.

Probably would let that for another article. Also I would like to test the SASL options and schema register option.

References