23 Dec, 2022 - About 5 minutes
SODA
Intro
Bernard Denys kindly share with me the availability of their product for the Open Source Community in Github. SODA provide a Data Monitoring Platform. In this article i will take some time to explore a quick setup on how to use it and final comments around it.
What does Soda SQL do
Soda SQL allows you to
Stop your pipeline when bad data is detected
Extract metrics and column profiles through super efficient SQL
Full control over metrics and queries through declarative config files
5m tutorial
There is a 5m tutorial available on their official page let’s have a go…
Install
I’m using Linux and i’m lazy so i’ve just installed the required pip
pip3 install soda-sql |
Let’s check if it is working
$ soda |
Great now let’s create some a dummy warehouse and do some tests:
Create dummy Datawarehouse
docker run --name soda_sql_tutorial_db --rm -d \ |
And load it with data with the following command
docker exec soda_sql_tutorial_db \ |
According to the tutorial one can remove the created container and volume with the following command :
docker stop soda_sql_tutorial_db |
Create warehouse directory
mkdir soda_sql_tutorial |
This created the warehouse.yml
with connection settings which are stored on your homedir ~/.soda/env_vars.yml
Analyse table scan YAML files
The following command will analyze the exiting tables and fill the ./tables/
directory with large data warehouse it can be inputted manually.
Well, i only have 5m so let’s give it a go.
soda analyze |
Hum! interesting, queries the information schema and generates a file which i will assume per table with several metrics for validation.
table_name: demodata |
Run a scan
Each scan requires a warehouse YAML and a scan YAML as input. The scan command will collect the configured metrics and run the defined tests against them.
soda scan warehouse.yml tables/demodata.yml |
And that’s it for the tutorial
One can then integrate this with a orchestration tool such as Airflow
It is recommend to PythonVirtualenvOperator
to prevent any dependency conflicts.
Example dag:
from airflow import DAG |
There is also some work being done to support a SodaScanOperator for Airflow.
Warehouse types
At the moment SODA is supporting the following techs:
- Snowflake
- AWS Athena
- GCP BigQuery
- PostgreSQL
- Redshift
- Spark SQL -> (Coming Soon)
Conclusion
This quick tutorial make’s it easy to test the tool. But i would like to extended the tests on a larger solution to check how it behaves. Also i would like to see the SparkSQL as a warehouse option as most of my teams work are centered on that technology.
Topics to do extended testing with:
- Schema evolution impact on the quality test setup
- Not using a Warehouse solution but a Datalake
- Providing schemas from a Datalake and using a processing engine on top of it (ex: spark.schemas and Parquet files)
- Testing with Delta tables
- Define levels of data-quality analysis
- Understanding better how the scans works
The tool seems to have good potential and seems very simple which would speed the adoption. Also by sharing this to the open source community seems to be as a good choice to increase user quorum and still offer enterprise services solution.