Data Engineering Toolchain
This article is still W.I.P and solely exists for my own purpose.
Welcome to my "Data Engineering toolchain", a curated collection of invaluable tools, links and resources collected throughout my journey.
The Collection of Tools That I Use and Swareby
graph LR A[1. Data processors] --> B1[Python] A --> B2[SQL] A --> B3[Spark] A --> B4[Snowflake] A --> B5[Flink] C[2. Code testing] --> D1[Pytest] E[3. Data quality] --> F1[Great Expectations] G[4. Data visualization] --> H1[Metabase] I[5. Orchestration] --> J1[Airflow] I --> J2[dbt] I --> J3[Databricks] K[6. Monitoring & alerting] --> L1[Prometheus] K --> L2[Grafana] M[7. Data infrastructure] --> N1[Docker] M --> N2[Terraform] M --> N3[AWS] O[8. Code hosting & CI/CD] --> P1[GitHub]
👇 List of lesser-known but useful tools I use and like 👇
Cronitor
The quick and simple editor for cron schedule expressions. Cronitor is my go-to tool for scheduling to run the DAG (Directed Acyclic Graph) because cron schedule expressions are difficult to understand and configure.
Link: Cronitor
ntfy (pronounced notify)
ntfy (pronounced notify) is a simple HTTP-based pub-sub notification service. It allows you to send notifications to your phone or desktop via scripts from any computer, and/or using a REST API. It's infinitely flexible, and 100% free software.
Link: ntfy
DevUtils
Offline development toolkit. Whether you need to format JSON, debug a JWT token, or convert a UNIX timestamp, DevUtils can solve any of this in just one click. Get access to a huge set of development utilities that all work offline.
I can't recommend this tool enough. You have to see it to believe it. You will never regret using DevUtils.
Link: DevUtils
Peewee
In the Python standard library, there is no built-in support for database interaction, which has been a longstanding challenge for the community. There are boatloads of ORMs and query builders for Python. SQLAlchemy Django ORM, rining any bells?
Personally, I prefer using Peewee as my go-to ORM for Python.
Link: Peewee
Mockingbird
Generate mock data streams for your next data project. It is as easy as 123.
-
Connect
Select a destination and configure your project settings. -
Build the schema
Define the structure of your data from scratch or using a template. -
Generate data
Mockingbird will generate and stream events to your selected destination.
Link: Mockingbird
novu
novu is an open-source notification infrastructure for developers. novu consists of simple components and APIs for managing all communication channels in one place: Email, SMS, Direct, and Push.
Build a real-time notification centre using novu's embeddable components or connect your custom UI with novu's notification feed API.
npx novu init
Link: novu
micro
Micro is a modern and intuitive terminal-based text editor. It serves as an alternative to the nano editor. Whenever I am required to modify a configuration file or create a script on a VPS, I rely on micro as my go-to tool.
You can run a real interactive shell from within micro. You could open up a split with code on one side and bash on the other -- all from within micro. Micro supports Sublime-style multiple cursors, directly giving us editing power in the terminal.
Link: micro
Poetry
Poetry helps you declare, manage and install dependencies of Python projects, ensuring you have the right stack everywhere. Poetry simplifies packaging and managing dependencies in Python.
Poetry allows you to declare the libraries your project depends on and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs, and can build your project for distribution.
Poetry replaces setup.py
, requirements.txt
, setup.cfg
, MANIFEST.in
and Pipfile
with a simple pyproject.toml
based project format.
Poetry requires Python 3.8+. It is multi-platform and the goal is to make it work equally well on Linux, macOS and Windows.
Link: Poetry
KubeBlocks
Building data infrastructure on K8s becomes increasingly popular. However, the most prominent obstacles are the difficulties of integrating with cloud providers, the lack of reliable operators, and the steep learning curve of K8s.
KubeBlocks helps developers and platform engineers manage database workloads (MySQL, PostgresSQL, Redis, MongoDB, Kafka and vector databases) on K8s inside your own cloud account. It supports multiple clouds, including AWS, Azure, GCP, and Alibaba Cloud.
When adopting a multi-cloud or hybrid cloud strategy, it is essential to prioritize application portability and use software or services that offer consistent functionality across different infrastructures.
KubeBlocks offers an open-source option that helps application developers and platform engineers set up feature-rich services for RDBMS, NoSQL, streaming and analytical systems. No need to be a K8s professional, anyone can set up a full-stack, production-ready data infrastructure in minutes.
Link: KubeBlocks
Protobuf
Protobuf, short for Protocol Buffers, is a Google project designed to simplify and enhance the serialisation of structured data. This makes it easier to transmit the data over a wire or store it in files.
In comparison to other serialisation methods, Protobuf is more optimized for scenarios where data is transmitted between multiple micro-services in a platform-neutral way. Because of this, many developers prefer using Protobuf for data serialisation.
Key features
Binary transfer format
In Protobuf data is transmitted as binary which allows for faster transmission due to its smaller size and bandwidth usage.
Separation of context and data
It separates context and data, which is not possible in JSON and XML but is achievable with Protobuf.
Message format
The message format is optimized for efficiency and effectiveness.
Link: Protobuf
Cockpit
The easy-to-use, integrated, glanceable, and open web-based graphical interface for your servers.
Cockpit makes Linux discoverable. You don’t have to remember commands at a command-line.
See your server in a web browser and perform system tasks with a mouse. It’s easy to start containers, administer storage, configure networks, and inspect logs. Basically, you can think of Cockpit like a graphical “desktop interface”, but for individual servers.
Cockpit intended for everyone, especially those who are:
-
new to Linux
(including Windows admins) -
familiar with Linux
and want an easy, graphical way to administer servers -
expert admins
who mainly use other tools but want an overview on individual systems
Link: Cockpit
ssh-audit
ssh-audit
is a tool for ssh server & client configuration auditing.
SSH server & client security auditing (banner, key exchange, encryption, mac, compression, compatibility, security, etc)
pip3 install ssh-audit
Link: ssh-audit
Trino
Trino, a query engine that runs at ludicrous speed. It is a fast distributed SQL query engine for big data analytics that helps you explore your data universe.
-
Speed
Trino is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics. -
Scale
The largest organizations in the world use Trino to query exabyte scale data lakes and massive data warehouses alike. -
Simplicity
Trino is an ANSI SQL compliant query engine, that works with BI tools such as R, Tableau, Power BI, Superset and many others. -
In-place analysis
You can natively query data in Hadoop, S3, Cassandra, MySQL, and many others, without the need for complex, slow, and error-prone processes for copying the data. -
Runs everywhere
Trino is optimized for both on-premise and cloud environments such as Amazon, Azure, Google Cloud, and others.
Link: Trino
MinIO
MinIO is a high-performance, S3 compatible object store. It is built for large scale AI/ML, data lake and database workloads. It is software-defined and runs on any cloud or on-premises infrastructure.
-
Simple
Simplicity is the foundation for exascale data infrastructure - both technically and operationally. No other object store lets you go from download to production in less time. -
High Performance
MinIO is the worlds fastest object store with published GETs/PUTs results that exceed 325 GiB/sec and 165 GiB/sec on 32 nodes of NVMe drives and a 100Gbe network. -
Kubernetes-Native
With a native Kubernetes operator integration, MinIO supports all the major Kubernetes distributions on public, private and edge clouds.
Link: minio
This article is still W.I.P and solely exists for my own purpose.