Skip to main content

Data Engineering Toolchain

This article is still W.I.P and solely exists for my own purpose.

Welcome to my "Data Engineering toolchain", a curated collection of invaluable tools, links and resources collected throughout my journey.

The Collection of Tools That I Use and Swareby

graph LR
    A[1. Data processors] --> B1[Python]
    A --> B2[SQL]
    A --> B3[Spark]
    A --> B4[Snowflake]
    A --> B5[Flink]

    C[2. Code testing] --> D1[Pytest]

    E[3. Data quality] --> F1[Great Expectations]

    G[4. Data visualization] --> H1[Metabase]

    I[5. Orchestration] --> J1[Airflow]
    I --> J2[dbt]
    I --> J3[Databricks]

    K[6. Monitoring & alerting] --> L1[Prometheus]
    K --> L2[Grafana]

    M[7. Data infrastructure] --> N1[Docker]
    M --> N2[Terraform]
    M --> N3[AWS]

    O[8. Code hosting & CI/CD] --> P1[GitHub]

👇 List of lesser-known but useful tools I use and like 👇

Cronitor

The quick and simple editor for cron schedule expressions. Cronitor is my go-to tool for scheduling to run the DAG (Directed Acyclic Graph) because cron schedule expressions are difficult to understand and configure.

Link: Cronitor

ntfy (pronounced notify)

ntfy (pronounced notify) is a simple HTTP-based pub-sub notification service. It allows you to send notifications to your phone or desktop via scripts from any computer, and/or using a REST API. It's infinitely flexible, and 100% free software.

Link: ntfy

DevUtils

Offline development toolkit. Whether you need to format JSON, debug a JWT token, or convert a UNIX timestamp, DevUtils can solve any of this in just one click. Get access to a huge set of development utilities that all work offline.

I can't recommend this tool enough. You have to see it to believe it. You will never regret using DevUtils.

Link: DevUtils

Peewee

In the Python standard library, there is no built-in support for database interaction, which has been a longstanding challenge for the community. There are boatloads of ORMs and query builders for Python. SQLAlchemy Django ORM, rining any bells?

Personally, I prefer using Peewee as my go-to ORM for Python.

Link: Peewee

Mockingbird

Generate mock data streams for your next data project. It is as easy as 123.

  1. Connect
    Select a destination and configure your project settings.

  2. Build the schema
    Define the structure of your data from scratch or using a template.

  3. Generate data
    Mockingbird will generate and stream events to your selected destination.

Link: Mockingbird

novu

novu is an open-source notification infrastructure for developers. novu consists of simple components and APIs for managing all communication channels in one place: Email, SMS, Direct, and Push.

Build a real-time notification centre using novu's embeddable components or connect your custom UI with novu's notification feed API.

npx novu init

Link: novu

micro

Micro is a modern and intuitive terminal-based text editor. It serves as an alternative to the nano editor. Whenever I am required to modify a configuration file or create a script on a VPS, I rely on micro as my go-to tool.

You can run a real interactive shell from within micro. You could open up a split with code on one side and bash on the other -- all from within micro. Micro supports Sublime-style multiple cursors, directly giving us editing power in the terminal.

Link: micro

Poetry

Poetry helps you declare, manage and install dependencies of Python projects, ensuring you have the right stack everywhere. Poetry simplifies packaging and managing dependencies in Python.

Poetry allows you to declare the libraries your project depends on and it will manage (install/update) them for you. Poetry offers a lockfile to ensure repeatable installs, and can build your project for distribution.

Poetry replaces setup.py, requirements.txt, setup.cfg, MANIFEST.in and Pipfile with a simple pyproject.toml based project format.

Poetry requires Python 3.8+. It is multi-platform and the goal is to make it work equally well on Linux, macOS and Windows.

Link: Poetry

KubeBlocks

Building data infrastructure on K8s becomes increasingly popular. However, the most prominent obstacles are the difficulties of integrating with cloud providers, the lack of reliable operators, and the steep learning curve of K8s.

KubeBlocks helps developers and platform engineers manage database workloads (MySQL, PostgresSQL, Redis, MongoDB, Kafka and vector databases) on K8s inside your own cloud account. It supports multiple clouds, including AWS, Azure, GCP, and Alibaba Cloud.

When adopting a multi-cloud or hybrid cloud strategy, it is essential to prioritize application portability and use software or services that offer consistent functionality across different infrastructures.

KubeBlocks offers an open-source option that helps application developers and platform engineers set up feature-rich services for RDBMS, NoSQL, streaming and analytical systems. No need to be a K8s professional, anyone can set up a full-stack, production-ready data infrastructure in minutes.

Link: KubeBlocks

Protobuf

Protobuf, short for Protocol Buffers, is a Google project designed to simplify and enhance the serialisation of structured data. This makes it easier to transmit the data over a wire or store it in files.

In comparison to other serialisation methods, Protobuf is more optimized for scenarios where data is transmitted between multiple micro-services in a platform-neutral way. Because of this, many developers prefer using Protobuf for data serialisation.

Key features

Binary transfer format

In Protobuf data is transmitted as binary which allows for faster transmission due to its smaller size and bandwidth usage.

Separation of context and data

It separates context and data, which is not possible in JSON and XML but is achievable with Protobuf.

Message format

The message format is optimized for efficiency and effectiveness.

Link: Protobuf

Cockpit

The easy-to-use, integrated, glanceable, and open web-based graphical interface for your servers.

Cockpit makes Linux discoverable. You don’t have to remember commands at a command-line.

See your server in a web browser and perform system tasks with a mouse. It’s easy to start containers, administer storage, configure networks, and inspect logs. Basically, you can think of Cockpit like a graphical “desktop interface”, but for individual servers.

Cockpit intended for everyone, especially those who are:

  • new to Linux
    (including Windows admins)

  • familiar with Linux
    and want an easy, graphical way to administer servers

  • expert admins
    who mainly use other tools but want an overview on individual systems

Link: Cockpit

ssh-audit

ssh-audit is a tool for ssh server & client configuration auditing.

SSH server & client security auditing (banner, key exchange, encryption, mac, compression, compatibility, security, etc)

pip3 install ssh-audit

Link: ssh-audit

Trino

Trino, a query engine that runs at ludicrous speed. It is a fast distributed SQL query engine for big data analytics that helps you explore your data universe.

  • Speed
    Trino is a highly parallel and distributed query engine, that is built from the ground up for efficient, low latency analytics.

  • Scale
    The largest organizations in the world use Trino to query exabyte scale data lakes and massive data warehouses alike.

  • Simplicity
    Trino is an ANSI SQL compliant query engine, that works with BI tools such as R, Tableau, Power BI, Superset and many others.

  • In-place analysis
    You can natively query data in Hadoop, S3, Cassandra, MySQL, and many others, without the need for complex, slow, and error-prone processes for copying the data.

  • Runs everywhere
    Trino is optimized for both on-premise and cloud environments such as Amazon, Azure, Google Cloud, and others.

Link: Trino

MinIO

MinIO is a high-performance, S3 compatible object store. It is built for large scale AI/ML, data lake and database workloads. It is software-defined and runs on any cloud or on-premises infrastructure.

  • Simple
    Simplicity is the foundation for exascale data infrastructure - both technically and operationally. No other object store lets you go from download to production in less time.

  • High Performance
    MinIO is the worlds fastest object store with published GETs/PUTs results that exceed 325 GiB/sec and 165 GiB/sec on 32 nodes of NVMe drives and a 100Gbe network.

  • Kubernetes-Native
    With a native Kubernetes operator integration, MinIO supports all the major Kubernetes distributions on public, private and edge clouds.

Link: minio

This article is still W.I.P and solely exists for my own purpose.