Data Engineering Best Practices

This page is still W.I.P and solely exists for my own reference.

Avoid Multiple SparkSessions and SparkContexts

Creating multiple SparkSessions and SparkContexts can cause problems. It's a best practice to use the SparkSession.builder.getOrCreate() method. This gives you an existing SparkSession if there's one around, or it makes a new one if needed.

# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession

# Create spark_session
spark_session = SparkSession.builder.getOrCreate()

# Print spark_session
print(spark_session)

📌 BigQuery Best Practice

Cost reduction

Avoid SELECT *
Price your queries before running them
Use clustered or partitioned tables
Use streaming inserts with caution
Materialise query results in stages

Query performance

Filter on partitioned columns
Denormalising data
Use nested or repeated columns
Use external data sources appropriately (Don't use external data sources, in case you want a high query performance)
Reduce data before using a JOIN
Do not treat WITH clauses as prepared statements
Avoid oversharding tables
Avoid JavaScript user-defined functions
Use approximate aggregation functions (HyperLogLog++)
Order Last, for query operations to maximize performance
Optimise your join patterns
As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.

📌 dbt Best Practices Checklist

🟢 Model & Query Structure

✅ Replace hard-coded table names with src() or ref() → Ensures modularity, dynamic references, and better lineage.
✅ Refactor queries based on a style guide (e.g., imports, CTEs, formatting) → Improves readability and maintainability.
✅ Move any source-centric transformations to the staging layer → Prevents duplication, maintains modularity.
✅ Break out staging models with joins into separate 1:1 models → Improves DAG readability and modular design.
✅ Ensure each model has a clear purpose → Staging models should clean data; marts should aggregate for reporting.
✅ Review the DAG to prevent circular dependencies in marts → Avoids difficult-to-debug pipelines.

🔍 Debugging & Issue Resolution

✅ Check the /target/ folder when models fail → View compiled SQL for easier debugging.
✅ Debug macros by isolating logic, printing logs, or running snippets → Helps pinpoint issues efficiently.
✅ Use dbt debug before running models → Ensures database connection and configuration are correct.

🚀 Performance Optimisation (For Big Data & PostgreSQL)

✅ Use is_incremental() for large tables instead of full refreshes → Reduces load, optimises compute costs.
✅ Partition large tables (if supported by the warehouse) → Improves query performance.
✅ Leverage indexes on primary keys & frequently queried columns → PostgreSQL benefits from well-defined indexes.
✅ Prefer table materialisation over view for expensive transformations → Reduces computation overhead.
✅ Ensure proper clustering & ordering of data for faster retrieval → Especially for frequently queried columns.

🧪 Data Quality & Testing (Great Expectations & dbt Tests)

✅ Define unique and not_null tests in schema.yml → Ensures data integrity and quality.
✅ Use dbt test to validate relationships & referential integrity → Avoids orphaned or duplicate records.
✅ Add column-level constraints where applicable (e.g., data types, ranges) → Prevents unexpected data issues.

📌 dbt Configuration & Best Practices

✅ Move model-specific configs from dbt_project.yml to individual models (if needed) → Reduces project-level clutter.
✅ Use config() blocks for clear, inline model settings → Makes it easier to track per-model configurations.
✅ Define meaningful descriptions in schema.yml for documentation → Helps maintain clarity on model usage.
✅ Use meta fields for additional metadata (e.g., owner, SLA) → Useful for governance and documentation.

📊 Monitoring & Logging

✅ Enable dbt logging (dbt logs) to track run issues → Helps diagnose failures faster.
✅ Regularly check dbt source freshness for data latency issues → Ensures upstream sources are updated as expected.
✅ Automate dbt runs with scheduling (e.g., Airflow, dbt Cloud) → Ensures timely updates and monitoring.

✅ Why These Are dbt Best Practices?

These practices follow dbt’s core philosophy of:

Modularity → Keeping transformations in the right layers (staging, marts, etc.).
Maintainability → Using style guides, removing hard-coded values, and structuring queries well.
Performance Optimisation → Leveraging is_incremental(), indexing, materialisation strategies.
Data Quality → Using dbt test, schema validation, and referential integrity.
Debugging & Monitoring → Checking /target/, logging, DAG review, dbt source freshness.

These are widely recommended in the dbt community and by teams managing large-scale production dbt pipelines.

This page is still W.I.P and solely exists for my own reference.