Data Engineering Best Practices
This page is still W.I.P and solely exists for my own reference.
Avoid Multiple SparkSessions and SparkContexts
Creating multiple SparkSessions
and SparkContexts
can cause problems. It's a best practice to use the SparkSession.builder.getOrCreate()
method. This gives you an existing SparkSession
if there's one around, or it makes a new one if needed.
# Import SparkSession from pyspark.sql
from pyspark.sql import SparkSession
# Create spark_session
spark_session = SparkSession.builder.getOrCreate()
# Print spark_session
print(spark_session)
๐ BigQuery Best Practice
Cost reduction
- Avoid
SELECT *
- Price your queries before running them
- Use clustered or partitioned tables
- Use streaming inserts with caution
- Materialise query results in stages
Query performance
- Filter on partitioned columns
- Denormalising data
- Use nested or repeated columns
- Use external data sources appropriately (Don't use external data sources, in case you want a high query performance)
- Reduce data before using a
JOIN
- Do not treat
WITH
clauses as prepared statements - Avoid oversharding tables
- Avoid JavaScript user-defined functions
- Use approximate aggregation functions (HyperLogLog++)
- Order Last, for query operations to maximize performance
- Optimise your join patterns
- As a best practice, place the table with the largest number of rows first, followed by the table with the fewest rows, and then place the remaining tables by decreasing size.
๐ dbt Best Practices Checklist
๐ข Model & Query Structure
โ
Replace hard-coded table names with src()
or ref()
โ Ensures modularity, dynamic references, and better lineage.
โ
Refactor queries based on a style guide (e.g., imports, CTEs, formatting) โ Improves readability and maintainability.
โ
Move any source-centric transformations to the staging layer โ Prevents duplication, maintains modularity.
โ
Break out staging models with joins into separate 1:1 models โ Improves DAG readability and modular design.
โ
Ensure each model has a clear purpose โ Staging models should clean data; marts should aggregate for reporting.
โ
Review the DAG to prevent circular dependencies in marts โ Avoids difficult-to-debug pipelines.
๐ Debugging & Issue Resolution
โ
Check the /target/
folder when models fail โ View compiled SQL for easier debugging.
โ
Debug macros by isolating logic, printing logs, or running snippets โ Helps pinpoint issues efficiently.
โ
Use dbt debug
before running models โ Ensures database connection and configuration are correct.
๐ Performance Optimisation (For Big Data & PostgreSQL)
โ
Use is_incremental()
for large tables instead of full refreshes โ Reduces load, optimises compute costs.
โ
Partition large tables (if supported by the warehouse) โ Improves query performance.
โ
Leverage indexes on primary keys & frequently queried columns โ PostgreSQL benefits from well-defined indexes.
โ
Prefer table
materialisation over view
for expensive transformations โ Reduces computation overhead.
โ
Ensure proper clustering & ordering of data for faster retrieval โ Especially for frequently queried columns.
๐งช Data Quality & Testing (Great Expectations & dbt Tests)
โ
Define unique
and not_null
tests in schema.yml
โ Ensures data integrity and quality.
โ
Use dbt test
to validate relationships & referential integrity โ Avoids orphaned or duplicate records.
โ
Add column-level constraints where applicable (e.g., data types, ranges) โ Prevents unexpected data issues.
๐ dbt Configuration & Best Practices
โ
Move model-specific configs from dbt_project.yml
to individual models (if needed) โ Reduces project-level clutter.
โ
Use config()
blocks for clear, inline model settings โ Makes it easier to track per-model configurations.
โ
Define meaningful descriptions in schema.yml
for documentation โ Helps maintain clarity on model usage.
โ
Use meta
fields for additional metadata (e.g., owner, SLA) โ Useful for governance and documentation.
๐ Monitoring & Logging
โ
Enable dbt logging (dbt logs
) to track run issues โ Helps diagnose failures faster.
โ
Regularly check dbt source freshness
for data latency issues โ Ensures upstream sources are updated as expected.
โ
Automate dbt runs with scheduling (e.g., Airflow, dbt Cloud) โ Ensures timely updates and monitoring.
โ Why These Are dbt Best Practices?
These practices follow dbtโs core philosophy of:
- Modularity โ Keeping transformations in the right layers (
staging
,marts
, etc.). - Maintainability โ Using style guides, removing hard-coded values, and structuring queries well.
- Performance Optimisation โ Leveraging
is_incremental()
, indexing, materialisation strategies. - Data Quality โ Using
dbt test
, schema validation, and referential integrity. - Debugging & Monitoring โ Checking
/target/
, logging, DAG review,dbt source freshness
.
These are widely recommended in the dbt community and by teams managing large-scale production dbt pipelines.
This page is still W.I.P and solely exists for my own reference.