CDC vs SCD: Key Differences, Use Cases & When to Use Each
When working with data pipelines and warehouses, two important concepts often come up: Change Data Capture (CDC) and Slowly Changing Dimensions (SCD). While they both deal with handling data changes, they serve different purposes. In this article, we will explore their differences and how they complement each other.
What is Change Data Capture (CDC)?
Change Data Capture (CDC) is a technique for identifying and capturing changes in a source database, so they can be replicated elsewhere, such as a data warehouse. CDC is crucial for real-time and incremental data processing, reducing the need to reload entire datasets.
Key Features of CDC
- Tracks all changes (INSERT, UPDATE, DELETE) in the source system.
- Works across all types of tables (fact, dimension, etc.).
- Used for real-time or batch ETL processes.
- Commonly implemented using:
- Log-based CDC (reads from database logs).
- Trigger-based CDC (uses database triggers).
- Timestamp-based CDC (compares timestamps).
- Versioning-based CDC (stores historical records in the source table).
What is Slowly Changing Dimensions (SCD)?
Slowly Changing Dimensions (SCD) refers to techniques used in data warehouses to handle changes in dimension tables while preserving historical accuracy. Unlike CDC, which captures every change, SCD focuses on tracking changes in descriptive attributes that evolve over time.
Types of SCD
There are several types of SCD, but the most common ones are:
- SCD Type 1: Overwrites old data (no history maintained).
- SCD Type 2: Retains history by adding a new record for each change (e.g., a new row with an effective date or version number).
- SCD Type 3: Stores only a limited history (e.g., keeping the current and previous value in separate columns).
CDC vs. SCD: A Side-by-Side Comparison
| Feature | Change Data Capture (CDC) | Slowly Changing Dimensions (SCD) |
|---|---|---|
| Purpose | Captures all changes in a source system. | Manages historical changes in dimension tables. |
| Scope | Applies to all tables (fact and dimension). | Applies only to dimension tables in a warehouse. |
| Types of Changes | Detects INSERT, UPDATE, DELETE. | Tracks changes in descriptive attributes (e.g., customer address). |
| Data Retention | Used for real-time replication or incremental loads. | Can overwrite, store limited history, or keep full history. |
| Common Uses | Real-time ETL, replication, streaming, auditing. | Data warehouse history tracking, reporting, analytics. |
| Implementation Methods | Log-based, trigger-based, timestamp-based, versioning. | SCD Type 1 (overwrite), Type 2 (new row), Type 3 (limited history). |
| Example Scenario | Detecting changes in an employee’s shift records. | Keeping track of an employee’s job title changes over time. |
How CDC and SCD Work Together
Although CDC and SCD serve different functions, they often work together in data pipelines. For example:
- CDC detects changes in a transactional database (OLTP) and sends them to the data warehouse.
- The data warehouse applies SCD techniques to store and manage historical data in dimension tables.
Example Use Case: Cleaning Company Workforce Monitoring System
- CDC is used to track changes in an OLTP PostgreSQL database, capturing modifications in shift records, payroll, and employee details.
- SCD Type 2 is implemented in the warehouse to preserve the history of employee records, ensuring past job titles, pay grades, and locations remain accessible.
SQL Implementation Examples
SCD Type 1: Overwrite (No History)
Type 1 simply overwrites the old value. Use when historical tracking isn't needed.
-- SCD Type 1: Update customer address (no history kept)
UPDATE dim_customer
SET
address = src.new_address,
city = src.new_city,
updated_at = CURRENT_TIMESTAMP
FROM staging_customer src
WHERE dim_customer.customer_id = src.customer_id
AND (dim_customer.address != src.new_address
OR dim_customer.city != src.new_city);
SCD Type 2: Version Rows (Full History)
Type 2 creates a new row for each change, preserving complete history.
-- SCD Type 2 table structure
CREATE TABLE dim_employee (
employee_sk INT PRIMARY KEY, -- Surrogate key
employee_id INT, -- Natural/business key
name VARCHAR(100),
job_title VARCHAR(100),
department VARCHAR(100),
effective_date DATE,
end_date DATE,
is_current BOOLEAN DEFAULT TRUE
);
-- Step 1: Expire old records
UPDATE dim_employee
SET
end_date = CURRENT_DATE - INTERVAL '1 day',
is_current = FALSE
WHERE employee_id IN (
SELECT employee_id FROM staging_employee
WHERE staging_employee.job_title != dim_employee.job_title
)
AND is_current = TRUE;
-- Step 2: Insert new version
INSERT INTO dim_employee (employee_id, name, job_title, department, effective_date, end_date, is_current)
SELECT
s.employee_id,
s.name,
s.job_title,
s.department,
CURRENT_DATE,
'9999-12-31',
TRUE
FROM staging_employee s
JOIN dim_employee d ON s.employee_id = d.employee_id
WHERE d.is_current = FALSE
AND d.end_date = CURRENT_DATE - INTERVAL '1 day';
Pro tip: Use hash comparisons (
MD5/SHA) on SCD2 columns to efficiently detect changes in ETL pipelines.
Frequently Asked Questions
When should I use CDC vs SCD?
Use CDC when: You need real-time or near real-time data replication, streaming analytics, or efficient incremental loads from OLTP systems.
Use SCD when: You need to track historical changes in dimension tables for reporting, compliance, or trend analysis.
Most pipelines use both: CDC captures changes from source systems, then SCD techniques store them appropriately in the warehouse.
What's the difference between SCD Type 1, 2, and 3?
Type 1 (Overwrite): No history kept. Use for corrections or non-critical attributes.
Type 2 (New Rows): Full history preserved. Use for audit trails, compliance, and trend analysis.
Type 3 (Limited History): Only stores previous + current value. Use when only one prior value matters.
Can I use CDC with Snowflake, BigQuery, or Databricks?
Yes. Modern platforms support CDC natively:
Snowflake: Streams and Tasks for CDC, Time Travel for point-in-time queries.
BigQuery: Change history with _CHANGE_TYPE column, MERGE statements.
Databricks: Delta Lake Change Data Feed (CDF).
Why do SCD Type 2 tables need surrogate keys?
Because the same business entity (e.g., employee_id = 123) can have multiple rows representing different time periods. The surrogate key (employee_sk) uniquely identifies each version, while the natural key identifies the business entity across all versions.
Conclusion
Both Change Data Capture (CDC) and Slowly Changing Dimensions (SCD) are essential for handling data changes but serve distinct purposes:
- CDC ensures that all changes are captured efficiently for real-time processing.
- SCD ensures that historical data is preserved for analytical and reporting needs.
By using CDC to extract changes from OLTP systems and SCD to structure historical data in a warehouse, organisations can build robust data architectures that support both real-time and analytical requirements.
Post Tags: