Monitoring Row Counts Across Multiple Tables in SQL
Introduction
When working with large datasets, it is crucial to keep track of table sizes to monitor data growth, detect anomalies, and optimise queries. One of the easiest ways to get an overview of the number of rows in each table is by using a simple SQL query that counts rows across multiple tables at once.
In this article, we'll walk through an efficient SQL query that retrieves row counts for multiple tables and sorts them in descending order, making it easy to analyse data distribution. We focus on PostgreSQL and BigQuery.
The Query
The following SQL query uses UNION ALL
to combine row count queries from multiple tables into a single result set.
PostgreSQL Example:
SELECT 'yellow_tripdata' AS table_name, COUNT(*) AS row_count
FROM your_table_schema.yellow_tripdata
UNION ALL
SELECT 'stg_yellow_tripdata', COUNT(*)
FROM your_table_schema.stg_yellow_tripdata
UNION ALL
SELECT 'green_tripdata', COUNT(*)
FROM your_table_schema.green_tripdata
UNION ALL
SELECT 'stg_green_tripdata', COUNT(*)
FROM your_table_schema.stg_green_tripdata
UNION ALL
SELECT 'fhv_tripdata', COUNT(*)
FROM your_table_schema.fhv_tripdata
UNION ALL
SELECT 'stg_fhv_tripdata', COUNT(*)
FROM your_table_schema.stg_fhv_tripdata
UNION ALL
SELECT 'dim_fhv_trips', COUNT(*)
FROM your_table_schema.dim_fhv_trips
UNION ALL
SELECT 'dim_trips_unioned', COUNT(*)
FROM your_table_schema.dim_trips_unioned
UNION ALL
SELECT 'fact_trips', COUNT(*)
FROM your_table_schema.fact_trips
UNION ALL
SELECT 'dim_monthly_zone_revenue', COUNT(*)
FROM your_table_schema.dim_monthly_zone_revenue
UNION ALL
SELECT 'dim_zones', COUNT(*)
FROM your_table_schema.dim_zones
UNION ALL
SELECT 'taxi_zone_lookup', COUNT(*)
FROM your_table_schema.taxi_zone_lookup
ORDER BY row_count DESC;
BigQuery Example:
SELECT 'yellow_tripdata' AS table_name, COUNT(*) AS row_count
FROM `PROJECT_ID.DATASET_ID.yellow_tripdata`
UNION ALL
SELECT 'stg_yellow_tripdata', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.stg_yellow_tripdata`
UNION ALL
SELECT 'green_tripdata', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.green_tripdata`
UNION ALL
SELECT 'stg_green_tripdata', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.stg_green_tripdata`
UNION ALL
SELECT 'fhv_tripdata', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.fhv_tripdata`
UNION ALL
SELECT 'stg_fhv_tripdata', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.stg_fhv_tripdata`
UNION ALL
SELECT 'dim_fhv_trips', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.dim_fhv_trips`
UNION ALL
SELECT 'dim_trips_unioned', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.dim_trips_unioned`
UNION ALL
SELECT 'fact_trips', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.fact_trips`
UNION ALL
SELECT 'dim_monthly_zone_revenue', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.dim_monthly_zone_revenue`
UNION ALL
SELECT 'dim_zones', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.dim_zones`
UNION ALL
SELECT 'taxi_zone_lookup', COUNT(*)
FROM `PROJECT_ID.DATASET_ID.taxi_zone_lookup`
ORDER BY row_count DESC;
Sample Output
After executing the query, you will get a result similar to the following:
Row | Table Name | Row Count |
---|---|---|
1 | dim_trips_unioned | 114,562,355 |
2 | fact_trips | 113,073,110 |
3 | yellow_tripdata | 109,047,518 |
4 | stg_yellow_tripdata | 107,749,398 |
5 | fhv_tripdata | 43,244,696 |
6 | stg_fhv_tripdata | 43,136,433 |
7 | dim_fhv_trips | 22,968,391 |
8 | green_tripdata | 7,778,101 |
9 | stg_green_tripdata | 6,812,957 |
10 | dim_monthly_zone_revenue | 12,053 |
11 | taxi_zone_lookup | 265 |
12 | dim_zones | 263 |
Explanation
1. Using COUNT(*)
For each table, COUNT(*)
retrieves the total number of rows. This is useful for understanding table size and detecting anomalies (e.g., unexpectedly empty or excessively large tables).
2. Using UNION ALL
The UNION ALL
operator is used instead of UNION
to avoid duplicate removal, making the query more efficient.
3. Assigning Table Names
Each subquery includes a manually assigned table name ('table_name' AS table_name
) to label the output.
4. Sorting Results
The final result is sorted in descending order by row_count
to highlight the largest tables first.
Why This Query is Useful
- Quick Insights: Provides an instant overview of table sizes.
- Performance Monitoring: Helps detect anomalies in data ingestion.
- Optimisation: Identifies large tables that may need indexing or partitioning.
Alternative Approach: Using Information Schema
Why Use INFORMATION_SCHEMA in BigQuery?
In BigQuery, querying INFORMATION_SCHEMA
can be a cost-effective alternative to using COUNT(*)
, especially for large tables. Since BigQuery charges based on the amount of data processed, running a COUNT(*)
query on a large dataset can become expensive as it scans all rows in the table. On the other hand, INFORMATION_SCHEMA.TABLE_STORAGE
provides metadata-based row counts without scanning the table's data, resulting in faster and cheaper queries.
However, note the trade-off:
INFORMATION_SCHEMA
row counts may not always be up-to-date, as metadata refresh rates can cause a slight delay.- If you require real-time accuracy, using
COUNT(*)
might still be necessary.
Instead of manually listing tables, you can dynamically query row counts using information_schema.tables
:
PostgreSQL Alternative:
SELECT relname AS table_name,
n_live_tup AS row_count
FROM pg_stat_user_tables
ORDER BY row_count DESC;
BigQuery Alternative:
SELECT table_id AS table_name,
row_count
FROM `PROJECT_ID.DATASET_ID.__TABLES__`
ORDER BY row_count;
SELECT table_name,
total_rows
FROM `PROJECT_ID.region-REGION`.INFORMATION_SCHEMA.TABLE_STORAGE
ORDER BY total_rows DESC;
Example usage for a London-based dataset:
SELECT table_name,
total_rows
FROM `dbt-sandbox-451313.region-europe-west2`.INFORMATION_SCHEMA.TABLE_STORAGE
ORDER BY total_rows DESC;
Conclusion
This query provides an efficient way to track row counts across multiple tables in PostgreSQL and BigQuery. By using UNION ALL
, manual table naming, and sorting, you can quickly identify trends, anomalies, and optimisation opportunities.
Try running this query in your database and see how your table sizes compare!
Post Tags: