Skip to main content

Learn to Generate Sample Data Using SQL

Learning to sample a subset of a table using SQL is a skill to master.

We have a total of 4 articles on this topic.

  1. Learn to generate data using SQL
  2. Learn to generate random data using SQL
  3. Learn to generate sample data using SQL (You are reading now!)
  4. Learn to split a dataset for testing and training using SQL

Extracting a small subset of a table is commonly referred to as sampling. There are several reasons to use sampling, such as:

  1. Performing estimations on large datasets: When dealing with large tables, there are instances where we are willing to sacrifice precision for the sake of speed. By sampling a portion of the table, we can obtain quicker but less accurate results.

  2. Producing a training set: During data analysis utilizing machine learning models, it is often crucial to train the model using a subset of the data known as a training set. Sampling the table allows us to generate this training set.

Sampling with LIMIT

An dead simple approach to retrieve a random section of a table is by combining random with LIMIT:

# Sampling with LIMIT

WITH sample AS (
    SELECT *
    FROM products
    ORDER BY random()
    LIMIT 10000
)

SELECT count(*) FROM sample;

#output

 count
───────
 10000
(1 row)

Time: 205.765 ms

To obtain a sample of 10,000 random rows from the products table, follow these steps:

  1. Randomly sort the table by utilising the random() function.
  2. Retrieve the first 10,000 rows using the LIMIT 10000 clause.

It is important to note that this sampling method requires the database to sort the entire dataset and select the initial N rows. This method is suitable for smaller datasets, it may become highly inefficient and lead to excessive memory consumption and CPU usage when dealing with very large datasets.

Using TABLESAMPLE

PostgreSQL offers table sampling methods that are better suited for larger tables. To sample a table, employ the TABLESAMPLE keyword within the FROM clause and specify the sampling method along with its respective arguments.

The SYSTEM sampling method:

To utilise the SYSTEM sampling method and extract a 10% sample from the table, execute the following query:

# Using the SYSTEM sampling method

WITH sample AS (
    SELECT *
    FROM products TABLESAMPLE SYSTEM(10)
)
SELECT count(*) FROM sample;

#output

 count
───────
 95400
(1 row)

Time: 13.654 ms

The products table comprises 1 million rows, and the resulting sample yielded slightly fewer than 100,000 rows.

The SYSTEM sampling method operates by selecting blocks from the table. For instance, when sampling 10% from a table that is stored in 10 blocks, the SYSTEM sampling method only reads one block and retrieves the rows within it.

It is worth noting that when sampling 10% of the table using the SYSTEM sampling method, it proves significantly faster compared to the previous method employing random and LIMIT, completing in 205ms as opposed to 13ms. The SYSTEM sampling method operates at the storage level, resulting in its exceptional speed.

The BERNOULLI sampling method

Going forward, we have a choice, and we got one more sampling method called BERNOULLI. To sample 10% of the table using BERNOULLI, we can use the following query:

# Using the BERNOULLI sampling method

WITH sample AS (
    SELECT *
    FROM products TABLESAMPLE BERNOULLI(10)
)
SELECT count(*) FROM sample;

# output

 count
────────
 100364
(1 row)

Time: 54.663 ms

By utilising the BERNOULLI sampling method, we obtained 100,364 rows in 54ms. This method exhibits faster performance compared to random and LIMIT, but it is slower than the SYSTEM sampling method.

Unlike the SYSTEM sampling method, which selects a sample of the table blocks, the BERNOULLI sampling method scans the entire table and randomly selects rows. This implies that BERNOULLI is slower than SYSTEM, but it produces more evenly distributed random results.

Repeatable sampling

There are instances when it proves beneficial to generate the same sample from a specific dataset. This situation may arise when implementing an incremental rollout of an experiment, conducting A/B testing, or gradually enabling a feature flag.

Let's consider the following example:

# Creating a table to sample

DROP TABLE IF EXISTS products;

CREATE TABLE products AS
    SELECT n AS id, false AS blackfriday_sale
    FROM generate_series(1, 100000) AS n;

This will create a table with 100,000 products, each with a boolean field to indicate whether they are enrolled in the BlackFriday sales.

To gradually implement the blackfriday_sale, our initial step involves ensuring that none of the products are enrolled. This is achieved by setting the blackfriday_sale column to false for all products.

To enroll 5% of the products in the blackfriday_sale, execute the following UPDATE command:

# Enroll 5% of products to blackfriday_sale

UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
    SELECT id
    FROM products TABLESAMPLE BERNOULLI(5)
);

The command above uses the BERNOULLI sampling method to sample 5% of the products, and enroll them into the blackfriday_sale by setting blackfriday_sale = true.

Let's aim to expand the blackfriday_sale rollout to include 25% of the products. Given that we have already enrolled 5% in the blackfriday_sale, we need to enroll an additional 20%:

# Enroll 20% of products to blackfriday_sale, to reach a total of 25%

UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
    SELECT id
    FROM products TABLESAMPLE BERNOULLI(20)
);

Executing the above two commands will sample 25% of the products and enroll them in the blackfriday_sale if they are not already enrolled.

However, it is important to note that after executing the second query, there is no assurance that approximately 25% of the products will be enrolled in the blackfriday_sale.

# Attempting to enroll additional products into the blackfriday_sale

# 1. Add 5% to blackfriday_sale

UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
    SELECT id
    FROM products TABLESAMPLE BERNOULLI(5)
);

# 2. Add 20% to blackfriday_sale, to reach a total of 25%

UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
    SELECT id
    FROM products TABLESAMPLE BERNOULLI(20)
);

# check how many products are enrolled in the blackfriday sale

SELECT blackfriday_sale, COUNT(*)
FROM products
GROUP BY 1;

Output:

 blackfriday_sale	count
f	75898
t	24102

Observe that even after enrolling an additional 25% in the "blackfriday_sale," the overall count of enrolled products remains below 25%.

The BERNOULLI sampling method functions by randomly selecting rows if the second query happens to sample products already enrolled in the initial 5%; the resulting overall sample size will be less than 25%.

To be able to incrementally enroll products into the blackfriday_sale who are not already enrolled, we can use a technique similar to generating reproducible random data.

Similar to the random function, the BERNOULLI sampling method also accepts a seed value for sampling, utilising the REPEATABLE keyword:

# Enroll 5% of products to the blackfriday_sale using a seed value

# Enroll 5% to blackfriday_sale

UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
    SELECT id
    FROM products TABLESAMPLE BERNOULLI(5) REPEATABLE (5040)
);

# output:
UPDATE 5066

Executing this query will enroll 5% of the products in the blackfriday_sale. Have you noticed that we utilised BERNOULLI(5) REPEATABLE (5040) to specify a seed value for the random function. In this instance, the seed value is set as 5040, which can be any value.

If we repeat the execution of the same UPDATE command, using the same seed, it will sample the exact same rows. Should we execute the identical command again, it will not add any additional products.

# Enroll 5% to blackfriday_sale

UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
    SELECT id
    FROM products TABLESAMPLE BERNOULLI(5) REPEATABLE (5040)
);

# output: UPDATE 0

If we want to increase the rollout of the blackfriday sale to 25%, we can now simply set the sampling percent to 25%. Yes, that simple.

# Enroll 25% to blackfriday_sale

UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
    SELECT id
    FROM products TABLESAMPLE BERNOULLI(25) REPEATABLE (5040)
);

# output: UPDATE 19923

Since we applied a seed value for both UPDATE commands, the initial 5% of the products sampled are encompassed within the 25% that were sampled in the second command. YaY!

Performance comparison

Here is the timing for sampling 10% of a table containing 1M rows:

Sampling Method Execution time
random() 205ms
BERNOULLI 54ms
SYSTEM 13ms

Which method should we choose?

If we are working with a small dataset where sample size is crucial, it is recommended to use random() in conjunction with LIMIT.

However, if we are dealing with a large dataset, it is advisable to start with the BERNOULLI method. Only consider switching to SYSTEM if the performance noticeably deteriorates.

Thanks a lot for reading. We have a total of 4 articles on this topic.

  1. Learn to generate data using SQL
  2. Learn to generate random data using SQL
  3. Learn to generate sample data using SQL (You already read!)
  4. Learn to split a dataset for testing and training using SQL

Post Tags: