Learn to Generate Sample Data Using SQL
Learning to sample a subset of a table using SQL is a skill to master.
We have a total of 4 articles on this topic.
- Learn to generate data using SQL
- Learn to generate random data using SQL
- Learn to generate sample data using SQL (You are reading now!)
- Learn to split a dataset for testing and training using SQL
Extracting a small subset of a table is commonly referred to as sampling. There are several reasons to use sampling, such as:
-
Performing estimations on large datasets: When dealing with large tables, there are instances where we are willing to sacrifice precision for the sake of speed. By sampling a portion of the table, we can obtain quicker but less accurate results.
-
Producing a training set: During data analysis utilizing machine learning models, it is often crucial to train the model using a subset of the data known as a training set. Sampling the table allows us to generate this training set.
Sampling with LIMIT
An dead simple approach to retrieve a random section of a table is by combining random
with LIMIT
:
# Sampling with LIMIT
WITH sample AS (
SELECT *
FROM products
ORDER BY random()
LIMIT 10000
)
SELECT count(*) FROM sample;
#output
count
───────
10000
(1 row)
Time: 205.765 ms
To obtain a sample of 10,000 random rows from the products
table, follow these steps:
- Randomly sort the table by utilising the
random()
function. - Retrieve the first 10,000 rows using the
LIMIT 10000
clause.
It is important to note that this sampling method requires the database to sort the entire dataset and select the initial N rows. This method is suitable for smaller datasets, it may become highly inefficient and lead to excessive memory consumption and CPU usage when dealing with very large datasets.
Using TABLESAMPLE
PostgreSQL offers table sampling methods that are better suited for larger tables. To sample a table, employ the TABLESAMPLE
keyword within the FROM
clause and specify the sampling method along with its respective arguments.
The SYSTEM
sampling method:
To utilise the SYSTEM
sampling method and extract a 10% sample from the table, execute the following query:
# Using the SYSTEM sampling method
WITH sample AS (
SELECT *
FROM products TABLESAMPLE SYSTEM(10)
)
SELECT count(*) FROM sample;
#output
count
───────
95400
(1 row)
Time: 13.654 ms
The products
table comprises 1 million rows, and the resulting sample yielded slightly fewer than 100,000 rows.
The SYSTEM
sampling method operates by selecting blocks from the table. For instance, when sampling 10% from a table that is stored in 10 blocks, the SYSTEM
sampling method only reads one block and retrieves the rows within it.
It is worth noting that when sampling 10% of the table using the SYSTEM
sampling method, it proves significantly faster compared to the previous method employing random
and LIMIT
, completing in 205ms as opposed to 13ms. The SYSTEM
sampling method operates at the storage level, resulting in its exceptional speed.
The BERNOULLI
sampling method
Going forward, we have a choice, and we got one more sampling method called BERNOULLI
. To sample 10% of the table using BERNOULLI
, we can use the following query:
# Using the BERNOULLI sampling method
WITH sample AS (
SELECT *
FROM products TABLESAMPLE BERNOULLI(10)
)
SELECT count(*) FROM sample;
# output
count
────────
100364
(1 row)
Time: 54.663 ms
By utilising the BERNOULLI
sampling method, we obtained 100,364 rows in 54ms. This method exhibits faster performance compared to random
and LIMIT
, but it is slower than the SYSTEM
sampling method.
Unlike the SYSTEM
sampling method, which selects a sample of the table blocks, the BERNOULLI
sampling method scans the entire table and randomly selects rows. This implies that BERNOULLI
is slower than SYSTEM, but it produces more evenly distributed random results.
Repeatable sampling
There are instances when it proves beneficial to generate the same sample from a specific dataset. This situation may arise when implementing an incremental rollout of an experiment, conducting A/B testing, or gradually enabling a feature flag.
Let's consider the following example:
# Creating a table to sample
DROP TABLE IF EXISTS products;
CREATE TABLE products AS
SELECT n AS id, false AS blackfriday_sale
FROM generate_series(1, 100000) AS n;
This will create a table with 100,000 products, each with a boolean field to indicate whether they are enrolled in the BlackFriday sales.
To gradually implement the blackfriday_sale, our initial step involves ensuring that none of the products are enrolled. This is achieved by setting the blackfriday_sale
column to false
for all products.
To enroll 5% of the products in the blackfriday_sale, execute the following UPDATE
command:
# Enroll 5% of products to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5)
);
The command above uses the BERNOULLI
sampling method to sample 5% of the products, and enroll them into the blackfriday_sale by setting blackfriday_sale = true
.
Let's aim to expand the blackfriday_sale rollout to include 25% of the products. Given that we have already enrolled 5% in the blackfriday_sale, we need to enroll an additional 20%:
# Enroll 20% of products to blackfriday_sale, to reach a total of 25%
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(20)
);
Executing the above two commands will sample 25% of the products and enroll them in the blackfriday_sale if they are not already enrolled.
However, it is important to note that after executing the second query, there is no assurance that approximately 25% of the products will be enrolled in the blackfriday_sale.
# Attempting to enroll additional products into the blackfriday_sale
# 1. Add 5% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5)
);
# 2. Add 20% to blackfriday_sale, to reach a total of 25%
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(20)
);
# check how many products are enrolled in the blackfriday sale
SELECT blackfriday_sale, COUNT(*)
FROM products
GROUP BY 1;
Output:
blackfriday_sale count
f 75898
t 24102
Observe that even after enrolling an additional 25% in the "blackfriday_sale," the overall count of enrolled products remains below 25%.
The BERNOULLI
sampling method functions by randomly selecting rows if the second query happens to sample products already enrolled in the initial 5%; the resulting overall sample size will be less than 25%.
To be able to incrementally enroll products into the blackfriday_sale who are not already enrolled, we can use a technique similar to generating reproducible random data.
Similar to the random
function, the BERNOULLI
sampling method also accepts a seed value for sampling, utilising the REPEATABLE
keyword:
# Enroll 5% of products to the blackfriday_sale using a seed value
# Enroll 5% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5) REPEATABLE (5040)
);
# output:
UPDATE 5066
Executing this query will enroll 5% of the products in the blackfriday_sale. Have you noticed that we utilised BERNOULLI(5)
REPEATABLE (5040)
to specify a seed value for the random function. In this instance, the seed value is set as 5040, which can be any value.
If we repeat the execution of the same UPDATE
command, using the same seed, it will sample the exact same rows. Should we execute the identical command again, it will not add any additional products.
# Enroll 5% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5) REPEATABLE (5040)
);
# output: UPDATE 0
If we want to increase the rollout of the blackfriday sale to 25%, we can now simply set the sampling percent to 25%. Yes, that simple.
# Enroll 25% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(25) REPEATABLE (5040)
);
# output: UPDATE 19923
Since we applied a seed value for both UPDATE
commands, the initial 5% of the products sampled are encompassed within the 25% that were sampled in the second command. YaY!
Performance comparison
Here is the timing for sampling 10% of a table containing 1M rows:
Sampling Method | Execution time |
---|---|
random() |
205ms |
BERNOULLI |
54ms |
SYSTEM |
13ms |
Which method should we choose?
If we are working with a small dataset where sample size is crucial, it is recommended to use random()
in conjunction with LIMIT
.
However, if we are dealing with a large dataset, it is advisable to start with the BERNOULLI
method. Only consider switching to SYSTEM
if the performance noticeably deteriorates.
Thanks a lot for reading. We have a total of 4 articles on this topic.
- Learn to generate data using SQL
- Learn to generate random data using SQL
- Learn to generate sample data using SQL (You already read!)
- Learn to split a dataset for testing and training using SQL
Post Tags: