# Learn to Generate Sample Data Using SQL

Learning to sample a subset of a table using SQL is a skill to master.

We have a total of 4 articles on this topic.

- Learn to generate data using SQL
- Learn to generate
**random**data using SQL - Learn to generate
**sample**data using SQL (You are reading now!) - Learn to
**split**a dataset for testing and training using SQL

Extracting a small subset of a table is commonly referred to as **sampling**. There are several reasons to use sampling, such as:

Performing estimations on large datasets: When dealing with large tables, there are instances where we are willing to sacrifice precision for the sake of speed. By sampling a portion of the table, we can obtain quicker but less accurate results.

Producing a training set: During data analysis utilizing machine learning models, it is often crucial to train the model using a subset of the data known as a training set. Sampling the table allows us to generate this training set.

## Sampling with `LIMIT`

An dead simple approach to retrieve a random section of a table is by combining `random`

with `LIMIT`

:

```
# Sampling with LIMIT
WITH sample AS (
SELECT *
FROM products
ORDER BY random()
LIMIT 10000
)
SELECT count(*) FROM sample;
#output
count
───────
10000
(1 row)
Time: 205.765 ms
```

To obtain a sample of 10,000 random rows from the `products`

table, follow these steps:

- Randomly sort the table by utilising the
`random()`

function. - Retrieve the first 10,000 rows using the
`LIMIT 10000`

clause.

It is important to note that this sampling method requires the database to sort the entire dataset and select the initial N rows. This method is suitable for smaller datasets, it may become highly inefficient and lead to excessive memory consumption and CPU usage when dealing with very large datasets.

## Using `TABLESAMPLE`

PostgreSQL offers table sampling methods that are better suited for larger tables. To sample a table, employ the `TABLESAMPLE`

keyword within the `FROM`

clause and specify the sampling method along with its respective arguments.

### The `SYSTEM`

sampling method:

To utilise the `SYSTEM`

sampling method and extract a 10% sample from the table, execute the following query:

```
# Using the SYSTEM sampling method
WITH sample AS (
SELECT *
FROM products TABLESAMPLE SYSTEM(10)
)
SELECT count(*) FROM sample;
#output
count
───────
95400
(1 row)
Time: 13.654 ms
```

The `products`

table comprises 1 million rows, and the resulting sample yielded slightly fewer than 100,000 rows.

The `SYSTEM`

sampling method operates by selecting blocks from the table. For instance, when sampling 10% from a table that is stored in 10 blocks, the `SYSTEM`

sampling method only reads one block and retrieves the rows within it.

It is worth noting that when sampling 10% of the table using the `SYSTEM`

sampling method, it proves significantly faster compared to the previous method employing `random`

and `LIMIT`

, completing in 205ms as opposed to 13ms. The `SYSTEM`

sampling method operates at the storage level, resulting in its exceptional speed.

### The `BERNOULLI`

sampling method

Going forward, we have a choice, and we got one more sampling method called `BERNOULLI`

. To sample 10% of the table using `BERNOULLI`

, we can use the following query:

```
# Using the BERNOULLI sampling method
WITH sample AS (
SELECT *
FROM products TABLESAMPLE BERNOULLI(10)
)
SELECT count(*) FROM sample;
# output
count
────────
100364
(1 row)
Time: 54.663 ms
```

By utilising the `BERNOULLI`

sampling method, we obtained 100,364 rows in 54ms. This method exhibits faster performance compared to `random`

and `LIMIT`

, but it is slower than the `SYSTEM`

sampling method.

Unlike the `SYSTEM`

sampling method, which selects a sample of the table blocks, the `BERNOULLI`

sampling method scans the entire table and randomly selects rows. This implies that `BERNOULLI`

is slower than SYSTEM, but it produces more evenly distributed random results.

## Repeatable sampling

There are instances when it proves beneficial to generate the same sample from a specific dataset. This situation may arise when implementing an incremental rollout of an experiment, conducting A/B testing, or gradually enabling a feature flag.

Let's consider the following example:

```
# Creating a table to sample
DROP TABLE IF EXISTS products;
CREATE TABLE products AS
SELECT n AS id, false AS blackfriday_sale
FROM generate_series(1, 100000) AS n;
```

This will create a table with 100,000 products, each with a boolean field to indicate whether they are enrolled in the BlackFriday sales.

To gradually implement the blackfriday_sale, our initial step involves ensuring that none of the products are enrolled. This is achieved by setting the `blackfriday_sale`

column to `false`

for all products.

To enroll 5% of the products in the blackfriday_sale, execute the following `UPDATE`

command:

```
# Enroll 5% of products to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5)
);
```

The command above uses the `BERNOULLI`

sampling method to sample 5% of the products, and enroll them into the blackfriday_sale by setting `blackfriday_sale = true`

.

Let's aim to expand the blackfriday_sale rollout to include 25% of the products. Given that we have already enrolled 5% in the blackfriday_sale, we need to enroll an additional 20%:

```
# Enroll 20% of products to blackfriday_sale, to reach a total of 25%
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(20)
);
```

Executing the above two commands will sample 25% of the products and enroll them in the blackfriday_sale if they are not already enrolled.

However, it is important to note that after executing the second query, there is no assurance that approximately 25% of the products will be enrolled in the blackfriday_sale.

```
# Attempting to enroll additional products into the blackfriday_sale
# 1. Add 5% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5)
);
# 2. Add 20% to blackfriday_sale, to reach a total of 25%
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(20)
);
# check how many products are enrolled in the blackfriday sale
SELECT blackfriday_sale, COUNT(*)
FROM products
GROUP BY 1;
```

Output:

```
blackfriday_sale count
f 75898
t 24102
```

Observe that even after enrolling an additional 25% in the "blackfriday_sale," the overall count of enrolled products remains below 25%.

The `BERNOULLI`

sampling method functions by randomly selecting rows if the second query happens to sample products already enrolled in the initial 5%; the resulting overall sample size will be less than 25%.

To be able to incrementally enroll products into the blackfriday_sale who are not already enrolled, we can use a technique similar to generating reproducible random data.

Similar to the `random`

function, the `BERNOULLI`

sampling method also accepts a seed value for sampling, utilising the `REPEATABLE`

keyword:

```
# Enroll 5% of products to the blackfriday_sale using a seed value
# Enroll 5% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5) REPEATABLE (5040)
);
# output:
UPDATE 5066
```

Executing this query will enroll 5% of the products in the blackfriday_sale. Have you noticed that we utilised `BERNOULLI(5)`

`REPEATABLE (5040)`

to specify a seed value for the random function. In this instance, the seed value is set as 5040, which can be any value.

If we repeat the execution of the same `UPDATE`

command, using the same seed, it will sample the exact same rows. Should we execute the identical command again, it will not add any additional products.

```
# Enroll 5% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(5) REPEATABLE (5040)
);
# output: UPDATE 0
```

If we want to increase the rollout of the blackfriday sale to 25%, we can now simply set the sampling percent to 25%. Yes, that simple.

```
# Enroll 25% to blackfriday_sale
UPDATE products
SET blackfriday_sale = true
WHERE NOT blackfriday_sale
AND id IN (
SELECT id
FROM products TABLESAMPLE BERNOULLI(25) REPEATABLE (5040)
);
# output: UPDATE 19923
```

Since we applied a seed value for both `UPDATE`

commands, the initial 5% of the products sampled are encompassed within the 25% that were sampled in the second command. YaY!

## Performance comparison

Here is the timing for sampling 10% of a table containing 1M rows:

Sampling Method | Execution time |
---|---|

`random()` | 205ms |

`BERNOULLI` | 54ms |

`SYSTEM` | 13ms |

Which method should we choose?

If we are working with a small dataset where sample size is crucial, it is recommended to use `random()`

in conjunction with `LIMIT`

.

However, if we are dealing with a large dataset, it is advisable to start with the `BERNOULLI`

method. Only consider switching to `SYSTEM`

if the performance noticeably deteriorates.

Thanks a lot for reading. We have a total of 4 articles on this topic.

- Learn to generate data using SQL
- Learn to generate
**random**data using SQL - Learn to generate
**sample**data using SQL (You already read!) - Learn to
**split**a dataset for testing and training using SQL

*Post Tags:*