Skip to main content

Learn to Generate Random Data Using SQL

Learning to generate random data using SQL is an incredible skill and a massive time-saver.

We have a total of 4 articles on this topic.

  1. Learn to generate data using SQL
  2. Learn to generate random data using SQL (You are reading now)
  3. Learn to generate sample data using SQL
  4. Learn to split a dataset for testing and training using SQL

Generating random data is beneficial for a multitude of reasons, particularly when it comes to testing, development, and benchmarking. Frequently, there is a requirement to generate random data for these purposes.

For instance, you may need to populate data in your local development environment, or you might want to evaluate query optimization techniques and necessitate a substantial table for that purpose.

The RANDOM function

To generate random numbers, PostgreSQL offers the "random" function. This function produces a value ranging from 0 to 1:

SELECT  random();

output:

random
0.604956973344088

Note: Executing random() function multiple times, and on each occasion, you will receive a unique random number between 0 and 1.

Generating random integers

As observed in the preceding example, the random function generates a random number that is greater than or equal to 0 and less than 1. Unfortunately, it does not accept any arguments to control the range. So, what if we require a random integer?

To generate values within different ranges, we can incorporate the random function into an expression. For instance, to generate a random floating-point number between 0 and 99, we can multiply the output of random() by 100.

# Generates a random float between 0 and 99
SELECT random() * 100;

output:

?column?
45.9825816098601

If our aim is to generate a random integer ranging from 0 to 99, we can accomplish this by rounding down the number using the floor function:

# Generate random number between 0 and 99
SELECT floor(random() * 100) as n;

output:

n
35

By implementing the floor function, the outcome of the expression will be rounded down, yielding an integer rather than a floating-point value.

To generate a random integer within the range of 10 to 99, we can expand the expression as follows:

# Generate a random integer between 10 and 99
SELECT 10 + floor(random() * 90) as n;

output:

n
83

The above expression utilizes the previously mentioned random expression and adds 10 to it. The random expression, floor(random() * 90), generates a random integer ranging from 0 to 89. By adding 10 to the result of this expression, we obtain an integer within the range of 11-99.

Converting float to integer

In the previous section, we used the floor function to convert a floating-point number into an integer. However, there are two additional functions that can be utilized for this purpose.

  1. floor: This function rounds a float down to the nearest integer. For example, floor(2.6) = 2.
  2. round: This function rounds a float to the nearest integer. For instance, round(2.6) = 3 and round(2.4) = 2.
  3. ceil: This function rounds a float up to the nearest integer. For example, ceil(2.4) = 3.

To illustrate the usage of each of these functions, we can use the following query:

# Different ways to round a number
SELECT
  n,
  floor(n) as _floor_,
  round(n) as _round_,
  ceil(n) as _ceil_
FROM (VALUES
  (2.0),
  (2.2),
  (2.5),
  (2.7),
  (3.0)
) AS t(n)

output:

n	_floor_	_round_	_ceil_
2.0	  2	      2	    2
2.2	  2	      2	    3
2.5	  2	      3	    3
2.7	  2	      3	    3
3.0	  3	      3	    3

In the output of the query, we can observe the conversion of each function where the float is transformed into an integer. It is particularly fascinating to note the behavior of the round function when the value is exactly halfway, such as 2.5. Based on the obtained results, it appears that PostgreSQL rounds the values of 2.5 up to 3.

Random choice

Up until now, we have explored the generation of random floating-point numbers and integers. However, there are instances when we require selecting random values from a given list of possibilities.

By employing the knowledge we have gained thus far, we can leverage the random function to pick a random value from an array:

SELECT (array['red', 'orange', 'green'])[1 + floor(random() * 3)] AS color

output:

color
red

To begin, we establish an array called "colors" that encompasses the values red, orange, and green. In order to select a random value from this array, we utilize the random function to generate a random index ranging from 1 to 3. Please note that in PostgreSQL, array indexing commences at 1.

Producing reproducible random data

Suppose we are conducting a performance test on an extensive table. In order to generate test data, we compose a script that creates a table and fills it with millions of rows containing random data.

Once the script is finalised, we share it with another developer. However, when they attempt to execute the script on their local environment, they obtain entirely different results. This discrepancy arises when the locally generated random data differs from the data we used.

To illustrate this scenario, execute the provided code multiple times:

# Generating a random float
SELECT random() as n;

Note: Every time you execute the above code you get a different result.

This scenario precisely illustrates what occurs when our fellow developer executes our script on their local environment—they obtain distinct outcomes.

Therefore, the question arises: How can we generate random results that can be reproduced?

In PostgreSQL, we can achieve reproducible random data by specifying the random seed. This can be accomplished by utilizing the setseed function:

# Using setseed to generating reproducible random numbers
SELECT setseed(0.2030);
SELECT random() as n;

Output:

setseed
 n
0.568829658441246

Believe me, we will now get the exact same value every time we execute the above query.

The setseed function accepts a random "seed" number ranging from -1 to 1. The "seed" value influences the values produced in all subsequent calls to the random function.

Tip: For future demonstrations or when sharing scripts containing random data, it is advisable to include the setseed function. This ensures that the results can be easily reproduced by others.

Thanks a lot for reading. We have a total of 4 articles on this topic.

  1. Learn to generate data using SQL
  2. Learn to generate random data using SQL (You already read!)
  3. Learn to generate sample data using SQL
  4. Learn to split a dataset for testing and training using SQL

Post Tags: