Learn to Generate Data Using SQL

Author: Narendra Mandadapu
Posted: Jul 14, 2023

Learning to generate data using SQL is an incredible skill and a massive time-saver. This skill is the single highest ROI (Return on Investment) skill that I have learned in my life, allowing me to efficiently manipulate data whenever the need arises.

We have a total of 4 articles on this topic.

Learn to generate data using SQL (You are reading now)
Learn to generate random data using SQL
Learn to generate sample data using SQL
Learn to split a dataset for testing and training using SQL

Data generation is incredibly useful in various scenarios. On occasion, we require generated data for training purposes, allowing us to practice and improve our skills. In other instances, when conducting performance simulations, a substantial dataset is necessary to carry out accurate assessments.

Moreover, there are times when generating a sequence, like a time dimension in a data warehouse or an axis table for joining in an anomaly detection system, becomes essential.

The SELECT clause
The UNION ALL command
The VALUES LIST keyword
The UNNEST function
The GENERATE_SERIES function
The GENERATE_SERIES function with row numbers
Exercise: Generate a Date Dimension
- Instructions:
- Guidelines:

There are several reasons to generate data in SQL, and in this article we’ll learn several ways to do that.

The `SELECT` clause

The most straightforward way to to generate data in SQL is using the SELECT clause:

SELECT
  1 AS id,
  'data-engineering' AS skill

output:

id	skill
1	data-engineering

This query will produce a single row with the id column and the skill column.

Note: The above query is not fetching data from any table. In SQL, the only required clause is SELECT.

The `UNION ALL` command

To combine the results of multiple queries we can use UNION ALL:

SELECT
  1 AS id,
  'data-engineering' AS skill
UNION ALL
SELECT
  2,
  'public-speaking'

output:

id	skill
1	data-engineering
2	public-speaking

We used UNION ALL to concatenate the output of two queries.

To use UNION ALL, it’s important to keep the types and the number of columns similar in all of the concatenated queries.

Using UNION ALL to concatenate query results is a widely used practice, but it can become quite laborious when generating large volumes of data.

The `VALUES LIST` keyword

In the previous example, our objective was to generate a collection of rows based on predefined values. We achieved this by employing bare SELECT clauses and combining the outcomes using UNION ALL. Generating data from a predefined list of values is a frequent occurrence, prompting SQL to provide a more expedient approach through the utilization of the VALUES keyword.

SELECT
  *
FROM
  (
    VALUES
      (1, 'data-engineering'),
      (2, 'public-speaking')
  ) AS t(id, skill)

output:

id	skill
1	data-engineering
2	public-speaking

VALUES is primarily advantageous when generating smaller datasets, but what if we require a significantly larger number of rows to be generated?

The `UNNEST` function

To generate small sets of one-dimensional data, we can use a special PostgreSQL array function:

SELECT
  unnest(array[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) AS id

output:

id
1
2
.
.
9
10

The unnest function takes an array and converts it into a tabular format. This functionality is highly beneficial when creating a table from a given list of values.

However, it is important to note that unnest has certain limitations compared to VALUES. Specifically, it can only generate a one-dimensional table consisting of elements of the same data type.

The `GENERATE_SERIES` function

To generate a data series, we were previously required to supply the unnest function with an array of values. However, PostgreSQL offers a convenient table function called generate_series specifically designed for this purpose:

SELECT
  *
FROM
  generate_series(0, 5) AS t(id)

output:

The generate_series function accepts three arguments: start, stop, and step. In the previous example, we did not specify a step value, so the default step of 1 was utilised.

To define the output column names and data types, we employed a table alias list as t(<col1>, <col2>, ...).

If we wish to generate a distinct series, we have the flexibility to specify a different step value:

SELECT
  *
FROM
  generate_series(
    0, -- start
    10, -- stop
    2 -- step
    ) AS t(id)

output:

By utilizing the third argument, step, we successfully generated a sequence of 5 numbers ranging from 0 to 10.

It's worth noting that the generate_series function is not limited to integers; it can be applied to various other data types as well.

One prevalent use case is generating date ranges:

SELECT
  *
FROM
  generate_series(
    '2021-01-01 UTC' :: timestamptz, -- start
    '2021-01-02 UTC' :: timestamptz, -- stop
    interval '1 hour' -- step
    ) AS t(dd)

output:

dd
2021-01-01 00:00:00+00
2021-01-01 01:00:00+00
..
..
2021-01-01 23:00:00+00
2021-01-02 00:00:00+00

To generate a 24-hour range, we utilize the generate_series function by specifying a start and end date while setting the step to a 1 hour interval. It's important to note that the step can be any valid interval, such as 1 minute, 5 minutes, or 2 hours.

In this particular case, we require an hourly series, thus we employ a 1-hour interval.

The resulting output of the function is dependent on the input provided. If integers are used, we obtain a range of integers. Conversely, if a timestamp is provided, we obtain a range of timestamps.

The `GENERATE_SERIES` function with row numbers

The generate_series function is a table function. There is a little a clever technique with table functions which enables us to incorporate row numbers into the output:

SELECT
  *
FROM
  generate_series(
    '2021-01-01' :: timestamptz, -- start
    '2021-01-02' :: timestamptz, -- stop
    interval '1 hour' -- step
    ) WITH ORDINALITY AS t(day, id)

output:

day	id
2021-01-01 00:00:00+00	1
2021-01-01 01:00:00+00	2
..
..
2021-01-01 23:00:00+00	24
2021-01-02 00:00:00+00	25

By utilizing the WITH ORDINALITY clause, the results now feature an additional column that represents the row number. To assign names to these columns, we can employ the table alias function t(...). In this scenario, day is designated for the timestamp column, while id represents the row number.

Exercise: Generate a Date Dimension

When building a data warehouse, a frequent requirement is to generate a date dimension. A date dimension is a table that stores dates along with associated attributes, enabling us to analyse data across various time periods.

A basic representation of a date dimension can resemble the following:

day	month	year	quarter	week
2013-01-01	January	2013	Q1	Tuesday
2013-01-02	January	2013	Q1	Wednesday





2023-30-12	December	2023	Q4	Saturday
2023-31-12	December	2023	Q4	Sunday

We can now leverage this time dimension to join tables and group data based on different attributes, enabling analysis across different time periods.

Instructions:

To apply the knowledge you have gained thus far, you will create a date dimension that fulfills the following requirements:

Daily date dimension for a period of 10 years, between 2013-01-01 until 2023-12-31.
The date dimension will include the following columns:
- day: For example, ‘2023-08-02.’
- month: Name of month. For example, August.”
- year: 4-digit year, such as “2023.”
- quarter: The quarter number prefixed with “Q,” such as “Q3.”
- week: The name of the day of the week, such as Wednesday.”

Guidelines:

To create your query, utilize generate_series to generate a list of dates. Consider the appropriate start date, end date, and step value to generate an accurate date range.
Use date formatting, as well as date/time functions and operators, to extract and format the date in the desired format.

Solution

with days as (
  SELECT
    *
  FROM
    generate_series(
      '2013-01-01', '2023-12-31', interval '1 day'
    ) AS t(day)
)
SELECT
  day,
  to_char(day, 'Month') AS month,
  date_part('year', day) AS year,
  'Q' || date_part('quarter', day) AS quarter,
  to_char(day, 'Day') AS week
FROM
  days;

Output:

day	        month	    year	quarter	week
2013-01-01 	January 	2013	Q1	Tuesday
2013-01-02 	January 	2013	Q1	Wednesday
...
...
2023-12-30 	December 	2023	Q4	Saturday
2023-12-31 	December 	2023	Q4	Sunday

Thanks a lot for reading. I have a total of 4 articles on this topic.

Learn to generate data using SQL (You already read!)
Learn to generate random data using SQL
Learn to generate sample data using SQL
Learn to split a dataset for testing and training using SQL

Post Tags:

Previous: Python Libraries Every Data Engineer Should Know
Next: Learn to Generate Random Data Using SQL