Learn to Generate Data Using SQL
Learning to generate data using SQL is an incredible skill and a massive time-saver. This skill is the single highest ROI (Return on Investment) skill that I have learned in my life, allowing me to efficiently manipulate data whenever the need arises.
We have a total of 4 articles on this topic.
- Learn to generate data using SQL (You are reading now)
- Learn to generate random data using SQL
- Learn to generate sample data using SQL
- Learn to split a dataset for testing and training using SQL
Data generation is incredibly useful in various scenarios. On occasion, we require generated data for training purposes, allowing us to practice and improve our skills. In other instances, when conducting performance simulations, a substantial dataset is necessary to carry out accurate assessments.
Moreover, there are times when generating a sequence, like a time dimension in a data warehouse or an axis table for joining in an anomaly detection system, becomes essential.
There are several reasons to generate data in SQL, and in this article weβll learn several ways to do that.
The SELECT
clause
The most straightforward way to to generate data in SQL is using the SELECT clause:
SELECT
1 AS id,
'data-engineering' AS skill
output:
id skill
1 data-engineering
This query will produce a single row with the id
column and the skill
column.
Note: The above query is not fetching data from any table. In SQL, the only required clause is SELECT
.
The UNION ALL
command
To combine the results of multiple queries we can use UNION ALL
:
SELECT
1 AS id,
'data-engineering' AS skill
UNION ALL
SELECT
2,
'public-speaking'
output:
id skill
1 data-engineering
2 public-speaking
We used UNION ALL
to concatenate the output of two queries.
To use UNION ALL
, itβs important to keep the types and the number of columns similar in all of the concatenated queries.
Using UNION ALL
to concatenate query results is a widely used practice, but it can become quite laborious when generating large volumes of data.
The VALUES LIST
keyword
In the previous example, our objective was to generate a collection of rows based on predefined values. We achieved this by employing bare SELECT
clauses and combining the outcomes using UNION ALL
. Generating data from a predefined list of values is a frequent occurrence, prompting SQL to provide a more expedient approach through the utilization of the VALUES
keyword.
SELECT
*
FROM
(
VALUES
(1, 'data-engineering'),
(2, 'public-speaking')
) AS t(id, skill)
output:
id skill
1 data-engineering
2 public-speaking
VALUES
is primarily advantageous when generating smaller datasets, but what if we require a significantly larger number of rows to be generated?
The UNNEST
function
To generate small sets of one-dimensional data, we can use a special PostgreSQL array function:
SELECT
unnest(array[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) AS id
output:
id
1
2
.
.
9
10
The unnest
function takes an array and converts it into a tabular format. This functionality is highly beneficial when creating a table from a given list of values.
However, it is important to note that unnest
has certain limitations compared to VALUES
. Specifically, it can only generate a one-dimensional table consisting of elements of the same data type.
The GENERATE_SERIES
function
To generate a data series, we were previously required to supply the unnest
function with an array of values. However, PostgreSQL
offers a convenient table function called generate_series
specifically designed for this purpose:
SELECT
*
FROM
generate_series(0, 5) AS t(id)
output:
id
0
1
2
3
4
5
The generate_series
function accepts three arguments: start
, stop
, and step
. In the previous example, we did not specify a step value, so the default step of 1
was utilized.
To define the output column names and data types, we employed a table alias list as t(<col1>, <col2>, ...)
.
If we wish to generate a distinct series, we have the flexibility to specify a different step value:
SELECT
*
FROM
generate_series(
0, -- start
10, -- stop
2 -- step
) AS t(id)
output:
id
0
2
4
6
8
10
By utilizing the third argument, step
, we successfully generated a sequence of 5 numbers ranging from 0 to 10.
It's worth noting that the generate_series
function is not limited to integers; it can be applied to various other data types as well.
One prevalent use case is generating date ranges:
SELECT
*
FROM
generate_series(
'2021-01-01 UTC' :: timestamptz, -- start
'2021-01-02 UTC' :: timestamptz, -- stop
interval '1 hour' -- step
) AS t(dd)
output:
dd
2021-01-01 00:00:00+00
2021-01-01 01:00:00+00
..
..
2021-01-01 23:00:00+00
2021-01-02 00:00:00+00
To generate a 24-hour range, we utilize the generate_series
function by specifying a start and end date while setting the step to a 1
hour interval. It's important to note that the step
can be any valid interval, such as 1
minute, 5
minutes, or 2
hours.
In this particular case, we require an hourly series, thus we employ a 1-hour interval.
The resulting output of the function is dependent on the input provided. If integers are used, we obtain a range of integers. Conversely, if a timestamp is provided, we obtain a range of timestamps.
The GENERATE_SERIES
function with row numbers
The generate_series
function is a table function. There is a little a clever technique with table functions which enables us to incorporate row numbers into the output:
SELECT
*
FROM
generate_series(
'2021-01-01' :: timestamptz, -- start
'2021-01-02' :: timestamptz, -- stop
interval '1 hour' -- step
) WITH ORDINALITY AS t(day, id)
output:
day id
2021-01-01 00:00:00+00 1
2021-01-01 01:00:00+00 2
..
..
2021-01-01 23:00:00+00 24
2021-01-02 00:00:00+00 25
By utilizing the WITH ORDINALITY
clause, the results now feature an additional column that represents the row number. To assign names to these columns, we can employ the table alias function t(...)
. In this scenario, day
is designated for the timestamp column, while id
represents the row number.
Exercise: Generate a Date Dimension
When building a data warehouse, a frequent requirement is to generate a date dimension. A date dimension is a table that stores dates along with associated attributes, enabling us to analyse data across various time periods.
A basic representation of a date dimension can resemble the following:
day | month | year | quarter | week |
---|---|---|---|---|
2013-01-01 | January | 2013 | Q1 | Tuesday |
2013-01-02 | January | 2013 | Q1 | Wednesday |
2023-30-12 | December | 2023 | Q4 | Saturday |
2023-31-12 | December | 2023 | Q4 | Sunday |
We can now leverage this time dimension to join tables and group data based on different attributes, enabling analysis across different time periods.
Instructions:
To apply the knowledge you have gained thus far, you will create a date dimension that fulfills the following requirements:
- Daily date dimension for a period of 10 years, between 2013-01-01 until 2023-12-31.
- The date dimension will include the following columns:
- day: For example, β2023-08-02.β
- month: Name of month. For example, August.β
- year: 4-digit year, such as β2023.β
- quarter: The quarter number prefixed with βQ,β such as βQ3.β
- week: The name of the day of the week, such as Wednesday.β
Guidelines:
-
To create your query, utilize
generate_series
to generate a list of dates. Consider the appropriate start date, end date, and step value to generate an accurate date range. -
Use date formatting, as well as date/time functions and operators, to extract and format the date in the desired format.
Solution
with days as (
SELECT
*
FROM
generate_series(
'2013-01-01', '2023-12-31', interval '1 day'
) AS t(day)
)
SELECT
day,
to_char(day, 'Month') AS month,
date_part('year', day) AS year,
'Q' || date_part('quarter', day) AS quarter,
to_char(day, 'Day') AS week
FROM
days;
Output:
day month year quarter week
2013-01-01 January 2013 Q1 Tuesday
2013-01-02 January 2013 Q1 Wednesday
...
...
2023-12-30 December 2023 Q4 Saturday
2023-12-31 December 2023 Q4 Sunday
Thanks a lot for reading. I have a total of 4 articles on this topic.
- Learn to generate data using SQL (You already read!)
- Learn to generate random data using SQL
- Learn to generate sample data using SQL
- Learn to split a dataset for testing and training using SQL
Post Tags: