Learn to Split a Dataset for Testing and Training Using SQL
A frequent task in data analysis involves dividing a dataset into training and testing portions. The training dataset is utilised to train the model, while the test dataset is employed to assess the model's performance.
We have a total of 4 articles on this topic.
- Learn to generate data using SQL
- Learn to generate random data using SQL
- Learn to generate sample data using SQL
- Learn to split a dataset for testing and training using SQL (You are reading now)
Creating a sales table
To put into practice the knowledge we have acquired thus far, let us begin by generating a sales table with randomly generated data:
# Creating a sales table for practice
CREATE TABLE sales AS
SELECT
id,
'2023-01-01'::date + interval '1 day' * ceil(random() * 365) as sales_date,
10 + ceil(90 * random()) as amount_charged,
random() > 0.6 as shipped_status
FROM
generate_series(1, 10) AS id
ORDER BY
1;
select * from sales;
Output:
id sales_date amount_charged shipped_status
1 2023-05-16 00:00:00 93 f
2 2023-11-30 00:00:00 99 f
3 2023-10-25 00:00:00 53 f
4 2023-12-17 00:00:00 64 t
5 2023-12-31 00:00:00 93 t
6 2023-05-28 00:00:00 77 t
7 2023-04-24 00:00:00 80 t
8 2023-11-01 00:00:00 60 f
9 2023-05-23 00:00:00 19 f
10 2023-05-05 00:00:00 72 f
The sales table comprises the sales date, amount, and an indication of whether the item was shipped.
To generate the table, apply the techniques we have learned throughout the previous articles:
# Generate a random day in 2023
select '2023-01-01'::date + interval '1 day' * ceil(random() * 365) as sales_date;
Output:
sales_date
2023-11-30 00:00:00
Generate a random date within the year 2023 by adding a random number of days between 1 and 365 to January 1st, 2023.
Then, generate a random amount charged ranging from 11 to 100:
# Generate a random charged amount between 11 and 100
select 10 + ceil(90 * random()) as amount_charged;
Output:
amount_charged
13
Lastly, generate the parameter that determines whether the item is shipped or not:
# Generate the parameter to determine the shipped status
select random() > 0.6 as shipped_status;
Output:
shipped_status
f
Within our fabricated data, we aim to include 40% shipped status sales. To accomplish this, we generate a boolean value using an expression that will yield true approximately 40% of the time.
Here is an example showcasing the structure of the data:
select * FROM sales;
Output:
# Content of sales table
id sales_date amount_charged shipped_status
1 2023-05-16 00:00:00 93 f
2 2023-11-30 00:00:00 99 f
3 2023-10-25 00:00:00 53 f
4 2023-12-17 00:00:00 64 t
5 2023-12-31 00:00:00 93 t
6 2023-05-28 00:00:00 77 t
7 2023-04-24 00:00:00 80 t
8 2023-11-01 00:00:00 60 f
9 2023-05-23 00:00:00 19 f
10 2023-05-05 00:00:00 72 f
Splitting the data to test and train
To generate a table that closely resembles an existing table in PostgreSQL, we can employ the following commands:
# Creating test and train tables
CREATE TABLE sales_training AS TABLE sales WITH NO DATA;
CREATE TABLE sales_test AS TABLE sales WITH NO DATA;
This syntax proves to be immensely valuable. We merely instruct PostgreSQL to create a table resembling another table, without including any data. In this scenario, we have created two tables, sales_training
and sales_test
, mirroring the structure of the sales
table but devoid of data.
Now, our objective is to divide the data within the sale table between sales_training
and sales_test
. We intend for our training set to encompass 80% of the rows, which in this case amounts to 8 rows:
# Populate the test and training set tables
WITH
training_sales_ids AS (
INSERT INTO sales_training
SELECT * FROM sales
ORDER BY random() LIMIT 8
RETURNING id
)
INSERT INTO sales_test
SELECT * FROM sales
WHERE id NOT IN (
SELECT id
FROM training_sales_ids
);
To populate the training data, we make a selection from the sales table. We use ORDER BY random()
to shuffle the rows to ensure randomisation. We then insert only the first 8 rows into the sales_training
table.
To populate the test table with the remaining rows, we keep the IDs of the training rows by specifying RETURNING id
in a common table expression. We then insert rows into sales_test
and exclude rows in training_sales_ids
.
This is the result:
SELECT * FROM sales_training;
SELECT * FROM sales_test;
Output:
# sales_training table data
id sales_date amount_charged shipped_status
2 2023-11-30 00:00:00 99 f
10 2023-05-05 00:00:00 72 f
1 2023-05-16 00:00:00 93 f
6 2023-05-28 00:00:00 77 t
7 2023-04-24 00:00:00 80 t
3 2023-10-25 00:00:00 53 f
8 2023-11-01 00:00:00 60 f
4 2023-12-17 00:00:00 64 t
# sales_test table data
id sales_date amount_charged shipped_status
5 2023-12-31 00:00:00 93 t
9 2023-05-23 00:00:00 19 f
And there you have it, a sales training dataset and a sales test dataset, created within the database using SQL!
Thanks a lot for reading. I have a total of 4 articles on this topic.
- Learn to generate data using SQL
- Learn to generate random data using SQL
- Learn to generate sample data using SQL
- Learn to split a dataset for testing and training using SQL (You already read!)
Post Tags:
- Previous: Learn to Generate Sample Data Using SQL
- Next: Anatomy of an SQL Query