Skip to main content

Learn to Split a Dataset for Testing and Training Using SQL

A frequent task in data analysis involves dividing a dataset into training and testing portions. The training dataset is utilised to train the model, while the test dataset is employed to assess the model's performance.

We have a total of 4 articles on this topic.

  1. Learn to generate data using SQL
  2. Learn to generate random data using SQL
  3. Learn to generate sample data using SQL
  4. Learn to split a dataset for testing and training using SQL (You are reading now)

Creating a sales table

To put into practice the knowledge we have acquired thus far, let us begin by generating a sales table with randomly generated data:

# Creating a sales table for practice

CREATE TABLE sales AS
  SELECT
      id,
      '2023-01-01'::date + interval '1 day' * ceil(random() * 365) as sales_date,
      10 + ceil(90 * random()) as amount_charged,
      random() > 0.6 as shipped_status
  FROM
      generate_series(1, 10) AS id
  ORDER BY
      1;

select * from sales;

Output:

id	sales_date	  amount_charged	  shipped_status
1	2023-05-16 00:00:00	93	f
2	2023-11-30 00:00:00	99	f
3	2023-10-25 00:00:00	53	f
4	2023-12-17 00:00:00	64	t
5	2023-12-31 00:00:00	93	t
6	2023-05-28 00:00:00	77	t
7	2023-04-24 00:00:00	80	t
8	2023-11-01 00:00:00	60	f
9	2023-05-23 00:00:00	19	f
10	2023-05-05 00:00:00	72	f

The sales table comprises the sales date, amount, and an indication of whether the item was shipped.

To generate the table, apply the techniques we have learned throughout the previous articles:

# Generate a random day in 2023

select '2023-01-01'::date + interval '1 day' * ceil(random() * 365) as sales_date;

Output:

sales_date
2023-11-30 00:00:00

Generate a random date within the year 2023 by adding a random number of days between 1 and 365 to January 1st, 2023.

Then, generate a random amount charged ranging from 11 to 100:

# Generate a random charged amount between 11 and 100

select 10 + ceil(90 * random()) as amount_charged;

Output:

amount_charged
13

Lastly, generate the parameter that determines whether the item is shipped or not:

# Generate the parameter to determine the shipped status

select random() > 0.6 as shipped_status;

Output:

shipped_status
f

Within our fabricated data, we aim to include 40% shipped status sales. To accomplish this, we generate a boolean value using an expression that will yield true approximately 40% of the time.

Here is an example showcasing the structure of the data:

select * FROM sales;

Output:

# Content of sales table

id	sales_date	  amount_charged	  shipped_status
1	2023-05-16 00:00:00	93	f
2	2023-11-30 00:00:00	99	f
3	2023-10-25 00:00:00	53	f
4	2023-12-17 00:00:00	64	t
5	2023-12-31 00:00:00	93	t
6	2023-05-28 00:00:00	77	t
7	2023-04-24 00:00:00	80	t
8	2023-11-01 00:00:00	60	f
9	2023-05-23 00:00:00	19	f
10	2023-05-05 00:00:00	72	f

Splitting the data to test and train

To generate a table that closely resembles an existing table in PostgreSQL, we can employ the following commands:

# Creating test and train tables

CREATE TABLE sales_training AS TABLE sales WITH NO DATA;

CREATE TABLE sales_test AS TABLE sales WITH NO DATA;

This syntax proves to be immensely valuable. We merely instruct PostgreSQL to create a table resembling another table, without including any data. In this scenario, we have created two tables, sales_training and sales_test, mirroring the structure of the sales table but devoid of data.

Now, our objective is to divide the data within the sale table between sales_training and sales_test. We intend for our training set to encompass 80% of the rows, which in this case amounts to 8 rows:

# Populate the test and training set tables

WITH
training_sales_ids AS (
    INSERT INTO sales_training
    SELECT * FROM sales
    ORDER BY random() LIMIT 8
    RETURNING id
)
INSERT INTO sales_test
SELECT * FROM sales
WHERE id NOT IN (
  SELECT id
  FROM training_sales_ids
);

To populate the training data, we make a selection from the sales table. We use ORDER BY random() to shuffle the rows to ensure randomisation. We then insert only the first 8 rows into the sales_training table.

To populate the test table with the remaining rows, we keep the IDs of the training rows by specifying RETURNING id in a common table expression. We then insert rows into sales_test and exclude rows in training_sales_ids.

This is the result:

SELECT * FROM sales_training;

SELECT * FROM sales_test;

Output:

# sales_training table data

id	sales_date	  amount_charged	  shipped_status
2	2023-11-30 00:00:00	99	f
10	2023-05-05 00:00:00	72	f
1	2023-05-16 00:00:00	93	f
6	2023-05-28 00:00:00	77	t
7	2023-04-24 00:00:00	80	t
3	2023-10-25 00:00:00	53	f
8	2023-11-01 00:00:00	60	f
4	2023-12-17 00:00:00	64	t
# sales_test table data

id	sales_date	  amount_charged	  shipped_status
5	2023-12-31 00:00:00	93	t
9	2023-05-23 00:00:00	19	f

And there you have it, a sales training dataset and a sales test dataset, created within the database using SQL!

Thanks a lot for reading. I have a total of 4 articles on this topic.

  1. Learn to generate data using SQL
  2. Learn to generate random data using SQL
  3. Learn to generate sample data using SQL
  4. Learn to split a dataset for testing and training using SQL (You already read!)

Post Tags: