Optimizing SQL: Simplifying Queries with Window Functions

Enterprise PostgreSQL Solutions

Comments are off

Optimizing SQL: Simplifying Queries with Window Functions

The term “Window Functions” never really give much away in terms of the capability and various options they provide. So, it made sense to explore these and while doing so, I thought it’s worth sharing. Turns out, Window functions are quite useful and can simplify writing complicated SQL queries.

When fetching data, you could fetch a tabular list of individual tuples or group those with aggregate functions. However, there are times when we need to include aggregate values in individual tuples; e.g. a list of products with each row containing category-wise average price, or perhaps, there is a need to calculate value from the previous or the next tuple. You can imagine that SQL becomes rather complicated.

Definition of Window Functions

“Window functions give us the ability to perform calculations over a set of rows that are related to the current query row.”

This is similar to the ability that aggregate functions provide except that where aggregate functions return a single row per each group, Window functions return all rows.

Let’s dig straight into the Window functions.

Setting Up The Environment

Let’s create an employee and a department table with department ID as a foreign key in the employee table.

CREATE TABLE dept
(
    dept_id SERIAL PRIMARY KEY,
    dept_name VARCHAR(50) NOT NULL
);

CREATE TABLE emp
(
    emp_id SERIAL PRIMARY KEY, 
    emp_name VARCHAR(50) NOT NULL,
    salary_amount decimal(8,2),
    dept_id integer references dept(dept_id)
);

Time to populate the data into the department table.

INSERT INTO dept(dept_name) 
VALUES 
('Finance'),
('Human Resource'),
('IT');

Let’s populate the employee table with a list of famous scientists, their departments and their salaries. The departments and salaries are assigned using random functions.

INSERT INTO emp (emp_name, salary_amount, dept_id) 
VALUES
('Albert Einstein', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Blaise Pascal', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Caroline Herschel', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Dorothy Hodgkin', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Elizabeth Blackburn', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Flossie Wong-Staal', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Geraldine Seydoux', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Ingrid Daubechies', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Jacqueline K. Barton', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Lise Meitner', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Marie Curie', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Max Planck', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Niels Bohr', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Polly Matzinger', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Richard Phillips Feynman', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Sarah Boysen', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Sir Isaac Newton', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1)),
('Wolfgang Ernst Pauli', FLOOR(RANDOM() * 5 + 1) * 1500, FLOOR(RANDOM() * 3 + 1));

So now the two tables are populated with the data, we are all set to jump into the Window functions.

Simple Queries with Window Functions

Let’s try to retrieve data for all employees along with the average salary in their department. Without Window functions, the query will look something like this:

SELECT emp_name, 
dept_name, 
salary_amount, 
(
    SELECT AVG(salary_amount)::DECIMAL(8,2)
    FROM emp e2
    WHERE e2.dept_id = e1.dept_id
    GROUP BY dept_id
) AS average_salary
FROM emp e1 INNER JOIN dept USING (dept_id)
ORDER BY dept_name;

Using Window functions and the query looks like:

SELECT emp_name,
dept_name,
salary_amount,
AVG (salary_amount) OVER 
(
   PARTITION BY dept_name
)::decimal(8,2) average_salary
FROM emp INNER JOIN dept USING (dept_id);

The query is shorter and avoids the need for subquery statements in the select clause. To state the obvious, the query with Window functions is only 8 lines compared with 11 lines without those. Obviously, for more complex queries, the difference becomes even more prominent.

Understanding Window Function Syntax

I’m going to briefly explain the syntax here to give you an idea of the basic structure of Window functions. For detailed syntax, please refer to the PostgreSQL documentation

<WINDOW FUNCTION>(<COMMA SEPARATED ARGUMENTS OR EXPRESSIONS>)
OVER <FRAME DEFINITION>

WINDOW DEFINITION is set of “partition by”, “order by” and “frame definition” clauses. Partition by clause effectively groups data rows based on the given criteria, order by sorts the data, and frame definition clause defines correlation of data with the current query row. From Window functions definition stated earlier, we can pickup 2 key parts; (1) calculations, (2) set of rows that are associated with the current query row. The “Window” function is the calculation we intend to perform, and the set of rows and association is determined by the window definition.

For the complete list of Window functions, please refer to PostgreSQL documentation. Some of the more useful ones are:

Function NameReturn TypeDescription
row_number()bigintThe row number within a group of rows.
rank()bigintRank, but in case multiple rows have the same rank, you’ll end up with gaps in the ranking numbers; e.g. if there are two rows with rank 1 within a group, the next rank will be 3.
dense_rank()bigintSimilar to the rank function, however, this does not create any gaps.
lag(value…)same type as valueAllows you to refer to previous row(s)
lead(value…)same type as valueAllows you to refer to the next row. 
first_value(value…)same type as valueFirst value in a group.
last_value(value…)same type as valueLast value in a group.

Lots of Window Functions in a Single Query

Let’s quickly have a look at the power of these functions in practice against the tables we created earlier in this blog.

The following query selects all rows with row number and ranking based on an employee’s salary in a department.

SELECT emp_name,
dept_name,
salary_amount,
ROW_NUMBER() OVER(PARTITION BY dept_name ORDER BY salary_amount),
RANK() OVER(PARTITION BY dept_name ORDER BY salary_amount),
DENSE_RANK() OVER(PARTITION BY dept_name ORDER BY salary_amount)
FROM emp e1 INNER JOIN dept USING (dept_id);
         emp_name         |   dept_name    | salary_amount | row_number | rank | dense_rank 
--------------------------+----------------+---------------+------------+------+------------
 Blaise Pascal            | Finance        |       1500.00 |          1 |    1 |          1
 Flossie Wong-Staal       | Finance        |       4500.00 |          2 |    2 |          2
 Marie Curie              | Finance        |       7500.00 |          3 |    3 |          3
 Dorothy Hodgkin          | Finance        |       7500.00 |          4 |    3 |          3
 Sarah Boysen             | Human Resource |       1500.00 |          1 |    1 |          1
 Max Planck               | Human Resource |       3000.00 |          2 |    2 |          2
 Ingrid Daubechies        | Human Resource |       4500.00 |          3 |    3 |          3
 Albert Einstein          | Human Resource |       6000.00 |          4 |    4 |          4
 Geraldine Seydoux        | Human Resource |       7500.00 |          5 |    5 |          5
 Sir Isaac Newton         | IT             |       1500.00 |          1 |    1 |          1
 Lise Meitner             | IT             |       3000.00 |          2 |    2 |          2
 Niels Bohr               | IT             |       3000.00 |          3 |    2 |          2
 Wolfgang Ernst Pauli     | IT             |       6000.00 |          4 |    4 |          3
 Richard Phillips Feynman | IT             |       6000.00 |          5 |    4 |          3
 Elizabeth Blackburn      | IT             |       6000.00 |          6 |    4 |          3
 Caroline Herschel        | IT             |       6000.00 |          7 |    4 |          3
 Polly Matzinger          | IT             |       7500.00 |          8 |    8 |          4
 Jacqueline K. Barton     | IT             |       7500.00 |          9 |    8 |          4
(18 rows)

The following query selects all rows with row number and shows first and last values, previous or next value in the group, and differences with the current row.

SELECT emp_name,
dept_name,
salary_amount,
LAG(salary_amount, 1) OVER(PARTITION BY dept_name ORDER BY salary_amount),
salary_amount - LAG(salary_amount, 1) OVER(PARTITION BY dept_name ORDER BY salary_amount) diff_from_previous,
LEAD(salary_amount, 1) OVER(PARTITION BY dept_name ORDER BY salary_amount),
salary_amount - LEAD(salary_amount, 1) OVER(PARTITION BY dept_name ORDER BY salary_amount) diff_from_next,
FIRST_VALUE(salary_amount) OVER(PARTITION BY dept_name ORDER BY salary_amount),
LAST_VALUE(salary_amount) OVER(PARTITION BY dept_name ORDER BY salary_amount)
FROM emp e1 INNER JOIN dept USING (dept_id);
         emp_name         |   dept_name    | salary_amount |   lag   | diff_from_previous |  lead   | diff_from_next | first_value | last_value 
--------------------------+----------------+---------------+---------+--------------------+---------+----------------+-------------+------------
 Blaise Pascal            | Finance        |       1500.00 |         |                    | 4500.00 |       -3000.00 |     1500.00 |    1500.00
 Flossie Wong-Staal       | Finance        |       4500.00 | 1500.00 |            3000.00 | 7500.00 |       -3000.00 |     1500.00 |    4500.00
 Marie Curie              | Finance        |       7500.00 | 4500.00 |            3000.00 | 7500.00 |           0.00 |     1500.00 |    7500.00
 Dorothy Hodgkin          | Finance        |       7500.00 | 7500.00 |               0.00 |         |                |     1500.00 |    7500.00
 Sarah Boysen             | Human Resource |       1500.00 |         |                    | 3000.00 |       -1500.00 |     1500.00 |    1500.00
 Max Planck               | Human Resource |       3000.00 | 1500.00 |            1500.00 | 4500.00 |       -1500.00 |     1500.00 |    3000.00
 Ingrid Daubechies        | Human Resource |       4500.00 | 3000.00 |            1500.00 | 6000.00 |       -1500.00 |     1500.00 |    4500.00
 Albert Einstein          | Human Resource |       6000.00 | 4500.00 |            1500.00 | 7500.00 |       -1500.00 |     1500.00 |    6000.00
 Geraldine Seydoux        | Human Resource |       7500.00 | 6000.00 |            1500.00 |         |                |     1500.00 |    7500.00
 Sir Isaac Newton         | IT             |       1500.00 |         |                    | 3000.00 |       -1500.00 |     1500.00 |    1500.00
 Lise Meitner             | IT             |       3000.00 | 1500.00 |            1500.00 | 3000.00 |           0.00 |     1500.00 |    3000.00
 Niels Bohr               | IT             |       3000.00 | 3000.00 |               0.00 | 6000.00 |       -3000.00 |     1500.00 |    3000.00
 Wolfgang Ernst Pauli     | IT             |       6000.00 | 3000.00 |            3000.00 | 6000.00 |           0.00 |     1500.00 |    6000.00
 Richard Phillips Feynman | IT             |       6000.00 | 6000.00 |               0.00 | 6000.00 |           0.00 |     1500.00 |    6000.00
 Elizabeth Blackburn      | IT             |       6000.00 | 6000.00 |               0.00 | 6000.00 |           0.00 |     1500.00 |    6000.00
 Caroline Herschel        | IT             |       6000.00 | 6000.00 |               0.00 | 7500.00 |       -1500.00 |     1500.00 |    6000.00
 Polly Matzinger          | IT             |       7500.00 | 6000.00 |            1500.00 | 7500.00 |           0.00 |     1500.00 |    7500.00
 Jacqueline K. Barton     | IT             |       7500.00 | 7500.00 |               0.00 |         |                |     1500.00 |    7500.00
(18 rows)

Useful in Calculating Running Total

Another useful query that I personally like is the one where we can calculate running totals which can be really useful especially in financial reports.

SELECT emp_name,
dept_name,
salary_amount,
SUM(salary_amount) OVER
(
    ORDER BY salary_amount
    ROWS BETWEEN UNBOUNDED PRECEDING
    AND CURRENT ROW
)::decimal(8,2) running_total
FROM emp e1 INNER JOIN dept USING (dept_id);
         emp_name         |   dept_name    | salary_amount | running_total 
--------------------------+----------------+---------------+---------------
 Sir Isaac Newton         | IT             |       1500.00 |       1500.00
 Blaise Pascal            | Finance        |       1500.00 |       3000.00
 Sarah Boysen             | Human Resource |       1500.00 |       4500.00
 Niels Bohr               | IT             |       3000.00 |       7500.00
 Lise Meitner             | IT             |       3000.00 |      10500.00
 Max Planck               | Human Resource |       3000.00 |      13500.00
 Flossie Wong-Staal       | Finance        |       4500.00 |      18000.00
 Ingrid Daubechies        | Human Resource |       4500.00 |      22500.00
 Wolfgang Ernst Pauli     | IT             |       6000.00 |      28500.00
 Caroline Herschel        | IT             |       6000.00 |      34500.00
 Elizabeth Blackburn      | IT             |       6000.00 |      40500.00
 Richard Phillips Feynman | IT             |       6000.00 |      46500.00
 Albert Einstein          | Human Resource |       6000.00 |      52500.00
 Polly Matzinger          | IT             |       7500.00 |      60000.00
 Dorothy Hodgkin          | Finance        |       7500.00 |      67500.00
 Geraldine Seydoux        | Human Resource |       7500.00 |      75000.00
 Marie Curie              | Finance        |       7500.00 |      82500.00
 Jacqueline K. Barton     | IT             |       7500.00 |      90000.00
(18 rows)

Conclusion

Certainly, Window functions provide a great ability and power for writing SQL queries that require returning a complete dataset with certain calculations on a set of rows. With great power, comes great responsibility.

The simplification may be at the cost of slight performance degradation, so make sure you have reviewed the query plan paying particular attention to any additional sorts may be introduced by Window functions. 

One caveat is that you can’t use Window functions in a where clause. You could, however, create a CTE with Window functions and use that in the where clause. All in all, I found these super useful and I’ll be looking forward to parallelisation in these in future releases of PostgreSQL.