The Database Doctor
Musing about Databases

TPC series - TPC-H Query 10 - Histograms and Functional Dependency

Welcome back to the TPC-H series, dear reader. And happy holidays to those of you who've already shut down.

In today's educational blog, I'm going to teach you about:

This is a lot of ground to cover in the around 5-15 minutes I have your attention. Every deep dive starts at the surface — let us jump into the deep sea.

Query 10

Here is the query.

Like other TPC-H it looks almost too simple to be worth talking about. And as always, the TPC-H council has some surprise in store for us.

SELECT c_custkey,
       c_name,
       SUM(l_extendedprice * (1 - l_discount)) AS revenue,
       c_acctbal,
       n_name,
       c_address,
       c_phone,
       c_comment
FROM customer,
     orders,
     lineitem,
     nation
WHERE c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND o_orderdate >='1994-06-01'
  AND o_orderdate < '1994-09-01'
  AND l_returnflag = 'R'
  AND c_nationkey = n_nationkey
GROUP BY c_custkey,
         c_name,
         c_acctbal,
         c_phone,
         n_name,
         c_address,
         c_comment
ORDER BY revenue DESC

A note on shorthand notation

Going forward, I will be using the symbol (UTF8 sequence: U+2A1D) in my blogs.

This:

Is shorthand for:

For example, this:

Means (assuming hash joins):

This notation doesn't fully correspond to that used by relational purist who might say that join is commutative, so the parenthesis and ordering don't matter. From the perspective of returning the right result, this is true (for inner joins):

XY = YX

This property of relational algebra allows us to reorder joins in the first place.

But the ordering of the expression does matter a lot - as we have seen. In fact, the join ordering is the most interesting property of optimisers. The logical rewrites relational algebra allows by reorganising expressions are trite. Let us, therefore, use a notation where the order we write our notation matters and has meaning in terms of execution. You can call it the "physical join order" if logical/physical separation still has any useful meaning to you (for me, it lost its usefulness as an abstraction a long time ago).

The notation I use is designed so left deep joins won't have any parenthesis, which in turn means bushy joins are easy to spot by simply skimming the written expression.

Selectivity of Filters

As always, filter selectivity:

Filter Selectivity Cardinality
o_orderdate >='1994-06-01' AND o_orderdate < '1994-09-01' 4% ~50K
l_returnflag = 'R' 22% 1.3M

Optimal Query Plan Analysis

Before reading on, please do the analysis I've taught you in the blogs so far. In particular, think about bushy joins.

Ready? Let us see if we agree...

Driving Table

Even after filtering l_returnflag = 'R' the lineitem table is still the largest stream. We start from that.

Join to orders

The only other filter we have is on orders. Next, we join want to join our lineitem stream with the filtered orders.

In our new notation, the current tree is now: lineitemorders.

Getting to customer and nation

To get from lineitemorders to nation we need to go via customer

Remember, when a query has a structure that says: "you must go via X to get to Y" we often have the opportunity to do a bushy join.

We're now faced with a tradeoff:

You can see the resulting plans on the SQL Arena Query 10 page.

ClickHouse and Postgres pick the latter plan, which also consumes the smallest amount of memory.

But DuckDB picks the bushy plan, the one that consumes more memory. Not only that, the bushy plan in fact creates more joins than the left deep plan!

Here are the numbers:

Engine Total Join Operators
PostgreSQL 1.6K
DuckDB 1.7K

The difference is small, but it is there. And it is particularly bad because the bushy join also consumes more memory. Why doesn't the DuckDB plan reduce the number of joins like we've seen bushy joins do?

Why Bushy doesn't always reduce joins

First, consider the result of lineitemorders, taking both filters in the query. The outcome of just this join is ~108K rows.

In a left deep tree, we can estimate the resulting joins later in the tree like this:

To illustrate what that looks like in an actual database, Here is the annotated PostgreSQL plan:

Estimate    Actual  Operator
   51264     30170  SORT SUM(l_extendedprice * ('1' - l_discount))
   51264     30170  GROUP BY SORT c_custkey, n_name AGGREGATE c_name, SUM(l_extendedprice * ('1' - l_discount)), c_acctbal, c_address, c_phone, c_comment
   64080    108325  SORT c_custkey, n_name
   64080    108325  INNER JOIN HASH ON c_nationkey = n_nationkey  <--- Second PK/FK join, ~108K joins
      25        25  │└TABLE SCAN nation
   64080    108325  INNER JOIN HASH ON o_custkey = c_custkey      <--- First PK/FK join, ~108K
  187500    150000  │└TABLE SCAN customer
   64080    108325  INNER JOIN HASH ON l_orderkey = o_orderkey    <-- Output of combined filters
   88810     54070  │└TABLE SCAN orders WHERE (o_orderdate >= '1994-06-01') AND (o_orderdate < '1994-09-01')
 1745655   1392750  TABLE SCAN lineitem WHERE l_returnflag = 'R'

But wait, how many rows does the customer table have? Answer: 150K - more than the number of rows that come out of: orderslineitem (~108K).

That means that using a bushy join tree will do this:

DuckDB wisely picks orders to be on the build side - because is smaller than (customernation). But because customer itself is bigger than the size of lineitemorder, we end up with more joins in the bushy plan than we do in the left deep tree:

Why can't DuckDB see the correct plan?

To realise that the left deep plan is best - you first have to know something else about the query. You need to know that the result of the filtered lineitemorder is smaller than the size of customer.

And DuckDB does not know:

Estimate    Actual  Operator
       -     30170  SORT SUM(l_extendedprice * (1 - l_discount))
  410170     30170  PROJECT c_custkey, c_name, revenue, c_acctbal, n_name, c_address, c_phone, c_comment
  410170     30170  GROUP BY HASH #0, #1, #2, #3, #4, #5, #6 AGGREGATE SUM(#7)
  410170    108325  PROJECT c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment, l_extendedprice * (1.000 - l_discount)
  410170    108325  INNER JOIN HASH ON l_orderkey = o_orderkey <---- Duck Thinks the join is 410K 
  304964     54071  │└INNER JOIN HASH ON c_custkey = o_custkey <---- Notice how the error below propagates ---v
  300000     54071  │ │└TABLE SCAN orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
  144230    149995  │ INNER JOIN HASH ON c_nationkey = n_nationkey
      25        25  │ │└TABLE SCAN nation
  150000    149995  │ TABLE SCAN customer WHERE c_custkey >= 3
 1999607   1392700  TABLE SCAN lineitem WHERE l_returnflag = 'R'

The root cause of this mis-estimation is the poor estimation of the filter on o_orderdate. To estimate such a filter correctly, you need histograms in your statistics — and DuckDB doesn't (AFAIK) use histograms for estimation.

Both SQL Server and PostgreSQL have histograms — and it shows! Here are the estimates of : orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'

Engine Estimate Actual
DuckDB 300K 54K
PostgreSQL 88K 54K
SQL Server 54K 54K

Aggregate before join and functional dependency — A big win!

In the previous blog about Q9 you learned about the optimisation of aggregating before joining.

SQL Server is the only engine I currently have in the SQL arena that understands how to do that.

For Q10 - the results are impressive. Here is the plan SQL Server makes:

Estimate    Actual  Operator
   26548     30170  SORT Expr1009
   26548     30170  INNER JOIN HASH ON n_nationkey = c_nationkey
      25        25  │└TABLE SEEK nation
   26548     30170  INNER JOIN HASH ON c_custkey = o_custkey
   26548     30170  │└PROJECT CASE WHEN Expr1034 = 0 THEN NULL ELSE Expr1035 END AS Expr1009
                    |    v---- Big reduction in joins here!
   26548     30170  │ GROUP BY o_custkey HASH AGGREGATE COUNT(Expr1010) AS Expr1034, SUM(Expr1010) AS Expr1035
   56888    108325  │ INNER JOIN HASH ON o_orderkey = l_orderkey
   54113     54071  │ │└TABLE SEEK orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
  138965    108325  │ PROJECT l_extendedprice * (1. - l_discount) AS Expr1010
  138965    216327  │ TABLE SEEK lineitem WHERE (l_returnflag = 'R' AND BLOOM(l_orderkey)) AND (l_returnflag = 'R')
  150000    150000  TABLE SEEK customer

The properties we want to group on all come via customer in the query, they are:

GROUP BY c_custkey,
         c_name,
         c_acctbal,
         c_phone,
         n_name,
         c_address,
         c_comment

SQL Server realises that this means it can aggregate only on o_custkey, reducing the number of joins needed on customer and nation by 3x.

How does it do this?

Functional Dependencies

SQL Server observes something that PostgreSQL could know, but it doesn't.

The result of this analysis (which is called: functional dependencies between columns) is reduction in joins by 3x.

But that is not all; SQL Server has an additional trick to play.

Bloom Filtering

In previous blogs, I promised to get to this subject. I'm getting there - bit by bit (pun intentional).

In Q10, we have a good example of a bloom filter in action. Look at this fragment of the query in SQL Server:

   56888    108325  INNER JOIN HASH ON o_orderkey = l_orderkey   <--- Hash Join
   54113     54071  │└TABLE SEEK orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
  138965    108325  PROJECT l_extendedprice * (1. - l_discount) AS Expr1010
                                                                      v--- What is this?
  138965    216327  TABLE SEEK lineitem WHERE (l_returnflag = 'R' AND BLOOM(l_orderkey)) AND (l_returnflag = 'R')

SQL Server hash joins orders with lineitem, yet it does a SEEK into lineitem - how is that possible? There is no index supporting l_returnflag = 'R' - it must be seeking on something else?

Compare with PostgreSQL doing the same join:

   64080    108325  INNER JOIN HASH ON l_orderkey = o_orderkey
   88810     54070  │└TABLE SCAN orders WHERE (o_orderdate >= '1994-06-01') AND (o_orderdate < '1994-09-01')
 1745655   1392750  TABLE SCAN lineitem WHERE l_returnflag = 'R'

PostgreSQL ends up reading (and joining) 1.4M rows out of lineitem - compared with 216K rows for SQL Server. That is a ~6x reduction in work that needs doing!

What is a bloom filter and why does it matter to a database?

The Wikipedia entry on bloom filters is pretty good, so I am not going to repeat the full description here.

TL;DR: bloom filter are:

The algorithm SQL Server executes is this:

  1. Find all orders matching o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
  2. Construct a bloom filter on the set of all o_orderkey from the above
  3. When scanning lineiitem - use the bloom filter to eliminate l_orderkey that certainly are not in the set found in 1. This may emit false positive
  4. Do the join lineitemorders on what remains, correcting for false positives

This is an extraordinary powerful optimisation in analytical systems. It allows you to accelerate scans, sometimes by orders of magnitude.

Joins aren't expensive and Dimensional Models are better

There is an entire blog to be written about bloom filters and what they mean for data modellers.

But for today, take this away:

It also means that One Big Table (OBT) models are uniquely unsuited for the use of bloom filters. While bloom filters on OBT models are possible (you can turn the filter into a general hash function over a column calculation), they are extremely ineffective to calculate. They often need to operate on lots of the strings duplicated in the large table. Bloom filters on super denormalized tables must also evaluate the complex filter on every row in the large table as opposed to only evaluating on a smaller dimension table when using a star schema.

Dimensional models allow any filter on a dimension table to be turned into a simple bloom of integers in the fact table. You only need to evaluate the complex filter once on the dimension table to apply that filter to every row in the fact table. This is computationally highly effective.

These trivial insights from computer science add further weight to my argument that:

Fleshing out the details in a way that appeals to the masses will take me some time. Time I don't yet have, because its holiday soon.

Summary

Today we saw how histogram is sometimes needed to avoid bad, bushy joins. We saw how DuckDB's lack of these data structures penalises it against PostgreSQL for Q10.

We also saw how knowledge of functional dependencies can be used to greatly reduce the work queries need to do. Remember to declare those keys, my friends!

Finally, we broke the surface and did a shallow dive into bloom filters and how they greatly accelerate scans. In the case of Q10, we saw a 6x improvement.