15 min

December 20, 2025

TPC-H Query 10 - Histograms and Functional Dependency

Welcome back to the TPC-H series, dear reader. And happy holidays to those of you who've already shut down.

In today's educational blog, I'm going to teach you about:

The importance of histograms
When not to do bushy joins
Functional dependencies and how they speed up queries
Bloom filters

This is a lot of ground to cover in the around 5-15 minutes I have your attention. Every deep dive starts at the surface — let us jump into the deep sea.

Query 10

Here is the query.

Like other TPC-H it looks almost too simple to be worth talking about. And as always, the TPC-H council has some surprise in store for us.

SELECT c_custkey,
       c_name,
       SUM(l_extendedprice * (1 - l_discount)) AS revenue,
       c_acctbal,
       n_name,
       c_address,
       c_phone,
       c_comment
FROM customer,
     orders,
     lineitem,
     nation
WHERE c_custkey = o_custkey
  AND l_orderkey = o_orderkey
  AND o_orderdate >='1994-06-01'
  AND o_orderdate < '1994-09-01'
  AND l_returnflag = 'R'
  AND c_nationkey = n_nationkey
GROUP BY c_custkey,
         c_name,
         c_acctbal,
         c_phone,
         n_name,
         c_address,
         c_comment
ORDER BY revenue DESC

A note on shorthand notation

Going forward, I will be using the ⨝ symbol (UTF8 sequence: U+2A1D) in my blogs.

This:

X ⨝ Y

Is shorthand for:

Look up values in a hash table of Y (or seek into it) while iterating over X

For example, this:

orders ⨝ (customer ⨝ nation)

Means (assuming hash joins):

Pre-Join customer with nation using a hash table on nation
Build a hash table on the result of customer ⨝ nation (because it is on the right side of the join to orders)
Iterate over orders looking into the hash table made by (customer ⨝ nation)

This notation doesn't fully correspond to that used by relational purist who might say that join is commutative, so the parenthesis and ordering don't matter. From the perspective of returning the right result, this is true (for inner joins):

X ⨝ Y = Y ⨝ X

This property of relational algebra allows us to reorder joins in the first place.

But the ordering of the expression does matter a lot - as we have seen. In fact, the join ordering is the most interesting property of optimisers. The logical rewrites relational algebra allows by reorganising expressions are trite. Let us, therefore, use a notation where the order we write our notation matters and has meaning in terms of execution. You can call it the "physical join order" if logical/physical separation still has any useful meaning to you (for me, it lost its usefulness as an abstraction a long time ago).

The notation I use is designed so left deep joins won't have any parenthesis, which in turn means bushy joins are easy to spot by simply skimming the written expression.

Selectivity of Filters

As always, filter selectivity:

Filter	Selectivity	Cardinality
`o_orderdate >='1994-06-01' AND o_orderdate < '1994-09-01'`	4%	~50K
`l_returnflag = 'R'`	22%	1.3M

Optimal Query Plan Analysis

Before reading on, please do the analysis I've taught you in the blogs so far. In particular, think about bushy joins.

Ready? Let us see if we agree...

Driving Table

Even after filtering l_returnflag = 'R' the lineitem table is still the largest stream. We start from that.

Join to `orders`

The only other filter we have is on orders. Next, we join want to join our lineitem stream with the filtered orders.

In our new notation, the current tree is now: lineitem ⨝ orders.

Getting to `customer` and `nation`

To get from lineitem ⨝ orders to nation we need to go via customer

Remember, when a query has a structure that says: "you must go via X to get to Y" we often have the opportunity to do a bushy join.

We're now faced with a tradeoff:

Bushy: Reduce the total number of joins with a bushy join tree
- The tree will either be: lineitem ⨝ orders ⨝ (customer ⨝ nation)
- Or maybe we bush out orders too
  - lineitem ⨝ (orders ⨝ (customer ⨝ nation))
  - lineitem ⨝ ((customer ⨝ nation) ⨝ orders)
- All these trees will need extra memory since we need to store n_name, c_address etc in the hash table that will eventually join to lineitem
left Deep: Join directly using a left deep tree:
- The tree will be: lineitem ⨝ orders ⨝ customer ⨝ nation
- This results (does it?) in a higher number of joins, but reduces the total memory needed to run the query

You can see the resulting plans on the SQL Arena Query 10 page.

ClickHouse and Postgres pick the latter plan, which also consumes the smallest amount of memory.

But DuckDB picks the bushy plan, the one that consumes more memory. Not only that, the bushy plan in fact creates more joins than the left deep plan!

Here are the numbers:

Engine	Total Join Operators
PostgreSQL	1.6K
DuckDB	1.7K

The difference is small, but it is there. And it is particularly bad because the bushy join also consumes more memory. Why doesn't the DuckDB plan reduce the number of joins like we've seen bushy joins do?

Why Bushy doesn't always reduce joins

First, consider the result of lineitem ⨝ orders, taking both filters in the query. The outcome of just this join is ~108K rows.

In a left deep tree, we can estimate the resulting joins later in the tree like this:

First: lineitem ⨝ orders ⨝ customer
- Another 108K joins from joining to customer (because we're joining on a key)
Then: lineitem ⨝ orders ⨝ customer ⨝ nation,
- Another 108K joins (as above) because we haven't reduced the stream
Total extra joins after lineitem ⨝ orders: ~216K

To illustrate what that looks like in an actual database, Here is the annotated PostgreSQL plan:

Estimate    Actual  Operator
   51264     30170  SORT SUM(l_extendedprice * ('1' - l_discount))
   51264     30170  GROUP BY SORT c_custkey, n_name AGGREGATE c_name, SUM(l_extendedprice * ('1' - l_discount)), c_acctbal, c_address, c_phone, c_comment
   64080    108325  SORT c_custkey, n_name
   64080    108325  INNER JOIN HASH ON c_nationkey = n_nationkey  <--- Second PK/FK join, ~108K joins
      25        25  │└TABLE SCAN nation
   64080    108325  INNER JOIN HASH ON o_custkey = c_custkey      <--- First PK/FK join, ~108K
  187500    150000  │└TABLE SCAN customer
   64080    108325  INNER JOIN HASH ON l_orderkey = o_orderkey    <-- Output of combined filters
   88810     54070  │└TABLE SCAN orders WHERE (o_orderdate >= '1994-06-01') AND (o_orderdate < '1994-09-01')
 1745655   1392750  TABLE SCAN lineitem WHERE l_returnflag = 'R'

But wait, how many rows does the customer table have? Answer: 150K - more than the number of rows that come out of: orders ⨝ lineitem (~108K).

That means that using a bushy join tree will do this:

Pre-join (customer ⨝ nation) (on a FK/PK relationship): 150K joins
Then do: orders ⨝ (customer ⨝ nation),
- resulting in another 150K joins if orders is on the build side
- or another 54K joins if orders in on the probe side (but requiring a large hash table)
The Resulting tree is one of:
- lineitem ⨝ ((customer ⨝ nation) ⨝ orders)
- lineitem ⨝ (orders ⨝ (customer ⨝ nation))
- lineitem ⨝ orders ⨝ customer ⨝ nation))
Total extra join after lineitem ⨝ orders, assuming we build or join direct to orders: ~300K

DuckDB wisely picks orders to be on the build side - because is smaller than (customer ⨝ nation). But because customer itself is bigger than the size of lineitem ⨝ order, we end up with more joins in the bushy plan than we do in the left deep tree:

300K extra joins in the bushy plan
208K extra joins for the left deep plan

Why can't DuckDB see the correct plan?

To realise that the left deep plan is best - you first have to know something else about the query. You need to know that the result of the filtered lineitem ⨝ order is smaller than the size of customer.

And DuckDB does not know:

Estimate    Actual  Operator
       -     30170  SORT SUM(l_extendedprice * (1 - l_discount))
  410170     30170  PROJECT c_custkey, c_name, revenue, c_acctbal, n_name, c_address, c_phone, c_comment
  410170     30170  GROUP BY HASH #0, #1, #2, #3, #4, #5, #6 AGGREGATE SUM(#7)
  410170    108325  PROJECT c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment, l_extendedprice * (1.000 - l_discount)
  410170    108325  INNER JOIN HASH ON l_orderkey = o_orderkey <---- Duck Thinks the join is 410K 
  304964     54071  │└INNER JOIN HASH ON c_custkey = o_custkey <---- Notice how the error below propagates ---v
  300000     54071  │ │└TABLE SCAN orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
  144230    149995  │ INNER JOIN HASH ON c_nationkey = n_nationkey
      25        25  │ │└TABLE SCAN nation
  150000    149995  │ TABLE SCAN customer WHERE c_custkey >= 3
 1999607   1392700  TABLE SCAN lineitem WHERE l_returnflag = 'R'

The root cause of this mis-estimation is the poor estimation of the filter on o_orderdate. To estimate such a filter correctly, you need histograms in your statistics — and DuckDB doesn't (AFAIK) use histograms for estimation.

Both SQL Server and PostgreSQL have histograms — and it shows! Here are the estimates of : orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'

Engine	Estimate	Actual
DuckDB	300K	54K
PostgreSQL	88K	54K
SQL Server	54K	54K

Aggregate before join and functional dependency — A big win!

In the previous blog about Q9 you learned about the optimisation of aggregating before joining.

SQL Server is the only engine I currently have in the SQL arena that understands how to do that.

For Q10 - the results are impressive. Here is the plan SQL Server makes:

Estimate    Actual  Operator
   26548     30170  SORT Expr1009
   26548     30170  INNER JOIN HASH ON n_nationkey = c_nationkey
      25        25  │└TABLE SEEK nation
   26548     30170  INNER JOIN HASH ON c_custkey = o_custkey
   26548     30170  │└PROJECT CASE WHEN Expr1034 = 0 THEN NULL ELSE Expr1035 END AS Expr1009
                    |    v---- Big reduction in joins here!
   26548     30170  │ GROUP BY o_custkey HASH AGGREGATE COUNT(Expr1010) AS Expr1034, SUM(Expr1010) AS Expr1035
   56888    108325  │ INNER JOIN HASH ON o_orderkey = l_orderkey
   54113     54071  │ │└TABLE SEEK orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
  138965    108325  │ PROJECT l_extendedprice * (1. - l_discount) AS Expr1010
  138965    216327  │ TABLE SEEK lineitem WHERE (l_returnflag = 'R' AND BLOOM(l_orderkey)) AND (l_returnflag = 'R')
  150000    150000  TABLE SEEK customer

The properties we want to group on all come via customer in the query, they are:

GROUP BY c_custkey,
         c_name,
         c_acctbal,
         c_phone,
         n_name,
         c_address,
         c_comment

SQL Server realises that this means it can aggregate only on o_custkey, reducing the number of joins needed on customer and nation by 3x.

How does it do this?

Functional Dependencies

SQL Server observes something that PostgreSQL could know, but it doesn't.

I'm asked to GROUP BY c_custkey
All the other columns in the GROUUP BY depend on that key
- Either directly, by coming from the customer table
- Or indirectly, via the join to nation over c_nationkey
That must mean that grouping by c_custkey on its own has the same rows as if I was grouping on all dependent columns.
Grouping by c_custkey is the same as grouping on o_custkey - because there is a PK/FK relation
Which in turn means that there is no need for me join to customer before I group
- I can take the reduction in the stream that comes from grouping before I join to customer and nation

The result of this analysis (which is called: functional dependencies between columns) is reduction in joins by 3x.

But that is not all; SQL Server has an additional trick to play.

Bloom Filtering

In previous blogs, I promised to get to this subject. I'm getting there - bit by bit (pun intentional).

In Q10, we have a good example of a bloom filter in action. Look at this fragment of the query in SQL Server:

   56888    108325  INNER JOIN HASH ON o_orderkey = l_orderkey   <--- Hash Join
   54113     54071  │└TABLE SEEK orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
  138965    108325  PROJECT l_extendedprice * (1. - l_discount) AS Expr1010
                                                                      v--- What is this?
  138965    216327  TABLE SEEK lineitem WHERE (l_returnflag = 'R' AND BLOOM(l_orderkey)) AND (l_returnflag = 'R')

SQL Server hash joins orders with lineitem, yet it does a SEEK into lineitem - how is that possible? There is no index supporting l_returnflag = 'R' - it must be seeking on something else?

Compare with PostgreSQL doing the same join:

   64080    108325  INNER JOIN HASH ON l_orderkey = o_orderkey
   88810     54070  │└TABLE SCAN orders WHERE (o_orderdate >= '1994-06-01') AND (o_orderdate < '1994-09-01')
 1745655   1392750  TABLE SCAN lineitem WHERE l_returnflag = 'R'

PostgreSQL ends up reading (and joining) 1.4M rows out of lineitem - compared with 216K rows for SQL Server. That is a ~6x reduction in work that needs doing!

What is a bloom filter and why does it matter to a database?

The Wikipedia entry on bloom filters is pretty good, so I am not going to repeat the full description here.

TL;DR: bloom filter are:

A probabilistic data structure that allows you to discard rows that aren't in a set
Compact in terms of memory use
Effective to evaluate for a modern CPU, allowing SIMD evaluation
Faster than hash joins for evaluating set membership (but probabilistic)

The algorithm SQL Server executes is this:

Find all orders matching o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
Construct a bloom filter on the set of all o_orderkey from the above
When scanning lineiitem - use the bloom filter to eliminate l_orderkey that certainly are not in the set found in 1. This may emit false positive
Do the join lineitem ⨝ orders on what remains, correcting for false positives

This is an extraordinary powerful optimisation in analytical systems. It allows you to accelerate scans, sometimes by orders of magnitude.

Joins aren't expensive and Dimensional Models are better

There is an entire blog to be written about bloom filters and what they mean for data modellers.

But for today, take this away:

Bloom filters allow you to "transfer" any filter from one table to another via a join
If the two tables are related with an integer key, they're extremely effective to calculate
That in turn means that if you have a dimensional model, it is already highly tuned for this kind of filtering

It also means that One Big Table (OBT) models are uniquely unsuited for the use of bloom filters. While bloom filters on OBT models are possible (you can turn the filter into a general hash function over a column calculation), they are extremely ineffective to calculate. They often need to operate on lots of the strings duplicated in the large table. Bloom filters on super denormalized tables must also evaluate the complex filter on every row in the large table as opposed to only evaluating on a smaller dimension table when using a star schema.

Dimensional models allow any filter on a dimension table to be turned into a simple bloom of integers in the fact table. You only need to evaluate the complex filter once on the dimension table to apply that filter to every row in the fact table. This is computationally highly effective.

These trivial insights from computer science add further weight to my argument that:

Joins are NOT expensive

Fleshing out the details in a way that appeals to the masses will take me some time. Time I don't yet have, because its holiday soon.

Summary

Today we saw how histograms are sometimes needed to avoid bad, bushy joins. We saw how DuckDB's lack of these data structures penalises it against PostgreSQL for Q10.

We also saw how knowledge of functional dependencies can be used to greatly reduce the work queries need to do. Remember to declare those keys, my friends!

Finally, we broke the surface and did a shallow dive into bloom filters and how they greatly accelerate scans. In the case of Q10, we saw a 6x improvement.