TPC series - TPC-H Query 10 - Histograms and Functional Dependency
Welcome back to the TPC-H series, dear reader. And happy holidays to those of you who've already shut down.
In today's educational blog, I'm going to teach you about:
- The importance of histograms
- When not to do bushy joins
- Functional dependencies and how they speed up queries
- Bloom filters
This is a lot of ground to cover in the around 5-15 minutes I have your attention. Every deep dive starts at the surface — let us jump into the deep sea.
Query 10
Here is the query.
Like other TPC-H it looks almost too simple to be worth talking about. And as always, the TPC-H council has some surprise in store for us.
SELECT c_custkey,
c_name,
SUM(l_extendedprice * (1 - l_discount)) AS revenue,
c_acctbal,
n_name,
c_address,
c_phone,
c_comment
FROM customer,
orders,
lineitem,
nation
WHERE c_custkey = o_custkey
AND l_orderkey = o_orderkey
AND o_orderdate >='1994-06-01'
AND o_orderdate < '1994-09-01'
AND l_returnflag = 'R'
AND c_nationkey = n_nationkey
GROUP BY c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
ORDER BY revenue DESC
A note on shorthand notation
Going forward, I will be using the ⨝ symbol (UTF8 sequence: U+2A1D) in my blogs.
This:
X⨝Y
Is shorthand for:
- Look up values in a hash table of
Y(or seek into it) while iterating overX
For example, this:
orders⨝ (customer⨝nation)
Means (assuming hash joins):
- Pre-Join
customerwithnationusing a hash table onnation - Build a hash table on the result of
customer⨝nation(because it is on the right side of the join toorders) - Iterate over
orderslooking into the hash table made by (customer⨝nation)
This notation doesn't fully correspond to that used by relational purist who might say that join is commutative, so the parenthesis and ordering don't matter. From the perspective of returning the right result, this is true (for inner joins):
X ⨝ Y = Y ⨝ X
This property of relational algebra allows us to reorder joins in the first place.
But the ordering of the expression does matter a lot - as we have seen. In fact, the join ordering is the most interesting property of optimisers. The logical rewrites relational algebra allows by reorganising expressions are trite. Let us, therefore, use a notation where the order we write our notation matters and has meaning in terms of execution. You can call it the "physical join order" if logical/physical separation still has any useful meaning to you (for me, it lost its usefulness as an abstraction a long time ago).
The notation I use is designed so left deep joins won't have any parenthesis, which in turn means bushy joins are easy to spot by simply skimming the written expression.
Selectivity of Filters
As always, filter selectivity:
| Filter | Selectivity | Cardinality |
|---|---|---|
o_orderdate >='1994-06-01' AND o_orderdate < '1994-09-01' |
4% | ~50K |
l_returnflag = 'R' |
22% | 1.3M |
Optimal Query Plan Analysis
Before reading on, please do the analysis I've taught you in the blogs so far. In particular, think about bushy joins.
Ready? Let us see if we agree...
Driving Table
Even after filtering l_returnflag = 'R' the lineitem table is still the largest stream.
We start from that.
Join to orders
The only other filter we have is on orders.
Next, we join want to join our lineitem stream with the filtered orders.
In our new notation, the current tree is now: lineitem ⨝ orders.
Getting to customer and nation
To get from lineitem ⨝ orders to nation we need to go via customer
Remember, when a query has a structure that says: "you must go via X to get to Y" we often have the opportunity to do a bushy join.
We're now faced with a tradeoff:
- Bushy: Reduce the total number of joins with a bushy join tree
- The tree will either be:
lineitem⨝orders⨝ (customer⨝nation) - Or maybe we bush out
orderstoolineitem⨝ (orders⨝ (customer⨝nation))lineitem⨝ ((customer⨝nation) ⨝orders)
- All these trees will need extra memory since we need to store
n_name,c_addressetc in the hash table that will eventually join tolineitem
- The tree will either be:
- left Deep: Join directly using a left deep tree:
- The tree will be:
lineitem⨝orders⨝customer⨝nation - This results (does it?) in a higher number of joins, but reduces the total memory needed to run the query
- The tree will be:
You can see the resulting plans on the SQL Arena Query 10 page.
ClickHouse and Postgres pick the latter plan, which also consumes the smallest amount of memory.
But DuckDB picks the bushy plan, the one that consumes more memory. Not only that, the bushy plan in fact creates more joins than the left deep plan!
Here are the numbers:
| Engine | Total Join Operators |
|---|---|
| PostgreSQL | 1.6K |
| DuckDB | 1.7K |
The difference is small, but it is there. And it is particularly bad because the bushy join also consumes more memory. Why doesn't the DuckDB plan reduce the number of joins like we've seen bushy joins do?
Why Bushy doesn't always reduce joins
First, consider the result of lineitem ⨝ orders, taking both filters in the query.
The outcome of just this join is ~108K rows.
In a left deep tree, we can estimate the resulting joins later in the tree like this:
- First:
lineitem⨝orders⨝customer- Another 108K joins from joining to
customer(because we're joining on a key)
- Another 108K joins from joining to
- Then:
lineitem⨝orders⨝customer⨝nation,- Another 108K joins (as above) because we haven't reduced the stream
- Total extra joins after
lineitem⨝orders: ~216K
To illustrate what that looks like in an actual database, Here is the annotated PostgreSQL plan:
Estimate Actual Operator
51264 30170 SORT SUM(l_extendedprice * ('1' - l_discount))
51264 30170 GROUP BY SORT c_custkey, n_name AGGREGATE c_name, SUM(l_extendedprice * ('1' - l_discount)), c_acctbal, c_address, c_phone, c_comment
64080 108325 SORT c_custkey, n_name
64080 108325 INNER JOIN HASH ON c_nationkey = n_nationkey <--- Second PK/FK join, ~108K joins
25 25 │└TABLE SCAN nation
64080 108325 INNER JOIN HASH ON o_custkey = c_custkey <--- First PK/FK join, ~108K
187500 150000 │└TABLE SCAN customer
64080 108325 INNER JOIN HASH ON l_orderkey = o_orderkey <-- Output of combined filters
88810 54070 │└TABLE SCAN orders WHERE (o_orderdate >= '1994-06-01') AND (o_orderdate < '1994-09-01')
1745655 1392750 TABLE SCAN lineitem WHERE l_returnflag = 'R'
But wait, how many rows does the customer table have?
Answer: 150K - more than the number of rows that come out of: orders ⨝ lineitem (~108K).
That means that using a bushy join tree will do this:
- Pre-join (
customer⨝nation) (on a FK/PK relationship): 150K joins - Then do:
orders⨝ (customer⨝nation),- resulting in another 150K joins if
ordersis on the build side - or another 54K joins if
ordersin on the probe side (but requiring a large hash table)
- resulting in another 150K joins if
- The Resulting tree is one of:
lineitem⨝ ((customer⨝nation) ⨝orders)lineitem⨝ (orders⨝ (customer⨝nation))lineitem⨝orders⨝customer⨝nation))
- Total extra join after
lineitem⨝orders, assuming we build or join direct toorders: ~300K
DuckDB wisely picks orders to be on the build side - because is smaller than (customer ⨝ nation).
But because customer itself is bigger than the size of lineitem ⨝ order, we end up with more
joins in the bushy plan than we do in the left deep tree:
- 300K extra joins in the bushy plan
- 208K extra joins for the left deep plan
Why can't DuckDB see the correct plan?
To realise that the left deep plan is best - you first have to know something else about the query.
You need to know that the result of the filtered lineitem ⨝ order is smaller than the size of customer.
And DuckDB does not know:
Estimate Actual Operator
- 30170 SORT SUM(l_extendedprice * (1 - l_discount))
410170 30170 PROJECT c_custkey, c_name, revenue, c_acctbal, n_name, c_address, c_phone, c_comment
410170 30170 GROUP BY HASH #0, #1, #2, #3, #4, #5, #6 AGGREGATE SUM(#7)
410170 108325 PROJECT c_custkey, c_name, c_acctbal, c_phone, n_name, c_address, c_comment, l_extendedprice * (1.000 - l_discount)
410170 108325 INNER JOIN HASH ON l_orderkey = o_orderkey <---- Duck Thinks the join is 410K
304964 54071 │└INNER JOIN HASH ON c_custkey = o_custkey <---- Notice how the error below propagates ---v
300000 54071 │ │└TABLE SCAN orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
144230 149995 │ INNER JOIN HASH ON c_nationkey = n_nationkey
25 25 │ │└TABLE SCAN nation
150000 149995 │ TABLE SCAN customer WHERE c_custkey >= 3
1999607 1392700 TABLE SCAN lineitem WHERE l_returnflag = 'R'
The root cause of this mis-estimation is the poor estimation of the filter on o_orderdate.
To estimate such a filter correctly, you need histograms in your statistics — and DuckDB doesn't (AFAIK) use
histograms for estimation.
Both SQL Server and PostgreSQL have histograms — and it shows!
Here are the estimates of : orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
| Engine | Estimate | Actual |
|---|---|---|
| DuckDB | 300K | 54K |
| PostgreSQL | 88K | 54K |
| SQL Server | 54K | 54K |
Aggregate before join and functional dependency — A big win!
In the previous blog about Q9 you learned about the optimisation of aggregating before joining.
SQL Server is the only engine I currently have in the SQL arena that understands how to do that.
For Q10 - the results are impressive. Here is the plan SQL Server makes:
Estimate Actual Operator
26548 30170 SORT Expr1009
26548 30170 INNER JOIN HASH ON n_nationkey = c_nationkey
25 25 │└TABLE SEEK nation
26548 30170 INNER JOIN HASH ON c_custkey = o_custkey
26548 30170 │└PROJECT CASE WHEN Expr1034 = 0 THEN NULL ELSE Expr1035 END AS Expr1009
| v---- Big reduction in joins here!
26548 30170 │ GROUP BY o_custkey HASH AGGREGATE COUNT(Expr1010) AS Expr1034, SUM(Expr1010) AS Expr1035
56888 108325 │ INNER JOIN HASH ON o_orderkey = l_orderkey
54113 54071 │ │└TABLE SEEK orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
138965 108325 │ PROJECT l_extendedprice * (1. - l_discount) AS Expr1010
138965 216327 │ TABLE SEEK lineitem WHERE (l_returnflag = 'R' AND BLOOM(l_orderkey)) AND (l_returnflag = 'R')
150000 150000 TABLE SEEK customer
The properties we want to group on all come via customer in the query, they are:
GROUP BY c_custkey,
c_name,
c_acctbal,
c_phone,
n_name,
c_address,
c_comment
SQL Server realises that this means it can aggregate only on o_custkey, reducing the number of joins
needed on customer and nation by 3x.
How does it do this?
Functional Dependencies
SQL Server observes something that PostgreSQL could know, but it doesn't.
- I'm asked to
GROUP BY c_custkey - All the other columns in the
GROUUP BYdepend on that key- Either directly, by coming from the
customertable - Or indirectly, via the join to
nationoverc_nationkey
- Either directly, by coming from the
- That must mean that grouping by
c_custkeyon its own has the same rows as if I was grouping on all dependent columns. - Grouping by
c_custkeyis the same as grouping ono_custkey- because there is a PK/FK relation - Which in turn means that there is no need for me join to
customerbefore I group- I can take the reduction in the stream that comes from grouping before I join to
customerandnation
- I can take the reduction in the stream that comes from grouping before I join to
The result of this analysis (which is called: functional dependencies between columns) is reduction in joins by 3x.
But that is not all; SQL Server has an additional trick to play.
Bloom Filtering
In previous blogs, I promised to get to this subject. I'm getting there - bit by bit (pun intentional).
In Q10, we have a good example of a bloom filter in action. Look at this fragment of the query in SQL Server:
56888 108325 INNER JOIN HASH ON o_orderkey = l_orderkey <--- Hash Join
54113 54071 │└TABLE SEEK orders WHERE o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01'
138965 108325 PROJECT l_extendedprice * (1. - l_discount) AS Expr1010
v--- What is this?
138965 216327 TABLE SEEK lineitem WHERE (l_returnflag = 'R' AND BLOOM(l_orderkey)) AND (l_returnflag = 'R')
SQL Server hash joins orders with lineitem, yet it does a SEEK into lineitem - how is that possible?
There is no index supporting l_returnflag = 'R' - it must be seeking on something else?
Compare with PostgreSQL doing the same join:
64080 108325 INNER JOIN HASH ON l_orderkey = o_orderkey
88810 54070 │└TABLE SCAN orders WHERE (o_orderdate >= '1994-06-01') AND (o_orderdate < '1994-09-01')
1745655 1392750 TABLE SCAN lineitem WHERE l_returnflag = 'R'
PostgreSQL ends up reading (and joining) 1.4M rows out of lineitem - compared with 216K rows for SQL Server.
That is a ~6x reduction in work that needs doing!
What is a bloom filter and why does it matter to a database?
The Wikipedia entry on bloom filters is pretty good, so I am not going to repeat the full description here.
TL;DR: bloom filter are:
- A probabilistic data structure that allows you to discard rows that aren't in a set
- Compact in terms of memory use
- Effective to evaluate for a modern CPU, allowing SIMD evaluation
- Faster than hash joins for evaluating set membership (but probabilistic)
The algorithm SQL Server executes is this:
- Find all orders matching
o_orderdate >= '1994-06-01' AND o_orderdate < '1994-09-01' - Construct a bloom filter on the set of all
o_orderkeyfrom the above - When scanning
lineiitem- use the bloom filter to eliminatel_orderkeythat certainly are not in the set found in 1. This may emit false positive - Do the join
lineitem⨝orderson what remains, correcting for false positives
This is an extraordinary powerful optimisation in analytical systems. It allows you to accelerate scans, sometimes by orders of magnitude.
Joins aren't expensive and Dimensional Models are better
There is an entire blog to be written about bloom filters and what they mean for data modellers.
But for today, take this away:
- Bloom filters allow you to "transfer" any filter from one table to another via a join
- If the two tables are related with an integer key, they're extremely effective to calculate
- That in turn means that if you have a dimensional model, it is already highly tuned for this kind of filtering
It also means that One Big Table (OBT) models are uniquely unsuited for the use of bloom filters. While bloom filters on OBT models are possible (you can turn the filter into a general hash function over a column calculation), they are extremely ineffective to calculate. They often need to operate on lots of the strings duplicated in the large table. Bloom filters on super denormalized tables must also evaluate the complex filter on every row in the large table as opposed to only evaluating on a smaller dimension table when using a star schema.
Dimensional models allow any filter on a dimension table to be turned into a simple bloom of integers in the fact table. You only need to evaluate the complex filter once on the dimension table to apply that filter to every row in the fact table. This is computationally highly effective.
These trivial insights from computer science add further weight to my argument that:
Fleshing out the details in a way that appeals to the masses will take me some time. Time I don't yet have, because its holiday soon.
Summary
Today we saw how histogram is sometimes needed to avoid bad, bushy joins. We saw how DuckDB's lack of these data structures penalises it against PostgreSQL for Q10.
We also saw how knowledge of functional dependencies can be used to greatly reduce the work queries need to do. Remember to declare those keys, my friends!
Finally, we broke the surface and did a shallow dive into bloom filters and how they greatly accelerate scans. In the case of Q10, we saw a 6x improvement.


5 min
12/20/2025
Listen