Here’s a surprising anomaly that showed up in a question on the OTN forums a little while ago. Consider a simple query that uses a hash join between two tables.
select
/*+
ordered
use_hash(t2)
parallel(t1 2)
parallel(t2 2)
pq_distribute(t2 hash hash)
*/
t1.padding,
t2.padding
from t1, t2
where t2.n1 = t1.n1
and t2.small_vc = t1.small_vc
;
When it runs serially the join completes in memory and the only I/O you see comes from the two tablescans. When the query runs parallel something causes a spill to the temporary tablespace. Here’s the code to build the tables, followed by the execution plans from 9.2.0.8 and 10.2.0.3:
alter session set workarea_size_policy = manual;
alter session set hash_area_size = 10495760;
create table t1
nologging -- adjust as necessary
as
with generator as (
select --+ materialize
rownum id
from all_objects
where rownum <= 3000
)
select
rownum n1,
lpad(rownum,6,'0') small_vc,
lpad(rownum,200,'0') padding
from
generator v1,
generator v2
where
rownum <= 10000
;
create table t2
nologging -- adjust as necessary
as
with generator as (
select --+ materialize
rownum id
from all_objects
where rownum <= 3000
)
select
1 + mod(rownum,10000) n1,
lpad(1 + mod(rownum,10000),6,'0') small_vc,
lpad(rownum,500,'0') padding
from
generator v1,
generator v2
where
rownum <= 20000
;
-- gather statistics here
Execution plan: 9.2.0.8
-------------------------------------------------------------------
| Id| Operation | Name| Rows | TQ |IN-OUT| PQ Distrib |
-------------------------------------------------------------------
| 0| SELECT STATEMENT | | 2 | | | |
|* 1| HASH JOIN | | 2 | 98,02 | P->S | QC (RAND) |
| 2| TABLE ACCESS FULL| T1 |10325 | 98,00 | P->P | HASH |
| 3| TABLE ACCESS FULL| T2 |20212 | 98,01 | P->P | HASH |
-------------------------------------------------------------------
Execution plan 10.2.0.3
--------------------------------------------------------------------------------
| Id | Operation | Name | Rows | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2 | | | |
| 1 | PX COORDINATOR | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 2 | Q1,02 | P->S | QC (RAND) |
|* 3 | HASH JOIN BUFFERED | | 2 | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 10000 | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10000 | 10000 | Q1,00 | P->P | HASH |
| 6 | PX BLOCK ITERATOR | | 10000 | Q1,00 | PCWC | |
| 7 | TABLE ACCESS FULL| T1 | 10000 | Q1,00 | PCWP | |
| 8 | PX RECEIVE | | 20000 | Q1,02 | PCWP | |
| 9 | PX SEND HASH | :TQ10001 | 20000 | Q1,01 | P->P | HASH |
| 10 | PX BLOCK ITERATOR | | 20000 | Q1,01 | PCWC | |
| 11 | TABLE ACCESS FULL| T2 | 20000 | Q1,01 | PCWP | |
--------------------------------------------------------------------------------
I’ve created table t1 with 10,000 rows and a (unique) column n1 that ranges from 1 to 10,000. Table t2 then has 20,000 rows, again with an n1 column that ranges from 1 and 10,000 in this case with two rows per value. When you join the two tables on their n1 columns, the result set will have 20,000 rows.
The table-creation code is more complex than it really needs to be, but I wrote it that way in case I needed to create a few million rows in each table.
You will notice that I have switched to manual workarea management so that I can set the hash_area_size (rather than letting Oracle pick a dynamic limit based on the pga_aggregate_target and current PGA usage). This is simply to make the test case reproducible. I’ve also included the pq_distribute() hint in the query to tell the optimizer to use the hash/hash distribution rather than broadcasting the first table (which could be forced by the hint /*+ pq_distribute(t2, broadcast, none) */). In my example, the optimizer took the hash/hash option automatically and I have just included the hint for reasons of reproducibility, and to demonstrate the hint syntax.
Mechanisms:
Checking the execution plans, we can build the following picture:

One set of PX slaves scans t1 (referenced as para_1 in the picture) and distributes the data by hashing (on the n1 column) to the second set of slaves by writing virtual table (or table queue) Q0, then the same set of slaves scans table t2 (para_2) and distributes the data in the same way, using virtual table Q1.
The second set joins the two virtual tables by reading Q0 to build the hash table then probing the hash table with each row from Q1 as it arrives, passing each join result to the query coordinator by writing to virtual table Q2.
Data Volume
So, I’ve got 10,000 rows of about 200 bytes each for a total of about 2MB, and a table of 20,000 rows of about 500 bytes each for a total of about 10MB, and a join result of 20,000 rows of about 700 bytes for a total of 14MB. Since we are running parallel 2, we might expect each PX slave to see about 5,000 rows (1 MB) from t1, 10,000 rows (5MB) and 10,000 rows (7MB) of result.
Since we set the hash_area_size to 10MB, we seem to have enough memory to hold the hash table completely in memory. We could even hold the hash table and all the data we use from t2 in memory – we could even hold the hash table and the result set in memory. So why to we see writes and reads to the temporary tablespace?
Analysis
The first thing to do, of course, is to find out where that I/O comes from. Enabling events 10046 (wait tracing) and 10104 (hash join tracing) is a good starting point.
The 10104 trace (9i version shown below) starts like this (the line numbers are my addition to the normal trace):
*** HASH JOIN STATISTICS (INITIALIZATION) ***
1 Original memory: 10495760
2 Memory after all overhead: 10342503
3 Memory for slots: 9691136
4 Calculated overhead for partitions and row/slot managers: 651367
5 Hash-join fanout: 8
6 Number of partitions: 9
7 Number of slots: 13
8 Multiblock IO: 91
9 Block size(KB): 8
10 Cluster (slot) size(KB): 728
11 Minimum number of bytes per block: 8160
12 Bit vector memory allocation(KB): 512
13 Per partition bit vector length(KB): 64
14 Maximum possible row length: 1047
15 Estimated build size (KB): 2636
16 Estimated Row Length (includes overhead): 540
17 # Immutable Flags:
18 BUFFER the output of the join for Parallel Query
The trace reports 10MB of memory available for the operating the hash join (line 1).
The optimizer has pre-empted the possibility of the hash table spilling to disc by allowing for 9 partitions (line 6) although it talks about the “fanout” being 8 (line 5). Now the number of partitions is “always” a power of 2 and “always” matches the fanout – so what’s that extra partition for ?
We see that the optimizer has decided to use units of 91 blocks for any I/O to the temporary tablespace (line 8 ) – this translates into a size of 728KB (line 10) as the unit I/O (“cluster”, or “slot”) size. (In terms of parameters, this setting can be dictated by the parameter hash_multiblock_io_count - which is hidden in newer versions of Oracle).
This “slot” size has a very important impact on the efficiency of the hash join. To start with each partition needs minimum of one slot to hold data. This means that the minimum size in memory of the hash table will have to be at least 5.6MB (728KB * 8 ) because the hash table is going to be partitioned 8 ways. We can see this a little further down the trace file, as the hash table build completes.
*** HASH JOIN BUILD HASH TABLE (PHASE 1) ***
Total number of partitions: 8
Number of partitions which could fit in memory: 8
Number of partitions left in memory: 8
Total number of slots in in-memory partitions: 8
Total number of rows in in-memory partitions: 5029
(used as preliminary number of buckets in hash table)
Estimated max # of build rows that can fit in avail memory: 77792
### Partition Distribution ###
Partition:0 rows:577 clusters:1 slots:1 kept=1
Partition:1 rows:605 clusters:1 slots:1 kept=1
Partition:2 rows:634 clusters:1 slots:1 kept=1
Partition:3 rows:630 clusters:1 slots:1 kept=1
Partition:4 rows:635 clusters:1 slots:1 kept=1
Partition:5 rows:645 clusters:1 slots:1 kept=1
Partition:6 rows:665 clusters:1 slots:1 kept=1
Partition:7 rows:638 clusters:1 slots:1 kept=1
>
This means the entire hash table has been built completely in memory (kept = 1 for every partition) with a huge amount of space to spare. We have 5,029 rows in memory, and could handle an estimated 77,792 in the space allocated.
After the build completes, the 10046 trace details show the process reading data from the next virtual queue then, a bit further down the file, we get the following extra clues from the 10104 trace:
kxhfWrite: hash-join is spilling to disk
kxhfWrite: Writing dba=23177 slot=8 part=8
...
qerhjFetch: PQ Partition rows:10058 clusters:8 in-memory slots 4
We have been dumping partition 8 to disc – but the hash table is in partitions 0 to 7, so what are we dumping?
There are only two possibilities – either Oracle is copying the incoming probe table to partition 8 before it does the join, or it is writing the result of the join to partition 8 before re-reading it and forwarding it to the query coordinator. A couple of simple checks (e.g. adjust the n1 values in t2 to ensure that the join returns no data) shows that it’s the result set that is being dumped.
Looking back at lines 17 and 18 of the trace file, we can see that this was dictated by the “immutable flags” which were set to “BUFFER the output of the join for Parallel Query”.
Why ?
This behaviour is astonishing – there is clearly no need to do things this way; it’s a total waste of effort and time. Why write, re-read, then forward the result set when it could simply be sent as it was created ? Alternatively, why doesn’t the PX slave write its virtual table (Q2) to the temporary tablespace and let the Query Co-ordinator read it ? The code seems to be doing something quite bizarre.
Here’s my guess. First, we’re looking at a special case; secondly, in the more general case a different read/write strategy could become quite complex – especially when the query started to run across multiple RAC instances.
Generically, a single DFO (data flow operation) may involve far more than two layers of parallel execution; but a single DFO can use only two sets of parallel execution slaves. So if you have a parallel execution plan had three “layers” of operations, what happens: slave set one does the first operation, slave set two does the second operation, and slave set one does the third operation.
But slave set two (in the middle) may have some output ready for the third operation (slave set one) before the first operation (also slave set one) has completed – which means slave set one won’t be in place to receive the output. If this happens, slave set two has to do something with the output – which explains why it has to have some sort of buffer operation.
My guess is that it is this buffering operation that appears in our example, even though in this particular case it is “obviously” not needed.
If you accept the need for some sort of buffering mechanism, the next question is, why not make one layer of slaves write to the temporary tablespace and the next layer read from it. This could cut out a huge amount of messaging between two layers of slaves – the only information that would have to be sent through the SGA is a list of which blocks in the temporary segment should be read by which slaves.
This sounds reasonable – but the code needed to handle the generic case may be sufficiently complicated that the cost/risk/benefit analysis says that it’s not worth doing.
Think about a simple cycle where slave set 1 writes to the temporary tablespace for slave set 2 to read, and slave set 2 writes to the temporary tablespace for slave set 1. We don’t need a more complicated example because slave set 1 is not going to start reading for the third operation until after it has finished the first operation, so a double layer is a complex as things can get.
But, even though we have a limit of two layers, think about the number of communication channels. If we are running degree N there are N slaves in slave set 1 which need to distribute data across N slaves in slave set 2, so we need to isolate N*N channels of communication. Then you have to double that number because you could have two layers of communication live at once.
There is Metalink note somewhere that points out the amount of memory you should allocate for PX messaging in the large pool is something like 3 * parallel_execution_message_size * parallel_max_servers * parallel_max_servers / 4 (which allows 3 buffer pages per possible channel when one query runs at degree parallel_max_servers / 2).
If you try to switch this messaging to the temporary tablespace, would you allocate one extent for each channel – probably not, as that could need a huge, and sparsely used, temporary tabelspace.
On the other hand if you tried something like one private segment per query, or per operation per query, you then have to write code that allows slaves in the same layer to negotiate with each other to ensure that they don’t overwrite each others blocks in that extent. This is the efficient, but complex, solution. And when you realise that this could include communication across a RAC interconnect, you can appreciate that Oracle Corp. might avoid rewriting the code unless it proved to be absolutely necessary for a number of really large customers.
So parallel queries may do far more I/O to the temporary tablespace than you expect (and this is just one of the reasons). The “HASH JOIN BUFFERED” operation that becomes visible in 10g (see line 3 of the plan above) may tell you where some of that I/O is coming from. (The same action appears in all versions I tested from 8i to 11g, by the way – it’s just that 10g is the first version to report the operation).
Damage limitation
If you do find that your parallel hash joins are working harder than you expect, check the plan, and check the distribution. It isn’t well known, but you can make a huge difference to the resource consumption by controlling exactly what Oracle does with hash joins. In this particular case, switching to a “broadcast” distribution bypassed the need for any sort of buffering at all, and the I/O to the temporary tablespace disappeared completely. I’ll be writing more about that option some time in the future.