Tuning with equivalent SQLs – a little challenge
I am fascinated by what I call “equal SQL”: statements that are equivalent, in that they deliver the same result but may have hugely different performance characteristics. Here’s a little case study.
Consider this example, working in the OE demonstration schema. There is a table of warehouses:
jw122pdb> select warehouse_id,warehouse_name from warehouses; WAREHOUSE_ID WAREHOUSE_NAME ------------ ----------------------------------- 1 Southlake, Texas 2 San Francisco 3 New Jersey 4 Seattle, Washington 5 Toronto 6 Sydney 7 Mexico City 8 Beijing 9 Bombay
and an inventory of products at each warehouse:
jw122pdb> desc inventories Name Null? Type ----------------------------------------------------------- -------- ---------------------------------------- PRODUCT_ID NOT NULL NUMBER(6) WAREHOUSE_ID NOT NULL NUMBER(3) QUANTITY_ON_HAND NOT NULL NUMBER(8)
I want to find out which products are stocked in both Toronto and Bombay. These are five solutions:
select product_id from inventories where warehouse_id=5 intersect select product_id from inventories where warehouse_id=9; select product_id from inventories where warehouse_id=5 and product_id in (select product_id from inventories where warehouse_id=9); select product_id from inventories i where warehouse_id=5 and exists (select product_id from inventories j where j.warehouse_id=9 and j.product_id=i.product_id); select distinct product_id from ( (select product_id from inventories where warehouse_id=5) join (select product_id from inventories where warehouse_id=9) using (product_id)); select product_id from (select product_id from inventories where warehouse_id=5 union all select product_id from inventories where warehouse_id=9) group by product_id having count(*) > 1;
To me, the first is the most intuitive: find the products in Toronto and the products in Bombay, and the answer is the intersection. The fifth solution is in effect the same thing done manually: add the two queries together, and keep only those products that occur twice (though there could a bug in that solution – what is it, and how can you avoid it?) The second uses a subquery. The third uses a correlated subquery which is often an inefficient, iterative, structure. The fourth is perhaps the most convoluted.
Which of the five will be the most efficient? Or will the cost based optimizer be able to re-write them into a common, efficient, form? Are there any other solutions?
These are my results:
jw122pdb> set autotrace traceonly explain jw122pdb> jw122pdb> select product_id from inventories where warehouse_id=5 2 intersect 3 select product_id from inventories where warehouse_id=9; Execution Plan ---------------------------------------------------------- Plan hash value: 3944618082 ------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 114 | 1694 | 6 (34)| 00:00:01 | | 1 | INTERSECTION | | | | | | | 2 | SORT UNIQUE NOSORT| | 114 | 798 | 3 (34)| 00:00:01 | |* 3 | INDEX RANGE SCAN | INVENTORY_IX | 114 | 798 | 2 (0)| 00:00:01 | | 4 | SORT UNIQUE NOSORT| | 128 | 896 | 3 (34)| 00:00:01 | |* 5 | INDEX RANGE SCAN | INVENTORY_IX | 128 | 896 | 2 (0)| 00:00:01 | ------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("WAREHOUSE_ID"=5) 5 - access("WAREHOUSE_ID"=9) jw122pdb> select product_id from inventories where warehouse_id=5 2 and product_id in (select product_id from inventories where warehouse_id=9); Execution Plan ---------------------------------------------------------- Plan hash value: 409421562 ---------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 80 | 1120 | 2 (0)| 00:00:01 | | 1 | NESTED LOOPS | | 80 | 1120 | 2 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN| INVENTORY_IX | 114 | 798 | 2 (0)| 00:00:01 | |* 3 | INDEX RANGE SCAN| INVENTORY_IX | 1 | 7 | 0 (0)| 00:00:01 | ---------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("WAREHOUSE_ID"=5) 3 - access("WAREHOUSE_ID"=9 AND "PRODUCT_ID"="PRODUCT_ID") jw122pdb> select product_id from inventories i where warehouse_id=5 2 and exists (select product_id from inventories j where j.warehouse_id=9 and j.product_id=i.product_id); Execution Plan ---------------------------------------------------------- Plan hash value: 1721271592 ---------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 80 | 1120 | 2 (0)| 00:00:01 | | 1 | NESTED LOOPS SEMI| | 80 | 1120 | 2 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN| INVENTORY_IX | 114 | 798 | 2 (0)| 00:00:01 | |* 3 | INDEX RANGE SCAN| INVENTORY_IX | 128 | 896 | 0 (0)| 00:00:01 | ---------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("WAREHOUSE_ID"=5) 3 - access("J"."WAREHOUSE_ID"=9 AND "J"."PRODUCT_ID"="I"."PRODUCT_ID") jw122pdb> select distinct product_id from ( 2 (select product_id from inventories where warehouse_id=5) 3 join 4 (select product_id from inventories where warehouse_id=9) 5 using (product_id)); Execution Plan ---------------------------------------------------------- Plan hash value: 49070421 ----------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 80 | 1120 | 3 (34)| 00:00:01 | | 1 | SORT UNIQUE NOSORT| | 80 | 1120 | 3 (34)| 00:00:01 | | 2 | NESTED LOOPS SEMI| | 80 | 1120 | 2 (0)| 00:00:01 | |* 3 | INDEX RANGE SCAN| INVENTORY_IX | 114 | 798 | 2 (0)| 00:00:01 | |* 4 | INDEX RANGE SCAN| INVENTORY_IX | 128 | 896 | 0 (0)| 00:00:01 | ----------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("WAREHOUSE_ID"=5) 4 - access("WAREHOUSE_ID"=9 AND "PRODUCT_ID"="PRODUCT_ID") jw122pdb> select product_id from 2 (select product_id from inventories where warehouse_id=5 3 union all 4 select product_id from inventories where warehouse_id=9) 5 group by product_id having count(*) > 1; Execution Plan ---------------------------------------------------------- Plan hash value: 352515046 ------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 | 20 | 5 (20)| 00:00:01 | |* 1 | FILTER | | | | | | | 2 | HASH GROUP BY | | 5 | 20 | 5 (20)| 00:00:01 | | 3 | VIEW | | 242 | 968 | 4 (0)| 00:00:01 | | 4 | UNION-ALL | | | | | | |* 5 | INDEX RANGE SCAN| INVENTORY_IX | 114 | 798 | 2 (0)| 00:00:01 | |* 6 | INDEX RANGE SCAN| INVENTORY_IX | 128 | 896 | 2 (0)| 00:00:01 | ------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(COUNT(*)>1) 5 - access("WAREHOUSE_ID"=5) 6 - access("WAREHOUSE_ID"=9) jw122pdb>
This surprised me: I had expected the the third solution to be cheapest (assuming that it could be rewritten to a semijoin, as it was) and that the fourth solution would be the worst.
The take away from all this is that the way you write your code can have a huge effect on the way it runs, and you should always consider alternative formulations.
Hope you enjoyed that – you will have if you are as much of a SQL headcase as I am.
Domo’s run using DB12.2.0.1 on Windows 10.
—
John Watson
Oracle Certified Master DBA
https://skillbuilders.com