解决 ORACLE 11.2 动态采样导致的性能问题 我们知道动
我们知道动态采样一般在没有统计信息的时候生效,但我们表都有最新的统计信息。为什么会这样呢?BUG 就算是level8的采样,也不过千百个block,肯定不准确,这个问题是发生在
我经过一番精心的准备之后,终于把三个生产数据仓库全部升级为11.2.0.1,并打上了最新的CPU patch。一般经历一次大的version变更后,都会发生点问题。尤其是在DW系统里面,海量数据的处理,稍不注意,就会有application team来抱怨什么job又forever了… …
升级完后12个小时,登录系统开始检查,JOB已经跑了2个小时了,根据等待事件还在读data file。平常这个JOB只需要2分钟。看执行计划,COST高达154K。JOIN ORDER 明显有问题,其中一个表是80GB大小的事实表,另外两个表是很小的维度表。当前就是是FACT表先和一个维度表JOIN后,再和另外一个表进行JOIN。
SELECT DISTINCT SALES_ORD_TRX.SALES_ORD_NO,
SALES_ORD_TRX.SALES_ORD_ITEM_NO,
SALES_ORD_TRX.COMPANY_KEY,
SALES_ORD_TRX.SALES_ORD_DATE_KEY,
SAP_MATL_DOC_HDR_TRX.POSTING_DATE_IN_THE_DOC
FROM ODS.SALES_ORD_TRX,
ODSSTAGE.SAP_MATL_DOC_HDR_TRX,
ODSSTAGE.SAP_MATL_DOC_DETL_TRX
WHERE SAP_MATL_DOC_DETL_TRX.MVT_TYPE_CODE IN ('Y79', 'Y80')
AND SAP_MATL_DOC_DETL_TRX.SPL_STK_IND = 'W'
AND SAP_MATL_DOC_HDR_TRX.MATL_DOC_NO = SAP_MATL_DOC_DETL_TRX.MATL_DOC_NO
AND SAP_MATL_DOC_HDR_TRX.MATL_DOC_YEAR =
SAP_MATL_DOC_DETL_TRX.MATL_DOC_YEAR
AND SAP_MATL_DOC_HDR_TRX.REF_DOC_NO = SALES_ORD_TRX.D_DELIV_NO
---------------------------------------------------
| Id | Operation | Name |
Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop |
TQ |IN-OUT| PQ Distrib |
-------------------------------------------------
| 0 | SELECT STATEMENT | |
1010K| 83M| | 154K (1)| 00:35:59 | | | | | |
| 1 | PX COORDINATOR | | | | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 |
1010K| 83M| | 154K (1)| 00:35:59 | | | Q1,02 | P->S |
QC (RAND) |
| 3 | SORT UNIQUE | | 1010K|
83M| 100M| 154K (1)| 00:35:59 | | | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | | | | | | | | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | | | | | | | | Q1,01 | P->P | HASH |
| 6 | NESTED LOOPS | | | | | | | | | Q1,01 | PCWP | |
| 7 | NESTED LOOPS | | 1010K| 83M| | 154K (1)| 00:35:59 | | | Q1,01 | PCWP | |
|* 8 | HASH JOIN | | 44437 | 2343K| | 5629 (1)| 00:01:19 | | | Q1,01 | PCWP | |
| 9 | PX BLOCK ITERATOR | | 51443 | 1657K| | 98 (2)| 00:00:02 | | | Q1,01 | PCWC | |
|* 10 | TABLE ACCESS FULL | SAP_MATL_DOC_HDR_TRX | 51443 | 1657K| | 98 (2)| 00:00:02 | | | Q1,01 | PCWP | |
| 11 | BUFFER SORT | | | | | | | | | Q1,01 | PCWC | |
| 12 | PX RECEIVE | | 44437 | 911K| | 5530 (1)| 00:01:18 | | | Q1,01 | PCWP | |
| 13 | PX SEND BROADCAST | :TQ10000 | 44437 | 911K| | 5530 (1)| 00:01:18 | | | Q1,00 | P->P | BROADCAST |
| 14 | PX BLOCK ITERATOR | | 44437 | 911K| | 5530 (1)| 00:01:18 | | | Q1,00 | PCWC | |
|* 15 | TABLE ACCESS FULL | SAP_MATL_DOC_DETL_TRX | 44437 | 911K| | 5530 (1)| 00:01:18 | | | Q1,00 | PCWP | |
|* 16 | INDEX RANGE SCAN | GX_SALES_ORD_TRX_09 | 23 | | | 3 (0)| 00:00:01 | | | Q1,01 | PCWP | |
| 17 | TABLE ACCESS BY GLOBAL INDEX ROWID| SALES_ORD_TRX | 23 | 759 | | 6 (0)| 00:00:01 | ROWID | ROWID | Q1,01 | PCWP | |
--------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
8 - access("SAP_MATL_DOC_HDR_TRX"."MATL_DOC_NO"="SAP_MATL_DOC_DETL_TRX"."MATL_DOC_NO" AND
"SAP_MATL_DOC_HDR_TRX"."MATL_DOC_YEAR"="SAP_MATL_DOC_DETL_TRX"."MATL_DOC_YEAR")
10 - filter("SAP_MATL_DOC_HDR_TRX"."REF_DOC_NO" IS NOT NULL)
15 - filter("SAP_MATL_DOC_DETL_TRX"."SPL_STK_IND"='W' AND ("SAP_MATL_DOC_DETL_TRX"."MVT_TYPE_CODE"='Y79' OR "SAP_MATL_DOC_DETL_TRX"."MVT_TYPE_CODE"='Y80'))
16 - access("SAP_MATL_DOC_HDR_TRX"."REF_DOC_NO"="SALES_ORD_TRX"."D_DELIV_NO")
Note
-----
- dynamic sampling used for this statement (level=8)
首先我怀疑11.2.0.1的优化器改进,导致这个JOB水土不服,立马拿出hint /*+ optimizer_features_enable('10.2.0.4') */。 SQL PLAN立刻改变成和我预想的一样,两个小表先JOIN ,然后再和事实JION.
-------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib |
--------------------------------------------------------------
| 0 | SELECT STATEMENT | | 85235 | 7241K| | 18163 (1)| 00:04:15 | | | | | |
| 1 | PX COORDINATOR | | | | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 85235 | 7241K| | 18163 (1)| 00:04:15 | | | Q1,02 | P->S | QC (RAND) |
| 3 | HASH UNIQUE | | 85235 | 7241K| 16M| 18163 (1)| 00:04:15 | | | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | 23 | 759 | | 6 (0)| 00:00:01 | | | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | 23 | 759 | | 6 (0)| 00:00:01 | | | Q1,01 | P->P | HASH |
| 6 | TABLE ACCESS BY GLOBAL INDEX ROWID| SALES_ORD_TRX | 23 | 759 | | 6 (0)| 00:00:01 | ROWID | ROWID | Q1,01 | PCWP | |
| 7 | NESTED LOOPS | | 85235 | 7241K| | 18162 (1)| 00:04:15 | | | Q1,01 | PCWP | |
|* 8 | HASH JOIN | | 3749 | 197K| | 5629 (1)| 00:01:19 | | | Q1,01 | PCWP | |
| 9 | PX RECEIVE | | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,01 | PCWP | |
| 10 | PX SEND BROADCAST | :TQ10000 | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,00 | P->P | BROADCAST |
| 11 | PX BLOCK ITERATOR | | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,00 | PCWC | |
|* 12 | TABLE ACCESS FULL | SAP_MATL_DOC_DETL_TRX | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,00 | PCWP | |
| 13 | PX BLOCK ITERATOR | | 51443 | 1657K| | 98 (2)| 00:00:02 | | | Q1,01 | PCWC | |
|* 14 | TABLE ACCESS FULL | SAP_MATL_DOC_HDR_TRX | 51443 | 1657K| | 98 (2)| 00:00:02 | | | Q1,01 | PCWP | |
|* 15 | INDEX RANGE SCAN | GX_SALES_ORD_TRX_09 | 23 | | | 3 (0)| 00:00:01 | | | Q1,01 | PCWP | |
--------------------------------------------------------------
客户利用这个hint,完成了当前的JOB。
为了深入研究这个问题,我继续分析。问题出在SAP_MATL_DOC_DETL_TRX上,在第一个计划中显示的cardinality是44437,第二个计划中只有5073.难道11g对多表JOIN的基数算法有大的改进?
我又使用了一个hint /*+ LEADING(SAP_MATL_DOC_HDR_TRX SAP_MATL_DOC_DETL_TRX SALES_ORD_TRX) */, 这个hint强制SQL按我的JOIN ORDER进行。SQL在2分钟类完成。在这个执行计划里面,SAP_MATL_DOC_DETL_TRX的cardinality仍然只有5073。我并没有改变优化器feature,那说明不是优化器升级带来的问题。
- 我在回顾第一个执行计划,一个非常重要的CLUE被我忽视了:dynamic sampling used for this statement (level=8)
原来动态采样生效了!
为了验证这个问题,我加入hint /*+ DYNAMIC_SAMPLING(SAP_MATL_DOC_DETL_TRX 0) */, 执行计划恢复正常,并且SQL在2分钟内完成。
------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib |
------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 115K| 9797K| | 22590 (1)| 00:05:17 | | | | | |
| 1 | PX COORDINATOR | | | | | | | | | | | |
| 2 | PX SEND QC (RANDOM) | :TQ10002 | 115K| 9797K| | 22590 (1)| 00:05:17 | | | Q1,02 | P->S | QC (RAND) |
| 3 | SORT UNIQUE | | 115K| 9797K| 11M| 22590 (1)| 00:05:17 | | | Q1,02 | PCWP | |
| 4 | PX RECEIVE | | | | | | | | | Q1,02 | PCWP | |
| 5 | PX SEND HASH | :TQ10001 | | | | | | | | Q1,01 | P->P | HASH |
| 6 | NESTED LOOPS | | | | | | | | | Q1,01 | PCWP | |
| 7 | NESTED LOOPS | | 115K| 9797K| | 22588 (1)| 00:05:17 | | | Q1,01 | PCWP | |
|* 8 | HASH JOIN | | 5073 | 267K| | 5629 (1)| 00:01:19 | | | Q1,01 | PCWP | |
| 9 | PX RECEIVE | | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,01 | PCWP | |
| 10 | PX SEND BROADCAST | :TQ10000 | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,00 | P->P | BROADCAST |
| 11 | PX BLOCK ITERATOR | | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,00 | PCWC | |
|* 12 | TABLE ACCESS FULL | SAP_MATL_DOC_DETL_TRX | 5073 | 104K| | 5530 (1)| 00:01:18 | | | Q1,00 | PCWP | |
| 13 | PX BLOCK ITERATOR | | 51443 | 1657K| | 98 (2)| 00:00:02 | | | Q1,01 | PCWC | |
|* 14 | TABLE ACCESS FULL | SAP_MATL_DOC_HDR_TRX | 51443 | 1657K| | 98 (2)| 00:00:02 | | | Q1,01 | PCWP | |
|* 15 | INDEX RANGE SCAN | GX_SALES_ORD_TRX_09 | 23 | | | 3 (0)| 00:00:01 | | | Q1,01 | PCWP | |
| 16 | TABLE ACCESS BY GLOBAL INDEX ROWID| SALES_ORD_TRX | 23 | 759 | | 6 (0)| 00:00:01 | ROWID | ROWID | Q1,01 | PCWP | |
我们知道动态采样一般在没有统计信息的时候生效,但我们表都有最新的统计信息。为什么会这样呢?BUG 就算是level8的采样,也不过千百个block,肯定不准确。
经过查询metalink,并且和ORACLE support沟通以后,确认bug
Bug 9272549 - User statistics are ignored when dynamic sampling occurs 9272549.8
解决方案 关闭动态采样
在 12.1 版本中修复 , GOD!
顺便贴上10个level的动态采样介绍
Level 0: Do not use dynamic sampling.
Level 1: Sample all tables that have not been analyzed if the following criteria are met: (1) there is at least 1 unanalyzed table in the query; (2) this unanalyzed table is joined to another table or appears in a subquery or non-mergeable view; (3) this unanalyzed table has no indexes; (4) this unanalyzed table has more blocks than the number of blocks that would be used for dynamic sampling of this table. The number of blocks sampled is the default number of dynamic sampling blocks (32).
Level 2: Apply dynamic sampling to all unanalyzed tables. The number of blocks sampled is two times the default number of dynamic sampling blocks.
Level 3: Apply dynamic sampling to all tables that meet Level 2 criteria, plus all tables for which standard selectivity estimation used a guess for some predicate that is a potential dynamic sampling predicate. The number of blocks sampled is the default number of dynamic sampling blocks. For unanalyzed tables, the number of blocks sampled is two times the default number of dynamic sampling blocks.
Level 4: Apply dynamic sampling to all tables that meet Level 3 criteria, plus all tables that have single-table predicates that reference 2 or more columns. The number of blocks sampled is the default number of dynamic sampling blocks. For unanalyzed tables, the number of blocks sampled is two times the default number of dynamic sampling blocks.
Levels 5, 6, 7, 8, and 9: Apply dynamic sampling to all tables that meet the previous level criteria using 2, 4, 8, 32, or 128 times the default number of dynamic sampling blocks respectively.
Level 10: Apply dynamic sampling to all tables that meet the Level 9 criteria using all blocks in the table.