1
© 2013 Julian Dyke juliandyke.com
2
Site A - Exadata v2 Half Rack
SELECT /*+ FULL (t1) / COUNT(*) FROM t1
takes 5 seconds
CREATE TABLE jmd1 PARALLEL 8 AS SELECT * FROM t1
) takes 336 seconds
Analytic function creates
CREATE TABLE jmd2 PARALLEL 8 AS
SELECT c1, c2, c3 … c35
CASE WHEN rank = 1
THEN TO_DATE (‘01-JAN-1990’,’DD-MON-YYYY’)
ELSE c36 temporary segment to perform the sort operation. Temporary segment performance is relatively poor on Exadata
(
END
FROM compared with cell smart scans
SELECT c1,c2,c3,…,c35 RANK() OVER (PARTITION BY c1 ORDER BY p36) rank,
NVL((LEAD(c36) OVER (PARTITION BY c1 ORDER BY c36)),
TO_DATE( '31/12/9999','DD/MM/YYYY' )
takes 722 seconds
© 2013 Julian Dyke juliandyke.com
Site B – Packaged application
DESC t2
C1
C2
C3
C4
C5
C6
C7
C8
C9
C10
C11
NUMBER
DATE
NUMBER
LONG RAW
DATE
DATE
NUMBER
DATE
DATE
DATE
DATE
c4 is always 17,000 bytes
17,000 byte LONG RAW is included in the sort data.
Table contains millions of rows so sort requires a huge temporary sort segment
SELECT c1, c2, c3, c4, c5, c6, c7,c8,c9,c10,c11 FROM t2 ORDER BY c1;
Takes days or weeks to complete
3 © 2013 Julian Dyke juliandyke.com
Sorts appear to involve a two stage process
1 - Sort in local process (session heap) memory
If there is not sufficient space
2
– Spill to temporary segment
If sorts run in local memory we probably don’t need to tune them.
If sorts spill to the temporary tablespace this generates additional I/O and might need tuning.
4 © 2013 Julian Dyke juliandyke.com
This presentation was developed in the following environment:
Oracle VirtualBox 4.2.0
Base Memory = 2GB
Oracle Enterprise Linux 5 Update 6
Oracle Enterprise Edition 11.2.0.3
Automatic Memory Management
MEMORY_TARGET = 800M
Instance was restarted for each test
Ensures shared memory is initialized
5 © 2013 Julian Dyke juliandyke.com
Initial tests were based on the following table:
(
CREATE TABLE p1 c1 VARCHAR2 (10) PRIMARY KEY, c2 NUMBER, c3 VARCHAR2 (100), c4 VARCHAR2 (200) c5 VARCHAR2 (300), c6 NUMBER
)
TABLESPACE ts01;
6 © 2013 Julian Dyke juliandyke.com
7
To ensure sorted data was in a different order to the table data, I decided to
REVERSE the primary key
However, the REVERSE function is undocumented
It works in SQL
It does not work in PL/SQL
Workaround
Create a function called MY_REVERSE
Cast VARCHAR to RAW
Use UTL_RAW.REVERSE function
CREATE OR REPLACE FUNCTION my_reverse (p_str IN VARCHAR2)
RETURN VARCHAR2 IS
BEGIN
RETURN UTL_RAW.CAST_TO_VARCHAR2
(UTL_RAW.REVERSE (UTL_RAW.CAST_TO_RAW (p_str)));
/
END;
© 2013 Julian Dyke juliandyke.com
8
The sample table was populated as follows:
DECLARE l_c1 VARCHAR2(10); l_c2 NUMBER; l_c3 VARCHAR2(100) := LPAD ('X',100,'X') l_c4 VARCHAR2(200) := LPAD ('Y',200,'Y'); l_c5 VARCHAR2(300) := LPAD ('Z',300,'Z'); l_c6 NUMBER;
100,000 rows
BEGIN
FOR f IN 1..100000
LOOP
REVERSE function
Gather optimizer statistics using:
Fixed Length
/
END; l_c1 := MY_REVERSE (TO_CHAR (100000 + f)); l_c2 := MOD (f,10); l_c6 := MOD (f,100);
INSERT INTO p1 VALUE (l_c1,l_c2,l_c3,l_c4,l_c5,l_c6);
END LOOP;
I now realise there are lots of flaws in the above, but it was sufficient to make some early progress
EXECUTE dbms_stats.gather_table_stats (ownname=>USER,tabname=>'P1', estimate_percent=>NULL,cascade=>TRUE);
© 2013 Julian Dyke juliandyke.com
9
The following statements were executed in a new SQL*Plus session:
SELECT sid FROM v$mystat WHERE ROWNUM < 2;
SID
42
SID = 42
SELECT value FROM v$diag_info
WHERE name = 'Default Trace File';
VALUE
/u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3407.trc
OS PID = 3407
EXECUTE dbms_monitor.session_trace_enable;
SET PAUSE ON
SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;
EXECUTE dbms_monitor.session_trace_disable;
SET PAUSE ON prevents all rows being returned immediately
Allows sort areas to be inspected while operation is ongoing
© 2013 Julian Dyke juliandyke.com
Execution plan from the sample statement is:
SQL> SET AUTOTRACE TRACE EXPLAIN
SQL> SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;
Execution Plan
----------------------------------------------------------
Plan hash value: 101931517
-----------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |
-----------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 100K| 58M| | 15686 (1)| 00:03:09 |
| 1 | SORT ORDER BY | | 100K| 58M| 60M| 15686 (1)| 00:03:09 |
| 2 | TABLE ACCESS FULL| P1 | 100K| 58M| | 2741 (1)| 00:00:33 |
-----------------------------------------------------------------------------------
SQL_ID is
SELECT sql_id FROM v$session WHERE sid = 42;
SQL_ID
-------------
9y1u8zmhjyw11
60M / 100K = 600 bytes/row
10 © 2013 Julian Dyke juliandyke.com
PGA usage is summarized in
V$PGASTAT
Sort operations are described in a number of dynamic performance views including:
V$SQL_WORKAREA
V$SQL_WORKAREA_ACTIVE
V$SORT_USAGE
V$SORT_SEGMENT
Temporary tablespaces and their extents are described in:
V$TEMP_SPACE_HEADER
V$TEMP_EXTENT_POOL
V$TEMP_EXTENT_MAP
11 © 2013 Julian Dyke juliandyke.com
Reports various PGA statistics
SELECT name, value FROM v$pgastat;
Name
PGA memory freed back to OS aggregate PGA auto target aggregate PGA target parameter bytes processed cache hit percentage extra bytes read/written global memory bound max processes count maximum PGA allocated maximum PGA used for auto workareas maximum PGA used for manual workareas
Value
128188416
153713664
293601280
64159744
100
0
58720256
36
207040512
58753024
0
12 © 2013 Julian Dyke juliandyke.com
Continued:
SELECT name, value FROM v$pgastat;
Name over allocation count process count recompute count (total) total PGA allocated total PGA inuse total PGA used for auto workareas total PGA used for manual workareas total freeable PGA memory
Value
0
32
1630
148678656
123791360
117760
0
14417920
13 © 2013 Julian Dyke juliandyke.com
Child cursor includes one work area for each sort operation
Each child cursor may have zero or more sort operations
Reported in V$SQL_WORKAREA
A work area is active if a child cursor owning that work area is currently executing
Reported in V$SQL_WORKAREA_ACTIVE
14 © 2013 Julian Dyke juliandyke.com
Returns one work area for each sort operation in the library cache
SELECT * FROM v$sql_workarea WHERE sql_id = ‘9y1u8zmhjyw11’;
15
Column
ADDRESS
HASH_VALUE
SQL_ID
CHILD_NUMBER
Value
000000008AFBBA30
3776933921
9y1u8zmhjyw11
0
WORKAREA_ADDRESS
OPERATION_TYPE
OPERATION_ID
POLICY
ESTIMATED_OPTIMAL_SIZE
000000008A6167B8
SORT (v2)
1 Column
AUTO
Operation Type = SORT (V2)
TOTAL_EXECUTIONS
Value
0
36852736 OPTIMAL_EXECUTIONS 0
ESTIMATED_ONEPASS_SIZE 2166784 ONEPASS_EXECUTIONS 0
LAST_MEMORY_USED 0 MULTIPASSES_EXECUTIONS 0
LAST_EXECUTION OPTIMAL ACTIVE_TIME 0
LAST_DEGREE 1 MAX_TEMPSEG_SIZE 0
LAST_TEMPSEG_SIZE
© 2013 Julian Dyke
0 juliandyke.com
16
Each work area has an operation type. Known operation types include:
3
4
5
Type
1
2
8
9
6
7
10
Name
GROUP BY (SORT)
IDX MAINTENANCE (SORT)
WINDOW (SORT)
ROLLUP (SORT)
BUFFER
CONNECT-BY (SORT)
SORT (V1)
SORT (V2)
HASH-JOIN
BITMAP CONSTRUCTION
13
14
15
Type
11
12
16
17
18
19
20
Name
BITMAP MERGE
LOAD WRITE BUFFERS
SPREADSHEET
FIC LOAD SNAKES
FIC LOAD TRANSACTIONS
FIC GENERATE CANDIDATE
FIC TREE COUNTING
FIC BITMAP ANDING
DST BITMAP COUNTING
GROUP BY (HASH)
WINDOW (SORT) is used in analytic queries
SPREADSHEET is used by the Model Clause
FIC is Frequent Itemset Counting
DST is Decision Tree
© 2013 Julian Dyke juliandyke.com
Returns one row for each currently executing work area
SELECT * FROM v$sql_workarea_active WHERE sql_id = ‘9y1u8zmhjyw11’;
17
Column
SQL_HASH_VALUE
SQL_ID
SQL_EXEC_START
SQL_EXEC_ID
WORKAREA_ADDRESS
OPERATION_TYPE
OPERATION_ID
POLICY
SID
QCINST_ID
QCSID
ACTIVE_TIME
WORK_AREA_SIZE
EXPECTED_SIZE
© 2013 Julian Dyke
Value
3776933921
3776933921
08-JUN-13
16777216
000000008A6167B8
SORT (v2)
1
AUTO
42
Column
ACTUAL_MEM_USED
MAX_MEM_USED
NUMBER_PASSES
TEMPSEG_SIZE
223831235
1245184
1245184
TABLESPACE
SEGRFNO#
SEGBLK#
Value
117760
58753024
1
63963136
TEMP
1
67072 juliandyke.com
18
Returns one row for each currently executing work area
SELECT * FROM v$sort_usage;
Column
USERNAME
USER
SESSION_ADDR
SESSION_NUM
SQLADDR
SQLHASH
SQL_ID
TABLESPACE
CONTENTS
SEGTYPE
SEGFILE#
SEGBLK
EXTENTS
BLOCKS
SEGRFNO#
© 2013 Julian Dyke
Value
US01
US01
00000000917A1E10
11
000000008AFBF7F0
774405238 grh681wr2hz3q
TEMP
TEMPORARY
SORT
201
67072
61
7808
1
Description
V$SESSION.SADDR
V$SESSION.SERIAL#
V$SESSION.PREV_SQL_ADDR
V$SESSION.PREV_HASH_VALUE
V$SESSION.PREV_SQL_ID
juliandyke.com
Returns list of statements using sort areas
Very misleading
Reports previous SQL statement executed by session
NOT current SQL statement performing sort
Reports address of parent cursor
Does not report child address or child number
Useless for multiple child cursors
Reports absolute file number as 201 and above
OK but inconsistent with other views
19 © 2013 Julian Dyke juliandyke.com
20
Returns one row for each currently executing work area
SELECT * FROM v$sort_segment;
Column
TABLESPACE_NAME
SEGMENT_FILE
SEGMENT_BLOCK
EXTENT_SIZE
CURRENT_USERS
TOTAL_EXTENTS
TOTAL_BLOCKS
USED_EXTENTS
USED_BLOCKS
FREE_EXTENTS
FREE_BLOCKS
ADDED_EXTENTS
EXTENT_HITS
FREED_EXTENTS
FREE_REQUESTS
© 2013 Julian Dyke
0
61
0
0
Value
TEMP
0
0
128
1
524
67072
61
7808
463
59264
Column
MAX_SIZE
MAX_BLOCKS
MAX_USED_SIZE
MAX_USED_BLOCKS
MAX_SORT_SIZE
MAX_SORT_BLOCKS
RELATIVE_FNO
Value
524
67072
61
7808
61
7808
0 juliandyke.com
Returns list of currently active sort segments
Incorrectly reports
Segment absolute file number
Segment block number
Segment relative file number
21 © 2013 Julian Dyke juliandyke.com
22
Returns one row for each extent in temporary tablespace
SELECT * FROM v$temp_extent_map;
Column
TABLESPACE_NAME
FILE_ID
BLOCK_ID
BYTES
BLOCKS
OWNER
RELATIVE_FNO
Column
TABLESPACE_NAME
FILE_ID
BLOCK_ID
BYTES
BLOCKS
OWNER
RELATIVE_FNO
© 2013 Julian Dyke
Extent 522
TEMP
1
66816
1048576
128
1
1
Extent 1
TEMP
1
128
1048576
128
1
1
Extent 523
TEMP
1
66944
1048576
128
1
1
Extent 2
TEMP
1
256
1048576
128
1
1
Extent 3
TEMP
1
384
1048576
128
1
1
Extent 524
TEMP
1
67072
1048576
128
1
1 juliandyke.com
Returns one row for each temporary file
SELECT * FROM v$temp_extent_pool;
Column
TABLESPACE_NAME
FILE_ID
EXTENTS_CACHEED
EXTENTS_USED
BLOCKS_CACHED
BLOCKS_USED
BYTES_CACHED
BYTES_USED
RELATIVE_FNO
Value
TEMP
1
524
61
67072
7808
549453824
63963136
1
23 © 2013 Julian Dyke juliandyke.com
Returns one row for each temporary file
SELECT * FROM v$temp_space_header;
Column
TABLESPACE_NAME
FILE_ID
BYTES_USED
BLOCKS_USED
BYTES_FREE
BLOCKS_FREE
RELATIVE_FNO
Value
TEMP
1
0
1
550502400
67200
0
24 © 2013 Julian Dyke juliandyke.com
The work area table can be dumped using:
$ sqlplus / as sysdba
SQL> ORADEBUG SETMYPID
Statement processed.
SQL> ORADEBUG DUMP WORKAREATAB_DUMP 3
Statement processed.
The trace file can be copied to the local directory using:
SQL> ORADEBUG TRACEFILE_NAME
/u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3779.trc
SQL> !cp /u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3779.trc wa3.trc
25 © 2013 Julian Dyke juliandyke.com
26
The work area table dump includes a summary of SGA/PGA use:
Dumping Work Area Table (level=3)
=================================
Global SGA Info
--------------global target: 280 MB auto target: 145 MB max pga: 200 MB pga limit: 0 MB pga limit known: 0 pga limit errors: 0 pga inuse: 118 MB pga alloc: 143 MB pga freeable: 13 MB pga freed: 112 MB pga to free: 0 % broker request: 0 pga auto: 0 MB pga manual: 0 MB pga alloc (max): 197 MB pga auto (max): 56 MB pga manual (max): 0 MB
# workareas : 1
# workareas(max): 3
© 2013 Julian Dyke juliandyke.com
The work area table is an array containing 67 elements
27
Work Area Table (67 entries)
----------------------------
Entry 0: 0 are active (max=1)
Entry 1: 0 are active (max=1)
......
Entry 30: 0 are active (max=1)
Entry 31: 1 are active (max=2) work area is an internal structure
Probably not externalized
Work area 0x8d7e6ec8 wid=2, trcId=-1 sid=42, ser#=11, flags=59 (AUTO|SGA|ADVICE|ADVICE_START|DATASTATS_SET) qc_sid=65535, qc_instid=65535, slv_grp=1 curSize=115, maxSize=57376, #maxPasses=1 tempSegStateObj=0x8f31d4d0, tempSegObjType=2 temp seg state obj is state object of type
62 (SORT SEGMENT HANDLE) in shared work area descriptor (0x8a6167b8): flags:52 ( AUTO SGA NOSTATS ) process state dump opType:8(SORT (v2)) opId=1 proxy=(nil) isize=31991K, osize=31991K, rowlen=326, flags=02 opt=0, 1pass=0, mpass=0, lastMem=0KB maxTempSegSize=0KB, lastTempSegSize=0KB shared work area descriptor is
Entry 32: 0 are active (max=1)
Entry 33: 1 are active (max=2)
......
WORKAREA_ADDRESS in
V$SQL_WORKAREA and
V$SQL_WORKAREA_ACTIVE
Entry 65: 0 are active (max=1)
Entry 66: 1 are active (max=2)
© 2013 Julian Dyke juliandyke.com
Used by PMON to release resources when a session ends abnormally
To dump for operating system pid 3407
$ sqlplus / as sysdba
SQL> ORADEBUG SETOSPID 3407
Oracle pid: 19, Unix process pid: 3407, image: oracle@vm1.juliandyke.com (TNS V1-V3)
SQL> ORADEBUG DUMP PROCESSSTATE 10
Statement processed.
Level 10 still appears to work
28 © 2013 Julian Dyke juliandyke.com
Process State dump ONLY includes a state object for the temporary segment
(SORT SEGMENT HANDLE)
No state objects for the work areas
SO: 0x8f31d4d0, type: 62, owner: 0x917a1e10, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x91498608, name=sort segment handle, file=ktst2.h LINE:883, pg=0
SORT SEGMENT HANDLE: tsn=3 extents=61
(enqueue) <no resource> lock_flag: 0x0, lock: 0x8f31d5a8, res: (nil) own: (nil), sess: (nil), prv: (nil)
(enqueue) released enqueue or enqueue in flux lock_flag: 0x0, lock: 0x8f31d610, res: 0x90d7b3c0 own: 0x917f8418, sess: 0x917f8418, prv: 0x8f31d620
(enqueue) <no resource> lock_flag: 0x0, lock: 0x8f31d678, res: (nil) own: (nil), sess: (nil), prv: (nil)
29 © 2013 Julian Dyke juliandyke.com
30
Oracle uses direct path writes to write blocks to the temporary segment
Default block size is 1MB (128 x 8192 byte blocks)
For example (event 10046 / level 8 trace):
WAIT #140079396924400: nam='direct path write temp' ela= 27071 file number=201 first dba=67072 block cnt=31 obj#=78342 tim=1370678380137850
WAIT #140079396924400: nam='direct path write temp' ela= 608 file number=201 first dba=67103 block cnt=31 obj#=78342 tim=1370678380138800
WAIT #140079396924400: nam='direct path write temp' ela= 1084 file number=201 first dba=67134 block cnt=31 obj#=78342 tim=1370678380140195
WAIT #140079396924400: nam='direct path write temp' ela= 524 file number=201 first dba=67165 block cnt=31 obj#=78342 tim=1370678380140845
WAIT #140079396924400: nam='direct path write temp' ela= 145 file number=201 first dba=67196 block cnt=4 obj#=78342 tim=1370678380141102
Oracle allows writes 31 + 31 + 31 + 31 + 4 = 128 blocks
I think this weird pattern is required because each block contains a pointer to the next block in the temporary segment
The final block cannot be written until the next extent is allocated. When the address of the next extent is known, the final 4 blocks can be written.
© 2013 Julian Dyke juliandyke.com
31
In Oracle 11.2 direct path writes are implemented using the pwrite system call.
On Linux this can be traced using the strace utility
Output is written to stderr
For example for operating system pid 3407 strace –p 3407 2> strace.txt; pwrite(257, "\10\242\0\0\0\6A\0007++\0\0\0\1\f?\36\0\0\0\6A\0\0\0\0\0\1\6A\0"...,
253952, 549453824) = 253952 pwrite(257, "\10\242\0\0\37\6A\0007++\0\0\0\1\f\207Q\0\0\37\6A\0\37\0\0\0
\6A\0"..., 253952, 549707776) = 253952 pwrite(257, "\10\242\0\0>\6A\0007++\0\0\0\1\f\25\277\0\0>\6A\0>\0\0\0?\6A\0"...,
253952, 549961728) = 253952 pwrite(257, "\10\242\0\0]\6A\0007++\0\0\0\1\f\246u\0\0]\6A\0]\0\0\0^\6A\0"...,
253952, 550215680) = 253952 pwrite(257, "\10\242\0\0|\6A\0007++\0\0\0\1\f\224\247\0\0|\6A\0|\0\0\0}\6A\0"...,
32768, 550469632) = 32768
© 2013 Julian Dyke juliandyke.com
Oracle uses direct path reads to read blocks back from the temporary segment
Default read size is 7 x 8192 byte blocks (57344 bytes)
For example (event 10046 / level 8 trace):
WAIT #139938233130376: nam='direct path read temp' ela= 32 file number=201 first dba=59520 block cnt=7 obj#=78527 tim=1370770967683714
WAIT #139938233130376: nam='direct path read temp' ela= 29 file number=201 first dba=59527 block cnt=7 obj#=78527 tim=1370806245404502
WAIT #139938233130376: nam='direct path read temp' ela= 17 file number=201 first dba=59534 block cnt=7 obj#=78527 tim=1370806248875622
WAIT #139938233130376: nam='direct path read temp' ela= 15 file number=201 first dba=59541 block cnt=7 obj#=78527 tim=1370806251857467
WAIT #139938233130376: nam='direct path read temp' ela= 15 file number=201 first dba=59548 block cnt=7 obj#=78527 tim=1370806255340181
32 © 2013 Julian Dyke juliandyke.com
In Oracle 11.2 direct path reads are implemented using the pread system call.
For example for operating system pid 3407 strace –p 3407 2> strace.txt; pread(257, "\10\242\0\0\243\350@\0\251'.\0\0\0\1\f\246\306\0\0\243\350@\
0#\0\0\0\244\350@\0"..., 57344, 487874560) = 57344 pread(257, "\10\242\0\0\252\350@\0\251'.\0\0\0\1\f\251\306\0\0\252\350@\
0*\0\0\0\253\350@\0"..., 57344, 487931904) = 57344
33 © 2013 Julian Dyke juliandyke.com
To dump the block(s) of a TEMPFILE use:
ALTER SYSTEM DUMP TEMPFILE <file#> BLOCK <block#>;
For example:
ALTER SYSTEM DUMP TEMPFILE 1 BLOCK 67072;
To dump multiple block(s) of a TEMPFILE use:
ALTER SYSTEM DUMP TEMPFILE <file#>
BLOCK MIN <block#>
BLOCK MAX <block#>
For example:
ALTER SYSTEM DUMP TEMPFILE 1
BLOCK MIN 67072
BLOCK MAX 67079;
34 © 2013 Julian Dyke juliandyke.com
For example:
Start dump data blocks tsn: 3 file#:1 minblk 67072 maxblk 67080
Block dump from cache:
Dump of buffer cache at level 4 for tsn=3 rdba=4261376
Block dump from disk: buffer tsn: 3 rdba: 0x00410600 (1/67072) scn: 0x0000.002a70a0 seq: 0x01 flg: 0x0c tail: 0x70a00801 frmt: 0x02 chkval: 0x1e3e type: 0x08=unknown
Hex dump of block: st=0, typ_found=0
Dump of memory from 0x00007FBF9B56BA00 to 0x00007FBF9B56DA00
7FBF9B56BA00 0000A208 00410600 002A70A0 0C010000 [......A..p*.....]
7FBF9B56BA10 00001E3E 00410600 00000000 00410601 [>.....A.......A.]
7FBF9B56BA20 00000000 0000007F 00001FA4 00000044 [............D...]
7FBF9B56BA30 0000000E 00000266 02C10002 5A5A012C [....f.......,.ZZ]
7FBF9B56BA40 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A [ZZZZZZZZZZZZZZZZ]
Repeat 17 times
7FBF9B56BB60 5A5A5A5A 5A5A5A5A 00025A5A 006402C1 [ZZZZZZZZZZ....d.]
7FBF9B56BB70 58585858 58585858 58585858 58585858 [XXXXXXXXXXXXXXXX]
Repeat 5 times
7FBF9B56BBD0 58585858 595900C8 59595959 59595959 [XXXX..YYYYYYYYYY]
7FBF9B56BBE0 59595959 59595959 59595959 59595959 [YYYYYYYYYYYYYYYY]
Repeat 10 times
7FBF9B56BC90 59595959 59595959 59595959 00005959 [YYYYYYYYYYYYYY..]
7FBF9B56BCA0 00000266 03C10002 5A5A012C 5A5A5A5A [f.......,.ZZZZZZ]
35 © 2013 Julian Dyke juliandyke.com
Temporary Segment Blocks contain:
Header
header – 52 bytes body – 8136 bytes footer
– 4 bytes
Body
36 © 2013 Julian Dyke
Footer juliandyke.com
37
For example block 1 / 67072 – RDBA 0x410600
A208 0000 0600 0041 70A0 002A 0000 0C01 1E3E 0000 0600 0041 0000 0000 0601 0041
0000 0000 007F 0000 1FA4 0000 0044 0000 000E 0000
Fields are:
Bytes Example
0-1
4-7
0000 A208
0041 0600
Description
Block Type (0x08 = Temporary Segment)
DBA of this block
8-11 002A 70A0 Sort number?
12-15 0C01 0000 Unknown – always 0xC01 = 3073
16-19 0000 1E3E Check digit
20-23 0041 0600 DBA of this block
24-27 0000 0000 Sequential number of block within sort (0 upwards)
28-31 0041 0601 DBA of next block
32-35 0000 0000 Sequential number of block within sort (0 upwards)
36-39 0000 007F Blocks remaining in extent (0x7F = 127)
40-43 0000 1FA4 Bytes used (0x1FA4 = 8100)
44-47 0000 0044 Bytes unused (0x44 = 68)
48-51 0000 000E Rows in block? (0xE = 15)
© 2013 Julian Dyke juliandyke.com
38
Check word
Calculated using bitwise XOR as follows:
XOR of all words in block including check word is zero ub2 getCheckWord (void *buffer)
{ ub2 *sptr = (ub2 *)buffer; int i; ub2 m = 0;
} for (i = 0;i < BLOCK_SIZE / sizeof (ub2);i++)
{ if (i == 8) continue; m = m ^ sptr[i]; skip the check word
} return m;
^ is bitwise XOR operator
Byte usage
Bytes used + bytes unused is always 8168
Rows in block
Appears to be number of rows in block
– 2
© 2013 Julian Dyke juliandyke.com
39
Body contains one or more sort rows:
Rows can be split across blocks
Columns are not split in this example, but it must be possible
Row starts with 4 byte value representing length of row in block
Rows are padded with zero byte(s) to ensure byte alignment for row length value
Column values are preceded by a 2-byte length word
Either maximum column length is 65536 or some longer length value can be created
NULL values are represented by a 2-byte length word containing zeros
No indication of number of columns in row or continuation rows
Rows may be indexed in another structure that remains in memory
© 2013 Julian Dyke juliandyke.com
Columns are stored in a strange order. For the statement:
SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;
The column order is always C1, C5, C2, C3, C4
C1 is the primary key and the ORDER BY column
There is no apparent reason why C5 is next
This behaviour is entirely reproducible, but the reason for the re-ordering of the selected columns is currently unknown.
One theory is that the columns in the ORDER BY clause are sorted first, followed by the longest column not included in the ORDER BY clause. This would may the sort output more deterministic where the ORDER BY columns are non-unique.
Further research is required
40 © 2013 Julian Dyke juliandyke.com
Example of row storage
0000 0269 # Length = 617
0006 3030 3030 3131 # C1 = 000011
012C 5A5A 5A5A 5A5A 5A5A ... 5A5A 5A5A # C5 = ‘ZZZZZZ...ZZZ’
0001 80 # C2 = 0
0064 5858 5858 5858 5858 ... 5858 5858 # C3 = ‘XXXXXX...XXX’
00C8 5959 5959 5959 5959 ... 5959 5959 # C4 = ‘YYYYYY...YYY’
Total size is 617 bytes
C1 = 2 + 6 = 8 bytes
C2 = 2 + 1 = 3 bytes
C3 = 2 + 100 = 102 bytes
C4 = 2 + 200 = 202 bytes
C5 = 2 + 300 = 302 bytes
41 © 2013 Julian Dyke juliandyke.com
42
Comparing the contents of the temporary segment to the original table we can conclude that (for this example)
Each extent contains 128 blocks
Data is sorted and written in batches of APPROXIMATELY 8 blocks
In each extent there will therefore be approximately 16 batches, each sorted internally
Batches do not align with block boundaries
For this example, the data is probably sorted in local process heap and then written directly to the temporary segment
Data appears to be identical between local heap and the temporary segment.
However, there is different metadata in each structure and therefore the data does not fully align in the temporary segment blocks
© 2013 Julian Dyke juliandyke.com
43
MEMORY_TARGET parameter set to 800M
SQL> SELECT bytes FROM v$sgainfo WHERE name = 'Granule Size‘;
BYTES
----------
4194304
Granule size is 4MB:
800MB / 4MB = 200 granules
SQL> SELECT grantype,COUNT(*) FROM x$ksmge
GROUP BY grantype
ORDER BY grantype;
GRANTYPE COUNT(*)
---------- ----------
0 70
1 58
2 1
3 1
7 43
14 25
# Unallocated
# Shared pool
# Large pool
# Java pool
# Buffer cache
#
Total 198 granules reported by X$KSMGE
© 2013 Julian Dyke juliandyke.com
Unallocated granules can be determined using:
SQL> SELECT granaddr FROM x$ksgme
WHERE grantype = 0
ORDER BY grantype;
GRANADDR
----------------
0000000060800000
0000000060C00000
0000000061000000
0000000061400000
0000000061800000
0000000061C00000
....
0000000070C00000
0000000071000000
0000000071400000
0000000071800000
0000000071C00000
70 rows selected.
44 © 2013 Julian Dyke juliandyke.com
The shared memory devices mapped to the unallocated granules can be determined using pmap for the operating system PID (3407) pmap 3407
0000000060800000 4096K rwxs/dev/shm/ora_TEST_2392076_0
0000000060c00000 4096K rwxs/dev/shm/ora_TEST_2392076_1
0000000061000000 4096K rwxs/dev/shm/ora_TEST_2392076_2
0000000061400000 4096K rwxs/dev/shm/ora_TEST_2392076_3
0000000061800000 4096K rwxs/dev/shm/ora_TEST_2392076_4
0000000061c00000 4096K rwxs/dev/shm/ora_TEST_2392076_5
0000000062000000 4096K rwxs/dev/shm/ora_TEST_2392076_6
....
0000000070400000 4096K rwxs/dev/shm/ora_TEST_2392076_63
0000000070800000 4096K rwxs/dev/shm/ora_TEST_2392076_64
0000000070c00000 4096K rwxs/dev/shm/ora_TEST_2392076_65
0000000071000000 4096K rwxs/dev/shm/ora_TEST_2392076_66
0000000071400000 4096K rwxs/dev/shm/ora_TEST_2392076_67
0000000071800000 4096K rwxs/dev/shm/ora_TEST_2392076_68
0000000071c00000 4096K rwxs/dev/shm/ora_TEST_2392076_69
45
The mappings can also be determined using strace during process startup
However, the output is more difficult to read.
© 2013 Julian Dyke juliandyke.com
46
Mappings are created for all granules during process start up
Even if shared memory is currently unallocated for that device
$ ls –l /dev/shm/ora_TEST*
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
....
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
-rw-r----- 1 oracle oinstall
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_0
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_1
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_2
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_3
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_4
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_5
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_6
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_63
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_64
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_65
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_66
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_67
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_68
0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_69
When a process starts, it creates mappings between addresses and shared memory devices for all granules (in this case 200 granules).
If granules are allocated to the SGA their size will be 4194304 (in this case); if they are not allocated the size will be 0.
© 2013 Julian Dyke juliandyke.com
Local PGA memory for a specific process can be dumped using ORADEBUG:
$ sqlplus / as sysdba
SQL> ORADEBUG SETOSPID 3407
Oracle pid: 19, Unix process pid: 3407, image: oracle@vm1.juliandyke.com (TNS V1-V3)
SQL> ORADEBUG DUMP HEAPDUMP 1025
Statement processed.
In this case the section of interest is “top uga heap”
HEAP DUMP heap name="top uga heap" desc=0xbaf9700 extent sz=0xffc0 alt=232 het=32767 rec=0 flg=2 opc=2 parent=(nil) owner=(nil) nex=(nil) xsz=0xfff8 heap=(nil) fl2=0x60, nex=(nil), dsxvers=1, dsxflg=0x0 dsx first ext=0xe8ab0008
47 © 2013 Julian Dyke juliandyke.com
48
Sorted rows appear in the heap dump:
EXTENT 0 addr=0x7f45e8927008
Chunk 7f45e8927018 sz= 65512 freeable "session heap " ds=0x7f45e8aa67c8
Dump of memory from 0x00007F45E8927018 to 0x00007F45E8937000
7F45E8927010 0000FFE9 10B38F00 [........]
......
7F45E8927400 34010007 39383533 02620431 5A5A012C [...435891.b.,.ZZ]
7F45E8927410 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A [ZZZZZZZZZZZZZZZZ]
Repeat 17 times
7F45E8927530 5A5A5A5A 5A5A5A5A 00025A5A 006405C1 [ZZZZZZZZZZ....d.]
7F45E8927540 58585858 58585858 58585858 58585858 [XXXXXXXXXXXXXXXX]
Repeat 5 times
7F45E89275A0 58585858 595900C8 59595959 59595959 [XXXX..YYYYYYYYYY]
7F45E89275B0 59595959 59595959 59595959 59595959 [YYYYYYYYYYYYYYYY]
Repeat 10 times
7F45E8927660 59595959 59595959 59595959 07055959 [YYYYYYYYYYYYYY..]
7F45E8927670 33350100 31393835 2C026204 5A5A5A01 [..535891.b.,.ZZZ]
7F45E8927680 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A [ZZZZZZZZZZZZZZZZ]
Repeat 17 times
7F45E89277A0 5A5A5A5A 5A5A5A5A C100025A 58006406 [ZZZZZZZZZ....d.X]
7F45E89277B0 58585858 58585858 58585858 58585858 [XXXXXXXXXXXXXXXX]
Repeat 5 times
7F45E8927810 C8585858 59595900 59595959 59595959 [XXX..YYYYYYYYYYY]
7F45E8927820 59595959 59595959 59595959 59595959 [YYYYYYYYYYYYYYYY]
We know these rows have been sorted as the columns appear in the “wrong” order
Repeat 10 times
7F45E89278D0 59595959 59595959 59595959 00070559 [YYYYYYYYYYYYY...]
© 2013 Julian Dyke juliandyke.com
Local heap memory shows as /dev/zero in the pmap output
For example the heap extent in the previous slide was
EXTENT 0 addr=0x7f45e8927008
Chunk 7f45e8927018 sz= 65512 freeable "session heap " ds=0x7f45e8aa67c8
Dump of memory from 0x00007F45E8927018 to 0x00007F45E8937000
Output of pmap for this area of memory is: pmap 3407
....
00007f45e88f7000 64K rwx-/dev/zero # EXTENT 3
00007f45e8907000 64K rwx-/dev/zero # EXTENT 2
00007f45e8917000 64K rwx-/dev/zero # EXTENT 1
00007f45e8927000 64K rwx-/dev/zero # EXTENT 0
00007f45e8937000 64K rwx-/dev/zero
00007f45e8947000 64K rwx-/dev/zero
00007f45e8957000 64K rwx-/dev/zero
00007f45e8967000 64K rwx-/dev/zero
00007f45e8977000 64K rwx-/dev/zero
EXTENT 0
49 © 2013 Julian Dyke juliandyke.com
50
In this example:
Sorted rows are inserted directly into temporary segment when allocated
PGA space is exhausted in local memory
All selected columns are included in sorted rows
ORDER BY columns
Other columns
Some dynamic performance views are unreliable
In general:
Minimize number of columns in SELECT list
For some queries it may be more efficient to order the rows first, then to revisit the table for the remaining data
Avoids unnecessary writes to temporary segments
© 2013 Julian Dyke juliandyke.com
General conclusions:
51 © 2013 Julian Dyke juliandyke.com
The original presentation as delivered in Scotland contained some inaccuracies that have been fixed in this version
This version also contains the results of some additional research
Thank you to the following for their invaluable advice, suggestions and additional input during the event:
Jonathan Lewis
Tony Hasler
Bjoern Rost
52 © 2013 Julian Dyke juliandyke.com