PPT, 1.45Mb - UK Oracle User Group

advertisement

A Presentation

Of Sorts

Julian Dyke

Independent Consultant

1

OUG Scotland Conference

June 2013

© 2013 Julian Dyke juliandyke.com

2

Motivation 1

 Site A - Exadata v2 Half Rack

SELECT /*+ FULL (t1) / COUNT(*) FROM t1

 takes 5 seconds

CREATE TABLE jmd1 PARALLEL 8 AS SELECT * FROM t1

) takes 336 seconds

Analytic function creates

CREATE TABLE jmd2 PARALLEL 8 AS

SELECT c1, c2, c3 … c35

CASE WHEN rank = 1

THEN TO_DATE (‘01-JAN-1990’,’DD-MON-YYYY’)

ELSE c36 temporary segment to perform the sort operation. Temporary segment performance is relatively poor on Exadata

(

END

FROM compared with cell smart scans

SELECT c1,c2,c3,…,c35 RANK() OVER (PARTITION BY c1 ORDER BY p36) rank,

NVL((LEAD(c36) OVER (PARTITION BY c1 ORDER BY c36)),

TO_DATE( '31/12/9999','DD/MM/YYYY' )

 takes 722 seconds

© 2013 Julian Dyke juliandyke.com

Motivation 2

Site B – Packaged application

DESC t2

C1

C2

C3

C4

C5

C6

C7

C8

C9

C10

C11

NUMBER

DATE

NUMBER

LONG RAW

DATE

DATE

NUMBER

DATE

DATE

DATE

DATE

 c4 is always 17,000 bytes

17,000 byte LONG RAW is included in the sort data.

Table contains millions of rows so sort requires a huge temporary sort segment

SELECT c1, c2, c3, c4, c5, c6, c7,c8,c9,c10,c11 FROM t2 ORDER BY c1;

 Takes days or weeks to complete

3 © 2013 Julian Dyke juliandyke.com

How Sorts Work

 Sorts appear to involve a two stage process

 1 - Sort in local process (session heap) memory

 If there is not sufficient space

 2

– Spill to temporary segment

If sorts run in local memory we probably don’t need to tune them.

If sorts spill to the temporary tablespace this generates additional I/O and might need tuning.

4 © 2013 Julian Dyke juliandyke.com

Test Environment

 This presentation was developed in the following environment:

 Oracle VirtualBox 4.2.0

 Base Memory = 2GB

 Oracle Enterprise Linux 5 Update 6

 Oracle Enterprise Edition 11.2.0.3

 Automatic Memory Management

 MEMORY_TARGET = 800M

 Instance was restarted for each test

 Ensures shared memory is initialized

5 © 2013 Julian Dyke juliandyke.com

Sample Table

 Initial tests were based on the following table:

(

CREATE TABLE p1 c1 VARCHAR2 (10) PRIMARY KEY, c2 NUMBER, c3 VARCHAR2 (100), c4 VARCHAR2 (200) c5 VARCHAR2 (300), c6 NUMBER

)

TABLESPACE ts01;

6 © 2013 Julian Dyke juliandyke.com

7

REVERSE Function

 To ensure sorted data was in a different order to the table data, I decided to

REVERSE the primary key

 However, the REVERSE function is undocumented

It works in SQL

It does not work in PL/SQL

 Workaround

 Create a function called MY_REVERSE

Cast VARCHAR to RAW

Use UTL_RAW.REVERSE function

CREATE OR REPLACE FUNCTION my_reverse (p_str IN VARCHAR2)

RETURN VARCHAR2 IS

BEGIN

RETURN UTL_RAW.CAST_TO_VARCHAR2

(UTL_RAW.REVERSE (UTL_RAW.CAST_TO_RAW (p_str)));

/

END;

© 2013 Julian Dyke juliandyke.com

8

Sample Data

 The sample table was populated as follows:

DECLARE l_c1 VARCHAR2(10); l_c2 NUMBER; l_c3 VARCHAR2(100) := LPAD ('X',100,'X') l_c4 VARCHAR2(200) := LPAD ('Y',200,'Y'); l_c5 VARCHAR2(300) := LPAD ('Z',300,'Z'); l_c6 NUMBER;

100,000 rows

BEGIN

FOR f IN 1..100000

LOOP

REVERSE function

Gather optimizer statistics using:

Fixed Length

/

END; l_c1 := MY_REVERSE (TO_CHAR (100000 + f)); l_c2 := MOD (f,10); l_c6 := MOD (f,100);

INSERT INTO p1 VALUE (l_c1,l_c2,l_c3,l_c4,l_c5,l_c6);

END LOOP;

I now realise there are lots of flaws in the above, but it was sufficient to make some early progress

EXECUTE dbms_stats.gather_table_stats (ownname=>USER,tabname=>'P1', estimate_percent=>NULL,cascade=>TRUE);

© 2013 Julian Dyke juliandyke.com

9

Sample Statement

 The following statements were executed in a new SQL*Plus session:

SELECT sid FROM v$mystat WHERE ROWNUM < 2;

SID

42

SID = 42

SELECT value FROM v$diag_info

WHERE name = 'Default Trace File';

VALUE

/u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3407.trc

OS PID = 3407

EXECUTE dbms_monitor.session_trace_enable;

SET PAUSE ON

SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;

EXECUTE dbms_monitor.session_trace_disable;

 SET PAUSE ON prevents all rows being returned immediately

 Allows sort areas to be inspected while operation is ongoing

© 2013 Julian Dyke juliandyke.com

Execution Plan

 Execution plan from the sample statement is:

SQL> SET AUTOTRACE TRACE EXPLAIN

SQL> SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;

Execution Plan

----------------------------------------------------------

Plan hash value: 101931517

-----------------------------------------------------------------------------------

| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time |

-----------------------------------------------------------------------------------

| 0 | SELECT STATEMENT | | 100K| 58M| | 15686 (1)| 00:03:09 |

| 1 | SORT ORDER BY | | 100K| 58M| 60M| 15686 (1)| 00:03:09 |

| 2 | TABLE ACCESS FULL| P1 | 100K| 58M| | 2741 (1)| 00:00:33 |

-----------------------------------------------------------------------------------

 SQL_ID is

SELECT sql_id FROM v$session WHERE sid = 42;

SQL_ID

-------------

9y1u8zmhjyw11

60M / 100K = 600 bytes/row

10 © 2013 Julian Dyke juliandyke.com

Dynamic Performance Views

 PGA usage is summarized in

 V$PGASTAT

 Sort operations are described in a number of dynamic performance views including:

 V$SQL_WORKAREA

V$SQL_WORKAREA_ACTIVE

V$SORT_USAGE

V$SORT_SEGMENT

 Temporary tablespaces and their extents are described in:

 V$TEMP_SPACE_HEADER

V$TEMP_EXTENT_POOL

V$TEMP_EXTENT_MAP

11 © 2013 Julian Dyke juliandyke.com

V$PGASTAT

 Reports various PGA statistics

SELECT name, value FROM v$pgastat;

Name

PGA memory freed back to OS aggregate PGA auto target aggregate PGA target parameter bytes processed cache hit percentage extra bytes read/written global memory bound max processes count maximum PGA allocated maximum PGA used for auto workareas maximum PGA used for manual workareas

Value

128188416

153713664

293601280

64159744

100

0

58720256

36

207040512

58753024

0

12 © 2013 Julian Dyke juliandyke.com

V$PGASTAT

 Continued:

SELECT name, value FROM v$pgastat;

Name over allocation count process count recompute count (total) total PGA allocated total PGA inuse total PGA used for auto workareas total PGA used for manual workareas total freeable PGA memory

Value

0

32

1630

148678656

123791360

117760

0

14417920

13 © 2013 Julian Dyke juliandyke.com

Work Areas

 Child cursor includes one work area for each sort operation

Each child cursor may have zero or more sort operations

Reported in V$SQL_WORKAREA

 A work area is active if a child cursor owning that work area is currently executing

 Reported in V$SQL_WORKAREA_ACTIVE

14 © 2013 Julian Dyke juliandyke.com

V$SQL_WORKAREA

 Returns one work area for each sort operation in the library cache

SELECT * FROM v$sql_workarea WHERE sql_id = ‘9y1u8zmhjyw11’;

15

Column

ADDRESS

HASH_VALUE

SQL_ID

CHILD_NUMBER

Value

000000008AFBBA30

3776933921

9y1u8zmhjyw11

0

WORKAREA_ADDRESS

OPERATION_TYPE

OPERATION_ID

POLICY

ESTIMATED_OPTIMAL_SIZE

000000008A6167B8

SORT (v2)

1 Column

AUTO

Operation Type = SORT (V2)

TOTAL_EXECUTIONS

Value

0

36852736 OPTIMAL_EXECUTIONS 0

ESTIMATED_ONEPASS_SIZE 2166784 ONEPASS_EXECUTIONS 0

LAST_MEMORY_USED 0 MULTIPASSES_EXECUTIONS 0

LAST_EXECUTION OPTIMAL ACTIVE_TIME 0

LAST_DEGREE 1 MAX_TEMPSEG_SIZE 0

LAST_TEMPSEG_SIZE

© 2013 Julian Dyke

0 juliandyke.com

16

Work Area Operation Types

 Each work area has an operation type. Known operation types include:

3

4

5

Type

1

2

8

9

6

7

10

Name

GROUP BY (SORT)

IDX MAINTENANCE (SORT)

WINDOW (SORT)

ROLLUP (SORT)

BUFFER

CONNECT-BY (SORT)

SORT (V1)

SORT (V2)

HASH-JOIN

BITMAP CONSTRUCTION

13

14

15

Type

11

12

16

17

18

19

20

Name

BITMAP MERGE

LOAD WRITE BUFFERS

SPREADSHEET

FIC LOAD SNAKES

FIC LOAD TRANSACTIONS

FIC GENERATE CANDIDATE

FIC TREE COUNTING

FIC BITMAP ANDING

DST BITMAP COUNTING

GROUP BY (HASH)

WINDOW (SORT) is used in analytic queries

SPREADSHEET is used by the Model Clause

FIC is Frequent Itemset Counting

DST is Decision Tree

© 2013 Julian Dyke juliandyke.com

V$SQL_WORKAREA_ACTIVE

 Returns one row for each currently executing work area

SELECT * FROM v$sql_workarea_active WHERE sql_id = ‘9y1u8zmhjyw11’;

17

Column

SQL_HASH_VALUE

SQL_ID

SQL_EXEC_START

SQL_EXEC_ID

WORKAREA_ADDRESS

OPERATION_TYPE

OPERATION_ID

POLICY

SID

QCINST_ID

QCSID

ACTIVE_TIME

WORK_AREA_SIZE

EXPECTED_SIZE

© 2013 Julian Dyke

Value

3776933921

3776933921

08-JUN-13

16777216

000000008A6167B8

SORT (v2)

1

AUTO

42

Column

ACTUAL_MEM_USED

MAX_MEM_USED

NUMBER_PASSES

TEMPSEG_SIZE

223831235

1245184

1245184

TABLESPACE

SEGRFNO#

SEGBLK#

Value

117760

58753024

1

63963136

TEMP

1

67072 juliandyke.com

18

V$SORT_USAGE

 Returns one row for each currently executing work area

SELECT * FROM v$sort_usage;

Column

USERNAME

USER

SESSION_ADDR

SESSION_NUM

SQLADDR

SQLHASH

SQL_ID

TABLESPACE

CONTENTS

SEGTYPE

SEGFILE#

SEGBLK

EXTENTS

BLOCKS

SEGRFNO#

© 2013 Julian Dyke

Value

US01

US01

00000000917A1E10

11

000000008AFBF7F0

774405238 grh681wr2hz3q

TEMP

TEMPORARY

SORT

201

67072

61

7808

1

Description

V$SESSION.SADDR

V$SESSION.SERIAL#

V$SESSION.PREV_SQL_ADDR

V$SESSION.PREV_HASH_VALUE

V$SESSION.PREV_SQL_ID

juliandyke.com

V$SORT_USAGE

 Returns list of statements using sort areas

 Very misleading

 Reports previous SQL statement executed by session

 NOT current SQL statement performing sort

 Reports address of parent cursor

 Does not report child address or child number

 Useless for multiple child cursors

 Reports absolute file number as 201 and above

 OK but inconsistent with other views

19 © 2013 Julian Dyke juliandyke.com

20

V$SORT_SEGMENT

 Returns one row for each currently executing work area

SELECT * FROM v$sort_segment;

Column

TABLESPACE_NAME

SEGMENT_FILE

SEGMENT_BLOCK

EXTENT_SIZE

CURRENT_USERS

TOTAL_EXTENTS

TOTAL_BLOCKS

USED_EXTENTS

USED_BLOCKS

FREE_EXTENTS

FREE_BLOCKS

ADDED_EXTENTS

EXTENT_HITS

FREED_EXTENTS

FREE_REQUESTS

© 2013 Julian Dyke

0

61

0

0

Value

TEMP

0

0

128

1

524

67072

61

7808

463

59264

Column

MAX_SIZE

MAX_BLOCKS

MAX_USED_SIZE

MAX_USED_BLOCKS

MAX_SORT_SIZE

MAX_SORT_BLOCKS

RELATIVE_FNO

Value

524

67072

61

7808

61

7808

0 juliandyke.com

V$SORT_SEGMENT

 Returns list of currently active sort segments

 Incorrectly reports

 Segment absolute file number

Segment block number

Segment relative file number

21 © 2013 Julian Dyke juliandyke.com

22

V$TEMP_EXTENT_MAP

 Returns one row for each extent in temporary tablespace

SELECT * FROM v$temp_extent_map;

Column

TABLESPACE_NAME

FILE_ID

BLOCK_ID

BYTES

BLOCKS

OWNER

RELATIVE_FNO

Column

TABLESPACE_NAME

FILE_ID

BLOCK_ID

BYTES

BLOCKS

OWNER

RELATIVE_FNO

© 2013 Julian Dyke

Extent 522

TEMP

1

66816

1048576

128

1

1

Extent 1

TEMP

1

128

1048576

128

1

1

Extent 523

TEMP

1

66944

1048576

128

1

1

Extent 2

TEMP

1

256

1048576

128

1

1

Extent 3

TEMP

1

384

1048576

128

1

1

Extent 524

TEMP

1

67072

1048576

128

1

1 juliandyke.com

V$TEMP_EXTENT_POOL

 Returns one row for each temporary file

SELECT * FROM v$temp_extent_pool;

Column

TABLESPACE_NAME

FILE_ID

EXTENTS_CACHEED

EXTENTS_USED

BLOCKS_CACHED

BLOCKS_USED

BYTES_CACHED

BYTES_USED

RELATIVE_FNO

Value

TEMP

1

524

61

67072

7808

549453824

63963136

1

23 © 2013 Julian Dyke juliandyke.com

V$TEMP_SPACE_HEADER

 Returns one row for each temporary file

SELECT * FROM v$temp_space_header;

Column

TABLESPACE_NAME

FILE_ID

BYTES_USED

BLOCKS_USED

BYTES_FREE

BLOCKS_FREE

RELATIVE_FNO

Value

TEMP

1

0

1

550502400

67200

0

24 © 2013 Julian Dyke juliandyke.com

Work Area Table Dump

 The work area table can be dumped using:

$ sqlplus / as sysdba

SQL> ORADEBUG SETMYPID

Statement processed.

SQL> ORADEBUG DUMP WORKAREATAB_DUMP 3

Statement processed.

 The trace file can be copied to the local directory using:

SQL> ORADEBUG TRACEFILE_NAME

/u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3779.trc

SQL> !cp /u01/app/oracle/diag/rdbms/test/TEST/trace/TEST_ora_3779.trc wa3.trc

25 © 2013 Julian Dyke juliandyke.com

26

Work Area Table Dump

 The work area table dump includes a summary of SGA/PGA use:

Dumping Work Area Table (level=3)

=================================

Global SGA Info

--------------global target: 280 MB auto target: 145 MB max pga: 200 MB pga limit: 0 MB pga limit known: 0 pga limit errors: 0 pga inuse: 118 MB pga alloc: 143 MB pga freeable: 13 MB pga freed: 112 MB pga to free: 0 % broker request: 0 pga auto: 0 MB pga manual: 0 MB pga alloc (max): 197 MB pga auto (max): 56 MB pga manual (max): 0 MB

# workareas : 1

# workareas(max): 3

© 2013 Julian Dyke juliandyke.com

Work Area Table Dump

 The work area table is an array containing 67 elements

27

Work Area Table (67 entries)

----------------------------

Entry 0: 0 are active (max=1)

Entry 1: 0 are active (max=1)

......

Entry 30: 0 are active (max=1)

Entry 31: 1 are active (max=2) work area is an internal structure

Probably not externalized

Work area 0x8d7e6ec8 wid=2, trcId=-1 sid=42, ser#=11, flags=59 (AUTO|SGA|ADVICE|ADVICE_START|DATASTATS_SET) qc_sid=65535, qc_instid=65535, slv_grp=1 curSize=115, maxSize=57376, #maxPasses=1 tempSegStateObj=0x8f31d4d0, tempSegObjType=2 temp seg state obj is state object of type

62 (SORT SEGMENT HANDLE) in shared work area descriptor (0x8a6167b8): flags:52 ( AUTO SGA NOSTATS ) process state dump opType:8(SORT (v2)) opId=1 proxy=(nil) isize=31991K, osize=31991K, rowlen=326, flags=02 opt=0, 1pass=0, mpass=0, lastMem=0KB maxTempSegSize=0KB, lastTempSegSize=0KB shared work area descriptor is

Entry 32: 0 are active (max=1)

Entry 33: 1 are active (max=2)

......

WORKAREA_ADDRESS in

V$SQL_WORKAREA and

V$SQL_WORKAREA_ACTIVE

Entry 65: 0 are active (max=1)

Entry 66: 1 are active (max=2)

© 2013 Julian Dyke juliandyke.com

State Objects

 Used by PMON to release resources when a session ends abnormally

 To dump for operating system pid 3407

$ sqlplus / as sysdba

SQL> ORADEBUG SETOSPID 3407

Oracle pid: 19, Unix process pid: 3407, image: oracle@vm1.juliandyke.com (TNS V1-V3)

SQL> ORADEBUG DUMP PROCESSSTATE 10

Statement processed.

 Level 10 still appears to work

28 © 2013 Julian Dyke juliandyke.com

State Objects

 Process State dump ONLY includes a state object for the temporary segment

(SORT SEGMENT HANDLE)

 No state objects for the work areas

SO: 0x8f31d4d0, type: 62, owner: 0x917a1e10, flag: INIT/-/-/0x00 if: 0x1 c: 0x1 proc=0x91498608, name=sort segment handle, file=ktst2.h LINE:883, pg=0

SORT SEGMENT HANDLE: tsn=3 extents=61

(enqueue) <no resource> lock_flag: 0x0, lock: 0x8f31d5a8, res: (nil) own: (nil), sess: (nil), prv: (nil)

(enqueue) released enqueue or enqueue in flux lock_flag: 0x0, lock: 0x8f31d610, res: 0x90d7b3c0 own: 0x917f8418, sess: 0x917f8418, prv: 0x8f31d620

(enqueue) <no resource> lock_flag: 0x0, lock: 0x8f31d678, res: (nil) own: (nil), sess: (nil), prv: (nil)

29 © 2013 Julian Dyke juliandyke.com

30

Direct Path Writes

 Oracle uses direct path writes to write blocks to the temporary segment

 Default block size is 1MB (128 x 8192 byte blocks)

 For example (event 10046 / level 8 trace):

WAIT #140079396924400: nam='direct path write temp' ela= 27071 file number=201 first dba=67072 block cnt=31 obj#=78342 tim=1370678380137850

WAIT #140079396924400: nam='direct path write temp' ela= 608 file number=201 first dba=67103 block cnt=31 obj#=78342 tim=1370678380138800

WAIT #140079396924400: nam='direct path write temp' ela= 1084 file number=201 first dba=67134 block cnt=31 obj#=78342 tim=1370678380140195

WAIT #140079396924400: nam='direct path write temp' ela= 524 file number=201 first dba=67165 block cnt=31 obj#=78342 tim=1370678380140845

WAIT #140079396924400: nam='direct path write temp' ela= 145 file number=201 first dba=67196 block cnt=4 obj#=78342 tim=1370678380141102

 Oracle allows writes 31 + 31 + 31 + 31 + 4 = 128 blocks

I think this weird pattern is required because each block contains a pointer to the next block in the temporary segment

The final block cannot be written until the next extent is allocated. When the address of the next extent is known, the final 4 blocks can be written.

© 2013 Julian Dyke juliandyke.com

31

Direct Path Writes

 In Oracle 11.2 direct path writes are implemented using the pwrite system call.

 On Linux this can be traced using the strace utility

 Output is written to stderr

 For example for operating system pid 3407 strace –p 3407 2> strace.txt; pwrite(257, "\10\242\0\0\0\6A\0007++\0\0\0\1\f?\36\0\0\0\6A\0\0\0\0\0\1\6A\0"...,

253952, 549453824) = 253952 pwrite(257, "\10\242\0\0\37\6A\0007++\0\0\0\1\f\207Q\0\0\37\6A\0\37\0\0\0

\6A\0"..., 253952, 549707776) = 253952 pwrite(257, "\10\242\0\0>\6A\0007++\0\0\0\1\f\25\277\0\0>\6A\0>\0\0\0?\6A\0"...,

253952, 549961728) = 253952 pwrite(257, "\10\242\0\0]\6A\0007++\0\0\0\1\f\246u\0\0]\6A\0]\0\0\0^\6A\0"...,

253952, 550215680) = 253952 pwrite(257, "\10\242\0\0|\6A\0007++\0\0\0\1\f\224\247\0\0|\6A\0|\0\0\0}\6A\0"...,

32768, 550469632) = 32768

© 2013 Julian Dyke juliandyke.com

Direct Path Reads

 Oracle uses direct path reads to read blocks back from the temporary segment

 Default read size is 7 x 8192 byte blocks (57344 bytes)

 For example (event 10046 / level 8 trace):

WAIT #139938233130376: nam='direct path read temp' ela= 32 file number=201 first dba=59520 block cnt=7 obj#=78527 tim=1370770967683714

WAIT #139938233130376: nam='direct path read temp' ela= 29 file number=201 first dba=59527 block cnt=7 obj#=78527 tim=1370806245404502

WAIT #139938233130376: nam='direct path read temp' ela= 17 file number=201 first dba=59534 block cnt=7 obj#=78527 tim=1370806248875622

WAIT #139938233130376: nam='direct path read temp' ela= 15 file number=201 first dba=59541 block cnt=7 obj#=78527 tim=1370806251857467

WAIT #139938233130376: nam='direct path read temp' ela= 15 file number=201 first dba=59548 block cnt=7 obj#=78527 tim=1370806255340181

32 © 2013 Julian Dyke juliandyke.com

Direct Path Reads

 In Oracle 11.2 direct path reads are implemented using the pread system call.

 For example for operating system pid 3407 strace –p 3407 2> strace.txt; pread(257, "\10\242\0\0\243\350@\0\251'.\0\0\0\1\f\246\306\0\0\243\350@\

0#\0\0\0\244\350@\0"..., 57344, 487874560) = 57344 pread(257, "\10\242\0\0\252\350@\0\251'.\0\0\0\1\f\251\306\0\0\252\350@\

0*\0\0\0\253\350@\0"..., 57344, 487931904) = 57344

33 © 2013 Julian Dyke juliandyke.com

Dumping TEMPFILE blocks

 To dump the block(s) of a TEMPFILE use:

ALTER SYSTEM DUMP TEMPFILE <file#> BLOCK <block#>;

 For example:

ALTER SYSTEM DUMP TEMPFILE 1 BLOCK 67072;

 To dump multiple block(s) of a TEMPFILE use:

ALTER SYSTEM DUMP TEMPFILE <file#>

BLOCK MIN <block#>

BLOCK MAX <block#>

 For example:

ALTER SYSTEM DUMP TEMPFILE 1

BLOCK MIN 67072

BLOCK MAX 67079;

34 © 2013 Julian Dyke juliandyke.com

Dumping TEMPFILE blocks

 For example:

Start dump data blocks tsn: 3 file#:1 minblk 67072 maxblk 67080

Block dump from cache:

Dump of buffer cache at level 4 for tsn=3 rdba=4261376

Block dump from disk: buffer tsn: 3 rdba: 0x00410600 (1/67072) scn: 0x0000.002a70a0 seq: 0x01 flg: 0x0c tail: 0x70a00801 frmt: 0x02 chkval: 0x1e3e type: 0x08=unknown

Hex dump of block: st=0, typ_found=0

Dump of memory from 0x00007FBF9B56BA00 to 0x00007FBF9B56DA00

7FBF9B56BA00 0000A208 00410600 002A70A0 0C010000 [......A..p*.....]

7FBF9B56BA10 00001E3E 00410600 00000000 00410601 [>.....A.......A.]

7FBF9B56BA20 00000000 0000007F 00001FA4 00000044 [............D...]

7FBF9B56BA30 0000000E 00000266 02C10002 5A5A012C [....f.......,.ZZ]

7FBF9B56BA40 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A [ZZZZZZZZZZZZZZZZ]

Repeat 17 times

7FBF9B56BB60 5A5A5A5A 5A5A5A5A 00025A5A 006402C1 [ZZZZZZZZZZ....d.]

7FBF9B56BB70 58585858 58585858 58585858 58585858 [XXXXXXXXXXXXXXXX]

Repeat 5 times

7FBF9B56BBD0 58585858 595900C8 59595959 59595959 [XXXX..YYYYYYYYYY]

7FBF9B56BBE0 59595959 59595959 59595959 59595959 [YYYYYYYYYYYYYYYY]

Repeat 10 times

7FBF9B56BC90 59595959 59595959 59595959 00005959 [YYYYYYYYYYYYYY..]

7FBF9B56BCA0 00000266 03C10002 5A5A012C 5A5A5A5A [f.......,.ZZZZZZ]

35 © 2013 Julian Dyke juliandyke.com

Temporary Segment Block Format

 Temporary Segment Blocks contain:

Header

 header – 52 bytes body – 8136 bytes footer

– 4 bytes

Body

36 © 2013 Julian Dyke

Footer juliandyke.com

37

Temporary Segment Block Header

For example block 1 / 67072 – RDBA 0x410600

A208 0000 0600 0041 70A0 002A 0000 0C01 1E3E 0000 0600 0041 0000 0000 0601 0041

0000 0000 007F 0000 1FA4 0000 0044 0000 000E 0000

 Fields are:

Bytes Example

0-1

4-7

0000 A208

0041 0600

Description

Block Type (0x08 = Temporary Segment)

DBA of this block

8-11 002A 70A0 Sort number?

12-15 0C01 0000 Unknown – always 0xC01 = 3073

16-19 0000 1E3E Check digit

20-23 0041 0600 DBA of this block

24-27 0000 0000 Sequential number of block within sort (0 upwards)

28-31 0041 0601 DBA of next block

32-35 0000 0000 Sequential number of block within sort (0 upwards)

36-39 0000 007F Blocks remaining in extent (0x7F = 127)

40-43 0000 1FA4 Bytes used (0x1FA4 = 8100)

44-47 0000 0044 Bytes unused (0x44 = 68)

48-51 0000 000E Rows in block? (0xE = 15)

© 2013 Julian Dyke juliandyke.com

38

Temporary Segment Block Header

 Check word

Calculated using bitwise XOR as follows:

XOR of all words in block including check word is zero ub2 getCheckWord (void *buffer)

{ ub2 *sptr = (ub2 *)buffer; int i; ub2 m = 0;

} for (i = 0;i < BLOCK_SIZE / sizeof (ub2);i++)

{ if (i == 8) continue; m = m ^ sptr[i]; skip the check word

} return m;

^ is bitwise XOR operator

 Byte usage

 Bytes used + bytes unused is always 8168

 Rows in block

 Appears to be number of rows in block

– 2

© 2013 Julian Dyke juliandyke.com

39

Temporary Segment Body

 Body contains one or more sort rows:

 Rows can be split across blocks

 Columns are not split in this example, but it must be possible

 Row starts with 4 byte value representing length of row in block

 Rows are padded with zero byte(s) to ensure byte alignment for row length value

 Column values are preceded by a 2-byte length word

 Either maximum column length is 65536 or some longer length value can be created

 NULL values are represented by a 2-byte length word containing zeros

 No indication of number of columns in row or continuation rows

 Rows may be indexed in another structure that remains in memory

© 2013 Julian Dyke juliandyke.com

Temporary Segment Body

 Columns are stored in a strange order. For the statement:

SELECT /*+ FULL (p1) */ c1,c2,c3,c4,c5 FROM p1 ORDER BY c1;

 The column order is always C1, C5, C2, C3, C4

C1 is the primary key and the ORDER BY column

There is no apparent reason why C5 is next

This behaviour is entirely reproducible, but the reason for the re-ordering of the selected columns is currently unknown.

One theory is that the columns in the ORDER BY clause are sorted first, followed by the longest column not included in the ORDER BY clause. This would may the sort output more deterministic where the ORDER BY columns are non-unique.

Further research is required

40 © 2013 Julian Dyke juliandyke.com

Temporary Segment Body

 Example of row storage

0000 0269 # Length = 617

0006 3030 3030 3131 # C1 = 000011

012C 5A5A 5A5A 5A5A 5A5A ... 5A5A 5A5A # C5 = ‘ZZZZZZ...ZZZ’

0001 80 # C2 = 0

0064 5858 5858 5858 5858 ... 5858 5858 # C3 = ‘XXXXXX...XXX’

00C8 5959 5959 5959 5959 ... 5959 5959 # C4 = ‘YYYYYY...YYY’

 Total size is 617 bytes

C1 = 2 + 6 = 8 bytes

C2 = 2 + 1 = 3 bytes

C3 = 2 + 100 = 102 bytes

C4 = 2 + 200 = 202 bytes

C5 = 2 + 300 = 302 bytes

41 © 2013 Julian Dyke juliandyke.com

42

Extents and Batches

 Comparing the contents of the temporary segment to the original table we can conclude that (for this example)

 Each extent contains 128 blocks

 Data is sorted and written in batches of APPROXIMATELY 8 blocks

 In each extent there will therefore be approximately 16 batches, each sorted internally

 Batches do not align with block boundaries

 For this example, the data is probably sorted in local process heap and then written directly to the temporary segment

Data appears to be identical between local heap and the temporary segment.

However, there is different metadata in each structure and therefore the data does not fully align in the temporary segment blocks

© 2013 Julian Dyke juliandyke.com

43

Granules

 MEMORY_TARGET parameter set to 800M

SQL> SELECT bytes FROM v$sgainfo WHERE name = 'Granule Size‘;

BYTES

----------

4194304

 Granule size is 4MB:

 800MB / 4MB = 200 granules

SQL> SELECT grantype,COUNT(*) FROM x$ksmge

GROUP BY grantype

ORDER BY grantype;

GRANTYPE COUNT(*)

---------- ----------

0 70

1 58

2 1

3 1

7 43

14 25

# Unallocated

# Shared pool

# Large pool

# Java pool

# Buffer cache

#

 Total 198 granules reported by X$KSMGE

© 2013 Julian Dyke juliandyke.com

PGA Granules

 Unallocated granules can be determined using:

SQL> SELECT granaddr FROM x$ksgme

WHERE grantype = 0

ORDER BY grantype;

GRANADDR

----------------

0000000060800000

0000000060C00000

0000000061000000

0000000061400000

0000000061800000

0000000061C00000

....

0000000070C00000

0000000071000000

0000000071400000

0000000071800000

0000000071C00000

70 rows selected.

44 © 2013 Julian Dyke juliandyke.com

Memory Mappings

 The shared memory devices mapped to the unallocated granules can be determined using pmap for the operating system PID (3407) pmap 3407

0000000060800000 4096K rwxs/dev/shm/ora_TEST_2392076_0

0000000060c00000 4096K rwxs/dev/shm/ora_TEST_2392076_1

0000000061000000 4096K rwxs/dev/shm/ora_TEST_2392076_2

0000000061400000 4096K rwxs/dev/shm/ora_TEST_2392076_3

0000000061800000 4096K rwxs/dev/shm/ora_TEST_2392076_4

0000000061c00000 4096K rwxs/dev/shm/ora_TEST_2392076_5

0000000062000000 4096K rwxs/dev/shm/ora_TEST_2392076_6

....

0000000070400000 4096K rwxs/dev/shm/ora_TEST_2392076_63

0000000070800000 4096K rwxs/dev/shm/ora_TEST_2392076_64

0000000070c00000 4096K rwxs/dev/shm/ora_TEST_2392076_65

0000000071000000 4096K rwxs/dev/shm/ora_TEST_2392076_66

0000000071400000 4096K rwxs/dev/shm/ora_TEST_2392076_67

0000000071800000 4096K rwxs/dev/shm/ora_TEST_2392076_68

0000000071c00000 4096K rwxs/dev/shm/ora_TEST_2392076_69

45

 The mappings can also be determined using strace during process startup

 However, the output is more difficult to read.

© 2013 Julian Dyke juliandyke.com

46

Shared Memory

 Mappings are created for all granules during process start up

 Even if shared memory is currently unallocated for that device

$ ls –l /dev/shm/ora_TEST*

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

....

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

-rw-r----- 1 oracle oinstall

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_0

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_1

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_2

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_3

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_4

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_5

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_6

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_63

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_64

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_65

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_66

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_67

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_68

0 Jun 8 08:57 /dev/shm/ora_TEST_2392076_69

When a process starts, it creates mappings between addresses and shared memory devices for all granules (in this case 200 granules).

If granules are allocated to the SGA their size will be 4194304 (in this case); if they are not allocated the size will be 0.

© 2013 Julian Dyke juliandyke.com

Local Heap Memory

 Local PGA memory for a specific process can be dumped using ORADEBUG:

$ sqlplus / as sysdba

SQL> ORADEBUG SETOSPID 3407

Oracle pid: 19, Unix process pid: 3407, image: oracle@vm1.juliandyke.com (TNS V1-V3)

SQL> ORADEBUG DUMP HEAPDUMP 1025

Statement processed.

In this case the section of interest is “top uga heap”

HEAP DUMP heap name="top uga heap" desc=0xbaf9700 extent sz=0xffc0 alt=232 het=32767 rec=0 flg=2 opc=2 parent=(nil) owner=(nil) nex=(nil) xsz=0xfff8 heap=(nil) fl2=0x60, nex=(nil), dsxvers=1, dsxflg=0x0 dsx first ext=0xe8ab0008

47 © 2013 Julian Dyke juliandyke.com

48

Local Heap Memory

 Sorted rows appear in the heap dump:

EXTENT 0 addr=0x7f45e8927008

Chunk 7f45e8927018 sz= 65512 freeable "session heap " ds=0x7f45e8aa67c8

Dump of memory from 0x00007F45E8927018 to 0x00007F45E8937000

7F45E8927010 0000FFE9 10B38F00 [........]

......

7F45E8927400 34010007 39383533 02620431 5A5A012C [...435891.b.,.ZZ]

7F45E8927410 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A [ZZZZZZZZZZZZZZZZ]

Repeat 17 times

7F45E8927530 5A5A5A5A 5A5A5A5A 00025A5A 006405C1 [ZZZZZZZZZZ....d.]

7F45E8927540 58585858 58585858 58585858 58585858 [XXXXXXXXXXXXXXXX]

Repeat 5 times

7F45E89275A0 58585858 595900C8 59595959 59595959 [XXXX..YYYYYYYYYY]

7F45E89275B0 59595959 59595959 59595959 59595959 [YYYYYYYYYYYYYYYY]

Repeat 10 times

7F45E8927660 59595959 59595959 59595959 07055959 [YYYYYYYYYYYYYY..]

7F45E8927670 33350100 31393835 2C026204 5A5A5A01 [..535891.b.,.ZZZ]

7F45E8927680 5A5A5A5A 5A5A5A5A 5A5A5A5A 5A5A5A5A [ZZZZZZZZZZZZZZZZ]

Repeat 17 times

7F45E89277A0 5A5A5A5A 5A5A5A5A C100025A 58006406 [ZZZZZZZZZ....d.X]

7F45E89277B0 58585858 58585858 58585858 58585858 [XXXXXXXXXXXXXXXX]

Repeat 5 times

7F45E8927810 C8585858 59595900 59595959 59595959 [XXX..YYYYYYYYYYY]

7F45E8927820 59595959 59595959 59595959 59595959 [YYYYYYYYYYYYYYYY]

We know these rows have been sorted as the columns appear in the “wrong” order

Repeat 10 times

7F45E89278D0 59595959 59595959 59595959 00070559 [YYYYYYYYYYYYY...]

© 2013 Julian Dyke juliandyke.com

Local Heap Memory

 Local heap memory shows as /dev/zero in the pmap output

 For example the heap extent in the previous slide was

EXTENT 0 addr=0x7f45e8927008

Chunk 7f45e8927018 sz= 65512 freeable "session heap " ds=0x7f45e8aa67c8

Dump of memory from 0x00007F45E8927018 to 0x00007F45E8937000

 Output of pmap for this area of memory is: pmap 3407

....

00007f45e88f7000 64K rwx-/dev/zero # EXTENT 3

00007f45e8907000 64K rwx-/dev/zero # EXTENT 2

00007f45e8917000 64K rwx-/dev/zero # EXTENT 1

00007f45e8927000 64K rwx-/dev/zero # EXTENT 0

00007f45e8937000 64K rwx-/dev/zero

00007f45e8947000 64K rwx-/dev/zero

00007f45e8957000 64K rwx-/dev/zero

00007f45e8967000 64K rwx-/dev/zero

00007f45e8977000 64K rwx-/dev/zero

EXTENT 0

49 © 2013 Julian Dyke juliandyke.com

50

Conclusions

 In this example:

 Sorted rows are inserted directly into temporary segment when allocated

PGA space is exhausted in local memory

 All selected columns are included in sorted rows

 ORDER BY columns

 Other columns

 Some dynamic performance views are unreliable

 In general:

 Minimize number of columns in SELECT list

 For some queries it may be more efficient to order the rows first, then to revisit the table for the remaining data

 Avoids unnecessary writes to temporary segments

© 2013 Julian Dyke juliandyke.com

Conclusions

 General conclusions:

51 © 2013 Julian Dyke juliandyke.com

Acknowledgements

 The original presentation as delivered in Scotland contained some inaccuracies that have been fixed in this version

 This version also contains the results of some additional research

 Thank you to the following for their invaluable advice, suggestions and additional input during the event:

 Jonathan Lewis

 Tony Hasler

 Bjoern Rost

52 © 2013 Julian Dyke juliandyke.com

Download