CS 524 – High Performance Computing

advertisement
Indexing Techniques
CS 543 – Data Warehousing
Indexing
Goal: Increase efficiency of data access by reducing the
number of I/Os required to find desired record(s).
Library analogy: Indexed access is analogous to using
the card catalog in a library rather than searching through
every shelf in the library until the desired book is found
(e.g. , avoids full table scan).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
2
DW Indexing Issues





Indexes and loading
Indexing for large tables
Index-only reads
Selecting columns for indexing
A staged approach
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
3
B-Tree Index
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
4
Bitmapped Index
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
5
Bitmapped Index
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
6
Indexing the Fact Table





If the DBMS does not create an index for the primary
key, create one using B-tree indexing
In the concatenated primary key, place the primary
keys of frequently accessed dimension tables in the top
order
Create indexes for combinations of dimension table
primary keys based query performance
Do not overlook indexing metric columns
Bitmapped indexing does not apply to fact tables; there
is hardly any low-selectivity columns
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
7
Indexing the Dimension Tables




Create a unique B-tree index on the single-column
primary key
Index any column that is used frequently to constrain
queries
Create index for combination of columns that are used
frequently together in queries
Index every column likely to be used in a join
operation
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
8
Hash Indexing
In contrast to B-tree indexing, hash based indexes
do not (typically) keep index values in sorted
order.
 Index entry is located by hashing index value.
 Index entries keep in hash organized tables
rather than B-tree structures.
 Index entry contains ROWID values for each
row corresponding to the index value.
 ROWIDs kept in sorted order to facilitate
maximum I/O performance.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
9
Primary Indexing
 Primary
index for a table in Teradata is a
specification of its partitioning column(s).
 Primary index may be defined as unique
(UPI) or non-unique (NUPI).
Automatic enforcement of
uniqueness when
UPI is specified.
 Primary
index provides an implicit access
path to any row just by knowing its value.
 Only one primary index per table.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
10
Primary Indexing
Primary index selection criteria:
 Common join and retrieval key.
 Distributes rows evenly across database
partitions.
 Less than ten thousand rows per PI value when
non-unique.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
11
Primary Indexing
Trick question: What should be the primary index of the
transaction table for a large financial services firm?
create table tx
(tx_id decimal (15,0) NOT NULL
,account_id decimal (10,0) NOT NULL
,tx_amt decimal (15,2) NOT NULL
,tx_dt date NOT NULL
,tx_cd char (2) NOT NULL
....
) primary index (???);
Answer: It depends.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
12
Primary Indexing
 Almost
all joins and retrievals will come in
through the account _id foreign key.
Want
account_id as NUPI.
 If
data is “lumpy” when distributed on
account_id or if accounts have very large
numbers of transactions (e.g., an institutional
account could easily have 10,000+
transactions).
Want
tx_id as UPI for good data distribution.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
13
Primary Indexing
 Joins
and access via primary index are very
efficient due to Teradata’s sophisticated row
hashing algorithms that allow going directly to the
data block containing the desired row.
 Single I/O operation for accessing a data row via
UPI.
 Single I/O operation for accessing a data row via
NUPI whenever all rows with the same PI value
fit into a single block.
 Single VAMP operation for indexed retrieval.
 No spool space required.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
14
Primary Indexing
Primary index is free!
 No storage cost.
 No index build required.
This is a direct result of the underlying hashbased file system implementation.
OLTP databases use a page-based file system
and therefore do not deliver this performance
advantage.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
15
Secondary Indexing
Secondary index structures are implemented using the
same underlying structure as base tables (often
referred to as subtables).
 Secondary index may be defined as unique (USI) or
non-unique (NUSI).
Automatic
enforcement of uniqueness when USI is
specified.
 Up
to thirty-two secondary indexes per table in
Teradata.
 Unlike a primary index, secondary indexes are not
“free” in terms of storage.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
16
Secondary Index: NUSI
A non-unique secondary index (NUSI) is partitioned so
that each index entry is co-located on the same Vamp
(Virtual Access Module Processor) with its
corresponding row in the base table.
 Each row access via a NUSI is a single Vamp operation
(for that row) because the NUSI entry and data row are
co-located.
 NUSI access is always performed in parallel across all
Vamp whenever it is appropriate to do so.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
17
Secondary Indexing: NUSI
Compressed ROWID index structure:
 Hash on index value to get block location (ROWID for
subtable).
 Store index value just once followed by all ROWIDs in
base table corresponding to the index value.
 Sorted by ROWID to facilitate maximum efficiency
when accessing base table, performing updates and
deletes, etc.
 Additional blocks allocated when NUSI is nonselective and compressed ROWID structure for the
index value exceeds 64K.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
18
Secondary Indexing: NUSI
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
19
When to Build a NUSI?
Building a NUSI helps when the selectivity of the indexed column
is very high.
Cost-based optimizer will determine when to access via NUSI:
 Number of rows selected by NUSI must be less than number of
blocks in the table to justify access via NUSI (assumes even
distribution of rows with NUSI value within table).
 Must also consider cost for reading the NUSI subtable and
building ROWID spool file.
Note that the extreme efficiency of table scanning in Teradata
reduces the need for secondary indexing as compared to other
databases.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
20
Secondary Indexing: USI
 A unique
secondary index (USI) is
partitioned by the unique column upon which
the index is built.
 Row access via a USI is a two Vamp
operation.
First
I/O is initiated on the Vamp with the
USI entry.
Second I/O is initiated on the Vamp with the
data row entry.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
21
Secondary Indexing: USI
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
22
When to Build a USI?
When to Build a USI?
 To
allow data access without all VAMP
operations.
Increased
efficiency for (very) high
selectivity retrievals.
 Obtain
co-location of index with frequently
joined tables.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
23
When to Build a USI?
Example:
create table order_header
(order_id decimal(12, 0) NOT NULL
,customer_id decimal(9, 0) NOT NULL
,order_dt date NOT NULL
...
)
primary index( customer_id );
create unique index oh_order_idx (order_id) on order_header;
create table order_detail
(order_id decimal(12, 0) NOT NULL
,product_id integer NOT NULL
,extended_price_amt decimal(15,2) NOT NULL
,item_cnt integer NOT NULL
...
)
primary index( order_id );
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
24
When to Build a USI?
Example: How many customers ordered green socks in
the last month? Assume that green socks is quite
selective.
select count(distinct order_header.customer_id)
from order_header
,order_detail
,product
where order_header.order_id = order_detail.order_id
and order_header.order_dt > add_months(date, -1)
and order_detail.product_id = product.product_id
and product.product_subcategory_cd = 'SOCKS'
and product.color_cd = 'GREEN'
;
The order_id USI on order_header table obviates the need
for all Vamp duplication of spool result from order
detail to product join when joining to the order header
table.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
25
A Simple Query
Example: What is the average age (in years) of customers who
live in California or Massachusetts, completed a graduate
degree, are consultants, and have a hobby of volleyball or chess?
select avg( (days(date) - days(customer.birth_dt)) /
365.25 )
from customer
where customer.state_cd in (‘CA’ , MA’)
and customer.education_cd = ‘G’
and customer.occupation_cd = ‘CONSULTANT’
and customer.hobby_cd in (‘VOLLEYBALL’,‘CHESS’)
;
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
26
Sample Table Structure
Assume:
 20M customers.
 128 byte rows.
 64K data block size.
Results in approximately 512 rows per block and a total
of 39,063 blocks in the customer table.
Note: We are ignoring block overhead for purposes of simplicity in calculations.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
27
Data Demographics
Assume:
8% of customers live in California.
 4% of customers live in Massachusetts.
 4% of customers have completed a graduate degree.
 6% of customers are consultants.
 2% of customers have a primary hobby of chess.
 3% of customers have a primary hobby of volleyball.

CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
28
Full Table Scan Performance
Must read every block in the table.
 Apply where clause predicates to determine which
customers to include in average.
 Adjust numerator and denominator of average as
appropriate.

Total I/O count = 39,063
Note: Data demographics have no (minimal) impact on
query performance when using a full table scan operation.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
29
Single Index Structure
B-tree or hash organization of column values:
 Index entries store row IDs (RIDs), lists of RIDs, or
pointers to lists of RIDs.
 Originally designed for columns with many unique values
(OLTP legacy).
 Assuming an eight byte RID, we will get 8096 RIDs per
64K block.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
30
Single Index Access
Optimizer chooses index with best selectivity based on
values specified in query.
Access next (first) index entry corresponding to specified
column value(s).
Use RID from index entry to locate row with specified
column value.
Validate remaining predicates to qualify row.
Adjust average as appropriate.
Go to 2 until no more matching index values.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
31
Single Index Access
What are my indexing choices?
state_cd (8% + 4% = 12% selectivity)
 education_cd (4% selectivity)
 occupation_cd (6% selectivity)
 hobby_cd (2% + 3% = 5% selectivity)

Choose education_cd because it has best selectivity.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
32
Single Index Performance
Access via index on education_cd:


800,000 RIDs (4% of 20M)
99 blocks of RIDs to read
But...4% selectivity with 512 rows per block in the base
table means that 800,000 selected RIDs will cause access
to every block in the base table!
Total I/O count = 39,063 + 99 = 39,162
Worse than full table scan!
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
33
Single Index Performance
Accessing via an index helps only when the selectivity of the
indexed column is very high.
Rule-of-thumb:

Number of rows selected by an index should not be more than the
number of blocks in the table to justify indexed access (assumes
rows with selected value(s) have an “even” distribution within
table).

Must also consider cost for reading the index and sorting RIDS (if
not already sorted) prior to accessing base table rows (to avoid
hitting same block multiple times).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
34
Single Index Performance
What is the break even index selectivity (S) versus full table scan
performance?
Selectivity
Row Count
Block Size
Row Width
RID Width
RIDs per Block
Rows per Block
Total Blocks
39,063
=S
= 20M rows
= 64K
= 128 bytes
= 8 bytes
= floor(Block Size/RID Width)
= 8k
= floor(Block Size/Row Width)
= 512
= ceiling((Row Count) / (Rows per Block)) =
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
35
Single Index Performance
RID I/Os
Indexed Base Table I/Os
Full Table Scan I/Os
= (S * Row Count) / (RIDs per Block)
= (Total Blocks) * (1 - ((1 - S) ** Rows per Block))
= Total Blocks
Break even formula:
RID I/Os + Indexed Base Table I/Os = Full Table Scan I/Os
Break Even S is less than 2%.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
36
Single Index Performance
Larger row widths and/or smaller block sizes will generally
make indexes more desirable because there is a higher
probability that a given block will not contain a selected row
when fewer rows fit in a block.
Example: With a row width of 256 bytes (instead of 128 bytes)
the break even selectivity for indexing becomes
approximately 2.7%.
Of course, larger row widths and/or smaller block sizes means
that we are actually getting fewer rows per I/O and thus the
amount of work we will do to satisfy a query will generally be
higher.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
37
Single Index Performance
Traditional index structures work well in OLTP because
selectivity is extremely high (1 customer out of 20M or a
few accounts out of 50M).
Selectivity is 0.000005% and thus is significantly better
than the one or two percent required for break even.
Bottom line: Traditional indexing is good for OLTP style
queries, but is not so great for traditional DSS queries.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
38
Combining Multiple Indexes
Observation: Indexed access on a single column is rarely
useful in a traditional data warehouse environment.
Idea: Combine multiple indexes to get the selectivity
required for efficient indexed access.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
39
Combining Multiple Indexes
While none of the index choices (state, education, occupation, hobby) are
selective enough on their own to be useful...when combined we have
sufficient selectivity to make indexed access efficient.
Example: Start with 20M customers...
Incremental
Column Index
Selectivity
Selectivity by state
12%
Followed with selectivity by education
4%
Followed with selectivity by occupation
6%
Followed with selectivity by hobby
5%
Selected
Customers
2,400,000
96,000
5,760
288
Combined selectivity is 0.00144%
Note: These selectivity figures assume that column values are independent
(...and they usually are not).
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
40
Combining Multiple Indexes
Must consider I/O cost of accessing four different indexes plus
cost of accessing selected blocks from base table:
State
Education
Occupation
Hobby
=
=
=
=
297
99
149
124
Base table
=
Total
=
288
=====
957
More efficient than a full table scan!
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
41
Combining Multiple Indexes
Notice that there is a significant performance benefit if we
can satisfy a query directly out of indexes without
accessing base table.
Example: How many customers live in California or
Massachusetts, completed a graduate degree, are
consultants, and have a hobby of volleyball or chess?
select count(*)
from customer
where customer.state_cd in ('CA','MA')
and customer.education_cd = 'G'
and customer.occupation_cd = 'CONSULTANT'
and customer.hobby_cd in
('VOLLEYBALL','CHESS')
;
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
42
Combining Multiple Indexes
Now we only need to consider the I/O cost of accessing the four
indexes (sans the cost of accessing base table):
State
Education
Occupation
Hobby
Total
=
=
=
=
297
99
149
124
=====
= 669
Much more efficient than a full table scan!
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
43
Combining Multiple Indexes
Traditional combining of multiple indexes:


Requires RID list ANDing (and sometimes ORing) to combine
the multiple indexes.
May incur overhead of RID list sorting in order to facilitate
ANDing operation (depends on RDBMS indexing
implementation).
This technique is useful when no one index is selective enough to
produce an efficient access path, but multiple indexes taken
together can provide the needed selectivity.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
44
Bottom Line

Optimizer sophistication is critical in effectively
exploiting indexes.

Selectivity of indices are critical in determining their
usefulness.

Indexed access paths are not nearly as useful in data
warehousing as compared to OLTP workloads.
CS 543 - Data Warehousing (Sp 2007-2008) - Asim Karim @ LUMS
45
Download