Physical Database and Index Structures Databases He Tan

advertisement
He Tan
hetan@ida.liu.se
IISLAB
IDA
Databases
Physical Database and
Index Structures
Real world
Model
Databases
DBMS
He Tan
Queries
Answers
Processing of queries and
updates
Access to stored data
Physical
database
september 2007
He Tan
hetan@ida.liu.se
IISLAB
IDA
september 2007
1
1
He Tan
hetan@ida.liu.se
IISLAB
IDA
What is this about?
•
Physical file storage on hard drive
•
How data is stored on disk
•
Why is disk storage expensive?
•
How to choose a suitable index
•
How to speed up accessing of data by using indexes
•
Calculate the size of an index
3
3
He Tan
hetan@ida.liu.se
IISLAB
IDA
september 2007
He Tan
hetan@ida.liu.se
IISLAB
IDA
Storage hierarchy
•
5
• Cache memory
• Main memory
Primary storage
• Disk
Secondary storage
4
4
CPU
september 2007
What you will learn
Disk memory
•
byte, capacity
•
track, cylinder
•
block
(fast, small, expensive, volatile,
accessible by CPU)
(slow, big, cheap, permanent,
inaccessible by CPU)
Important because it effects query efficiency.
5
september 2007
6
6
1
He Tan
hetan@ida.liu.se
IISLAB
IDA
He Tan
hetan@ida.liu.se
IISLAB
IDA
block
track
Disk
•
actuator armread/write head
spindle
disk rotation
A track is divided into equal-sized blocks during disk
formatting (or initialization).
•
cylinder
of tracks
Block is the unit of transfer of data between disk and main
memory, e.g.
ƒ
ƒ
Read = copy block from disk to buffer in main memory.
Write = the opposite way.
ƒ
R/w time = seek time + rotational delay + block transfer time.
locate block
1-2 msec.
9-60 msec.
actuator
movement
september 2007
7
7
He Tan
hetan@ida.liu.se
IISLAB
IDA
september 2007
He Tan
hetan@ida.liu.se
IISLAB
IDA
Block transfer
8
8
A block is the unit of transfer of data between disk memory and main
memory
Physical database
•
How data is stored on disk ?
Approximately 1000 to 100000 times more expensive to perform one
disk operation than one memory operation
september 2007
•
Processor instruction
•
Memory fetch
•
Disk fetch
10 −9 s
10 −8 s
10 −3 s
9
9
He Tan
hetan@ida.liu.se
IISLAB
IDA
Data stored in files.
•
A file is a sequence of records (table)
•
A record is a collection of field values (tuple)
•
Each field has an elementary data type
•
A collection of field names and their correponding data types constitues
a record type
10
10
He Tan
hetan@ida.liu.se
IISLAB
IDA
Files and records
•
september 2007
Files and records
•
The records in a file are often of the same type.
•
A file contains records of the same or different length.
•
A file contains records of the different length :
ƒ Records have same type, but fields have variable size.
ƒ Records have same type, but fields have multiple values.
ƒ Records have same type, but fields are optional.
ƒ Records have different types.
september 2007
11
11
september 2007
12
12
2
He Tan
hetan@ida.liu.se
IISLAB
IDA
He Tan
hetan@ida.liu.se
IISLAB
IDA
File and record storage
•
Records are allocated to blocks
•
Suppose that
ƒ r is the number of records in the file
ƒ B is the size in bytes of the block
ƒ R is the size in bytes of the record (fixed-length)
september 2007
13
13
He Tan
hetan@ida.liu.se
IISLAB
IDA
•
What is the space wasted per block ?
14
File and record storage
Block size (B) of 4096 bytes
•
Record size (R) is 100 bytes (fixed-length, unspanned)
•
Blocking factor (bf) is
•
Blocks needed (b) is
block i
wasted
record 1
record 2
block i+1
wasted
record 3
record 5
block i+1
record 1
record 3
record 2
record 4
record 3 p
15
september 2007
Contiguous allocation
•
Indexed allocation.
File organization
How records are placed in the file?
Linked allocation
Linked clusters allocation.
16
16
He Tan
hetan@ida.liu.se
IISLAB
IDA
From file blocks to disk blocks
•
 B   4096 
 R  =  100  = 40
  

 r  1000000 
 bf  =  40  = 25000

  
record 5
ƒ expensive sequential access but cheap record addition.
17
•
Solution: Spanned records.
ƒ cheap sequential access but expensive record addition.
september 2007
 r 
b = 

 bfr 
File with 1 000 000 records (r)
15
•
Blocks needed:
•
block i
•
•
•
Spanned
Spanned
He Tan
hetan@ida.liu.se
IISLAB
IDA
B
bfr =  
R
Wasted space per block = B – bfr * R.
Unspanned
Unspanned
september 2007
Blocking factor:
14
He Tan
hetan@ida.liu.se
IISLAB
IDA
Spanned/Unspanned records
•
september 2007
•
17
september 2007
18
•
Heap files.
•
Sorted files.
•
Hash files.
18
3
He Tan
hetan@ida.liu.se
IISLAB
IDA
He Tan
hetan@ida.liu.se
IISLAB
IDA
Heap files
•
Records are added to the end of the file. Hence,
Sorted files
•
ƒ Cheap record addition.
ƒ Expensive record retrieval, removal and update, since they imply linear
search:
b 
ƒ Cheap ordered record retrieval:
• All the records: access blocks sequentially.
• Next record: probably in the same block.
• Random record: binary search, then worst case implies
• Average case:  2  block accesses.
• Worst case: b block accesses (if it doesn't exist or several exist).
•
september 2007
(i mod bfr) th record in the block
19
19
He Tan
hetan@ida.liu.se
IISLAB
IDA
september 2007
september 2007
21
21
He Tan
hetan@ida.liu.se
IISLAB
IDA
23
Indexing
•
an extra structure that is used to make the search of records
faster
•
How to choose a suitable index
•
Calculate the size of an index
The hash function applies to the hash field and returns the
position of the record in the array of records. E.g.
position = field mod r
september 2007
20
20
He Tan
hetan@ida.liu.se
IISLAB
IDA
Hash files
•
Is record updating cheap or expensive ?
ƒ The search condition to locate the record.
ƒ The field to be modified.
What is the disk block and record of the i-th file record?
i
bfr
block accesses.
• Addition: a temporary unordered file (overflow file) + periodic reorganization.
• Deletion: deletion markers + periodic reorganization.
Heap file, fix-length records, unspanned blocks, and contiguous
allocation.
block
log2 b
ƒ Expensive record addition and deletion.
ƒ Moreover, record removal implies waste of space. So, periodic
reorganization.
•
Records ordered according to some field. Hence,
is based one or more index files,
•
stores the physical addresses of records.
22
22
He Tan
hetan@ida.liu.se
IISLAB
IDA
Indexing
•
september 2007
Primary Index
•
23
september 2007
24
An ordered file with two fields, an indexing field and a block
pointer.
•
The data file is physically ordered on a key feild on the disk.
•
All data records are stored sequentially on a disk.
24
4
He Tan
hetan@ida.liu.se
IISLAB
IDA
He Tan
hetan@ida.liu.se
IISLAB
IDA
Primary Index
•There is one index entry for each
data block
•Each index entry contains the
ordering key value of the first
record in the data block plus a
pointer to the block.
september 2007
25
25
He Tan
hetan@ida.liu.se
IISLAB
IDA
september 2007
Blocks needed for primary index?
•
File with 1000 000 records (r)
•
Block size (B) of 4096 bytes
•
Record size (R) 100 bytes (fixed, unspanned)
•
Index size (R) is 10 bytes (fixed , unspanned)
•
Blocks for the file 25000
26
26
Clustering Index
Clustering Index
•
An ordered file with two fields, an indexing field and a block
pointer.
1 (Andersson, Anders)
•
The data file is physically ordered on a nonkey field on the disk.
2 (Andersson, Anders)
•
Store all data records sequentially on a disk
1 (Andersson, Anders)
1
Block 1
3 (Andersson, Nils)
2
3
4
4 (Andersson, Sven)
5
Block 2
4 (Bengtsson, Anders)
4 (Davidsson, Nils)
Block 3
5 (Larsson, Anders)
september 2007
27
27
He Tan
hetan@ida.liu.se
IISLAB
IDA
Clustering Index
1 (Andersson, Anders)
block anchoring
1 (Andersson, Anders)
2 (Andersson, Anders)
Block 1
Block 2
1
2
3 (Andersson, Nils)
Block 3
Blocks needed for clustering index
•
File with 1000 000 records
•
Block size (B) of 4096
•
Record size 100 bytes (fix, unspanned)
•
Index size is 10 bytes (unspanned)
•
Assume: 100 records of each value in average (block anchoring)
3
4
5
4 (Andersson, Sven)
4 (Bengtsson, Anders)
4 (Davidsson, Nils)
5 (Larsson, Anders)
Block 4
Block 5
Block 6
september 2007
30
30
5
He Tan
hetan@ida.liu.se
IISLAB
IDA
He Tan
hetan@ida.liu.se
IISLAB
IDA
Terminology
•
Dense index: All records in the file occurs in the index
•
Sparse (nondense) index: Only some of the records occur in the
index.
Problems with Primary & Clustering
Indexes
•
As with any ordered file, addition and deletion becomes very
expensive.
ƒ We might have to move records in the data file.
ƒ When the data file is changed, the anchor items might change.
•
Are a primary index/a clustering index dense or sparse?
ƒ Use some kind of unordered overflow structure
september 2007
31
31
He Tan
hetan@ida.liu.se
IISLAB
IDA
Secondary Index
•
An ordered file with two fields, an indexing field and a block
pointer.
•
The data do not have to be physically ordered on the indexing
field
september 2007
32
32
Secondary Index
A dense secondary index on a nonordering key field
september 2007
33
33
He Tan
hetan@ida.liu.se
IISLAB
IDA
Secondary Index Example
•
File with 1000 000 elements (r)
•
Block size (B) of 4096
•
Record size (R) 100
•
Suppose that the indexing field is 6 bytes
•
A block pointer is 4 bytes
•
Each index entry, R, is then 10 bytes
Secondary Index
A dense secondary index on a nonordering nonkey field
september 2007
35
35
6
He Tan
hetan@ida.liu.se
IISLAB
IDA
september 2007
He Tan
hetan@ida.liu.se
IISLAB
IDA
Secondary Index Example
•
•
File with 1000 000 elements (r)
•
Block size (B) of 4096
•
Record size (R) 100
•
the indexing field is 6 bytes, block pointer is 4 bytes
•
Each index entry, R, is then 10 bytes
•
100 values that are identical (average)
•
Option 2/ 3
•
•
Secondary index
ƒ Index on a field that is not used for file ordering.
37
september 2007
Summary: Properties of indexes
38
38
He Tan
hetan@ida.liu.se
IISLAB
IDA
Example
Number of first level index entries
Dense or nondense
Block anchoring
Primary
Number of blocks in data file
Nondense
Yes
Assume a file with 5 000 students. There is in average 50 students
with the same family name. Each record is 200 bytes. Family
name takes 20 bytes and personal number 10. The block size is
4096 and a block pointer is 5 bytes.
Clustering
Number of distinct index field
values
Nondense
Yes/No
The file is sorted on personal number.
Secondary (unique)
Number of records in data file
Dense
No
Secondary
(nonunique)
Number of records or number of
distinct index field values
Dense
No
39
39
He Tan
hetan@ida.liu.se
IISLAB
IDA
Clustering index
ƒ Index a nonkey field
ƒ File ordered by that nonkey field
on the data file
september 2007
Primary index
ƒ Index the key field
ƒ File ordered by the key field
37
He Tan
hetan@ida.liu.se
IISLAB
IDA
Summary: Main index types
september 2007
40
•
An index on personal number (files sorted on personal number)
•
An index on family name
40
Example
Assume a file with 5 000 students. There is in average 50 students
with the same family name. Each record is 200 bytes. Family
name takes 20 bytes and personal number 10. The block size is
4096 and a block pointer is 5 bytes
The file is sorted on family name
september 2007
41
•
An index on personal number (files sorted on personal number)
•
An index on family name
41
7
Download