File Organization and Storage Structures File Organization and Storage Structures

advertisement
File Organization and Storage Structures
o Storage of data
File Organization and
Storage Structures
– Primary Storage = Main Memory
• Fast
• Volatile
• Expensive
– Secondary Storage = Files in disks or tapes
• Non-Volatile
Secondary Storage is preferred for storing
data
File Organization and Storage Structures - 1
File Organization and Storage Structures - 2
Basic Concepts
Logical Record Vs Physical Record
o Information are stored in data files
o Logical record
– Eg. The record of a staff (SG37).
– “A record”
o Each file is a sequence of records
o Each record consists of one or more fields
o Physical record
Sno
Lname
Position
NIN
Bno
SL21
White
Manager
WK440211B
B5
SG37
Beech
Snr Asst
WL432514C
B3
SG14
Ford
Deputy
WL220658D
B3
File Organization and Storage Structures - 3
CS3462 Introduction to Database Systems
Helena Wong, 2001
– The unit of transfer between disk and primary
storage.
– “A page”, “A block”
Generally, a physical record consists of more than
one logical record
File Organization and Storage Structures - 4
Logical Record Vs Physical Record
Sno
Lname Position NIN
Bno
SL21
White
Manager WK440211B
B5
SG37
Beech
Snr Asst WL432514C
B3
SG14
Ford
Deputy
B3
SA9
SG5
Howe
Assistant WM532187D B7
Brand
Manager WK588932E
B3
SL41
Lee
Assistant WA290573K
B5
WL220658D
Page
1
File Organization & Access Method
o File Organization means the physical arrangement
of data in a file into records and pages on
secondary storage
– Eg. Ordered files, indexed sequential file etc.
o Access Method means the steps involved in storing
and retrieving records from a file.
2
– Eg. Using an indexed access method to retrieve a
record from an indexed sequntial file.
File Organization and Storage Structures - 5
Heap Files
File Organization and Storage Structures - 6
Ordered Files
o Heap files are files of unordered records.
o Quick insertion (no particular ordering)
– When a new record is created, it is put in the last
page of the file if there is sufficient space. Otherwise
a new page is added to the file.
o Slow retrieval (only allow linear search)
– reading pages from the file until a required record is
found.
o Ordered Files: Records are sorted on field(s) => Key
o Allow Binary Searching
Suppose one page stores one record.
To search for SG37, search the middle page (6/2 = 3)
first. We find that SG37 does not exist in this
page(SG14). Then, since SG37 is greater than SG14,
we search the middle page within the lower half of the
file, and so on.
o To delete a record, the record is marked as deleted.
Space is reclaimed during periodical reoganization.
File Organization and Storage Structures - 7
CS3462 Introduction to Database Systems
Helena Wong, 2001
File Organization and Storage Structures - 8
Ordered Files
Direct Files
o Inserting a record
o Direct Files are also called Hash Files or Random
Files
– If the appropriate page is full, may have to reorganize the whole file => Time consuming
– Solution: use a temporary unsorted file (transaction
file). Merge to the sorted file periodically.
o Rarely used unless come with an index => Indexed
Sequential File
o Both Heap Files and Ordered Files are also called
Sequential Files.
o No need to write records sequentially
o Use a hash function to calculate the number of the
page (bucket) which a record should be located
o Eg., use the division-remainder calculation method
that,
bucket_no = Record_key mod 3
File Organization and Storage Structures - 9
Direct Files
File Organization and Storage Structures - 10
Direct Files
Open Addressing
o Upon a collision, the system
performs a linear search to
find the first available slot.
o Problem: If a new record SG41 is created, which
bucket to go?
o Collision Management
Open addressing, Unchained overflow, Chained
overflow, Multiple hashing
File Organization and Storage Structures - 11
CS3462 Introduction to Database Systems
Helena Wong, 2001
o When last bucket has been
searched, starts from the first
bucket.
o SL41 will be inserted to:
Bucket 1
File Organization and Storage Structures - 12
Direct Files
Direct Files
Unchained Overflow
o An overflow area is maintained for collisions.
Chained Overflow
o Each bucket has a synonym pointer
o Value of the synonym pointer:
o SL41 will be inserted to:
Zero: no collision occurred
Bucket 3
Non-zero: the overflow bucket used
File Organization and Storage Structures - 13
Direct Files
File Organization and Storage Structures - 14
Direct Files
Multiple Hashing
Limitation (of Hashing)
o Upon collision, apply a second hashing function to
produce a new hash address in an overflow area.
Inappropriate for some retrievals:
– based on pattern matching
eg. Find all students with ID like 98xxxxxx.
– Involving ranges of values
eg. Find all students from 50100000 to 50199999.
– Based on a field other than the hash field
File Organization and Storage Structures - 15
CS3462 Introduction to Database Systems
Helena Wong, 2001
File Organization and Storage Structures - 16
Indexes
Indexes
Index: A data structure that allows particular records in
a file to be located more quickly
TERMINOLOGY
~ Index in a book
Data file: a file containing the logical records
Index file: a file containing the index records
An index can be sparse or dense:
Sparse: record for only some of the search key values
(eg. Staff Ids: CS001, EE001, MA001). Applicable to
ordered data files only.
Indexing field: the field used to order the index records
in the index file
Key: One or more fields which can uniquely identify a
record (eg. No 2 students have the same student ID).
Dense: record for every search key value. (eg. Staff Ids:
CS001, CS002, .. CS089, EE001, EE002, ..)
File Organization and Storage Structures - 17
Indexes
Indexed Sequential Files
TYPES OF INDEXES
What are Indexed Sequential Files?
Primary Index: An index ordered in the same way as
the data file, which is sequentially ordered
according to a key. (The indexing field is equal to
this key.)
Secondary Index: An index that is defined on a nonordering field of the data file. (The indexing field
need not contain unique values).
A data file can associate with at most one primary
index plus several secondary indexes.
File Organization and Storage Structures - 19
CS3462 Introduction to Database Systems
Helena Wong, 2001
File Organization and Storage Structures - 18
= A sorted data file with a primary index
Advantage of an Indexed Sequential File
Allows both sequential processing and individual
record retrieval through the index.
Structure of an Indexed Sequential File
o A primary storage area
o A separate index or indexes
o An overflow area
File Organization and Storage Structures - 20
B+-Trees
B+-Trees
In B+-Tree, data or indexes are stored in a hierarchy of
nodes
o B => Balanced
o Consistent access time (for each access, same
number of nodes are searched)
TERMINOLOGY
Degree (Order) : The maximum number of children
allowed per parent.
Depth : The maximum number of levels between the
root node and a leaf node in the tree.
Point to
data
File Organization and Storage Structures - 21
B+-Trees
B+-Trees
In practice, each node in the tree is actually a page, so we
can store many pointers and keys. Eg. For a page size
of 4KB, the B+-Tree can be of order 512.
Access time depends more ofen upon depth than on
breadth => Shallow trees are preferred.
RULES
o The root (if not a leaf node) must have at least 2
children
o For a tree of order n, each node (except root and leaf)
must have between n/2 and n pointers and children. If
n/2 is not an integer, the result is rounded up.
File Organization and Storage Structures - 23
CS3462 Introduction to Database Systems
Helena Wong, 2001
File Organization and Storage Structures - 22
RULES (Cont’d):
o For a tree or order n, the number of key values in a
leaf node must be between (n-1)/2 and (n-1) pointers
and children. If (n-1)/2 is not an integer, the result is
rounded up.
o The number of key values contained in a nonleaf
node is 1 less than the number of pointers.
o The tree must always be balanced: every path from
the root node to a leaf must have the same length.
o Leaf nodes are linked in order of key values.
File Organization and Storage Structures - 24
B+-Trees
B+-Trees
Balancing can be costly to maintain.
Example:
Example:
Adding
SG14
Adding
SA9
File Organization and Storage Structures - 25
File Organization and Storage Structures - 26
B+-Trees
Summary
Example:
o
o
o
o
Adding SA9
File Organization and Storage Structures - 27
CS3462 Introduction to Database Systems
Helena Wong, 2001
Basic concepts (Files, Records, Fields)
Primary storage vs secondary storage
Logical record vs physical record
File Organization (and access methods)
– Heap files
– Ordered Files (Binary Search)
– Direct Files (Hashing)
– Indexes
– Indexed Sequential Files
– B+- Trees
File Organization and Storage Structures - 28
Download