Organizing Files for Performance

advertisement
Organizing Files for
Performance
Chapter 6
Jim Skon
File Processing - Organizing file for Performance
MVNC
1
Organizing Files for
Performance




Data Compression
Reclaiming space in files
Fast Searching
Keysorting
File Processing - Organizing file for Performance
MVNC
2
Data Compression

Making files smaller
» Use less storage, save space
» Faster Transmission
» Processed faster

Data Compression
» encoding information more efficiently
» Many techniques exist
File Processing - Organizing file for Performance
MVNC
3
Data Compression


Consider fields with fixed length or fixed set of
values
A binary representation can save space
» States - 50 states - 6 bits (one byte)
» Zip - 0 to 99999. 17 bits (three bytes)

Called Compact Notation
» Redundancy reduction
File Processing - Organizing file for Performance
MVNC
4
Data Compression

Cost of binary representations
» file not readable as test
» Processing time for conversion
» All software must including appropriate/compatable
encoding and decoding routines.
» Potential lost of flexibility
File Processing - Organizing file for Performance
MVNC
5
Data Compression

Suppressing repreating sequences
» Consider a picture
–
–
–
–
Series of pixels - each a color
Colors represented by 8 bit value
usually come in bunches, e.g.
24 23 22 22 22 22 22 25 25 25 25 25 25 65 65 66 66 66 66
» Run length encoding
– Represent long runs with a prefix (FF) follwed by count,
followed by color
– 24 23 FF 05 22 FF 06 25 65 65 FF 04 66
» Simple images would be small, busy images would be
no bigger.
File Processing - Organizing file for Performance
MVNC
6
Data Compression

Assigning variable length codes
» Some codes are more likely then others
» Use shorter codes for often used values, longer
ones for less used values.
» Each code must have the property of a unique
prefix
– No code is the prefix of any other code
– Thus we always know if we are at the end of a given code
File Processing - Organizing file for Performance
MVNC
7
Variable length codes

Example:
Letter:
Prob:
Code:


a
0.4
1
b
0.1
010
c
0.1
011
d
e
f
g
0.1
0.1
0.1
0.1
0000 0001 0010 0011
Can be decoded with a binary tree!
Called Huffman code
»
»
»
»
Algorithm exists to easily create optimal code
Requires that a table of codes be mainted with file
Most often used for fixed codes
Example - Type 3 FAX
File Processing - Organizing file for Performance
MVNC
8
Data Compression

Irreversible Compression
» Compression which losses some information
» Example - compress a 400x400 image into a
100x100 image by averaging groups of 16
adjacent pixels
» Saves space, but resolution of picture reduced
» Used most often for visual or audio information
(which has inherient redundancy)
File Processing - Organizing file for Performance
MVNC
9
Data Compression

Compression in UNIX
» pack and unpack programs
–
–
–
–
Uses Huffman coding
25% to 40% savings on text files
much less on binary files
Uses “.z” file prefix
» compress and uncompress programs
– Uses Lempel-Ziv compression
– No coding table needed - self coding
– Uses “.Z” file prefix
File Processing - Organizing file for Performance
MVNC
10
Reclaiming space in files

Suppose a variable length record in the
middle of a file is modified so it is:
» Longer?
» Shorter?

Suppose a record is
» Added to to the middle?
» Deleted from middle?
File Processing - Organizing file for Performance
MVNC
11
Reclaiming space in files


Record deletion and storage compaction
storage compaction
» recovering unused space in a file
» from deletion or from record size changing

Consider deleted records
» Must be able to recognize deleted records
» Have a special mark for record
– e,g, asterisk in first charater in key field
– May be undeleted if not overwritten!
File Processing - Organizing file for Performance
MVNC
12
Dealing with Deleted records


Occasional compaction
Dynamic maintanance
File Processing - Organizing file for Performance
MVNC
13
Occasional compaction



A process periodically run which reads file,
and rewrites with no empty space.
Could happen every night automactically
every night/week/month
File unavailable while operation underway.
File Processing - Organizing file for Performance
MVNC
14
Dynamic maintanance



Delete records by marking
Reuse deleted records a new records added,
updated
Need:
» Way of knowing if deleted records exist
» Where deleted records are so we can jump right to
them
File Processing - Organizing file for Performance
MVNC
15
Dynamic maintanance

Solution: linked list of deleted records
» Each deleted record contains a mark, and a pointer
to the next deleted record
» The file header contains a pointer to the first
deleted record.
File Processing - Organizing file for Performance
MVNC
16
Linked list of deleted records


Fixed-length records
Variable-length records
File Processing - Organizing file for Performance
MVNC
17
Linked list of deleted records

Fixed-length records
» Simply maintain a stack of deleted records rooted
in header record
» Deletion - add to front of list
» Addition - use record at front of list
» Minimal list maintanance cost
File Processing - Organizing file for Performance
MVNC
18
Linked list of deleted records

Variable-length records
» Store for each deleted record
– Deletion Marker
– link to nect deleted record
– record size indicator
File Processing - Organizing file for Performance
MVNC
19
Variable-length records

Insertion
» Which deleted record?

Deletion
» Add records to list (stack?)
» Where
File Processing - Organizing file for Performance
MVNC
20
Variable-length records Insertion


Select and use a deleted record
Break up records
» pick a record
» If size of deleted record bigger, break into two - a
record to use and a new, smaller, deleted record.
» Put smaller deleted record back in list

Leave empty space at end
» pick a record
» If size of deleted record bigger, just leave empty
space at end.
File Processing - Organizing file for Performance
MVNC
21
Variable-length records Fragmentation

Recall fragmentation in Fixed-length records
» At the end of fields if fixed length fields
» At the end of records in variable length fields
» Called internal fragmentation


Leaving space and the end of a variable
length records also leads to internal
fragmentation.
Breaking up variable length records get rid of
fragmentation, right? Wrong!
File Processing - Organizing file for Performance
MVNC
22
Variable-length records Fragmentation


As records get broken up, smaller and smaller
pieces get left over.
These pieces are external fragmentation
File Processing - Organizing file for Performance
MVNC
23
Variable-length records Insertion strategy


How to pick record to use?
First Fit
» Use first deleted record found in list

Best Fit
» Use deleted record closest in size

Worst Fit
» Use deleted record that is largest
» No good when not breaking up records!
File Processing - Organizing file for Performance
MVNC
24
Variable-length records Insertion

How do we find the record with the desired
size?
» Search them ALL!
» Keep the records in sorted order by record size
– Increasing size facilitates Best fit
– Decreasing size facilitates worst fit (just pick first in list)
– This increases deletion time!
File Processing - Organizing file for Performance
MVNC
25
Variable-length records Reducing fragmentation


Merge adjacent free records
How do we know if a newly deleted record is
adjacent to a free record?
» Search the deleted list
» Keep deleted records sorted by position in file
– This makes finding of adjacent free space trivial
– Costs more at deletion time
File Processing - Organizing file for Performance
MVNC
26
Fast Searching

Binary Searching
» O(log n), where n is number of records
» requires file be sorted

Question - how do we sort file?
File Processing - Organizing file for Performance
MVNC
27
File Sorting

Sort in Ram
» read in entire file - sort
» Called internal sorting
» Limited by size of memory
File Processing - Organizing file for Performance
MVNC
28
Binary Search - Problems

Binary searching requires more then one or
two accesses
»
»
»
»
Accesses are VERY expensive
Access are very random (much seek time)
100,000 requires average of 16.5 accesses
We would like to approach the speed of a direct
lookup!
File Processing - Organizing file for Performance
MVNC
29
Binary Search - Problems

Keeping a file sorted is expensive
» Every record added must be entered in sorted
order
» Reordering is costly

Internal sorted is limited to small files
» We will see there are sort methods to sort a file
that will not fit in memory. But it is still expensive!
File Processing - Organizing file for Performance
MVNC
30
Keysorting



Rather then sorting file, we could sort an array
of primary keys, where each key is
accompanied by the address of the
associated record.
Pointer could be a byte offset from start, or (if
records fixed length) a RRN.
After sort keys, the file can be rewritten in
order.
File Processing - Organizing file for Performance
MVNC
31
Keysorting

Advantages
» Keys can be sorted in smaller space then whole
file
» Faster to sort (swap!) keys then entire records
File Processing - Organizing file for Performance
MVNC
32
Keysorting

Disadvantages
» Still limited in size to key lists which fit in memory
» Sequential processing cannot not take advantage
of buffering!
File Processing - Organizing file for Performance
MVNC
33
Keysorting



Alternative - keeping sorted keylist,pointer
structure around.
Is a type of index file!
Can be read in and searched in memory!
File Processing - Organizing file for Performance
MVNC
34
Key Sorted Index

Advantages
» Keys and pointers can be searched in memery.
Only one I/O per lookup!
» File can be maintained in ANY order. Searching
and key order sequential processing still possible.
File Processing - Organizing file for Performance
MVNC
35
Key Sorted Index

Disadvantages
» Sequential processing cannot not take advantage
of buffering!
» Pinned records
– Records in main file cannot change location without
invalidating index file!
– Must either maintain index in parallel, or rebuild!
File Processing - Organizing file for Performance
MVNC
36
Download