CIS 256 (File Structures)

advertisement
6
Organizing Files for Performance
• Data Compression
6.1
Data Compression
Organizing Files for Performance 6
Content
►Introduction to Compression
►Methods in Data Compression
– Run-Length Coding
– Huffman Coding
CIS 256 (File Structures)
3
Organizing Files for Performance 6
Data compression
► Data compression methods are used to make files smaller by re/encoding
data that goes into a file .
► There are many reasons for making file smaller
–
–
–
Use less storage , resulting in cost saving
Can be transmitted faster , decreasing access time or,
alternatively ,allowing the same access time with a lower And cheaper
bandwidth
Can be processed faster sequentially
CIS 256 (File Structures)
4
CIS 256 (File Structures)
5
Organizing Files for Performance 6
CIS 256 (File Structures)
6
Organizing Files for Performance 6
Organizing Files for Performance 6
Techniques of compressions
► Using different notation
► Suppressing Repeating Sequences
► Assigning Variables Length Codes
► Irreversible Compression Techniques (Lossy)
CIS 256 (File Structures)
7
Organizing Files for Performance 6
Using different Notation
► Fixed-Length fields are good candidates
► Decrease the number of bits by finding a more compact notation
► Cons.
– unreadable by human
– cost in encoding time
– decoding modules  increase the complexity of s/w
 used for particular application.
►Example: The state fields in the person records. 6 bits (for 50
states) instead of 16.
►It’s classified as redundancy reduction technique.
►With so many costs, is this kind of compression worth it?
CIS 256 (File Structures)
8
Organizing Files for Performance 6
Using different Notation
► The notation used for representing information can often be
made more compact.
► EX
if we are going to write a file that contains information about
students such as name, marks , and major, we can declare the
mark as byte instead of integer, in this way we can save a space.
ST_REC = Name_Stu : string[50];
Mark_Stu : int ; byte
Major_Stu : string[30]; string[3];
// using lookup table.
CIS 256 (File Structures)
9
Organizing Files for Performance 6
Suppressing Repeating Sequences
► Run-length encoding (RLE): encode sequences of repeating values
rather than writing all the values in the file.
► EX: Suppose we wish to compress an image using run –length encoding,
and we find that can be omit the byte 0xff
from the representation of image .
- How would we encode the following sequence of hexadecimal byte
values ?
22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24
- The way: the first three pixels are to be copied in sequence. The runs
of 24 and 26 are both run length encoded .the remaining pixels are
copied in sequence ,the resulting sequence is:
22 23 ff 24 07 25 ff 26 06 25 24
CIS 256 (File Structures)
10
Organizing Files for Performance 6
Suppressing Repeating Sequences
► Run-length encoding (cont’d)
– example of redundancy reduction
– cons.
• not guarantee any particular amount of space savings
• under some circumstances, compressed image is larger than
original image
– Why? Can you prevent this?
CIS 256 (File Structures)
11
Organizing Files for Performance 6
RLE
Ex 1:
Here we have a series of blue x 6, magenta x 7, red x 3, yellow x 3 and green x 4,
that is:
Ex 2:
CIS 256 (File Structures)
12
Organizing Files for Performance 6
Assigning Variable-Length Codes
► Morse code: oldest & most common scheme of variable-length code
► Some values occur more frequently than others
– that value should take the least amount of space
► Huffman coding
– base on probability of occurrence
• determine probabilities of each value occurring
• build binary tree with search path for each value
• more frequently occurring values are given shorter search
paths in tree
CIS 256 (File Structures)
13
Organizing Files for Performance 6
Variable-length encoding
► Any encoding scheme in which the codes are of different
lengths. More frequently occurring codes are given shorter
lengths than frequently occurring codes. Huffman encoding is
an example of variable-length encoding.
► Huffman code which determines the probabilities of each
value occurring in the data set and then builds a binary tree
in which the search path for each value represent the code
for that value.
CIS 256 (File Structures)
14
CIS 256 (File Structures)
15
Organizing Files for Performance 6
Organizing Files for Performance 6
Huffman Encoding
► Compression
– Typically, in files and messages,
• Each character requires 1 byte or 8 bits
• Already wasting 1 bit for most purposes!
► Question
– What’s the smallest number of bits that can be used to store
an arbitrary piece of text?
► Idea
– Find the frequency of occurrence of each character
– Encode Frequent characters
short bit strings
–
Rarer characters
longer bit strings
CIS 256 (File Structures)
16
Organizing Files for Performance 6
Huffman Encoding
► Encoding
– Use a tree
– Encode by following
tree to leaf
– eg
• E is 00
• S is 011
– Frequent characters
E, T
2 bit encodings
– Others
A, S, N, O 3 bit encodings
CIS 256 (File Structures)
17
Organizing Files for Performance 6
Huffman Encoding
► Encoding
– Use a tree
• Inefficient in practice
– Use a direct-addressed lookup
table
A
010
B
:
E
00
:
N
110
? Finding the optimal encoding
:
– Smallest number of bits to
represent arbitrary text
S
001
T
10
CIS 256 (File Structures)
18
Organizing Files for Performance 6
Huffman Encoding
► Divide and conquer
– Decide on a root - n choices
– Decide on roots for sub-trees - n choices
– Repeat n times
O(n!)
► Greedy Approach
– Sort characters by frequency
– Form two lowest weight nodes into a sub-tree
• Sub-tree weight = sum of weights of nodes
– Move new tree to correct place
CIS 256 (File Structures)
19
Organizing Files for Performance 6
Huffman Encoding - Operation
Initial sequence
Sorted by frequency
Combine lowest two
into sub-tree
Move it to correct
place
CIS 256 (File Structures)
20
Organizing Files for Performance 6
Huffman Encoding - Operation
After shifting sub-tree
to its correct place ...
Combine next lowest
pair
Move sub-tree to
correct place
CIS 256 (File Structures)
21
Organizing Files for Performance 6
Huffman Encoding - Operation
Move the new tree
to the correct place ...
Now the lowest two are the
“14” sub-tree and D
Combine and move to
correct place
CIS 256 (File Structures)
22
Organizing Files for Performance 6
Huffman Encoding - Operation
Move the new tree
to the correct place ...
Now the lowest two are the
the “25” and “30” trees
Combine and move to
correct place
CIS 256 (File Structures)
23
Organizing Files for Performance 6
Huffman Encoding - Operation
Combine
last two trees
CIS 256 (File Structures)
24
Organizing Files for Performance 6
Huffman Encoding - Decoding
CIS 256 (File Structures)
25
Organizing Files for Performance 6
Huffman Encoding - Time Complexity
► Sort keys
► Repeat n times
O(n log n)
– Form new sub-tree
O(1)
– Move sub-tree O(logn)
(binary search)
– Total
► Overall
CIS 256 (File Structures)
O(n log n)
O(n log n)
26
Organizing Files for Performance 6
Irreversible Compression Techniques
► Compression in which information is lost.
– EX : Shrinking a raster image from 400 by 400 pixels to 100 by 100 pixels
.
► There is no way to determine what the original pixels were from the
one new pixel.
► Irreversible Compression is less common in data files than reversible
compression but there are times when the info. That is lost of little
or no value.
–
EX: Speech Compression.
CIS 256 (File Structures)
27
Organizing Files for Performance 6
Lossy Compression Techniques
►Some information can be sacrificed
►Less common in data files
►Shrinking raster image
– 400-by-400 pixels to 100-by-100 pixels
– 1 pixel for every 16 pixels
►Speech compression
– voice coding (the lost information is of no little or no value)
CIS 256 (File Structures)
28
Download