Today’s topics Data Representation Memory organization Compression Upcoming Systems Connections Algorithms Reading Computer Science, Chapter 1-3 CompSci 001 3.1 Main Memory Cells Cell: A unit of main memory (typically 8 bits which is one byte) Most significant bit: the bit at the left (high-order) end of the conceptual row of bits in a memory cell Least significant bit: the bit at the right (low-order) end of the conceptual row of bits in a memory cell CompSci 001 3.2 Brookshear 1-2 Figure 1.7 The organization of a bytesize memory cell 3.3 CompSci 001 Brookshear 1-3 Main Memory Addresses Address: A “name” that uniquely identifies one cell in the computer’s main memory The names are actually numbers. These numbers are assigned consecutively starting at zero. Numbering the cells in this manner associates an order with the memory cells. 3.4 CompSci 001 1-4 Figure 1.8 Memory cells arranged by address CompSci 001 3.5 Brookshear 1-5 Memory Terminology Random Access Memory (RAM): Memory in which individual cells can be easily accessed in any order Dynamic Memory (DRAM): RAM composed of volatile memory CompSci 001 3.6 Brookshear 1-6 Measuring Memory Capacity Kilobyte: 210 bytes = 1024 bytes Megabyte: 220 bytes = 1,048,576 bytes Example: 3 KB = 3 times1024 bytes Sometimes “kibi” rather than “kilo” Example: 3 MB = 3 times 1,048,576 bytes Sometimes “megi” rather than “mega” Gigabyte: 230 bytes = 1,073,741,824 bytes Example: 3 GB = 3 times 1,073,741,824 bytes Sometimes “gigi” rather than “giga” CompSci 001 3.7 Brookshear 1-7 Mass Storage On-line versus off-line Typically larger than main memory Typically less volatile than main memory Typically slower than main memory CompSci 001 3.8 Brookshear 1-8 Mass Storage Systems Magnetic Systems Disk Tape Optical Systems CD DVD Flash Drives CompSci 001 3.9 Brookshear 1-9 Figure 1.9 A magnetic disk storage system CompSci 001 3.10 Brookshear 1-10 Figure 1.10 Magnetic tape storage CompSci 001 3.11 Brookshear 1-11 Figure 1.11 CD storage CompSci 001 3.12 Brookshear 1-12 Files File: A unit of data stored in mass storage system Fields and keyfields Physical record versus Logical record Buffer: A memory area used for the temporary storage of data (usually as a step in transferring the data) CompSci 001 3.13 Brookshear 1-13 Figure 1.12 Logical records versus physical records on a disk CompSci 001 3.14 Brookshear 1-14 Data Compression Compression is a high-profile application .zip, .mp3, .jpg, .gif, .gz, … What property of MP3 was a significant factor in what made Napster work (why did Napster ultimately fail?) Why do we care? Secondary storage capacity doubles every year Disk space fills up quickly on every computer system More data to compress than ever before CompSci 001 3.15 More on Compression What’s the difference between compression techniques? .mp3 files and .zip files? .gif and .jpg? Lossless and lossy Is it possible to compress (lossless) every file? Why? Lossy methods Good for pictures, video, and audio (JPEG, MPEG, etc.) Lossless methods Run-length encoding, Huffman, LZW, … 11 3 5 3 2 6 2 6 5 3 5 3 5 3 10 CompSci 001 3.16 Text Compression Input: String S Output: String S Shorter S can be reconstructed from S CompSci 001 3.17 Figure 1.13 The message “Hello.” in ASCII CompSci 001 3.18 Brookshear 1-18 Text Compression: Examples “abcde” in the different formats ASCII: 01100001011000100110001101100100… Fixed: 000001010011100 1 Encodings ASCII: 8 bits/character Unicode: 16 bits/character 0 Symbol ASCII Fixed length Var. length a 01100001 000 000 b 01100010 001 11 c 01100011 010 01 d 01100100 011 001 e 01100101 100 10 0 0 a Var: 000110100110 1 1 b 0 1 c d 0 0 e 1 0 0 0 CompSci 001 0 a 1 1 d c 0 e 1 b 3.19 Huffman coding: go go gophers ASCII g 103 o 111 p 112 h 104 e 101 r 114 s 115 sp. 32 3 bits Huffman 1100111 1101111 1110000 1101000 1100101 1110010 1110011 1000000 000 001 010 011 100 101 110 111 00 01 1100 1101 1110 1111 101 101 13 7 6 g o 3 3 Encoding uses tree: 0 left/1 right How many bits? 37!! Savings? Worth it? 4 3 s * 1 2 2 2 p h e r 1 1 1 1 6 4 2 2 g o 3 3 3 CompSci 001 p h e r 1 1 1 1 s * 1 2 3.20 Huffman Coding D.A Huffman in early 1950’s Before compressing data, analyze the input stream Represent data using variable length codes Variable length codes though Prefix codes Each letter is assigned a codeword Codeword is for a given letter is produced by traversing the Huffman tree Property: No codeword produced is the prefix of another Letters appearing frequently have short codewords, while those that appear rarely have longer ones Huffman coding is optimal per-character coding method CompSci 001 3.21 Building a Huffman tree Begin with a forest of single-node trees (leaves) Each node/tree/leaf is weighted with character count Node stores two values: character and count There are n nodes in forest, n is size of alphabet? Repeat until there is only one node left: root of tree Remove two minimally weighted trees from forest Create new tree with minimal trees as children, • New tree root's weight: sum of children (character ignored) Does this process terminate? How do we get minimal trees? Remove minimal trees, hummm…… CompSci 001 3.22 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 11 6 5 5 4 4 3 3 3 3 2 2 2 2 2 1 1 I E N S M A B O T G D L R U P F CompSci 001 1 3.23 C Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 1 1 F C 6 5 5 4 4 3 3 3 3 I E N S M A B O T CompSci 001 2 2 2 2 2 2 1 G D L R U P 3.24 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 2 11 3 1 1 1 2 C F P U 6 5 5 4 4 I E N S M CompSci 001 3 3 3 3 3 A B O T 2 2 2 2 2 G D L R 3.25 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CompSci 001 4 4 4 S M 3 2 2 L R 2 3 3 3 3 A B O T 2 2 G D 3.26 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 4 2 11 3 1 1 1 2 C F P U 6 5 5 I E N CompSci 001 4 4 4 4 S M 3 4 2 2 2 2 G D L R 3 3 3 3 A B O T 2 3.27 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 4 2 3 O 11 6 1 1 1 2 C F P U 5 CompSci 001 I 5 5 E N 4 4 4 4 S M 4 2 2 2 2 G D L R 3 3 3 A B T 3 3.28 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U 6 CompSci 001 I 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 3 3 S M A B 3.29 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 5 3 6 2 3 O 11 6 4 3 T 1 1 1 2 C F P U 6 CompSci 001 6 I 5 5 5 E N 4 4 4 2 2 2 2 G D L R 4 4 S M 3 3 A B 3.30 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 6 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U 6 CompSci 001 6 6 I 5 5 5 E N 4 4 2 2 2 2 G D L R 4 3 3 A B 3.31 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 8 5 3 6 2 3 O 11 8 4 4 S M 4 3 T 1 1 1 2 C F P U 8 CompSci 001 6 6 6 I 5 6 8 5 5 E N 4 2 2 2 2 G D L R 3 3 A B 3.32 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 8 5 6 E 3 2 3 O 11 10 4 4 S M 4 3 T 1 1 1 2 C F P U 8 CompSci 001 8 6 6 6 I 5 6 8 5 N 4 2 2 2 2 G D L R 3 3 A B 3.33 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 10 5 11 5 5 E 6 N 3 2 3 O 11 8 11 3 T 1 1 1 2 C F P U 10 CompSci 001 8 8 6 6 I 6 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.34 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 12 10 11 8 6 8 6 I 5 5 5 E N 3 2 3 O 12 6 11 3 T 1 1 1 2 C F P U 11 CompSci 001 10 8 8 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.35 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 16 6 12 3 T 1 1 1 2 C F P U 11 CompSci 001 11 10 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.36 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 21 16 10 11 12 8 6 8 6 I 5 5 5 E N 3 2 3 O 21 6 16 3 T 1 1 1 2 C F P U 12 CompSci 001 11 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.37 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 23 6 21 3 T 1 1 1 2 C F P U 16 CompSci 001 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.38 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 37 6 23 3 T 1 1 1 2 C F P U CompSci 001 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.39 Building a tree “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E N 3 2 3 O 60 6 3 T 1 1 1 2 C F P U CompSci 001 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.40 Properties of Huffman coding Want to minimize weighted path length L(T)of tree T L(T ) d w i iLeaf ( T ) i wi is the weight or count of each codeword i di is the leaf corresponding to codeword i How do we calculate character (codeword) frequencies? Huffman coding creates pretty full bushy trees? When would it produce a “bad” tree? How do we produce coded compressed data from input efficiently? CompSci 001 3.41 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.42 Decoding a message 1100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.43 Decoding a message 100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.44 Decoding a message 00000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R 3 3 A B 3.45 Decoding a message 0000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 3.46 Decoding a message 000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 3.47 Decoding a message 00100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 3.48 Decoding a message 0100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 3.49 Decoding a message 100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U G 4 4 2 2 2 2 G D L R 3 3 A B 3.50 Decoding a message 00001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 3.51 Decoding a message 0001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 3.52 Decoding a message 001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 3.53 Decoding a message 01001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 3.54 Decoding a message 1001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GO 4 4 2 2 2 2 G D L R 3 3 A B 3.55 Decoding a message 001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 3.56 Decoding a message 01101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 3.57 Decoding a message 1101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 3.58 Decoding a message 101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 3.59 Decoding a message 01 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 4 4 S M 3 T 1 1 1 2 C F P U GOO 4 4 2 2 2 2 G D L R 3 3 A B 3.60 Decoding a message 1 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R GOOD 3 3 A B 3.61 Decoding a message 01100000100001001101 60 37 23 21 16 10 11 12 8 11 6 8 6 I 5 5 5 E 6 N 3 2 3 O CompSci 001 3 T 1 1 1 2 C F P U 4 4 S M 4 4 2 2 2 2 G D L R GOOD 3 3 A B 3.62 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.63 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.64 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.65 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.66 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.67 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.68 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.69 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.70 Huffman Tree 2 “A SIMPLE STRING TO BE ENCODED USING A MINIMAL NUMBER OF BITS” E.g. “ A SIMPLE” “10101101001000101001110011100000” CompSci 001 3.71 Data Compression Year Scheme Bit/Cha r 1967 ASCII 7.00 1950 Huffman 4.70 1977 Lempel-Ziv (LZ77) 3.94 1984 Lempel-Ziv-Welch (LZW) – Unix 3.32 compress Why is data compression important? How well can you compress files losslessly? 1987 (LZH) used by zip and unzip 3.30 1987 Move-to-front 3.24 1987 gzip 2.71 1995 Burrows-Wheeler 2.29 1997 BOA (statistical data compression) 1.99 CompSci 001 Is there a limit? How to compare? How do you measure how much information? 3.72 Other methods Adaptive Huffman coding Lempel-Ziv algorithms Build the coding table on the fly while reading document Coding table changes dynamically Protocol between encoder and decoder so that everyone is always using the right coding scheme Works well in practice (compress, gzip, etc.) More complicated methods Burrows-Wheeler (bunzip2) PPM statistical methods CompSci 001 3.73