• Conceptual understanding, details left to …
• All information here, we won’t discuss details
• Ruby, Scheme, …
Compsci 100, Spring 2010 18.1
Multiply two near-zero numbers, what happens?
Add their logarithms: log(a)+log(b) = log(ab), invertible
What is log of 10 -13 ? Benefits of transform?
What is FFT: Fast Fourier Transform?
O(n log n) method for computing a Fourier Transform
Better than O(n 2 ), huge difference for lots of data points
Shazam? how shazam might work
Feature extraction from images: faces, edges, lines, …
Hough transform
Wavelet transforms do something too, but …
http://en.wikipedia.org/wiki/Ingrid_Daubechies
Compsci 100, Spring 2010 18.2
Michael Burrows and David Wheeler in 1994, BWT
By itself it is NOT a compression scheme
It’s used to preprocess data, or transform data, to make it more amenable to compression like Huffman Coding
Huff depends on redundancy/repetition, as do many compression schemes http://en.wikipedia.org/wiki/Burrows-Wheeler_transform http://marknelson.us/1996/09/01/bwt
Main idea in BWT: transform the data into something more compressible and make the transform fast, though it will be slower than no transform
TANSTAAFL (what does this mean?)
Compsci 100, Spring 2010 18.3
Invented subroutine
“Wheeler was an inspiring teacher who helped to develop computer science teaching at Cambridge from its inception in 1953, when the Diploma in
Computer Science was launched as the world's first taught course in computing.
Compsci 100, Spring 2010 18.4
He's one of the pioneers of the information age. His invention of Alta Vista helped open up an entire new route for the information highway that is still far from fully explored. His work history, intertwined with the development of the hightech industry over the past two decades, is distinctly a tale of scientific genius
http://www.stanford.edu/group/gpj/cgi-bin/drupal/?q=node/60
Compsci 100, Spring 2010 18.5
BWT is a block transform – requires storing n copies of the file with time O(n log n) to sort copy (file has length n)
We can’t really do this in practice in terms of storage
Instead of storing n copies of the file, store one copy and an integer index (break file into blocks of size n)
But sorting is still O(n log n) and it’s actually worse
Each comparison in the sort looks at the entire file
In normal sort analysis the comparison is O(1) , strings are small
Now we have key comparison of O(n) , so sort is actually…
O(n 2 log n) , why?
18.6
Compsci 100, Spring 2010
Remember, goal is to exploit/create repetition (redundancy)
Create repetition as follows
Consider original text: duke blue devils.
Create n copies by shifting/rotating by one character
0: duke blue devils.
1: uke blue devils.d
2: ke blue devils.du
3: e blue devils.duk
4: blue devils.duke
5: blue devils.duke
6: lue devils.duke b
7: ue devils.duke bl
8: e devils.duke blu
9: devils.duke blue
10: devils.duke blue
11: evils.duke blue d
12: vils.duke blue de
13: ils.duke blue dev
14: ls.duke blue devi
15: s.duke blue devil
16: .duke blue devils
Compsci 100, Spring 2010 18.7
Once we have n copies (but not really n copies!)
Sort the copies
Remember the comparison will be O(n)
We’ll look at the last column, see next slide
• What’s true about first column?
4: blue devils.duke
9: devils.duke blue
16: .duke blue devils
5: blue devils.duke
10: devils.duke blue
0: duke blue devils.
3: e blue devils.duk
8: e devils.duke blu
11: evils.duke blue d
13: ils.duke blue dev
2: ke blue devils.du
14: ls.duke blue devi
6: lue devils.duke b
15: s.duke blue devil
7: ue devils.duke bl
1: uke blue devils.d
12: vils.duke blue de
Compsci 100, Spring 2010 18.8
4: blue devils.duke
9: devils.duke blue
16: .duke blue devils
5: blue devils.duke
10: devils.duke blue
0: duke blue devils.
3: e blue devils.duk
8: e devils.duke blu
11: evils.duke blue d
13: ils.duke blue dev
2: ke blue devils.du
14: ls.duke blue devi
6: lue devils.duke b
15: s.duke blue devil
7: ue devils.duke bl
1: uke blue devils.d
12: vils.duke blue de
Properties of first column
Lexicographical order
Maximally ‘clumped’ why?
From it, can we create last?
Properties of last column
Some clumps (real files)
Can we create first? Why?
See row labeled 8:
Last char precedes first in original! True for all rows!
Can recreate everything:
Simple (code) but hard (idea)
Compsci 100, Spring 2010 18.9
Contains every character of original file
Why is there repetition in the last column?
Is there repetition in the first column?
Keep the last column because we can recreate the first
What’s in every column of the sorted list?
If we have the last column we can create the first
• Sorting the last column yields first
We can create every column which means if we know what row the original text is in we’re done!
• Look back at sorted rows, what row has index 0?
18.10
Compsci 100, Spring 2010
How do we avoid storing n copies of the input file?
Store once with index of what the first character is
0 and “duke blue devils.” is the original string
3 and “duke blue devils.” is “e blue devils. du”
What is 7 and “duke blue devils.”
You’ll be given a class Rotatable that can be sorted
Construct object from original text and index
When compared, use the index as a place to start
Rotatable can report the last char of any “row”
Rotatable can report its index (stored on construction)
Compsci 100, Spring 2010 18.11
To transform all we need is the last column and the row at which the original string is in the list of sorted strings
We take these two pieces of information and either compress them or transform them further
After the transform we run Huff on the result
We can’t store/sort a huge file, what do we do?
Process big files in chunks/blocks
• Read block, transform block, Huff block
• Read block, transform block, Huff block…
• Block size may impact performance
18.12
Compsci 100, Spring 2010
First look at code for HuffProcessor.compress
Tree already made, preprocessCompress
How writeHeader,writeCompressedData work?
public int compress(InputStream in, OutputStream out) {
BitOutputStream bout = new BitOutputStream(out);
BitInputStream bin = new BitInputStream(in); int bitCount = 0; myRoot = makeTree(); makeMapEncodings(myRoot,””); bitCount += writeHeader(bout); bitCount += writeCompressedData(bin,bout); bout.flush(); return bitCount;
}
18.13
Compsci 100, Spring 2010
Read a block of data, transform it, then huff it
To huff we write a magic number, write header/tree, and write compressed bits based on Huffman encodings
We already have huff code, need to use on a transformed bunch of characters rather than on the input file
So process input stream by passing it to BW transform which reads a chunk and returns char[] , the last column
A char is a 16-bit, unsigned value, we only need 8-bit value, but use char because we can’t use byte
• In Java byte is signed, -128,.. 127
• What does all that mean?
18.14
Compsci 100, Spring 2010
We want to use existing compression code we wrote before
Read a block of 8-bit/chunks, store in char[] array
Repeat until no more blocks, last block not full?
Block as char[] , treat as stream and feed it to Huff
• Count characters, make tree, compress
We need an Adapter, something that takes char[] array and turns it into an InputStream which we feed to Huff compressor
ByteArrayInputStream, turns byte[] to stream
We can store 8-bit chunks as bytes for stream purposes
18.15
Compsci 100, Spring 2010
public int compress(InputStream in, OutputStream out) {
BitOutputStream boout = new BitOutputStream(out);
BitInputStream bin = new BitInputStream(in); int bitCount = 0;
BurrowsWheeler bwt = new BurrowsWheeler(); while (true){ char[] chunk = bw.transform(bin); if (chunk.length < 1) break; chunk = btw.mtf(chunk); byte[] array = new byte[chunk.length]; for(int k=0; k < array.length; k++){ array[k] = (byte) chunk[k];
}
ByteArrayInputStream bas = new ByteArrayInputStream(array); preprocessInitialize(bas); myRoot = makeTree();
} makeMapEncodings(myRoot,””);
BitInputStream blockBis = new BitInputStream(new ByteArrayInputStream(array)); bitCount += writeHeader(bout); bitCount += writeCompressedData(blockBis,bout);
} bout.flush(); return bitCount;
Compsci 100, Spring 2010 18.16
Untransforming is very slick
Basically sort the last column in O(n) time
Run an O(n) algorithm to get back original block
We sort the last column in O(n) time using a counting
sort, which is sometimes one phase of radix sort
Call sort: easier to code and a good first step
The counting sort leverages that we’re sorting
“characters” --- whatever we read when doing compression which is an 8-bit chunk
How many different 8-bit chunks are there?
Compsci 100, Spring 2010 18.17
If we have an array of integers all of whose values are between 0 and 255, how can we sort by counting number of occurrences of each integer?
Suppose we have 4 occurrences of one, 1 occurrence of two, 3 occurrences of five and 2 occurrences of seven, what’s the sorted array? (we don’t know the original, just the counts)
What’s the answer? How do we write code to do this?
More than one way, as long as O(n) doesn’t matter really
Compsci 100, Spring 2010 18.18
In practice we can introduce more repetition and redundancy using a Move-to-front transform (MTF)
We’re going to compress a sequence of numbers (the 8bit chunks we read, might be the last column from BWT)
Instead of just writing the numbers, use MTF to write
Introduce more redundancy/repetition if there are runs of characters. For example: consider “AAADDDFFFF”
As numbers this is 97 97 97 100 100 100 102 102 102
Using MTF, start with index[k] = k
• 0,1,2,3,4,…,96,97,98,99,…,255
Search for 97, initially it’s at index[97], then MTF
• 97,0,1,2,3,4,5,…, 96,98,99,…,255
Compsci 100, Spring 2010 18.19
As numbers this is 97 97 97 100 100 100 102 102 102
Using MTF, start with index[k] = k
Search for 97, initially it’s at index[97], then MTF
• 97,0,1,2,3,4,5,…,96,98,99,100,101,…
Next time we search for 97 where is it? At 0!
So, to write out 97 97 97 we actually write 97 0 0, then we write out 100, where is it? Still at 100, why? Then MTF:
100,97,0,1,2,3,…96,98,99,101,102,…
So, to write out 97 97 97 100 100 100 102 102 102 we write:
97, 0, 0, 100, 0, 0, 102, 0, 0
Lots of zeros, ones, etc. Thus more Huffable, why?
Compsci 100, Spring 2010 18.20
Given n characters, we have to look through 256 indexes
(worst case)
So, 256*n , this is …. O(n)
Average case is much better, the whole point of MTF is to find repeats near the beginning (what about MTF complexity?)
How to untransform, undo MTF, e.g., given
97, 0, 0, 100, 0, 0, 102, 0, 0
How do we recover AAADDDFFF (97,97,97,100,100,…102)
Initially index[k] = k , so where is 97? O(1) look up, then MTF
Compsci 100, Spring 2010 18.21
Transform data: make it more “compressable”
Introduce redundancy
First do BWT, then do MTF (latter provided)
Do this in chunks
For each chunk array (after BWT and MTF) huff it
To uncompress data
Read block of huffed data, uncompress it, untransform
Undo MTF, undo BWT: this code is given to you
Don’t forget magic numbers
Compsci 100, Spring 2010 18.22
Cooley-Tukey FFT
Bit: Binary Digit
Box-plot
“software” used in print
Far better an approximate answer to
the right question, which is often
vague, than an exact answer to the wrong question, which can always be made precise.
The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data.
Compsci 100, Spring 2010 18.23
Why do we use Java in our courses (royal we?)
Object oriented
Large collection of libraries
Safe for advanced programming and beginners
Harder to shoot ourselves in the foot
Why don't we use C++ (or C)?
Standard libraries weak or non-existant
(comparatively)
Easy to make mistakes when beginning
No GUIs, complicated compilation model
What about other languages?
Compsci 100, Spring 2010 18.24
Perl, Python, PHP, Ruby, C, C++, Java, Scheme, ML,
Can we do something different in one language?
• Depends on what different means.
• In theory: no; in practice: yes
What languages do you know? All of them.
In what languages are you fluent? None of them
In later courses why do we use C or C++?
Closer to the machine, understand abstractions at many levels
Some problems are better suited to one language
• Writing an operating system? Linux?
18.25
Compsci 100, Spring 2010
import java.util.*; import java.io.*; public class Unique { public static void main(String[] args) throws IOException{
Scanner scan = new Scanner(new File("/data/melville.txt"));
TreeSet<String> set = new TreeSet<String>(); while (scan.hasNext()){
String str = scan.next(); set.add(str);
} for(String s : set){
System.out.println(s);
}
}
}
Compsci 100, Spring 2010 18.26
Numerous awards, engineering and science
ACM Grace Hopper
Formerly at Bell Labs
Now Texas A&M
“There's an old story about the person who wished his computer was as easy to use as his telephone.
That wish has come true, since I no longer know how to use my telephone.”
Bjarne Stroustrup
Compsci 100, Spring 2010 18.27
#include <iostream>
#include <fstream>
#include <set> using namespace std; int main(){ ifstream input("/data/melville.txt"); set<string> unique; string word; while (input >> word){ unique.insert(word);
} set<string>::iterator it = unique.begin(); for(; it != unique.end(); it++){ cout << *it << endl;
} return 0;
}
Compsci 100, Spring 2010 18.28
Rasmus Lerdorf
Qeqertarsuaq, Greenland
1995 started PHP, now part of it http://en.wikipedia.org/wiki/PHP
Personal Home Page
No longer an acronym
Rasmus Lerdorf
Compsci 100, Spring 2010 18.29
<?php
$wholething = file_get_contents("file:///data/melville.txt");
$wholething = trim($wholething);
$array = preg_split("/\s+/",$wholething);
$uni = array_unique($array); sort($uni); foreach ($uni as $word){ echo $word."<br>";
}
?>
18.30
Compsci 100, Spring 2010
BDFL for Python development
Benevolent Dictator For Life
Late 80’s began development
Python is multi-paradigm
OO, Functional, Structured, …
We're looking forward to a future where every computer user will be able to "open the hood" of their computer and make improvements to the applications inside. We believe that this will eventually change the nature of software and software development tools fundamentally.
Guido van Rossum, 1999!
Compsci 100, Spring 2010 18.31
#! /usr/bin/env python import sys import re def main(): f = open('/data/melville.txt', 'r') words = re.split('\s+',f.read().strip()) allWords = set() for w in words: allWords.add(w) for word in sorted(allWords): print "%s" % word if __name__ == "__main__": main()
18.32
Compsci 100, Spring 2010
First C book, 1978
First ‘hello world’
Ritchie: Unix too!
Turing award 1983
Kernighan: tools
Strunk and White
Everyone knows that debugging is twice as hard as writing a program in the first place. So if you are as clever as you can be when you write it, how will you ever debug it?
Brian Kernighan
Compsci 100, Spring 2010 18.33
#include <stdio.h>
#include <string.h>
#include <stdlib.h> int strcompare(const void * a, const void * b){ char ** stra = (char **) a; char ** strb = (char **) b; return strcmp(*stra, *strb);
} int main(){
FILE * file = fopen("/data/melville.txt","r"); char buf[1024]; char ** words = (char **) malloc(5000*sizeof(char **)); int count = 0; int k;
18.34
Compsci 100, Spring 2010
while (fscanf(file,"%s",buf) != EOF){ int found = 0; // look for word just read for(k=0; k < count; k++){ if (strcmp(buf,words[k]) == 0){ found = 1; break;
}
} if (!found){ // not found, add to list words[count] = (char *) malloc(strlen(buf)+1); strcpy(words[count],buf); count++;
}
}
Complexity of reading/storing? Allocation of memory
Compsci 100, Spring 2010 18.35
qsort(words,count,sizeof(char *), strcompare); for(k=0; k < count; k++) { printf("%s\n",words[k]);
} for(k=0; k < count; k++){ free(words[k]);
} free(words);
}
Sorting, printing, and freeing
How to sort? What’s analgous to comparator?
Why do we call free? Necessary in this program?
Why
Compsci 100, Spring 2010 18.36