notes

advertisement
6.006
IntroductiontoAlgorithms
Lecture1:DocumentDistance
Prof.ErikDemaine
YourProfessors
Prof.ErikDemaine
Prof.Piotr Indyk Prof.Manolis Kellis
Your
TAs
KevinKelley
JosephLaurendi
DavidWen
Tianren Qi
NicholasZehender
Your
Textbook
Administrivia
•
•
•
•
•
•
•
•
•
Handout: Course information
Webpage:http://courses.csail.mit.edu/6.006/spring11/
Signupforrecitationifyoudidn’tfilloutformalready
Sign up for problemsetserver: https://alg.csail.mit.edu/
SignupforPiazzza accounttoask/answerquestions:
http://piazzza.com/
Prereqs: 6.01(Python), 6.042(discretemath)
Grades: Problem sets (30%)
Quiz1
(20%;Mar.8@7.30–9.30pm)
Quiz2
(20%;Apr.13@7.30–9.30pm)
Final
(30%)
Lectures&Recitations;Homeworklabs;Quizreviews
Read collaboration policy!
Today
• Classoverview
– What’sa(good)algorithm?
– Topics
• DocumentDistance
– Vectorspacemodel
– Algorithms
– Pythonprofiling&gotchas
What’sanAlgorithm?
• Mathematicalabstractionof
computerprogram
• Well‐specifiedmethodforsolving
acomputationalproblem
– Typically,afinitesequence
ofoperations
• Descriptionmightbestructured
English,pseudocode,orrealcode
• Key: no ambiguity
http://en.wikipedia.org/wiki/File:Euclid_flowchart_1.png
al‐Khwārizmī
(c.780–850)
• “al‐kha‐raz‐mi”
http://en.wikipedia.org/wiki/Al‐Khwarizmi
http://en.wikipedia.org/wiki/File:Abu_Abdullah_Muhammad_bin_Musa_al‐Khwarizmi_edit.png
al‐Khwārizmī
(c.780–850)
• “al‐kha‐raz‐mi”
• Fatherofalgebra
– TheCompendious
BookonCalculationby
Completionand
Balancing(c.830)
– Linear&quadratic
equations:someofthe
firstalgorithms
http://en.wikipedia.org/wiki/Al‐Khwarizmi
http://en.wikipedia.org/wiki/File:Image‐Al‐Kit%C4%81b_al‐
mu%E1%B8%ABta%E1%B9%A3ar_f%C4%AB_%E1%B8%A5is%C4%81b_al‐
%C4%9Fabr_wa‐l‐muq%C4%81bala.jpg
EfficientAlgorithms
• Wantanalgorithmthat’s
– Correct
– Fast
– Smallspace
– General
– Simple
– Clever
EfficientAlgorithms
• Mainlyinterestedinscalability
asproblemsizegrows
WhyEfficient Algorithms?
• Savewaittime,storageneeds,energy
consumption/cost,…
• Scalability=win
– Solvebiggerproblemsgivenfixedresources
(CPU,memory,disk,etc.)
• Optimizetraveltime,scheduleconflicts,…
HowtoDesignan
Efficient Algorithm?
1. Definecomputational problem
2. Abstract irrelevant detail
3. Reducetoaproblemyoulearnhere
(or6.046oralgorithmicliterature)
4.
5.
6.
7.
Elsedesignusing“algorithmictoolbox”
Analyzealgorithm’sscalability
Implement & evaluate performance
Repeat(optimize,generalize)
Modules&Applications
1.
2.
3.
4.
5.
6.
7.
8.
Introduction
BinarySearchTrees
Hashing
Sorting
GraphSearch
Shortest Paths
Dynamic Programming
NumbersPictures(NP)
9. Beyond
Document similarity
Scheduling
Filesynchronization
Spreadsheets
Rubik’s Cube
Google Maps
Justifyingtext,packing,…
Computingπ,collision
detection,hardproblem
Folding,streaming,bio
Document
Distance
• Giventwodocuments,
howsimilararethey?
• Applications:
– Findsimilardocuments
– Detectplagiarism/
duplicates
– Websearch
(one“document”isquery)
http://www.google.com/
http://en.wikipedia.org/wiki/Wikipedia:Mirrors_and_forks/
Document
Distance
• Howtodefine
“document”?
• Word =sequenceof
alphanumeric
characters
• Document=
sequenceofwords
– Ignorepunctuation&
formatting
Document
Distance
• Howtodefine
“distance”?
• Idea: focuson
sharedwords
• Wordfrequencies:
–
=#
occurrencesofword
indocument
VectorSpaceModel
[Salton, Wong, Yang 1975]
• Treat each document
– Onecoordinate
as a vector of its words
foreverypossibleword
‘dog’
• Example:
–
–
1
=“thecat”
=“thedog”
• Similaritybetweenvectors?
– Dotproduct:
‘the’
1
1
‘cat’
http://portal.acm.org/citation.cfm?id=361220
VectorSpaceModel
[Salton, Wong, Yang 1975]
• Problem: Dotproductnotscaleinvariant
• Example1:
‘dog’
–
–
–
=“thecat”
=“thedog”
2
1
• Example2:
–
–
–
=“thecatthecat”
=“thedogthedog”
0
1
‘the’
1
2
2
‘cat’
http://portal.acm.org/citation.cfm?id=361220
VectorSpaceModel
[Salton, Wong, Yang 1975]
• Idea: Normalizeby#words:
‘dog’
1
‘the’
• Geometricsolution:
anglebetweenvectors
– 0=“identical”,
∘
1
1
‘cat’
=orthogonal(nosharedwords)
http://portal.acm.org/citation.cfm?id=361220
Algorithm
1.
2.
3.
4.
Read documents
Split eachdocument into words
Count wordfrequencies(documentvectors)
Compute dot product
Algorithm
1. Read documents
2. Split eachdocument into words
– re.findall(‘\w+’, doc)
– Buthowdoesthisactuallywork?
3. Count wordfrequencies(documentvectors)
4. Compute dot product
Algorithm
1. Read documents
2. Split eachdocument into words
– Foreachlineindocument:
Foreachcharacterinline:
Ifnotalphanumeric:
Addpreviousword
(ifany)tolist
Startnewword
3. Count wordfrequencies(documentvectors)
4. Compute dot product
Algorithm
1. Read documents
2. Split eachdocument into words
3. Count wordfrequencies(documentvectors)
a. Sortthewordlist
b. Foreachwordinwordlist:
– Ifsameaslastword:
Incrementcounter
– Else:
Addlastwordanditscountertolist
Resetcounterto0
4. Compute dot product
Algorithm
1.
2.
3.
4.
Read documents
Split eachdocument into words
Count wordfrequencies(documentvectors)
Compute dot product:
Foreverypossibleword:
Lookupfrequencyineachdocument
Multiply
Addtototal
Algorithm
1.
2.
3.
4.
Read documents
Split eachdocument into words
Count wordfrequencies(documentvectors)
Compute dot product:
Foreverywordinfirstdocument:
Ifitappearsinseconddocument:
Multiplywordfrequencies
Addtototal
Algorithm
1.
2.
3.
4.
Read documents
Split eachdocument into words
Count wordfrequencies(documentvectors)
Compute dot product:
a. Startatfirstwordofeachdocument(insortedorder)
b. Ifwordsareequal:
Multiplywordfrequencies
Addtototal
c. Inwhicheverdocumenthaslexically
lesserword,advancetonextword
d. Repeatuntileitherdocumentoutofwords
Algorithm
1. Read documents
2. Split eachdocument into words
3. Count wordfrequencies(documentvectors)
a. Initializeadictionarymappingwordstocounts
b. Foreachwordinwordlist:
– Ifindictionary:
Incrementcounter
– Else:
Put0indictionary
4. Compute dot product
Algorithm
1.
2.
3.
4.
Read documents
Split eachdocument into words
Count wordfrequencies(documentvectors)
Compute dot product:
Foreverywordinfirstdocument:
Ifitappearsinseconddocument:
Multiplywordfrequencies
Addtototal
PythonImplementations
PythonProfiling
Culprit
Fix
PythonImplementations
docdist1
docdist2
docdist3
docdist4
docdist5
docdist6
docdist7
docdist8
initialversion
addprofiling
replace+ withextend
countfrequenciesusingdictionary
splitwordswithstring.translate
changeinsertion sorttomergesort
nosorting, dotproductwithdictionary
split wordsonwholedocument,
notlinebyline
192.5 sec
126.5sec
73.4 sec
18.1sec
11.5sec
1.8sec
0.2sec
ExperimentsonIntelPentium4,2.8GHz,Python2.6.2,Linux2.6.18.
Document1(t2.bobsey.txt)has268,778lines,49,785words,3,354distincts.
Document2(t3.lewis.txt)has1,031,470lines,182,355words,8,530distincts.
Don’tForget!
• Webpage:
http://courses.csail.mit.edu/6.006/spring11/
• Signupforrecitationifyoudidn’talready
receivearecitationassignmentfromus
• Sign up for problemsetserver:
https://alg.csail.mit.edu/
• SignupforPiazzza accounttoask/answer
questions:http://piazzza.com/
Download