When are we ever going to use this stuff (Linear

advertisement
Linear Algebra: When are we
ever going to use this stuff?
Dr. Chris Pavone
CSU Chico
October 7, 2006
Objectives
• Learn a few (6) new concepts in linear
algebra.
• Test your short-term memory.
• Learn how Google uses Linear Algebra to
order search results.
• Gain an appreciation for the power of
mathematics.
Notation
Recall:
(We will be viewing matrices as linear transformations)
Definitions (short-term memory)
Example:
We can also get a matrix from a directed graph…
(HINT: The WWW is a directed graph!!!)
RECAP:
• Nonnegative Matrix: all entries ¸ 0.
• Stochastic Matrix: nonnegative and rows add up
to 1.
• Irreducible Matrix: there is a path between any 2
nodes on the graph of the matrix (the graph of the
matrix is strongly connected).
• Primitive Matrix: nonnegative, irreducible, and has
exactly 1 eigenvalue with magnitude 1.
A is nonnegative, stochastic, irreducible, and
primitive.
The Power Method
• Given a diagonalizable matrix with a dominant
eigenvalue, the power method is an iterative
technique for computing the dominant eigenvector
(i.e., the eigenvector that corresponds to the
dominant eigenvalue).
Here is how the power method works:
Your answer will be a vector pointing in the
same direction of the “dominant” eigenvector.
Note that this says nothing about magnitude.
Example:
The power method should give us a
vector pointing in the same direction as
x1=(.7071,.7071,0).
“normalize”
Here is why it works:
Normalize:
Notice how crucial it is that we have a “largest”
eigenvalue and a diagonalizable matrix. This
guarantees convergence.
Disclaimer
I am not a computer scientist, and I don’t claim to
know anything about computers, the internet,
hacking, or Bill Gates. The following is “the gist” of
what I learned from doing a little research on the
Google search algorithm. Everything I learned I got
from the internet, research articles, and various
linear algebra books. There were some small
variations in what I found, and I have left out the
annoying little technicalities (I have references for
those interested). THIS IS JUST “THE GIST.”
Don’t ask me why Google does what it does, it just
does.
Flashback to 1997…
Sergey Brin received his B.S. degree in mathematics and
computer science from the University of Maryland at College Park
in 1993. Currently, he is a Ph.D. candidate in computer science at
Stanford University where he received his M.S. in 1995. He is a
recipient of a National Science Foundation Graduate Fellowship.
His research interests include search engines, information
extraction from unstructured sources, and data mining of
large text collections and scientific data.
Lawrence Page was born in East Lansing, Michigan, and received
a B.S.E. in Computer Engineering at the University of Michigan Ann
Arbor in 1995. He is currently a Ph.D. candidate in Computer
Science at Stanford University. Some of his research interests
include the link structure of the web, human computer
interaction, search engines, scalability of information access
interfaces, and personal data mining.
These guys created
using Linear Algebra.
Taken from “Does Google know what it’s doing,” BBC 1115-05:
Referring to Google’s first office (1998):
“It was unreconstructed 1960s California; bikes in the corridors, lava lamps
everywhere, the famous ex-Grateful Dead chef cooking delights in the Google
canteen, a grand piano in reception for the Google PhDs to tinkle on during breaks.”
Referring to Brin and Page back in 98’:
“…the founding duo was not at all clear about what the business plan actually was.”
Now:
“It is taking a million square feet of NASA property in Silicon Valley for a new
Googleplex to create space for its army of PhDs who spend their time dreaming up
ways of making search better.”
“Its main asset is the number of PhDs it has working for it, ceaselessly trying to
figure out how to extend the principle of search into everything, unbounded by time,
space and (soon) language barriers.”
“The company refuses to hire people more than a year or two out of university, for
fear that experience in the conventional business world will taint their freshness of
mind.”
What is a
?
The name "Google" is a play on the word "googol," which was
coined by Milton Sirotta, nephew of American mathematician
Edward Kasner. A googol refers to the number represented by a 1
followed by 100 zeros. A googol is a very large number. There isn't a
googol of anything in the universe -- not stars, not dust particles, not
atoms. Google's use of the term reflects our mission to organize the
world's immense (and seemingly infinite) amount of information and
make it universally accessible and useful (taken from Google’s
website).
Number of pages in Google’s index as of 11/04:
(http://blog.searchenginewatch.com/blog/041111-084221)
But how does it work…?
The Basic Idea…
When you do a search, Google finds all the relevant pages, and then orders
the results using PageRank. PageRank is a numeric value assigned to
each page that represents how important that page is.
“…a page is important if it is pointed to by other important pages. That is,
they [Brin & Page] decided that the importance of your page (its PageRank
score) is determined by summing the PageRanks of all pages that
point to yours.”
“In building a mathematical definition of PageRank, Brin and Page also
reasoned that when an important page points to several places, its weight
(PageRank) should be distributed proportionately.”
“In other words, if YAHOO! points to your Web page, that’s good, but you
shouldn’t receive the full weight of YAHOO! because they point to many other
places. If YAHOO! points to 999 pages in addition to yours, then you should
only get credit for 1/1000 of YAHOO!’s PageRank.”
Langville, A., Meyer, C. 2004. “The Use of Linear Algebra by Web Search Engines.”
How Google uses Linear Algebra:
Google turns the hyperlink structure of the WWW (a
directed graph) into a primitive stochastic matrix G
(“The Google Matrix”), and then uses the power
method to find the PageRank of each page.
Building the Google Matrix:
Then H is a nonnegative matrix.
Example:
1
2
3
6
5
4
Suppose n=6 (day 1 of the WWW)
But we want a primitive stochastic matrix.
Problem: Some rows of H may be all zeros (i.e.,
some pages may have no links). Therefore, H is not
necessarily stochastic.
Step 2:
Replace all zeros rows with e=(1/n 1/n 1/n …)
Example (cont.):
1
2
3
6
5
4
We are still not in the clear…S may not be irreducible.
Remember: We need a primitive (nonnegative,
irreducible, and exactly 1 eigenvalue with absolute
value equal to 1) stochastic matrix in order to perform
the power method.
Step 3: Let
where 0·  · 1 and
G is called the “Google Matrix”.
Google applies the power method to GT to
compute the PageRank of each page.
Google orders all pages according to each pages
PageRank. If page m has the biggest PageRank
(i.e., gm¸ gi for all i), then page m is the first result you se
after doing a search.
Keep in mind that this algorithm only RANKS pages.
Google does have a separate method for finding the
relevant pages before ranking them.
To finish our example…
1
2
3
6
5
4
Recall:
(Google
uses =.85)
Note: Using Matlab, the dominant eigenvector of GT is
d = (.1044, .1488, .1160, .7043, .4038, .54250)
Normalize d = (.1044, .1488, .1160, .7043, .4038, .54250) to get
(0.1482, 0.2113, 0.1647, 1.0000, 0.5733, 0.7703)
1
Same EXACT direction
2
PageRanks:
3
6
5
4
Why find the dominant eigenvector of GT?
PageRanks being
contributed to
page 1
Start with a random
PageRank for each
page (e.g.,assume
every page has initial
PageRank equal to 1)
Proportion of page 3’s
PageRank being
contributed to page 1.
The new PageRanks
for each page
Do it again, and again, and again…
Did you know??? (Party Knowledge)
•“It has been reported that Google computes PageRank once every
few weeks for all documents in its web collection.”
•“The time required by Google to compute the PageRank vector
has been reported to be on the order of several days.”
•Google’ s index is larger than any other search engine.
•Brin and Page were 23 and 24 (respectively) when they created
Google.
•Google claims that 50-100 iterates of the power method is all
that’s needed.
•The “Google Toolbar” has a PageRank indicator on it.
•PageRank has been called “the world’s largest matrix computation.”
•The WWW is not an irreducible directed graph. Google’s
justification for their method is based on the “random surfer.”
References
1.
THANK YOU
Strang, G. Introduction to Linear Algebra, 2nd edition, 1998.
2. Brin, S., Page, L. The Anatomy of a Large-Scale Hypertextual
Web Search Engine. http://www-db.stanford.edu/~backrub/google.html
3. Craven, P. Google's PageRank Calculator.
http://webworkshop.net/pagerank_calculator.html
4. Craven, P. Google's PageRank Explained and how to make the most of it.
www.webworkshop.net/pagerank.html
5. Day, P. Does Google know what it's doing?
http://news.bbc.co.uk/2/hi/business/4436764.stm
6. Google. http://www.google.com/technology/
7. Langville, A., Meyer, C. The Use of the Linear Algebra by Web
Search Engines. 2004. http://www.tufts.edu/~mkilme01/siagla/articles/IMAGE.pdf
8. Langville, A., Meyer, C. A Survey of Eigenvector Methods for
Web Information Retrieval. SIAM Review, v. 47(1), 2005.
9.
Larson & Edwards. Elementary Linear Algebra, 2nd edition, 1991.
10. Meyer, C. Matrix Analysis and Applied Linear Algebra, 2000.
11. Rogers, I. The Google Pagerank Algorithm and How It Works. http://www.iprcom.com/papers/pagerank/
Download