Linear Algebra: When are we ever going to use this stuff? Dr. Chris Pavone CSU Chico October 7, 2006 Objectives • Learn a few (6) new concepts in linear algebra. • Test your short-term memory. • Learn how Google uses Linear Algebra to order search results. • Gain an appreciation for the power of mathematics. Notation Recall: (We will be viewing matrices as linear transformations) Definitions (short-term memory) Example: We can also get a matrix from a directed graph… (HINT: The WWW is a directed graph!!!) RECAP: • Nonnegative Matrix: all entries ¸ 0. • Stochastic Matrix: nonnegative and rows add up to 1. • Irreducible Matrix: there is a path between any 2 nodes on the graph of the matrix (the graph of the matrix is strongly connected). • Primitive Matrix: nonnegative, irreducible, and has exactly 1 eigenvalue with magnitude 1. A is nonnegative, stochastic, irreducible, and primitive. The Power Method • Given a diagonalizable matrix with a dominant eigenvalue, the power method is an iterative technique for computing the dominant eigenvector (i.e., the eigenvector that corresponds to the dominant eigenvalue). Here is how the power method works: Your answer will be a vector pointing in the same direction of the “dominant” eigenvector. Note that this says nothing about magnitude. Example: The power method should give us a vector pointing in the same direction as x1=(.7071,.7071,0). “normalize” Here is why it works: Normalize: Notice how crucial it is that we have a “largest” eigenvalue and a diagonalizable matrix. This guarantees convergence. Disclaimer I am not a computer scientist, and I don’t claim to know anything about computers, the internet, hacking, or Bill Gates. The following is “the gist” of what I learned from doing a little research on the Google search algorithm. Everything I learned I got from the internet, research articles, and various linear algebra books. There were some small variations in what I found, and I have left out the annoying little technicalities (I have references for those interested). THIS IS JUST “THE GIST.” Don’t ask me why Google does what it does, it just does. Flashback to 1997… Sergey Brin received his B.S. degree in mathematics and computer science from the University of Maryland at College Park in 1993. Currently, he is a Ph.D. candidate in computer science at Stanford University where he received his M.S. in 1995. He is a recipient of a National Science Foundation Graduate Fellowship. His research interests include search engines, information extraction from unstructured sources, and data mining of large text collections and scientific data. Lawrence Page was born in East Lansing, Michigan, and received a B.S.E. in Computer Engineering at the University of Michigan Ann Arbor in 1995. He is currently a Ph.D. candidate in Computer Science at Stanford University. Some of his research interests include the link structure of the web, human computer interaction, search engines, scalability of information access interfaces, and personal data mining. These guys created using Linear Algebra. Taken from “Does Google know what it’s doing,” BBC 1115-05: Referring to Google’s first office (1998): “It was unreconstructed 1960s California; bikes in the corridors, lava lamps everywhere, the famous ex-Grateful Dead chef cooking delights in the Google canteen, a grand piano in reception for the Google PhDs to tinkle on during breaks.” Referring to Brin and Page back in 98’: “…the founding duo was not at all clear about what the business plan actually was.” Now: “It is taking a million square feet of NASA property in Silicon Valley for a new Googleplex to create space for its army of PhDs who spend their time dreaming up ways of making search better.” “Its main asset is the number of PhDs it has working for it, ceaselessly trying to figure out how to extend the principle of search into everything, unbounded by time, space and (soon) language barriers.” “The company refuses to hire people more than a year or two out of university, for fear that experience in the conventional business world will taint their freshness of mind.” What is a ? The name "Google" is a play on the word "googol," which was coined by Milton Sirotta, nephew of American mathematician Edward Kasner. A googol refers to the number represented by a 1 followed by 100 zeros. A googol is a very large number. There isn't a googol of anything in the universe -- not stars, not dust particles, not atoms. Google's use of the term reflects our mission to organize the world's immense (and seemingly infinite) amount of information and make it universally accessible and useful (taken from Google’s website). Number of pages in Google’s index as of 11/04: (http://blog.searchenginewatch.com/blog/041111-084221) But how does it work…? The Basic Idea… When you do a search, Google finds all the relevant pages, and then orders the results using PageRank. PageRank is a numeric value assigned to each page that represents how important that page is. “…a page is important if it is pointed to by other important pages. That is, they [Brin & Page] decided that the importance of your page (its PageRank score) is determined by summing the PageRanks of all pages that point to yours.” “In building a mathematical definition of PageRank, Brin and Page also reasoned that when an important page points to several places, its weight (PageRank) should be distributed proportionately.” “In other words, if YAHOO! points to your Web page, that’s good, but you shouldn’t receive the full weight of YAHOO! because they point to many other places. If YAHOO! points to 999 pages in addition to yours, then you should only get credit for 1/1000 of YAHOO!’s PageRank.” Langville, A., Meyer, C. 2004. “The Use of Linear Algebra by Web Search Engines.” How Google uses Linear Algebra: Google turns the hyperlink structure of the WWW (a directed graph) into a primitive stochastic matrix G (“The Google Matrix”), and then uses the power method to find the PageRank of each page. Building the Google Matrix: Then H is a nonnegative matrix. Example: 1 2 3 6 5 4 Suppose n=6 (day 1 of the WWW) But we want a primitive stochastic matrix. Problem: Some rows of H may be all zeros (i.e., some pages may have no links). Therefore, H is not necessarily stochastic. Step 2: Replace all zeros rows with e=(1/n 1/n 1/n …) Example (cont.): 1 2 3 6 5 4 We are still not in the clear…S may not be irreducible. Remember: We need a primitive (nonnegative, irreducible, and exactly 1 eigenvalue with absolute value equal to 1) stochastic matrix in order to perform the power method. Step 3: Let where 0· · 1 and G is called the “Google Matrix”. Google applies the power method to GT to compute the PageRank of each page. Google orders all pages according to each pages PageRank. If page m has the biggest PageRank (i.e., gm¸ gi for all i), then page m is the first result you se after doing a search. Keep in mind that this algorithm only RANKS pages. Google does have a separate method for finding the relevant pages before ranking them. To finish our example… 1 2 3 6 5 4 Recall: (Google uses =.85) Note: Using Matlab, the dominant eigenvector of GT is d = (.1044, .1488, .1160, .7043, .4038, .54250) Normalize d = (.1044, .1488, .1160, .7043, .4038, .54250) to get (0.1482, 0.2113, 0.1647, 1.0000, 0.5733, 0.7703) 1 Same EXACT direction 2 PageRanks: 3 6 5 4 Why find the dominant eigenvector of GT? PageRanks being contributed to page 1 Start with a random PageRank for each page (e.g.,assume every page has initial PageRank equal to 1) Proportion of page 3’s PageRank being contributed to page 1. The new PageRanks for each page Do it again, and again, and again… Did you know??? (Party Knowledge) •“It has been reported that Google computes PageRank once every few weeks for all documents in its web collection.” •“The time required by Google to compute the PageRank vector has been reported to be on the order of several days.” •Google’ s index is larger than any other search engine. •Brin and Page were 23 and 24 (respectively) when they created Google. •Google claims that 50-100 iterates of the power method is all that’s needed. •The “Google Toolbar” has a PageRank indicator on it. •PageRank has been called “the world’s largest matrix computation.” •The WWW is not an irreducible directed graph. Google’s justification for their method is based on the “random surfer.” References 1. THANK YOU Strang, G. Introduction to Linear Algebra, 2nd edition, 1998. 2. Brin, S., Page, L. The Anatomy of a Large-Scale Hypertextual Web Search Engine. http://www-db.stanford.edu/~backrub/google.html 3. Craven, P. Google's PageRank Calculator. http://webworkshop.net/pagerank_calculator.html 4. Craven, P. Google's PageRank Explained and how to make the most of it. www.webworkshop.net/pagerank.html 5. Day, P. Does Google know what it's doing? http://news.bbc.co.uk/2/hi/business/4436764.stm 6. Google. http://www.google.com/technology/ 7. Langville, A., Meyer, C. The Use of the Linear Algebra by Web Search Engines. 2004. http://www.tufts.edu/~mkilme01/siagla/articles/IMAGE.pdf 8. Langville, A., Meyer, C. A Survey of Eigenvector Methods for Web Information Retrieval. SIAM Review, v. 47(1), 2005. 9. Larson & Edwards. Elementary Linear Algebra, 2nd edition, 1991. 10. Meyer, C. Matrix Analysis and Applied Linear Algebra, 2000. 11. Rogers, I. The Google Pagerank Algorithm and How It Works. http://www.iprcom.com/papers/pagerank/