Design of a Click-tracking Network for Full-text Search Engine Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu • Introduction Outline • Objective • Project diagram – Web Crawling – Indexing schema • Ranking strategies – PageRank Algorithms – Neural Network – Content-Based Ranking • Software and Reference Introduction • Full-text Search Engine – search on key words – rank results • What is in a Search Engine? – Crawling – Indexing – Ranking results of query Objective • Design a full-text search engine • Rank search results in different ways Project Diagram Website Crawling Content-Based Ranking Text & urls Indexing Click-Tracking Network Database PageRank Algorithms Query Function Ranked results Web Crawling Main page: Depth 1: http://en.wikipedia.org/wiki/Machine_learning crawling all the url links on the main page http://en.wikipedia.org/wiki/Machine_learning#Decision_tree_learning …… Depth 2: crawling all the url links found in depth 1 http://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain …… # Implemented with Python urllib2 module and BeautifulSoup API URL Main Page Depth 1 LINK URL LINK URL Depth 2 Schema for Basic Index Link Row_ID From_ID To_ID Url_list Row_ID Url Word_location Url_ID Word_ID Location Link_words Word_list Word_ID Row_ID Link_ID Word # Implemented with SQLite Results for Multiple-words Query Words Combination Query function Word location ! Notice that all the url_ids returned are not ranked.. Same url _id http://www.rasch.org/rmt/rmt232a.htm PageRank Algorithm •Developed by Larry Page at Stanford U. in 1996. •How important that page is. •The importance of the page is calculated from all the other pages that link to it. http://www.rasch.org/rmt/rmt232a.htm How to Calculate PR • d: damping factor, 0<d<1, 0.85. • PR(B), ……..,PR(D)…. : PageRank value of each webpage linking to page A. • L(B),…….,L(D),….. : The number of links going out of page B,……D….. Example PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) +PR(D)/links(D) ) = 0.15 + 0.85 * ( 0.5/4 + 0.7/4 + 0.2/1 ) = 0.15 + 0.85 * ( 0.125 + 0.175 + 0.2) = 0.15 + 0.85 * 0.465 = 0.575 http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm How to Update the PR Value If we don’t know what their PR should be to begin with, just assign an initial PR value for every page. 20 Iterations Update http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm Results for PageRank PageRank values Neural Network Why? • Make reasonable guess about results for queries that they have never seen before. Click-tracking • The weights are updated based on the search results which the user clicked. Neural Net Work • Step1: Setting Up the Database • Step2: Feeding Forward Activation • Step3: Training with BackPropagation How Neural Network works? Solid line: Strong connections Bold text: Active node Step1: Setting Up the ANN Database • Create a table for hidden layer(red box) • Create two tables for the connections(green boxes) Step2: Feeding Forward Activation • Objective: activate the ANN. – Take words as inputs – Activate the links in the network – Give outputs for URL • Hyperbolic tangent function X-axis: total input to the node Step3: Training with Backpropagation • Train the network every time someone performs a search and choose one of the links • The same algorithm covered in class. • Learning rate = 0.5 Results For Neural Network Step 1: From ID Hidden node Strength To ID Step 2: relevance of URL input URL Step 3: Training with one query Results For Neural Network(contd) Step 3: Training with more queries Content-Based Ranking Basic Idea: Calculate a score based only on the query and the content of the page • Word frequency • Document location • Word distance Software • Ubuntu 11.04 • Python 2.7.3 • SQLite Reference • Collective Intelligence- Toby Segaran • SQLite Tutorial - ZetCode • Dive into Python – Mark Pilgrim Thank you.