Design of a Click-tracking Network for Full

advertisement
Design of a Click-tracking Network
for Full-text Search Engine
Group 5: Yuan Hu, Yu Ge, Youwen Gong, Zenghui Qiu and Miao Liu
• Introduction
Outline
• Objective
• Project diagram
– Web Crawling
– Indexing schema
• Ranking strategies
– PageRank Algorithms
– Neural Network
– Content-Based Ranking
• Software and Reference
Introduction
• Full-text Search Engine
– search on key words
– rank results
• What is in a Search Engine?
– Crawling
– Indexing
– Ranking results of query
Objective
• Design a full-text search engine
• Rank search results in different ways
Project Diagram
Website
Crawling
Content-Based Ranking
Text & urls
Indexing
Click-Tracking Network
Database
PageRank Algorithms
Query Function
Ranked results
Web Crawling
Main page:
Depth 1:
http://en.wikipedia.org/wiki/Machine_learning
crawling all the url links on the main page
http://en.wikipedia.org/wiki/Machine_learning#Decision_tree_learning
……
Depth 2:
crawling all the url links found in depth 1
http://en.wikipedia.org/wiki/Decision_tree_learning#Information_gain
……
# Implemented with Python urllib2 module and BeautifulSoup API
URL
Main Page
Depth 1
LINK
URL
LINK
URL
Depth 2
Schema for Basic Index
Link
Row_ID
From_ID
To_ID
Url_list
Row_ID
Url
Word_location
Url_ID
Word_ID
Location
Link_words
Word_list
Word_ID
Row_ID
Link_ID
Word
# Implemented with SQLite
Results for Multiple-words Query
Words Combination
Query function
Word location
! Notice that all the url_ids returned are not ranked..
Same url _id
http://www.rasch.org/rmt/rmt232a.htm
PageRank Algorithm
•Developed by Larry Page at Stanford U. in 1996.
•How important that page is.
•The importance of the page is calculated from all the other
pages that link to it.
http://www.rasch.org/rmt/rmt232a.htm
How to Calculate PR
• d: damping factor, 0<d<1, 0.85.
• PR(B), ……..,PR(D)…. : PageRank value of each
webpage linking to page A.
• L(B),…….,L(D),….. : The number of links going out
of page B,……D…..
Example
PR(A) = 0.15 + 0.85 * ( PR(B)/links(B) + PR(C)/links(C) +PR(D)/links(D) )
= 0.15 + 0.85 * ( 0.5/4 + 0.7/4 + 0.2/1 )
= 0.15 + 0.85 * ( 0.125 + 0.175 + 0.2)
= 0.15 + 0.85 * 0.465
= 0.575
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
How to Update the PR Value
If we don’t know what their PR should be to begin with,
just assign an initial PR value for every page.
20 Iterations
Update
http://www.cs.princeton.edu/~chazelle/courses/BIB/pagerank.htm
Results for PageRank
PageRank values
Neural Network
Why?
• Make reasonable guess about results for
queries that they have never seen before.
Click-tracking
• The weights are updated based on the search
results which the user clicked.
Neural Net Work
• Step1: Setting Up the Database
• Step2: Feeding Forward Activation
• Step3: Training with BackPropagation
How Neural Network works?
Solid line: Strong connections
Bold text: Active node
Step1: Setting Up the ANN Database
• Create a table for hidden layer(red box)
• Create two tables for the connections(green
boxes)
Step2: Feeding Forward Activation
• Objective: activate the ANN.
– Take words as inputs
– Activate the links in the network
– Give outputs for URL
• Hyperbolic tangent function
X-axis: total input to the node
Step3: Training with Backpropagation
• Train the network every time someone
performs a search and choose one of the links
• The same algorithm covered in class.
• Learning rate = 0.5
Results For Neural Network
Step 1:
From ID
Hidden node
Strength
To ID
Step 2:
relevance of URL input URL
Step 3:
Training with
one query
Results For Neural Network(contd)
Step 3:
Training with
more queries
Content-Based Ranking
Basic Idea: Calculate a score based only on the
query and the content of the page
• Word frequency
• Document location
• Word distance
Software
• Ubuntu 11.04
• Python 2.7.3
• SQLite
Reference
• Collective Intelligence- Toby Segaran
• SQLite Tutorial - ZetCode
• Dive into Python – Mark Pilgrim
Thank you.
Download