Semantic Web Project Results

advertisement
Project Title: Semantic Web Project Results
Jordan Willis
James Reid
Submitted in partial fulfillment of
The requirements for
03-60-569
Semantic Web
Faculty of Computer Science
University of Windsor
Windsor, Ontario
Dr. J. Lu
December 15, 2013
Semantic Web Project Results
Overview
Our project goal was to create a scalable distributed web crawler with a search engine to query
the pages crawled, which uses a Page Ranking to prioritize and sort search result while removing near
duplicates before indexing and addressing similar search returns. Although we never achieved the
distributed aspect because of resource limitations; we were able to test the individual components that
comprise our remote severs and simulate the central sever search results. This document is a summary
of our results and findings. To view our further information about our project you can review the
project web site, and you can also test the search engine the Project Search Engine site.
Over All Design: Divide and Concur Scheme
The distributed severs are responsible for crawling and indexing their web space, calculating the
local PageRank’s, and returning their k-best results to the Central Search Engine which in turn displays
the k-best results from all the distributed servers to the searcher. This design is meant to reduce
network traffic and search time by keeping the bulk of the information on the distributed severs and
only return links between domain and search requests to the Central Search Engine. To address the
possibility of duplicates being returned by the remote severs a similarity check is done before returning
the results to the users
Crawling
The crawling package is currently a Custom Java Web Crawler threaded to use all 6 CPUs on our
test machine and is able to crawl 20 pages/ second or 20796 pages in 18 minutes during time testing. To
achieve the speed we had to modify our design to perform the crawling first and the duplicate
detecting, indexing and page rank in a batch mode following crawling. We also had to avoid using a SQL
data base to store parent child relationships which caused bottle necks.
Below are three snap shots from the performance Monitor during the three phases of Crawling.
The first phase is loading the URLs already seen into the URL visited pool to avoid visiting them again
unless specified. To increase the speed here we kept the URL names in their MD5 Hash form to avoid
recalculating the MD5 Hashes each time we restarted crawling. The second phase we populate the
URL pool from the last few pages written to disk. Each page is a vector which contains the RAW HTML,
and the children links from that page. And the third phase is the crawling function itself which crawls
the web in a limited depth with a stay in Domain setting. When you examine the third phase you should
notice that the CPUs are fully being utilized without causing network and disk bottle necks, therefore we
believe that we can easily increase our crawling speed by moving to a processors with more CPUs.
Cralwer Performance Monitor
Loading
already seen
URLS
Starting to
crawl
Middle of a
crawl
Page Rank
In our project we chose to use a domain Rank as an approximation for the PageRank. As
proposed by Wu and Aberer [3] and similar to Wang and DeWitt method of utilizing a Sever Rank
combined with a local PageRank [4] to approximate a Global PageRank. Our PageRank algorithm is able
to compute the PageRank for 80465 domains in under 1 second and is currently taking nine
interactionsto converge at the Page Rank. In our sample set we found the top domains to have the
following PageRank and compared them against Gephi’s PageRank. When we examine the results we
find that our PageRank is very similar to Gephi’s PageRank which is sufficient enough to prove that our
algorithm is correct enough for our requirements, since PageRank is only one of the weights we plan on
using to prioritize search results.
Our top 16 Page Rank
Random teleports (beta=0.8)
Gephi Default PageRank
Random teleports (beta=0.8)
1
twitter.com
0.001567
twitter.com
2
facebook.com
0.001191
facebook.com
0.00599
facebook.com
3
akamaihd.net
0.001017
akamaihd.net
0.004807
akamaihd.net
4
google.com
google.com
0.004713
google.com
5
apple.com
apple.com
0.002376
apple.com
6
youtube.com
0.00041
youtube.com
0.002013
youtube.com
7
flickr.com
0.000284
linkedin.com
0.001496
linkedin.com
8
linkedin.com
0.000267
flickr.com
0.001353
flickr.com
9
t.co
0.000228
t.co
0.001065
t.co
10
export.gov
0.000202
export.gov
9.54E-04
export.gov
11
twimg.com
0.000197
pinterest.com
9.27E-04
vine.co
12
vine.co
0.000195
vine.co
9.20E-04
twimg.com
13
pinterest.com
0.000167
twimg.com
9.06E-04
pinterest.com
14
co.uk
0.000145
instagram.com
7.77E-04
instagram.com
15
instagram.com
0.000139
co.uk
7.20E-04
co.uk
16
tumblr.com
tumblr.com
6.49E-04
tumblr.com
0.0009
0.000426
0.00012
Gephi Default Pag
Random teleports
0.0078
twitter.com
Near-Duplicate Detection
In our project we modified the shingling algorithm discussed in class [2] to select the
random k-grams in a document following stop words. The new algorithm SpotSigs [1] uses the
fact stop words seldom occur in the unimportant template blocks such as navigation sidebar or
links shown at the bottom of the page. By selecting the k-grams following stop words, the
shingling algorithm focus on the documents true content ignoring the unimportant
information. In experiments this method outperformed methods utilizing a purely random kgrams selection [1].
Results
The shingling algorithm detected 365 near duplicate documents out of a set of 1276
documents in 8 seconds with a Jaccard similarity threshold of .85. In additional testing we were
able to process 9996 pages in under 6 minutes.
Log
Test set of 1276
0:00:00.000 Loading pages
0:00:00.009 Enumerating File List
0:00:00.126 Loading Files
0:00:00.128 Waiting for threads to work
0:00:06.320 1000 pages loaded
0:00:06.320 Constructing Shingles
0:00:08.101 Done
Test set of 9996
0:00:00.000
0:00:00.008
0:00:02.625
0:00:02.637
0:04:30.231
0:04:30.231
0:05:38.123
Loading pages
Enumerating File List
Loading Files
Waiting for (file loading) threads to work
9996 pages loaded
Constructing Shingles
Done
Search Engine
The search engine is based on an Apache Lucerne search engine which we modified to user our
PageRank in conjunction with the key word weighting to prioritize search returns and we incorporated a
final similarity check since our results are made up of the top results from the remote severs, which
when combined could include duplicates. In addition to duplicate detection it also allows us to show
more unique query returns to the user with an option to view similar returns.
Results
The indexing of the files into Lucerne is by far the most time consuming batch process which
takes approximately 80 minutes to index 111068 pages.
Findings
The most glaring finding we discovered was the amount of processing power required to
perform the crawling and indexing processes, and difficulties we encountered while looking for a unique
novel fact in all the pages we crawled. At first we hoped to locate RDF information in the HTML pages
we crawled; however as we learned in class the semantic web is for machines and HTML is for humans.
Next we tried to use an FP-growth algorithm to hopefully strip out some useful facts, and in testing we
discovered that text mining is a project unto itself and abandoned the idea when the FP- algorithm
returned nothing when processing 2000 sports news pages.
In an effort to discover a useful fact from all our data we adjusted near-duplicates algorithm to
focus on the contents of the <script> tags in html documents to calculate a Domain Framework
Similarity. Which allows us to group domains into similar styles to identify possible authors and allow
for a focused mining effort for knowledge extractions on style rather than on simple words association.
Below is the top ten domains similar to marislilly.de. Therefore once we able to extract meaningful
information from marislilly.de we should be able to easily extract information from gepinsel.de.
Domain
Similarity
marislilly.de
0.454545454545454
gepinsel.de
0.441176470588235
chocolate-bit.ch
0.435897435897436
feinschmeckerle.de
0.40625
dielorbeerkrone.com
0.4
oberstrifftsahne.com
0.4
depeu-japon.com
0.394736842105263
meinesuessewerkstatt.de 0.394736842105263
bearnerdette.de
0.393939393939394
midiamundo.com
0.382352941176471
fashionhippieloves.com
0.371428571428571
References
[1] Hajishirzi, Hannaneh, Wen-tau Yih, and Aleksander Kolcz. "Adaptive near-duplicate detection via
similarity learning." Proceedings of the 33rd international ACM SIGIR conference on Research and
development in information retrieval. ACM, 2010.
[2] Anand Rajaraman and Jeff Ullman, Mining of massive datasets , 2013.
[3] Wu, Jie, and Karl Aberer. "Using siterank for decentralized computation of web document ranking."
Adaptive Hypermedia and Adaptive Web-Based Systems. Springer Berlin Heidelberg, 2004.
[4] Wang, Yuan, and David J. DeWitt. "Computing pagerank in a distributed internet search system."
Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB
Endowment, 2004.
Download