Project Title: Semantic Web Project Results Jordan Willis James Reid Submitted in partial fulfillment of The requirements for 03-60-569 Semantic Web Faculty of Computer Science University of Windsor Windsor, Ontario Dr. J. Lu December 15, 2013 Semantic Web Project Results Overview Our project goal was to create a scalable distributed web crawler with a search engine to query the pages crawled, which uses a Page Ranking to prioritize and sort search result while removing near duplicates before indexing and addressing similar search returns. Although we never achieved the distributed aspect because of resource limitations; we were able to test the individual components that comprise our remote severs and simulate the central sever search results. This document is a summary of our results and findings. To view our further information about our project you can review the project web site, and you can also test the search engine the Project Search Engine site. Over All Design: Divide and Concur Scheme The distributed severs are responsible for crawling and indexing their web space, calculating the local PageRank’s, and returning their k-best results to the Central Search Engine which in turn displays the k-best results from all the distributed servers to the searcher. This design is meant to reduce network traffic and search time by keeping the bulk of the information on the distributed severs and only return links between domain and search requests to the Central Search Engine. To address the possibility of duplicates being returned by the remote severs a similarity check is done before returning the results to the users Crawling The crawling package is currently a Custom Java Web Crawler threaded to use all 6 CPUs on our test machine and is able to crawl 20 pages/ second or 20796 pages in 18 minutes during time testing. To achieve the speed we had to modify our design to perform the crawling first and the duplicate detecting, indexing and page rank in a batch mode following crawling. We also had to avoid using a SQL data base to store parent child relationships which caused bottle necks. Below are three snap shots from the performance Monitor during the three phases of Crawling. The first phase is loading the URLs already seen into the URL visited pool to avoid visiting them again unless specified. To increase the speed here we kept the URL names in their MD5 Hash form to avoid recalculating the MD5 Hashes each time we restarted crawling. The second phase we populate the URL pool from the last few pages written to disk. Each page is a vector which contains the RAW HTML, and the children links from that page. And the third phase is the crawling function itself which crawls the web in a limited depth with a stay in Domain setting. When you examine the third phase you should notice that the CPUs are fully being utilized without causing network and disk bottle necks, therefore we believe that we can easily increase our crawling speed by moving to a processors with more CPUs. Cralwer Performance Monitor Loading already seen URLS Starting to crawl Middle of a crawl Page Rank In our project we chose to use a domain Rank as an approximation for the PageRank. As proposed by Wu and Aberer [3] and similar to Wang and DeWitt method of utilizing a Sever Rank combined with a local PageRank [4] to approximate a Global PageRank. Our PageRank algorithm is able to compute the PageRank for 80465 domains in under 1 second and is currently taking nine interactionsto converge at the Page Rank. In our sample set we found the top domains to have the following PageRank and compared them against Gephi’s PageRank. When we examine the results we find that our PageRank is very similar to Gephi’s PageRank which is sufficient enough to prove that our algorithm is correct enough for our requirements, since PageRank is only one of the weights we plan on using to prioritize search results. Our top 16 Page Rank Random teleports (beta=0.8) Gephi Default PageRank Random teleports (beta=0.8) 1 twitter.com 0.001567 twitter.com 2 facebook.com 0.001191 facebook.com 0.00599 facebook.com 3 akamaihd.net 0.001017 akamaihd.net 0.004807 akamaihd.net 4 google.com google.com 0.004713 google.com 5 apple.com apple.com 0.002376 apple.com 6 youtube.com 0.00041 youtube.com 0.002013 youtube.com 7 flickr.com 0.000284 linkedin.com 0.001496 linkedin.com 8 linkedin.com 0.000267 flickr.com 0.001353 flickr.com 9 t.co 0.000228 t.co 0.001065 t.co 10 export.gov 0.000202 export.gov 9.54E-04 export.gov 11 twimg.com 0.000197 pinterest.com 9.27E-04 vine.co 12 vine.co 0.000195 vine.co 9.20E-04 twimg.com 13 pinterest.com 0.000167 twimg.com 9.06E-04 pinterest.com 14 co.uk 0.000145 instagram.com 7.77E-04 instagram.com 15 instagram.com 0.000139 co.uk 7.20E-04 co.uk 16 tumblr.com tumblr.com 6.49E-04 tumblr.com 0.0009 0.000426 0.00012 Gephi Default Pag Random teleports 0.0078 twitter.com Near-Duplicate Detection In our project we modified the shingling algorithm discussed in class [2] to select the random k-grams in a document following stop words. The new algorithm SpotSigs [1] uses the fact stop words seldom occur in the unimportant template blocks such as navigation sidebar or links shown at the bottom of the page. By selecting the k-grams following stop words, the shingling algorithm focus on the documents true content ignoring the unimportant information. In experiments this method outperformed methods utilizing a purely random kgrams selection [1]. Results The shingling algorithm detected 365 near duplicate documents out of a set of 1276 documents in 8 seconds with a Jaccard similarity threshold of .85. In additional testing we were able to process 9996 pages in under 6 minutes. Log Test set of 1276 0:00:00.000 Loading pages 0:00:00.009 Enumerating File List 0:00:00.126 Loading Files 0:00:00.128 Waiting for threads to work 0:00:06.320 1000 pages loaded 0:00:06.320 Constructing Shingles 0:00:08.101 Done Test set of 9996 0:00:00.000 0:00:00.008 0:00:02.625 0:00:02.637 0:04:30.231 0:04:30.231 0:05:38.123 Loading pages Enumerating File List Loading Files Waiting for (file loading) threads to work 9996 pages loaded Constructing Shingles Done Search Engine The search engine is based on an Apache Lucerne search engine which we modified to user our PageRank in conjunction with the key word weighting to prioritize search returns and we incorporated a final similarity check since our results are made up of the top results from the remote severs, which when combined could include duplicates. In addition to duplicate detection it also allows us to show more unique query returns to the user with an option to view similar returns. Results The indexing of the files into Lucerne is by far the most time consuming batch process which takes approximately 80 minutes to index 111068 pages. Findings The most glaring finding we discovered was the amount of processing power required to perform the crawling and indexing processes, and difficulties we encountered while looking for a unique novel fact in all the pages we crawled. At first we hoped to locate RDF information in the HTML pages we crawled; however as we learned in class the semantic web is for machines and HTML is for humans. Next we tried to use an FP-growth algorithm to hopefully strip out some useful facts, and in testing we discovered that text mining is a project unto itself and abandoned the idea when the FP- algorithm returned nothing when processing 2000 sports news pages. In an effort to discover a useful fact from all our data we adjusted near-duplicates algorithm to focus on the contents of the <script> tags in html documents to calculate a Domain Framework Similarity. Which allows us to group domains into similar styles to identify possible authors and allow for a focused mining effort for knowledge extractions on style rather than on simple words association. Below is the top ten domains similar to marislilly.de. Therefore once we able to extract meaningful information from marislilly.de we should be able to easily extract information from gepinsel.de. Domain Similarity marislilly.de 0.454545454545454 gepinsel.de 0.441176470588235 chocolate-bit.ch 0.435897435897436 feinschmeckerle.de 0.40625 dielorbeerkrone.com 0.4 oberstrifftsahne.com 0.4 depeu-japon.com 0.394736842105263 meinesuessewerkstatt.de 0.394736842105263 bearnerdette.de 0.393939393939394 midiamundo.com 0.382352941176471 fashionhippieloves.com 0.371428571428571 References [1] Hajishirzi, Hannaneh, Wen-tau Yih, and Aleksander Kolcz. "Adaptive near-duplicate detection via similarity learning." Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. ACM, 2010. [2] Anand Rajaraman and Jeff Ullman, Mining of massive datasets , 2013. [3] Wu, Jie, and Karl Aberer. "Using siterank for decentralized computation of web document ranking." Adaptive Hypermedia and Adaptive Web-Based Systems. Springer Berlin Heidelberg, 2004. [4] Wang, Yuan, and David J. DeWitt. "Computing pagerank in a distributed internet search system." Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment, 2004.