SILOs - CS 8803 AIAD Project Report

advertisement
SILOs - Distributed Web Archiving & Analysis using Map Reduce
CS 8803 AIAD Project Report
Anushree Venkatesh
Sagar Mehta
Sushma Rao
Contents
Contents ................................................................................................................................................ 1
Motivation ............................................................................................................................................. 3
Related work ......................................................................................................................................... 4
An Introduction to Map Reduce ............................................................................................................ 5
Architecture........................................................................................................................................... 6
Component Description ……………………………………………………………………………………………………………….....7
Challenges ...........................................................................................................................................11
Evaluation ............................................................................................................................................12
Conclusion ...........................................................................................................................................18
Extensions............................................................................................................................................19
Bibliography.........................................................................................................................................20
Appendix –A ........................................................................................................................................20
Appendix – B........................................................................................................................................27
2|Page
Motivation
The internet is currently the largest database in the world for any required information. However,
one of the most amazing facts about it is that the average life of a page on the internet is, according
to statistics, only 44 days [14]. This means that we are continuously losing an incredible amount of
information. In order to avoid this loss, the concept of web archiving was introduced in as early as
1994. This was just one of the reasons for building a web archive. Some of the other reasons are as
follows [7]:
•
•
•
•
Cultural: The pace of change of technology makes it difficult to maintain the data which is
mostly in digital format.
Technical: Technology needs a while to stabilize but spreading data over different technologies
may lead to loss of data.
Economic: This serves in the interest of the public and helps them access previously hosted data
as well.
Legal: This deals with intellectual property issues.
Considering the vast amount of data that is available on the World Wide Web, it is practically
impossible to even imagine running a crawler on a single machine. This brings into view the
requirement of a distributed architecture. Google's Map-Reduce is an efficient programming model
that allows for distributed processing. Thus our project tries to explore the usage of this model for
distributed web crawling and archiving and analyzing the results over various parameters
including efficiency through implementation using Hadoop [11], an open source framework
implementing map-reduce in Java.
3|Page
Related work
This project aims to propose a new method for web archiving. There currently exist a number of
traditional approaches to web archiving and crawling [14]. Some of these are described below.
1. Automatic harvesting approach: It is the implementation of a crawler without any restrictions
per se. One of the most common examples following this approach is the Internet Archive [16].
2. Deposit approach: Here the author of a page himself submits the page to be included in the
archive, eg. DDB[15] .
Beyond this, our project brings to the forefront the issue of the crawling mechanism itself.
There are two dimensions along which types of crawling can be defined – content based and
architecture based. The different types of existing content based methods of crawling are [4]:
1. Selective: This basically involves changing the policy of insertions and extractions in the
queue of discovered URLs. It can incorporate various factors like depth, link popularity etc.
2. Focused: This is a refined version of a selective crawler where only specific content is
crawled.
3. Distributed: This is done by distributing the task based on content and running crawlers on
multiple processes and implementing parallelization.
4. Web dynamics based: This method keeps the data updated by refreshing the content at
regular intervals based on decided functions and possibly by using intelligent agents.
The different architectural approaches to crawling are:
1. Centralized: Implements a central scheduler and downloader. This is not a very popular
architecture. Most of the crawlers that have been implemented use the distributed architecture.
2. Distributed: To our knowledge, the crawlers making use of the distributed architecture
basically implement a peer-to-peer network distribution protocol, eg. Apoidea[5].
We are implementing the distributed architecture provided by the map reduce framework.
4|Page
An Introduction to Map Reduce
Figure 1 [1]
The MapReduce Library splits the input into M configurable segments, which are processed on a set
of distributed machines in parallel. The intermediate key, value pairs generated by the `map`
function are then partitioned into R regions and assigned to `reduce` workers . For every unique
intermediate key that is generated by map, the corresponding values are passed to the reduce
function. On completion of the reduce function, we will have R output files, one from each reduce
worker.
5|Page
Architecture
Graph
Builder
Seed List
URL Extractor
Distributed Crawler
M
M
R
URL,
value
URL, page
content
Parse for
URL
R
URL,
Parent
M
R
Parse for
key word
KeyWord,
URL
Diff
Compression
Page
Content
Table
URL
Table
Back Links
Table
Figure 2: Architecture
6|Page
(Remove
Duplicates)
Adjacency
List
Table
Key Word Extractor
<URL, parent URL>
Back Links
Mapper
M
Parent,
URL
R
URL, 1
Inverted
Index
Table
Component Description
1. Seed List: This is a list of urls to start with, which will bootstrap the crawling process.
2. Distributed Crawler:
1. Its uses the map-reduce functions to distribute the crawling process among multiple
worker nodes.
2. Once it gets an input a url it checks for duplicates and if not then it inserts the new url
into the url_table. The url_table has key as the hash of the url. Note that wherever hash
is mentioned, it’s a standard SHA-1 hash. This url is then fed to page fetcher which sends
a http get request for the url.
3. One copy of the page is sent to the Url extractor and another is sent to the keyword
extractor component.The page is then stored in the page_content_table with key as
hash(page_content) and the value as compressed difference as compared to the
previous latest version of that page if it exists.
Map
Input <url, 1>
if(!duplicate(URL))
{
Insert into url_table
Page_content = http_get(url);
<hash(url), url, hash(page_content),time_stamp >
Output Intermediate pair < url, page_content>
}
Else If( ( duplicate(url) && (Current Time – Time Stamp(URL) > Threshold))
{
Page_content = http_get(url);
Update url table(hash(url),current_time);
Output Intermediate pair < url, page_content>
}
Else
{
Update url table(hash(url),current_time);
7|Page
}
Reduce
Input < url, page_content >
If(! Exits hash(URL) in page content table)
{
Insert into page_content_table
<hash(page_content), compress(page_content) >
}
Else if(hash(page_content_table(hash(url)) != hash(current_page_content) {
Insert into page_content_table
<hash(page_content), compress( diff_with_latest(page_content) )>
}
}
3. URL Extractor:
1. It parses the input web-page contents that it receives and extracts urls out of it using
regular expressions, which are fed to the Distributed crawler.
2. Additionally, for all the urls extracted out of the webpage, we generate and adjacency
list < node, list<adjacent nodes> > and this tuple is fed into the Adjacency list table.
3. The reduce step also emits a tuple <parsed_url, parent_url > which is fed to the back link
Mapper.
Map
Input < url, page_content>
List<urls> = parse(page_content);
Insert into adjacency_list_table
<hash(url), list<(hash(urls)> >
For each url emit
{
Ouput Intermediate pair < url, 1> which is fed to distributed crawler
8|Page
Output Intermediate pair < url, parent_url> which is fed to backlink mapper
}
Reduce
No need to do anything in reduce component
4. Back link Mapper:
1. It uses map-reduce to generate a list of urls that point to a particular url, and the tuple <
url, list(parent_urls) >
Map
Input <parent_url, points_to_url>
Output
Emit Intermediate pair = < points_to_url, parent_url>
Reduce
Input <point_to_url,parent_url>
Combine all pairs <point_to_url,parent_url> which have the same point_to_url to emit
<point_to_url, list of parent_urls>
Insert into backward_links_table <point_to_url, list<parent_urls> >
5. Key word Extractor:
1. It parses the input web-page contents that it receives and extracts words which are not
stop words and these are used to build an inverted index using the map-reduce method.
The inverted index consists of a key-value pair where the key is the keyword and the
value is the list of urls where that keyword is found.
Map
Input < url, page_content>
List<keywords> = parse(page_content);
For each keyword, emit
Output Intermediate pair < keyword, url>
9|Page
Reduce
Combine all <keyword, url> pairs with the same keyword to emit
<keyword, List<urls> >
Insert into inverted index table
<keyword, List<urls> >
6. Graph Builder :
1. It takes as input the adjacency list representation of the crawled-web-graph, and builds
a visual representation of the graph.
10 | P a g e
Challenges
The following were the main challenges in the project:
•
Understanding the map reduce framework and its usage
•
Installing and managing the Hadoop open source map-reduce framework which is a non-trivial
task
•
Identification of components in the architecture that could be ported to a map reduce structure
•
Identification of experiments that made use of the map reduce framework
•
Implementing the identified components and experiments in accordance with the framework
•
Coming up with a fair performance comparison between equivalent programs written within
and outside map-reduce implicitly answering the question “When is it better to have a mapreduce version of a program instead of just multi-threading the same program? “
11 | P a g e
Evaluation
A zoomed out snapshot of the crawled data starting with the www.gatech.edu domain [ Note that here
the crawled links are not restricted only to Gatech.edu meaning if a link points to an outside domain it is
crawled too ]
The above visualization is created by the graph-builder component which uses the graphviz library to
visualize a graph given the adjacency list structure. To understand how this works consider a small
example where the adjacency list is as follows –
12 | P a g e
digraph G {
gatech -> stanford
gatech -> cmu
gatech -> cornell
stanford->MIT
stanford-> cornell
cornell-> MIT
}
The corresponding graph using graphviz is as follows
So basically we automated the construction of the digraph structure while parsing out the links on a
particular page during the distributed crawling.
Distributed Crawler:
We were successfully able to distribute the process of crawling using map reduce. We provided an input
file of URLs to the map task. These URLs were crawled and the fetched pages were stored as the output
of the reduce function.
13 | P a g e
We noticed that the execution time for the distributed crawler using map-reduce was much lesser than
the time taken by a single threaded crawler for crawling the same number of URLs. However as noted
previously, having a fair comparison of similar functionalities within and outside map-reduce is a nontrivial task and another project in itself.
Appendix A has a sample code of the distributed fetch functionality that we implemented using mapreduce.
Analysis on the fetched data using map-reduce:
We also used map-reduce to analyze the data that was obtained from the fetched pages.
In these experiments however, we noticed that the execution time of map reduce was significantly
higher than that of a single threaded java application performing the same task. For example for the
keyword frequency experiment that we describe below even a single threaded standalone java program
was on average 10 times faster than the equivalent map-reduce version. This could be attributed to the
fact that the communication overhead among the master-slave nodes in map-reduce is noticeable when
the size of the data set is not large enough.
Key word frequency:
In this task we wanted to find out the top frequently occurring words on different comparable websites.
So using the distributed crawler we fetched equivalent subsets from the sites www.gatech.edu,
www.cmu.edu and www.cornell.edu
On the fetched data we ran the Frequency count program using map-reduce. The program in map-phase
basically parses out the page contents using html-parser and for each keyword emits the pair
<keyword,one> to signify that the particular keyword occurred once. So if a keyword occurs N times, N
such pairs will be emitted across different map-functions.
In the reduce phase, we sum up the counts for all such keywords thus giving us the frequency count for
each keyword. Also the task is distributed since we have a large number of map and reduce tasks.
14 | P a g e
Appendix B has a java implementation of the above frequency count program in map-reduce
framework.
Verifying the Zipf law on the natural language word corpus
Another interesting experiment related to the above frequency count task, was to verify the Zipf law
which basically states that the frequency of a word is inversely proportional to its rank in the frequency
table. The Zipf distribution is a power-law distribution and to verify this, we plot the log of frequency of
word count against the log of word count. If we get a linear plot on such a scale, it is a power law.
To get the frequency of word-count, i.e the frequencies of each frequency of a word, we ran another
pass of the Frequency count map-reduce program where each distinct frequency was now the keyword.
When plotted on the cmu dataset, the zipf distribution looked as follows -
15 | P a g e
URL Depth:
We observed the average URL depth for a given domain.
For e.g – www.gatech.edu/sports would mean a URL depth of 1 & www.gatech.edu/library/books would
mean a URL depth of 2. Hence the average depth would be (1+2)/2 = 1.5
Again this was done using map-reduce where the map phase would emit the url and its depth, while the
reduce phase would sum it up and produce the average for each such url.
Our results are presented in the following tables:
Average URL Depth
CMU
cmu.edu
Cornell
2.73
cornell.edu
alumni.cmu.edu 2.18
www.library.cmu.edu
2.23
www.alumniconnections.com 4.81
Gatech
1.34
gatech.edu
1
www.gradschool.cornell.edu 1
gtalumni.org
3
www.news.cornell.edu 2.57
ramblinwreck.cstv.com 2.57
www.sce.cornell.edu 1
cyberbuzz.gatech.edu
2
URL Count:
URL Count refers to the number of URLs of a particular domain that are present in the pages of a domain
being crawled.
16 | P a g e
The table below shows for e.g. that www.cmu.edu has links to alumni.cmu.edu, hr.web.cmu.edu,
www.library.cmu.edu etc.
Note that here we extracted only the domain portion of the urls, to get the count and again used a
similar map-reduce program as in the above experiments.
Top 6 URL domains that get traversed
CMU
Cornell
Gatech
www.cornell.edu 43
gtalumni.org
alumni.cmu.edu 92
www.cuinfo.cornell.edu 2
centennial.gtalumni.org 4
hr.web.cmu.edu 13
www.gradschool.cornell.edu 2
cyberbuzz.gatech.edu
www.alumniconnections.com 16
www.news.cornell.edu 7
georgiatech.searchease.com 9
www.carnegiemellontoday.com 10
www.sce.cornell.edu
8
ramblinwreck.cstv.com
www.library.cmu.edu 69
www.vet.cornell.edu
1
www.gatech.edu
www.cmu.edu
17 | P a g e
170
236
7
56
14
Conclusion
The basic goal of our project was to understand the map-reduce framework and use the knowledge to
work on some interesting experiments on large scale web dataset. To get the web data set, we started
out with an open source crawler.
As we got more acquainted with Hadoop framework we thought why not exploit the distributed
processing functionality of the map-reduce framework to write a distributed crawler. We then came up
with a map-reduce algorithm for it and were able to successfully write a distributed crawler using mapreduce.
Given the time constraints of the project and that it was our first time working with a crawler too; we
were able to contribute a substantial part.
To conclude with, the following might be considered the three most important contributions of our
system:
-
It brings to light the advantage that map reduce brings to the table when processing large
amounts of data and how the distributed processing is abstracted out by the framework which
allows a person to concentrate on the problem at hand
-
With different experiments, it shows how most problems can be modeled as a map reduce
problem.
-
Distributed crawler using map-reduce which according to our knowledge is the first ever such
attempt
-
We have laid out the framework to build a distributed archiving system
-
The inverted index that our system generates can be used to build a layer of search engine on
the top of it.
As a course project, it was extremely beneficial, as it allowed us to see the details of two major existing
systems – Peer Crawler and the Hadoop Framework and allowed us to use our knowledge to try and
integrate the two concepts of crawling and map-reduce and perform experiments.
As an extension, it might be interesting to see the results, if we stored the data in a better manner and
also implemented a search over this data that would do improved information retrieval from the
archived data set.
18 | P a g e
Extensions
•
An obvious question especially if you are from a systems background is when map-reduce is
better than an equivalent multi-threaded program outside map-reduce. This is a non-trivial
question in itself given that most of the optimizations are domain-specific and would require at
minimum –
1. An in-depth study of the Hadoop source code along with the various optimizations used
2. Coming up with a fair comparison of the equivalent programs
•
Another extension as we previously noted is to have distributed archiving using the framework
we have laid out. Currently the pages are written directly to the file system; however a better
approach would be to store them in a relational database with different versions stored
incrementally. Again this is a non-trivial task since
1. The pages are in html and therefore would require html diffs before storing the pages.
2. When queried for a particular version of page, it needs to be generated at runtime
based on the incremental html diffs
•
Another interesting experiment would be to build a search engine layer on the top of the
inverted index that our system generates
19 | P a g e
Bibliography
1. Map Reduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat,
Google Inc
2. Brown, A. (2006). Archiving Websites: a practical guide for information management professionals.
Facet Publishing.
3. Brügger, N. (2005). Archiving Websites. General Considerations and Strategies. The Centre for Internet
Research.
4. Pierre Baldi, Paolo Frasconi, Padhraic Smyth. “Modeling the Internet and the Web, Probabilistic
Methods and Algorithms” Chapter 6 – Advanced Crawling Techniques
5. Aameek Singh, Mudhakar Srivatsa, Ling Liu, and Todd Miller. “Apoidea: A Decentralized Peer-to-Peer
Architecture for Crawling the World Wide Web”
6. Day, M. (2003). "Preserving the Fabric of Our Lives: A Survey of Web Preservation Initiatives".
Research and Advanced Technology for Digital Libraries: Proceedings of the 7th European Conference
(ECDL): 461-472.
7. Eysenbach, G. and Trudel, M. (2005). "Going, going, still there: using the WebCite service to
permanently archive cited web pages". Journal of Medical Internet Research 7 (5).
8. Fitch, Kent (2003). "Web site archiving - an approach to recording every materially different
response produced by a website". Ausweb 03.
9. Lyman, P. (2002). "Archiving the World Wide Web". Building a National Strategy for Preservation:
Issues in Digital Media Archiving.
10. Masanès, J. (ed.) (2006). Web Archiving. Springer-Verlag.
11. http://hadoop.apache.org/
12. http://en.wikipedia.org/wiki/Hadoop
13. http://code.google.com/edu/content/submissions/uwspr2007_clustercourse/listing.html
14. http://www.clir.org/pubs/reports/pub106/web.html
15. http://deposit.ddb.de/
16. http://www.archive.org/
17. http://en.wikipedia.org/wiki/Zipf's_law
18. HtmlParser: http://htmlparser.sourceforge.net/
20 | P a g e
Appendix –A
Sample Distributed fetch using map-reduce
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.util.*;
import java.io.*;
import java.net.*;
import java.util.Scanner;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class MyWebPageFetcher {
public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
String downloadedStuff = "initial";
public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException
{
String line = ((Text)value).toString();
StringTokenizer tokenizer = new StringTokenizer(line);
21 | P a g e
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
String url = word.toString();
String option = "header";
MyWebPageFetcher fetcher = new MyWebPageFetcher(url);
/*if ( HEADER.equalsIgnoreCase(option) ) {
log( fetcher.getPageHeader() );
}
else if ( CONTENT.equalsIgnoreCase(option) ) {
log( fetcher.getPageContent() );
}
else {
log("Unknown option.");
}*/
downloadedStuff = fetcher.getPageContent();
//System.out.println(downloadedStuff);
//System.out.println("\n\nIn Map " + downloadedStuff + "\n\nEnd Map");
output.collect(word,new Text(downloadedStuff));
}
}
}
public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
String sum = "";
while (values.hasNext()) {
sum += ((Text)values.next()).toString();
output.collect(key, new Text(sum));
}
22 | P a g e
//System.out.println("\n\nIn reduce " + sum + "\n\nEnd reduce");
}}
public MyWebPageFetcher( URL aURL ){
if ( ! HTTP.equals(aURL.getProtocol()) ) {
throw new IllegalArgumentException("URL is not for HTTP Protocol: " + aURL);
}
fURL = aURL;
}
public MyWebPageFetcher( String aUrlName ) throws MalformedURLException {
this ( new URL(aUrlName) );
}
/** Fetch the HTML content of the page as simple text. */
public String getPageContent() {
String result = null;
URLConnection connection = null;
try {
connection = fURL.openConnection();
Scanner scanner = new Scanner(connection.getInputStream());
scanner.useDelimiter(END_OF_INPUT);
result = scanner.next();
}
catch ( IOException ex ) {
log("Cannot open connection to " + fURL.toString());
}
return result;
}
/** Fetch HTML headers as simple text. */
23 | P a g e
public String getPageHeader(){
StringBuilder result = new StringBuilder();
URLConnection connection = null;
try {
connection = fURL.openConnection();
}
catch (IOException ex) {
log("Cannot open connection to URL: " + fURL);
}
//not all headers come in key-value pairs - sometimes the key is
//null or an empty String
int headerIdx = 0;
String headerKey = null;
String headerValue = null;
while ( (headerValue = connection.getHeaderField(headerIdx)) != null ) {
headerKey = connection.getHeaderFieldKey(headerIdx);
if ( headerKey != null && headerKey.length()>0 ) {
result.append( headerKey );
result.append(" : ");
}
result.append( headerValue );
result.append(NEWLINE);
headerIdx++;
}
return result.toString();
}
// PRIVATE //
private URL fURL;
24 | P a g e
private static final String HTTP = "http";
private static final String HEADER = "header";
private static final String CONTENT = "content";
private static final String END_OF_INPUT = "\\Z";
private static final String NEWLINE = System.getProperty("line.separator");
private static void log(Object aObject){
//System.out.println(aObject);
/*File outFile1 = new File("bodycontent.txt");
FileWriter fileWriter1 = new FileWriter(outFile1,true);
fileWriter1.write((String)aObject);
fileWriter1.close();*/
//System.out.println((String) aObject);
}
public static void main(String[] args) throws Exception {
try {
JobConf conf = new JobConf(MyWebPageFetcher.class);
conf.setJobName("MyWebPageFetcher");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
25 | P a g e
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
conf.setNumMapTasks(8);
conf.setNumReduceTasks(8);
JobClient.runJob(conf);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
26 | P a g e
Appendix – B
Word count using map-reduce
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException {
String line = ((Text)value).toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
public static class Reduce extends MapReduceBase implements Reducer {
public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += ((IntWritable)values.next()).get();
27 | P a g e
}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
conf.setInputPath(new Path(args[0]));
conf.setOutputPath(new Path(args[1]));
JobClient.runJob(conf);
}
}
28 | P a g e
Download