1 - University of Wolverhampton

advertisement
Technical report: Web-crawl graph exchange
Viv Cothey
University of Wolverhampton
viv.cothey@wlv.ac.uk
Abstract
This report identifies the need for an xml application to exchange Web-crawl graphs. An
illustrative such application, blinker, is described. The experience of using blinker is
discussed and the successful achievement of its design goals is reported. Some specific additional
requirements of blinker are specified.
1. Introduction
Data collection by Web-crawling is time consuming. It also consumes both network and server
resource. It is therefore desirable to make optimal use of both the crawler and the large datasets
(Web-crawl graphs) that result. That is, Web-crawl graphs should be able to support a broad range
of Web research by different researchers employing different techniques and using different
computing environments. By reducing overall network and server resource consumption the
crawler is behaving ethically and "treading lightly" on the Web. A Web-crawl graph is taken here
to mean all the data that defines a Web-page graph within the crawl-space. Each vertex (or Webpage) is associated with a collection, possibly empty, of arcs (or outlinks) which are the hyperlinks
from the Web-page to other vertices in the graph. Each vertex is also possibly associated with a
collection of attributes.
Given that a crawler has collected a large Web-crawl graph and that this is to be shared, then a
mutually acceptable description and exchange format is required.
The problem of describing and exchanging Web-crawl graphs is a special case of the problem of
describing and exchanging graphs generally. In general this is coupled with graph visualisation and
display so that the general solution for graph data includes those aspects which relate to a particular
rendering of the graph.
Visualisation and rendering information is not required in respect of Web-crawl graphs. However
there is a need to provide appropriate additional information that describes the crawler conditions
under which the data were collected.
2. Goals for a Web-crawl graph exchange format
The following pair of goals are proposed.
Goal one: minimise modification or distortion of the data
The data describing Web-crawl graphs should correspond as faithfully as possibly to the data
that are provided by each Web server. The data should be as complete as possible and
selectively retaining only a subset of the data should be avoided.
Goal two: maximise accessibility
Access to Web-crawl graphs should not require the use of proprietary software nor should it
be predicated upon the use of highly configured (for example multi-GB memory) machines.
3. Problems/challenges
The nature of the data collected when Web-crawling causes two kinds of problems in respect of
data exchange. One relates to the inclusion of arbitrary characters within the data; this is the
characterset problem. The second kind of problem arises from the magnitude of the datasets; this is
the scale problem.
What has become known as Postel's Law says "be conservative in what you do, be liberal in what
you accept from others" (Postel, 1981). Although contentious this philosophy has influenced the
practice of producing and processing Web-pages. There is a general lack of conformance
validation. For example, the text purporting to be a hyperlink in html can contain arbitrary (other
than the html reserved, "<" and "&") characters including control characters. Web clients
(browsers) act liberally to process the text and use heuristics to render the page as best they can.
The problem of dealing with invalid or malformed hyperlink urls is passed to the user. In
consequence Web-crawlers obtaining Web-page files from servers inevitably collect dirty Webcrawl graphs, such as urls that may be accidentally (or deliberately) malformed in a variety of ways
as well as HTTP (hypertext transmission protocol) headers that are malformed.
In addition it should be noted in the context of the characterset problem, that W3C's
internationalisation efforts continue and the ASCII (American standard code for information
interchange) based characterset for the components of a url are being expanded. This will allow, for
example, accented characters to appear within the url. In any event, the HTTP headers and their
values may be constructed arbitrarily and use the local characterset of the server.
Web-page graphs also may be very large. Processing large graphs is computationally intensive
which constrains how they may be exchanged. This is because the interchange format must be
resolvable by the recipient regardless of the scale of the computing resource that has been employed
to create it. Hence, for example, the use of large in-memory data structures is problematic.
The proposed strategy to address the character problem is to use Unicode rather than ASCII. The
proposed strategy to address the scale problem is to serialise the graph and to ensure that any
processing can be carried out serially. These strategies combine to suggest an xml based format for
Web-crawl graph exchange.
4. An xml based format for Web-crawl graph exchange
4.1 The generic benefits of xml
The Extensible Markup Language (xml) is a syntax for marking up data (including text) with
simple human readable tags. W3C have endorsed xml as a standard for document markup.
The default character encoding used is Unicode (utf8) so that xml can be used for all
recognised languages.
xml is said to be portable, that is, xml applications can be processed without modification by
different machines. This is achieved by the format being explicit. One needs only to know
that the data in a file is in xml in order to read it rather than needing to know by some
external means the particular format arrangement necessary to access the data. In addition
software is generally available both to read (or parse) xml as well as to carry out more
sophisticated processing.
In order to correctly parse an xml file the text content must be well-formed. That is, it must
conform to the general syntax of xml. In addition to being syntactically well-formed valid
xml conforms to the structural requirements of a particular xml application. It is not
necessary for xml files to be valid since, for example, the structural requirements may not
have been specified.
There are two approaches for processing xml files. The first is serial. Serial processing
makes no assumptions about the size of the file in relation to the computing memory that is
available. In essence, the file is read line by line (although in xml an arrangement of the file
into lines is not significant). The other mode for processing xml files relies on holding a
complete representation of the data in memory. This mode which is very fast for small files
is not appropriate when considering arbitrarily large files.
4.2 Graph exchange formats
A common feature of the general approach to describing and exchanging graphs is that the
graph is represented serially as a list of nodes or vertices and a list of arcs and edges. Each
node has a unique identifier and may be assigned a label. Each arc, for example, is then
described as an ordered pair if node identifiers.
Additional descriptive features list the attributes of each node, arc and edge, for example the
node co-ordinates in a particular visualization rendering.
A simple general graph description could therefore resemble;
node = 1
node = 2
node = 3
node = 4
arc = (1, 2)
arc = (1, 3)
arc = (2, 4)
arc = (4, 3)
Examples of general graph formats include Pajek's ".net" file format and GraphML
(previously GraphXML) which uses xml. The above example when converted to Pajek .net
format which requires the number of nodes to be stated at the outset, would be;
*Vertices 4
1 "label for node 1"
2 "label for node 2"
3 "label for node 3"
4 "label for node 4"
*Arcs
12
13
24
43
4.3 Outline of a Web-crawl graph exchange format
The Web-crawl graph exchange format proposed encapsulates the graph description within a
description of the particular crawl which generated the data. The blinker (Web link crawler)
xml application illustrates this;
<blinker>
<header>
<!-- crawl specification/description -->
</header>
<crawl>
<!-- Web-page graph description -->
</crawl>
<trailer>
<!-- Web-page graph summary information -->
</trailer>
</blinker>
The Web-page graph crawl element is wrapped with header and trailer elements.
The header element contains a description that is particular to the crawler and the crawl
that collected the Web-page graph while the trailer element contains a summary of the
Web-page graph.
In addition to xml elements, such as blinker, containing other xml elements as is
illustrated, xml elements may also have their own element attributes. This feature is
exploited as described in the next Section.
4.3.1 The crawl element
The crawl element describes the Web-page graph as a list of nodes each of which is
described in a node element. Each node in the graph is uniquely identified using an "id"
element attribute. Hence in outline a small Web-page graph is described by;
<crawl>
<node id="_1">
<!-- description of node -->
</node>
<node id="_2">
<!-- description of node -->
</node>
<node id="_3">
<!-- description of node -->
</node>
<!-- description of node -->
<node id="_4">
<!-- description of node -->
</node>
</crawl>
In principle the arcs should be described separately as ordered pairs as noted in Section 4.2.
However for processing simplicity, the arcs are represented just by the list of outlinks in
respect of each node. Hence the xml file can be produced concurrently as the crawler
proceeds rather than being generated only retrospectively after the crawl has concluded. The
list of outlinks within the node element is a urlReferences element which may not
always be present. Both the node and each outlink is labelled by a regularized version of the
text of the relevant locating url. These are contained within a label element. It should be
noted that every arc terminal node mentioned, that is every outlink label, must also occur
(just once) as a node element label. Thus the description of each node within the node
element is represented as;
<node id="_n">
<!-- description of node -->
<label><!-- regularized form of node url --></label>
<urlReferences>
<label><!-- regularized form of outlink url --></label>
<label><!-- regularized form of outlink url --></label>
<label><!-- regularized form of outlink url --></label>
</urlReferences>
</node>
In pursuit of Goal one, other than the textual regularization (discussed in Section 4.4)
mentioned, no other data modification is proposed. Hence, for example, loops (outlinks that
are self referring and multiple arcs (two or more outlinks having the same label) are included.
This may be relevant, say, when determining the frequency distribution of hyperlinks per
page employed by authors.
The node element is completed by including all the HTTP header information that is available
and any other descriptive information that is computed by the crawler. The HTTP
information consists of a collection of "attributes". Unfortunately it is not possible to include
these as node element attributes because of the uncontrolled presence of arbitrary characters
which have the consequence that the xml is not well-formed. Node attributes are therefore
included within individual attribute elements as a "type" and "value" pair. For example;
<attribute type="status-line" value="401 Authorization Required" />
A real example is illustrated in the Appendix.
4.3.2 The header element
The purpose of the header element is to contain the crawler and crawl specific information
that is needed to qualify the Web-page graph described in the crawl element. This
qualifying information is described using the configuration, crawlSpace and
crawlSeed elements. In addition the header element attributes give the start time of the
crawl and the operational name of the crawler. (Note that the crawler name need not be the
same as the name of the user agent.)
The configuration element is not discussed in detail here but an example of its usage is
given in the Appendix.
The crawlSpace element comprises a collection of subdomains and "websites" over which
the crawler is permitted to operate and from which data may be collected. Since neither
subdomains nor "websites" are urls then label elements are not used. A simple example is
shown in the Appendix while a more complex example is;
<crawlSpace>
<subDomain>immunologie.de</subDomain>
<subDomain>tu-dresden.de</subDomain>
<subDomain>drfz.de</subDomain>
<website>www.charite.de/ch/institute/</website>
<subDomain>ukaachen.de</subDomain>
<website>www.zoologie.uni-bonn.de/Immunbiologie/</website>
<subDomain>ruhr-uni-bochum.de</subDomain>
<subDomain>biozentrum.unibas.ch</subDomain>
<website>www.fz-borstel.de/</website>
<website>www.medizin.fu-berlin.de/immu/</website>
<website>www.charite.de/immunologie/</website>
<website>www.mpiib-berlin.mpg.de/</website>
<subDomain>rki.de</subDomain>
<website>www.charite.de/ch/rheuma/</website>
</crawlSpace>
The crawlSeed element comprises the collection of urls from which the crawler started to
collect data. The regularized text for each of these is contained with a label element. (The
crawlSeed element is thus structurally equivalent to the urlReferences element but is
distinguished in order to make clear its separate function.
4.3.3 The trailer element
The trailer element both complements the header element and contains a summary of
the crawl. It complements the header in that its element attributes record the time when the
crawler finished and the duration of the crawl. The summary provided in the example shown
in the Appendix analyses the total number of nodes in the graph by their HTTP status-line
attribute. This information is used to manage the crawler and is generated by the crawler. In
principle a wide variety of other information could be summarised. Note that such a
summary is always also available by later analysing the node elements of the graph.
4.4 Text regularisation
4.4.1 Node label text regularisation (or normalisation)
Each label element contained within either the crawlSeed or urlReferences
elements contains a textual representation of a (possibly malformed) url. It is desirable, at the
least from an ethical crawler perspective, that multiple possible textual representation of the
same text be regularised into some standard form so that equivalent texts can be identified.
This enables the crawler to tread lightly and avoid requesting the same url from a server more
than once. Url fragments are therefore discarded from the url text as part of the regularisation
process.
Standard algorithms are available, for example Burke (2002). Note that these must be able to
regularise the text of malformed urls, (for example where there are invalid host name
characters) in addition to normalising well formed urls.
The text of a malformed url can contain arbitrary characters. Hence difficulties may be
encountered when using alternative file formats that make use of special characters including
control characters as format separators. An advantage of using xml in conjunction with urllike text is that the reserved html characters are also xml reserved characters. Hence
malformed urls cannot corrupt, either by accident or deliberately, the xml application.
4.4.1 HTTP header text regularisation
Both the HTTP header description and the text that is assigned to it can contain arbitrary
characters. In particular these may include xml reserved characters. In order to safely encode
this text, unsafe characters can be automatically converted to their xml entities. For example
" becomes ". xml parsers routinely code or decode such entities as required. The text
can now be safely assigned to an element attribute as its value. Since the text of HTTP
headers occurs as : separated type: value pairs then corresponding attribute element
attribute pairs are used.
4.5 Serialisation
The blinker xml application which describes a Web-crawl graph is produced serially as
the crawl proceeds. The blinker xml file may thus be arbitrarily large. In contrast,
generating a Web-page graph retrospectively may be limited by the computing resource that
is available.
Well formed xml applications can be processed serially. Syntactically xml elements may be
nested but they may not overlap. This means that, for example, in blinker each node
element is self contained and can be processed in isolation. It can also be expected that every
label element that appears is regularised and has a corresponding node element that has a
unique identifier. Therefore for example, a single pass through the file analysing each node
element in turn is a nodal analysis of the whole Web-page graph described. A similar double
pass of the file can be used to convert the xml file format to Pajek .net format. The xml
processing to undertake the conversion uses standard serial xml tools and memory
consumption remains constant. Arbitrarily large Web-crawl graphs can be processed in this
way.
5. Achievement of goals
5.1 Goal one: minimise modification or distortion of the data
The blinker Web-crawl graph xml application has been tested with respect to crawls over
UK, German and Spanish based Web servers. This has exposed the application to a range of
non-ASCII charactersets as well as a range of malformed url-like text and HTTP headers.
The xml application provides a systematic procedure for preserving the data provided by each
server while not compromising the integrity of the file format.
The integrity of the xml application was verified by using standard xml processing software
to parse the file and to convert the Web-page graph to Pajek .net format. The .net file
produced was then processed by Pajek.
5.2 Goal two: maximise accessibility
A blinker xml application file generated in a strict Unix type environment was exported to
another computer where it was analysed to determine the frequency distribution of one of the
node attribute parameters. The original file was analysed with respect to the same question.
The pair of analyses were carried out independently by two researcher without sharing any
information other than the question to be answered.
The pair of frequency distributions obtained were then processed by SPSS and shown to be
identical. (A later comparison of methods revealed that in one case the xml application was
parsed with a custom coded event handler that analysed each node element, while in the
other case an xml-stylesheet processor had been used to extract the particular parameter
values.)
In principle there are no restrictions on access to the exchanged Web-crawl graphs.
5.3 Outstanding issues
The xml Web-crawl graph exchange format proposed faithfully includes any email address
that is included as an outlink in a Web-page. This is in conformance with Goal one.
It is recognised that it is not ethically safe to collect and make available for exchange the
large numbers of email addresses that may be included in a Web-crawl graph.
Therefore each email address should be anonymized prior to exchange. However this should
be achieved in a way that preserves the topology of the Web-page graph.
Discussion of the Web-crawl graph exchange format has focussed on the crawl element that
describes the Web-page graph. The header element is important in that it describes the
crawler and qualifies the Web-page graph. As yet there is little experience or evidence on
which to base any recommendations as to either the minimum or recommended composition
of the header.
The exchange format has so far been tested in only two computing environments. It is
desirable that the range of environments be extended to include a purely proprietary
environments.
An advantage of xml and xml applications is that the format is extensible by design. The
outline format presented provides a basis for extension. For example keywords as well as
outlinks may be obtained from (html) Web-pages. These data could be included as, say a
keywords element within each node element. The requirements of potential users have
not yet been determined.
6. Conclusions
The blinker xml application used to illustrate a proposed Web-crawl graph exchange
format meets the criteria set. That is, the format is able to comprehensively describe and
exchange without data loss;


the Web-crawl that was undertaken, and
the associated Web-page graph including the HTTP header data provided by each
Web server.
The Web-graph exchange format has proved to be robust when exchanged and does not
require any proprietary software for access.
7. Acknowledgements
This work was supported by a grant from the Common Basis for Science, Technology and
Innovation Indicators part of the Improving Human Research Potential specific programme of the
Fifth Framework for Research and Technological Development of the European Commission. It is
part of the WISER project (Web indicators for scientific, technological and innovation research)
Contract HPV2-CT-2002-00015) (www.webindicators.org).
Appendix:
<blinker>
<header start-time="Thu Mar 25 15:44:53 2004" name="blinker/incremental0.52">
<configuration>
<agent>blinker/incremental0.52</agent>
<admin>viv.cothey@wlv.ac.uk</admin>
<delay unit="seconds">10</delay>
<timeout unit="seconds">60</timeout>
<maxSize>unlimited</maxSize>
<maxDuration unit="days">2</maxDuration>
<maxNetworkHits>10000</maxNetworkHits>
<maxServerHits>10000</maxServerHits>
<refreshAge unit="days">28</refreshAge>
<maxPathSegments>6</maxPathSegments>
<maxDocumentsPerDirectory>1000</maxDocumentsPerDirectory>
<includeShallowQueries>no</includeShallowQueries>
</configuration>
<crawlSpace>
<subDomain>heaven.li</subDomain>
</crawlSpace>
<crawlSeed>
<label>http://www.heaven.li/</label>
</crawlSeed>
</header>
<crawl>
<node id="_1">
<attribute type="status-line" value="200 OK" />
<attribute type="protocol" value="HTTP/1.1" />
<attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" />
<attribute type="Accept-Ranges" value="bytes" />
<attribute type="Client-Date" value="Thu, 18 Mar 2004 08:49:15 GMT" />
<attribute type="Date" value="Thu, 18 Mar 2004 12:07:11 GMT" />
<attribute type="Title" value="www.heaven.li" />
<attribute type="Connection" value="close" />
<attribute type="Content-Length" value="289" />
<attribute type="Last-Modified" value="Wed, 18 Jul 2001 08:10:28 GMT" />
<attribute type="ETag" value="df27a-121-3b554474" />
<attribute type="Content-Type" value="text/html" />
<attribute type="method" value="GET" />
<attribute type="md5" value="af4cc05ae7a89c3a22c6cddb5a57e3e1" />
<attribute type="Client-Response-Num" value="1" />
<label>http://www.heaven.li/</label>
<urlReferences>
<label>http://www.heaven.li/private/</label>
<label>http://www.heaven.li/home/</label>
</urlReferences>
</node>
<node id="_2">
<attribute type="status-line" value="401 Authorization Required" />
<attribute type="protocol" value="HTTP/1.1" />
<attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" />
<attribute type="Client-Date" value="Thu, 18 Mar 2004 08:50:01 GMT" />
<attribute type="Date" value="Thu, 18 Mar 2004 12:07:56 GMT" />
<attribute type="Title" value="401 Authorization Required" />
<attribute type="X-Pad" value="avoid browser bug" />
<attribute type="Client-Transfer-Encoding" value="chunked" />
<attribute type="Connection" value="close" />
<attribute type="WWW-Authenticate" value="Basic realm="Internal" />
<attribute type="Content-Type" value="text/html; charset=iso-8859-1" />
<attribute type="method" value="GET" />
<attribute type="Client-Response-Num" value="1" />
<label>http://www.heaven.li/private/</label>
</node>
<node id="_3">
<attribute type="status-line" value="200 OK" />
<attribute type="protocol" value="HTTP/1.1" />
<attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" />
<attribute type="Accept-Ranges" value="bytes" />
<attribute type="Client-Date" value="Thu, 18 Mar 2004 08:49:36 GMT" />
<attribute type="Date" value="Thu, 18 Mar 2004 12:07:32 GMT" />
<attribute type="Title" value="www.heaven.li/home/" />
<attribute type="Connection" value="close" />
<attribute type="Content-Length" value="4717" />
<attribute type="Last-Modified" value="Mon, 30 Jul 2001 11:54:18 GMT" />
<attribute type="ETag" value="53b49-126d-3b654aea" />
<attribute type="Content-Type" value="text/html" />
<attribute type="method" value="GET" />
<attribute type="md5" value="1bfc62f7c115adc33e5693f4d7013d66" />
<attribute type="Client-Response-Num" value="1" />
<label>http://www.heaven.li/home/</label>
<urlReferences>
<label>http://www.heaven.li/home/byrail.jpg</label>
<label>http://www.heaven.li/home/byroad.gif</label>
</urlReferences>
</node>
<node id="_4">
<attribute type="status-line" value="200 OK" />
<attribute type="protocol" value="HTTP/1.1" />
<attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" />
<attribute type="Accept-Ranges" value="bytes" />
<attribute type="Client-Date" value="Thu, 18 Mar 2004 08:50:27 GMT" />
<attribute type="Date" value="Thu, 18 Mar 2004 12:08:24 GMT" />
<attribute type="Connection" value="close" />
<attribute type="Content-Length" value="392625" />
<attribute type="Last-Modified" value="Tue, 01 May 2001 12:14:27 GMT" />
<attribute type="ETag" value="53b48-5fdb1-3aeea8a3" />
<attribute type="Content-Type" value="image/gif" />
<attribute type="method" value="HEAD" />
<attribute type="Client-Response-Num" value="1" />
<label>http://www.heaven.li/home/byroad.gif</label>
</node>
<node id="_5">
<attribute type="status-line" value="200 OK" />
<attribute type="protocol" value="HTTP/1.1" />
<attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" />
<attribute type="Accept-Ranges" value="bytes" />
<attribute type="Client-Date" value="Thu, 18 Mar 2004 08:50:17 GMT" />
<attribute type="Date" value="Thu, 18 Mar 2004 12:08:14 GMT" />
<attribute type="Connection" value="close" />
<attribute type="Content-Length" value="87309" />
<attribute type="Last-Modified" value="Tue, 01 May 2001 12:15:27 GMT" />
<attribute type="ETag" value="53b47-1550d-3aeea8df" />
<attribute type="Content-Type" value="image/jpeg" />
<attribute type="method" value="HEAD" />
<attribute type="Client-Response-Num" value="1" />
<label>http://www.heaven.li/home/byrail.jpg</label>
</node>
</crawl>
<trailer finish-time="Thu Mar 25 15:44:53 2004" duration="0">
<total type="200 OK" value="4" />
<total type="401 Authorization Required" value="1" />
</trailer>
</blinker>
Bibliography:
Batagelj and Mrvar, (2003). Pajek: analysis and visualization of large networks. In Jünger M. &
Mutzel P. (eds.). Graph drawing software. (pp. 77 - 103) London: Springer.
Burke S. M., (2002). Perl & LWP. Farnham: O'Reilly.
Gourley D. and Totty B., (2002). HTTP: the definitive guide. Farnham: O'Reilly.
Harold E. R. and Means W. S., (2002). XML in a nutshell. Farnham: O'Reilly, 2nd edition.
Herman I. and Marshall M. S., (2000). GraphXML: an xml based graph interchange format.
Technical report INS-R0009, Centrum voor Wiskunde en Informatica, Amsterdam.
Postel J., (ed.), (1981). RFC 793: transmission control protocol.
Download