Technical report: Web-crawl graph exchange Viv Cothey University of Wolverhampton viv.cothey@wlv.ac.uk Abstract This report identifies the need for an xml application to exchange Web-crawl graphs. An illustrative such application, blinker, is described. The experience of using blinker is discussed and the successful achievement of its design goals is reported. Some specific additional requirements of blinker are specified. 1. Introduction Data collection by Web-crawling is time consuming. It also consumes both network and server resource. It is therefore desirable to make optimal use of both the crawler and the large datasets (Web-crawl graphs) that result. That is, Web-crawl graphs should be able to support a broad range of Web research by different researchers employing different techniques and using different computing environments. By reducing overall network and server resource consumption the crawler is behaving ethically and "treading lightly" on the Web. A Web-crawl graph is taken here to mean all the data that defines a Web-page graph within the crawl-space. Each vertex (or Webpage) is associated with a collection, possibly empty, of arcs (or outlinks) which are the hyperlinks from the Web-page to other vertices in the graph. Each vertex is also possibly associated with a collection of attributes. Given that a crawler has collected a large Web-crawl graph and that this is to be shared, then a mutually acceptable description and exchange format is required. The problem of describing and exchanging Web-crawl graphs is a special case of the problem of describing and exchanging graphs generally. In general this is coupled with graph visualisation and display so that the general solution for graph data includes those aspects which relate to a particular rendering of the graph. Visualisation and rendering information is not required in respect of Web-crawl graphs. However there is a need to provide appropriate additional information that describes the crawler conditions under which the data were collected. 2. Goals for a Web-crawl graph exchange format The following pair of goals are proposed. Goal one: minimise modification or distortion of the data The data describing Web-crawl graphs should correspond as faithfully as possibly to the data that are provided by each Web server. The data should be as complete as possible and selectively retaining only a subset of the data should be avoided. Goal two: maximise accessibility Access to Web-crawl graphs should not require the use of proprietary software nor should it be predicated upon the use of highly configured (for example multi-GB memory) machines. 3. Problems/challenges The nature of the data collected when Web-crawling causes two kinds of problems in respect of data exchange. One relates to the inclusion of arbitrary characters within the data; this is the characterset problem. The second kind of problem arises from the magnitude of the datasets; this is the scale problem. What has become known as Postel's Law says "be conservative in what you do, be liberal in what you accept from others" (Postel, 1981). Although contentious this philosophy has influenced the practice of producing and processing Web-pages. There is a general lack of conformance validation. For example, the text purporting to be a hyperlink in html can contain arbitrary (other than the html reserved, "<" and "&") characters including control characters. Web clients (browsers) act liberally to process the text and use heuristics to render the page as best they can. The problem of dealing with invalid or malformed hyperlink urls is passed to the user. In consequence Web-crawlers obtaining Web-page files from servers inevitably collect dirty Webcrawl graphs, such as urls that may be accidentally (or deliberately) malformed in a variety of ways as well as HTTP (hypertext transmission protocol) headers that are malformed. In addition it should be noted in the context of the characterset problem, that W3C's internationalisation efforts continue and the ASCII (American standard code for information interchange) based characterset for the components of a url are being expanded. This will allow, for example, accented characters to appear within the url. In any event, the HTTP headers and their values may be constructed arbitrarily and use the local characterset of the server. Web-page graphs also may be very large. Processing large graphs is computationally intensive which constrains how they may be exchanged. This is because the interchange format must be resolvable by the recipient regardless of the scale of the computing resource that has been employed to create it. Hence, for example, the use of large in-memory data structures is problematic. The proposed strategy to address the character problem is to use Unicode rather than ASCII. The proposed strategy to address the scale problem is to serialise the graph and to ensure that any processing can be carried out serially. These strategies combine to suggest an xml based format for Web-crawl graph exchange. 4. An xml based format for Web-crawl graph exchange 4.1 The generic benefits of xml The Extensible Markup Language (xml) is a syntax for marking up data (including text) with simple human readable tags. W3C have endorsed xml as a standard for document markup. The default character encoding used is Unicode (utf8) so that xml can be used for all recognised languages. xml is said to be portable, that is, xml applications can be processed without modification by different machines. This is achieved by the format being explicit. One needs only to know that the data in a file is in xml in order to read it rather than needing to know by some external means the particular format arrangement necessary to access the data. In addition software is generally available both to read (or parse) xml as well as to carry out more sophisticated processing. In order to correctly parse an xml file the text content must be well-formed. That is, it must conform to the general syntax of xml. In addition to being syntactically well-formed valid xml conforms to the structural requirements of a particular xml application. It is not necessary for xml files to be valid since, for example, the structural requirements may not have been specified. There are two approaches for processing xml files. The first is serial. Serial processing makes no assumptions about the size of the file in relation to the computing memory that is available. In essence, the file is read line by line (although in xml an arrangement of the file into lines is not significant). The other mode for processing xml files relies on holding a complete representation of the data in memory. This mode which is very fast for small files is not appropriate when considering arbitrarily large files. 4.2 Graph exchange formats A common feature of the general approach to describing and exchanging graphs is that the graph is represented serially as a list of nodes or vertices and a list of arcs and edges. Each node has a unique identifier and may be assigned a label. Each arc, for example, is then described as an ordered pair if node identifiers. Additional descriptive features list the attributes of each node, arc and edge, for example the node co-ordinates in a particular visualization rendering. A simple general graph description could therefore resemble; node = 1 node = 2 node = 3 node = 4 arc = (1, 2) arc = (1, 3) arc = (2, 4) arc = (4, 3) Examples of general graph formats include Pajek's ".net" file format and GraphML (previously GraphXML) which uses xml. The above example when converted to Pajek .net format which requires the number of nodes to be stated at the outset, would be; *Vertices 4 1 "label for node 1" 2 "label for node 2" 3 "label for node 3" 4 "label for node 4" *Arcs 12 13 24 43 4.3 Outline of a Web-crawl graph exchange format The Web-crawl graph exchange format proposed encapsulates the graph description within a description of the particular crawl which generated the data. The blinker (Web link crawler) xml application illustrates this; <blinker> <header> <!-- crawl specification/description --> </header> <crawl> <!-- Web-page graph description --> </crawl> <trailer> <!-- Web-page graph summary information --> </trailer> </blinker> The Web-page graph crawl element is wrapped with header and trailer elements. The header element contains a description that is particular to the crawler and the crawl that collected the Web-page graph while the trailer element contains a summary of the Web-page graph. In addition to xml elements, such as blinker, containing other xml elements as is illustrated, xml elements may also have their own element attributes. This feature is exploited as described in the next Section. 4.3.1 The crawl element The crawl element describes the Web-page graph as a list of nodes each of which is described in a node element. Each node in the graph is uniquely identified using an "id" element attribute. Hence in outline a small Web-page graph is described by; <crawl> <node id="_1"> <!-- description of node --> </node> <node id="_2"> <!-- description of node --> </node> <node id="_3"> <!-- description of node --> </node> <!-- description of node --> <node id="_4"> <!-- description of node --> </node> </crawl> In principle the arcs should be described separately as ordered pairs as noted in Section 4.2. However for processing simplicity, the arcs are represented just by the list of outlinks in respect of each node. Hence the xml file can be produced concurrently as the crawler proceeds rather than being generated only retrospectively after the crawl has concluded. The list of outlinks within the node element is a urlReferences element which may not always be present. Both the node and each outlink is labelled by a regularized version of the text of the relevant locating url. These are contained within a label element. It should be noted that every arc terminal node mentioned, that is every outlink label, must also occur (just once) as a node element label. Thus the description of each node within the node element is represented as; <node id="_n"> <!-- description of node --> <label><!-- regularized form of node url --></label> <urlReferences> <label><!-- regularized form of outlink url --></label> <label><!-- regularized form of outlink url --></label> <label><!-- regularized form of outlink url --></label> </urlReferences> </node> In pursuit of Goal one, other than the textual regularization (discussed in Section 4.4) mentioned, no other data modification is proposed. Hence, for example, loops (outlinks that are self referring and multiple arcs (two or more outlinks having the same label) are included. This may be relevant, say, when determining the frequency distribution of hyperlinks per page employed by authors. The node element is completed by including all the HTTP header information that is available and any other descriptive information that is computed by the crawler. The HTTP information consists of a collection of "attributes". Unfortunately it is not possible to include these as node element attributes because of the uncontrolled presence of arbitrary characters which have the consequence that the xml is not well-formed. Node attributes are therefore included within individual attribute elements as a "type" and "value" pair. For example; <attribute type="status-line" value="401 Authorization Required" /> A real example is illustrated in the Appendix. 4.3.2 The header element The purpose of the header element is to contain the crawler and crawl specific information that is needed to qualify the Web-page graph described in the crawl element. This qualifying information is described using the configuration, crawlSpace and crawlSeed elements. In addition the header element attributes give the start time of the crawl and the operational name of the crawler. (Note that the crawler name need not be the same as the name of the user agent.) The configuration element is not discussed in detail here but an example of its usage is given in the Appendix. The crawlSpace element comprises a collection of subdomains and "websites" over which the crawler is permitted to operate and from which data may be collected. Since neither subdomains nor "websites" are urls then label elements are not used. A simple example is shown in the Appendix while a more complex example is; <crawlSpace> <subDomain>immunologie.de</subDomain> <subDomain>tu-dresden.de</subDomain> <subDomain>drfz.de</subDomain> <website>www.charite.de/ch/institute/</website> <subDomain>ukaachen.de</subDomain> <website>www.zoologie.uni-bonn.de/Immunbiologie/</website> <subDomain>ruhr-uni-bochum.de</subDomain> <subDomain>biozentrum.unibas.ch</subDomain> <website>www.fz-borstel.de/</website> <website>www.medizin.fu-berlin.de/immu/</website> <website>www.charite.de/immunologie/</website> <website>www.mpiib-berlin.mpg.de/</website> <subDomain>rki.de</subDomain> <website>www.charite.de/ch/rheuma/</website> </crawlSpace> The crawlSeed element comprises the collection of urls from which the crawler started to collect data. The regularized text for each of these is contained with a label element. (The crawlSeed element is thus structurally equivalent to the urlReferences element but is distinguished in order to make clear its separate function. 4.3.3 The trailer element The trailer element both complements the header element and contains a summary of the crawl. It complements the header in that its element attributes record the time when the crawler finished and the duration of the crawl. The summary provided in the example shown in the Appendix analyses the total number of nodes in the graph by their HTTP status-line attribute. This information is used to manage the crawler and is generated by the crawler. In principle a wide variety of other information could be summarised. Note that such a summary is always also available by later analysing the node elements of the graph. 4.4 Text regularisation 4.4.1 Node label text regularisation (or normalisation) Each label element contained within either the crawlSeed or urlReferences elements contains a textual representation of a (possibly malformed) url. It is desirable, at the least from an ethical crawler perspective, that multiple possible textual representation of the same text be regularised into some standard form so that equivalent texts can be identified. This enables the crawler to tread lightly and avoid requesting the same url from a server more than once. Url fragments are therefore discarded from the url text as part of the regularisation process. Standard algorithms are available, for example Burke (2002). Note that these must be able to regularise the text of malformed urls, (for example where there are invalid host name characters) in addition to normalising well formed urls. The text of a malformed url can contain arbitrary characters. Hence difficulties may be encountered when using alternative file formats that make use of special characters including control characters as format separators. An advantage of using xml in conjunction with urllike text is that the reserved html characters are also xml reserved characters. Hence malformed urls cannot corrupt, either by accident or deliberately, the xml application. 4.4.1 HTTP header text regularisation Both the HTTP header description and the text that is assigned to it can contain arbitrary characters. In particular these may include xml reserved characters. In order to safely encode this text, unsafe characters can be automatically converted to their xml entities. For example " becomes &quot;. xml parsers routinely code or decode such entities as required. The text can now be safely assigned to an element attribute as its value. Since the text of HTTP headers occurs as : separated type: value pairs then corresponding attribute element attribute pairs are used. 4.5 Serialisation The blinker xml application which describes a Web-crawl graph is produced serially as the crawl proceeds. The blinker xml file may thus be arbitrarily large. In contrast, generating a Web-page graph retrospectively may be limited by the computing resource that is available. Well formed xml applications can be processed serially. Syntactically xml elements may be nested but they may not overlap. This means that, for example, in blinker each node element is self contained and can be processed in isolation. It can also be expected that every label element that appears is regularised and has a corresponding node element that has a unique identifier. Therefore for example, a single pass through the file analysing each node element in turn is a nodal analysis of the whole Web-page graph described. A similar double pass of the file can be used to convert the xml file format to Pajek .net format. The xml processing to undertake the conversion uses standard serial xml tools and memory consumption remains constant. Arbitrarily large Web-crawl graphs can be processed in this way. 5. Achievement of goals 5.1 Goal one: minimise modification or distortion of the data The blinker Web-crawl graph xml application has been tested with respect to crawls over UK, German and Spanish based Web servers. This has exposed the application to a range of non-ASCII charactersets as well as a range of malformed url-like text and HTTP headers. The xml application provides a systematic procedure for preserving the data provided by each server while not compromising the integrity of the file format. The integrity of the xml application was verified by using standard xml processing software to parse the file and to convert the Web-page graph to Pajek .net format. The .net file produced was then processed by Pajek. 5.2 Goal two: maximise accessibility A blinker xml application file generated in a strict Unix type environment was exported to another computer where it was analysed to determine the frequency distribution of one of the node attribute parameters. The original file was analysed with respect to the same question. The pair of analyses were carried out independently by two researcher without sharing any information other than the question to be answered. The pair of frequency distributions obtained were then processed by SPSS and shown to be identical. (A later comparison of methods revealed that in one case the xml application was parsed with a custom coded event handler that analysed each node element, while in the other case an xml-stylesheet processor had been used to extract the particular parameter values.) In principle there are no restrictions on access to the exchanged Web-crawl graphs. 5.3 Outstanding issues The xml Web-crawl graph exchange format proposed faithfully includes any email address that is included as an outlink in a Web-page. This is in conformance with Goal one. It is recognised that it is not ethically safe to collect and make available for exchange the large numbers of email addresses that may be included in a Web-crawl graph. Therefore each email address should be anonymized prior to exchange. However this should be achieved in a way that preserves the topology of the Web-page graph. Discussion of the Web-crawl graph exchange format has focussed on the crawl element that describes the Web-page graph. The header element is important in that it describes the crawler and qualifies the Web-page graph. As yet there is little experience or evidence on which to base any recommendations as to either the minimum or recommended composition of the header. The exchange format has so far been tested in only two computing environments. It is desirable that the range of environments be extended to include a purely proprietary environments. An advantage of xml and xml applications is that the format is extensible by design. The outline format presented provides a basis for extension. For example keywords as well as outlinks may be obtained from (html) Web-pages. These data could be included as, say a keywords element within each node element. The requirements of potential users have not yet been determined. 6. Conclusions The blinker xml application used to illustrate a proposed Web-crawl graph exchange format meets the criteria set. That is, the format is able to comprehensively describe and exchange without data loss; the Web-crawl that was undertaken, and the associated Web-page graph including the HTTP header data provided by each Web server. The Web-graph exchange format has proved to be robust when exchanged and does not require any proprietary software for access. 7. Acknowledgements This work was supported by a grant from the Common Basis for Science, Technology and Innovation Indicators part of the Improving Human Research Potential specific programme of the Fifth Framework for Research and Technological Development of the European Commission. It is part of the WISER project (Web indicators for scientific, technological and innovation research) Contract HPV2-CT-2002-00015) (www.webindicators.org). Appendix: <blinker> <header start-time="Thu Mar 25 15:44:53 2004" name="blinker/incremental0.52"> <configuration> <agent>blinker/incremental0.52</agent> <admin>viv.cothey@wlv.ac.uk</admin> <delay unit="seconds">10</delay> <timeout unit="seconds">60</timeout> <maxSize>unlimited</maxSize> <maxDuration unit="days">2</maxDuration> <maxNetworkHits>10000</maxNetworkHits> <maxServerHits>10000</maxServerHits> <refreshAge unit="days">28</refreshAge> <maxPathSegments>6</maxPathSegments> <maxDocumentsPerDirectory>1000</maxDocumentsPerDirectory> <includeShallowQueries>no</includeShallowQueries> </configuration> <crawlSpace> <subDomain>heaven.li</subDomain> </crawlSpace> <crawlSeed> <label>http://www.heaven.li/</label> </crawlSeed> </header> <crawl> <node id="_1"> <attribute type="status-line" value="200 OK" /> <attribute type="protocol" value="HTTP/1.1" /> <attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" /> <attribute type="Accept-Ranges" value="bytes" /> <attribute type="Client-Date" value="Thu, 18 Mar 2004 08:49:15 GMT" /> <attribute type="Date" value="Thu, 18 Mar 2004 12:07:11 GMT" /> <attribute type="Title" value="www.heaven.li" /> <attribute type="Connection" value="close" /> <attribute type="Content-Length" value="289" /> <attribute type="Last-Modified" value="Wed, 18 Jul 2001 08:10:28 GMT" /> <attribute type="ETag" value="df27a-121-3b554474" /> <attribute type="Content-Type" value="text/html" /> <attribute type="method" value="GET" /> <attribute type="md5" value="af4cc05ae7a89c3a22c6cddb5a57e3e1" /> <attribute type="Client-Response-Num" value="1" /> <label>http://www.heaven.li/</label> <urlReferences> <label>http://www.heaven.li/private/</label> <label>http://www.heaven.li/home/</label> </urlReferences> </node> <node id="_2"> <attribute type="status-line" value="401 Authorization Required" /> <attribute type="protocol" value="HTTP/1.1" /> <attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" /> <attribute type="Client-Date" value="Thu, 18 Mar 2004 08:50:01 GMT" /> <attribute type="Date" value="Thu, 18 Mar 2004 12:07:56 GMT" /> <attribute type="Title" value="401 Authorization Required" /> <attribute type="X-Pad" value="avoid browser bug" /> <attribute type="Client-Transfer-Encoding" value="chunked" /> <attribute type="Connection" value="close" /> <attribute type="WWW-Authenticate" value="Basic realm=&quot;Internal" /> <attribute type="Content-Type" value="text/html; charset=iso-8859-1" /> <attribute type="method" value="GET" /> <attribute type="Client-Response-Num" value="1" /> <label>http://www.heaven.li/private/</label> </node> <node id="_3"> <attribute type="status-line" value="200 OK" /> <attribute type="protocol" value="HTTP/1.1" /> <attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" /> <attribute type="Accept-Ranges" value="bytes" /> <attribute type="Client-Date" value="Thu, 18 Mar 2004 08:49:36 GMT" /> <attribute type="Date" value="Thu, 18 Mar 2004 12:07:32 GMT" /> <attribute type="Title" value="www.heaven.li/home/" /> <attribute type="Connection" value="close" /> <attribute type="Content-Length" value="4717" /> <attribute type="Last-Modified" value="Mon, 30 Jul 2001 11:54:18 GMT" /> <attribute type="ETag" value="53b49-126d-3b654aea" /> <attribute type="Content-Type" value="text/html" /> <attribute type="method" value="GET" /> <attribute type="md5" value="1bfc62f7c115adc33e5693f4d7013d66" /> <attribute type="Client-Response-Num" value="1" /> <label>http://www.heaven.li/home/</label> <urlReferences> <label>http://www.heaven.li/home/byrail.jpg</label> <label>http://www.heaven.li/home/byroad.gif</label> </urlReferences> </node> <node id="_4"> <attribute type="status-line" value="200 OK" /> <attribute type="protocol" value="HTTP/1.1" /> <attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" /> <attribute type="Accept-Ranges" value="bytes" /> <attribute type="Client-Date" value="Thu, 18 Mar 2004 08:50:27 GMT" /> <attribute type="Date" value="Thu, 18 Mar 2004 12:08:24 GMT" /> <attribute type="Connection" value="close" /> <attribute type="Content-Length" value="392625" /> <attribute type="Last-Modified" value="Tue, 01 May 2001 12:14:27 GMT" /> <attribute type="ETag" value="53b48-5fdb1-3aeea8a3" /> <attribute type="Content-Type" value="image/gif" /> <attribute type="method" value="HEAD" /> <attribute type="Client-Response-Num" value="1" /> <label>http://www.heaven.li/home/byroad.gif</label> </node> <node id="_5"> <attribute type="status-line" value="200 OK" /> <attribute type="protocol" value="HTTP/1.1" /> <attribute type="Server" value="Apache/1.3.27 (Unix) (Red-Hat/Linux)" /> <attribute type="Accept-Ranges" value="bytes" /> <attribute type="Client-Date" value="Thu, 18 Mar 2004 08:50:17 GMT" /> <attribute type="Date" value="Thu, 18 Mar 2004 12:08:14 GMT" /> <attribute type="Connection" value="close" /> <attribute type="Content-Length" value="87309" /> <attribute type="Last-Modified" value="Tue, 01 May 2001 12:15:27 GMT" /> <attribute type="ETag" value="53b47-1550d-3aeea8df" /> <attribute type="Content-Type" value="image/jpeg" /> <attribute type="method" value="HEAD" /> <attribute type="Client-Response-Num" value="1" /> <label>http://www.heaven.li/home/byrail.jpg</label> </node> </crawl> <trailer finish-time="Thu Mar 25 15:44:53 2004" duration="0"> <total type="200 OK" value="4" /> <total type="401 Authorization Required" value="1" /> </trailer> </blinker> Bibliography: Batagelj and Mrvar, (2003). Pajek: analysis and visualization of large networks. In Jünger M. & Mutzel P. (eds.). Graph drawing software. (pp. 77 - 103) London: Springer. Burke S. M., (2002). Perl & LWP. Farnham: O'Reilly. Gourley D. and Totty B., (2002). HTTP: the definitive guide. Farnham: O'Reilly. Harold E. R. and Means W. S., (2002). XML in a nutshell. Farnham: O'Reilly, 2nd edition. Herman I. and Marshall M. S., (2000). GraphXML: an xml based graph interchange format. Technical report INS-R0009, Centrum voor Wiskunde en Informatica, Amsterdam. Postel J., (ed.), (1981). RFC 793: transmission control protocol.