Interpreting logs and reports IIPC GA 2014 Crawl engineers and operators workshop Bert Wendland/BnF Introduction Job logs and reports created by Heritrix contain a lot of information, much more than visible on the first view. Information can be obtained by extracting, filtering and evaluating certain fields of log files. Specialised evaluation tools are available for everything. However, it is sometimes difficult to look for the right one and to adapt it to actual needs. This presentation shows some examples of how information can be obtained by using standard unix tools. They are available by default on every unix installation and are ready to be used immediately. This brings a flexibility into the evaluation process that no specialised tool can provide. The list of examples is not exhaustive at all. It is intended to show some possibilities as inspiration for further work. The unix tools used here are: cat, grep, sort, uniq, sed, awk, wc, head, regular expressions, and pipelining (i.e. the output data of one command is used as input data for the next command in the same command line). The crawl.log used in the examples comes from a typical medium-sized job of a selective crawl. The job run between 2014-02-24T10:26:08.273Z and 2014-03-04T16:16:30.214Z. The crawl.log contains 3,205,936 lines. clm The extraction of columns from log files is a basic action which is heavily used in the evaluation process. It can be realised with the awk command. Extracted columns can be rearranged in arbitrary order. They are separated by default by a “white space” (one or several space or tab characters) or optionally by any other character indicated by the -F option. To facilitate daily life operations, an alias “clm” (the name stands for “columns”) has been created which shortens the use of the awk command. $ $ $ $ awk awk awk awk '{print $3}' '{print $1 $3}' '{print $3 $1}' -F ':' '{print $1 $2}' clm clm clm clm 3 1 3 3 1 -F ':' 1 2 sum_col A perl script “sum_col” calculates the sum of all numerical values that can be found in the first column of every line of an input data stream. #!/usr/bin/env perl use warnings; my $sum = 0; while (<STDIN>) { chomp; my @parts = split; if (scalar @parts > 0) { if ($parts[0] =~ /^(\d+(\.\d+)?)/) { $sum += $1; } } } print $sum . "\n"; avg URI fetch duration The crawl.log holds in its 9th column a timestamp indicating when a network fetch was begun and the millisecond duration of the fetch, separated from the begin-time by a “+” character. 2014-02-24T10:26:09.345Z 200 3339 http://www.facebook.com/robots.txt P http://www.facebook.com/antoine.adelisse text/plain #042 20140224102608798+543 sha1:EZ6YOU7YB3VVAGOD4PPMQGG3VKZN42D2 http://www.facebook.com/antoine.adelisse content-size:3534 One can extract the duration of all the fetches, limited optionally in the 4th field (the URI of the document downloaded) to a particular host or domain, to compute the average URI fetch duration of a job. $cat crawl.log | clm 9 | clm -F '+' 2 | sum_col 2582697481 $cat crawl.log | clm 9 | grep -cE '[0-9]+\+[0-9]+' 3197842 2,582,697,481 / 3,197,842 = 805.6 [ms] $cat crawl.log | clm 4 9 | grep www.facebook.com | clm 2 | \ clm -F '+' 2 | sum_col 70041498 $cat crawl.log | clm 4 9 | grep www.facebook.com | wc -l 72825 70,041,498 / 72,825 = 952.4 [ms] nb of images per host To know the number of images fetched per host: $grep -i image mimetype-report.txt 1139893 33332978181 image/jpeg 296992 7276505356 image/png 110204 380490180 image/gif 79039 663127491 image/jpg $cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq –c 1 4 6396 10 4 -gof.pagesperso-orange.fr 0.academia-assets.com 0.gravatar.com 0.media.collegehumor.cvcdn.com 0.static.collegehumor.cvcdn.com The top 5: $cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq -c | \ sort -nr | head -n 5 89030 78764 44992 38781 36998 upload.wikimedia.org fr.cdn.v5.futura-sciences.com pbs.twimg.com media.meltybuzz.fr s-www.ledauphine.com nb of images per seed Very often, images embedded into a web page are not fetched from the same host as the page itself. So, instead counting the number of images per host, it is more interesting to have the number of images collected per referring site or – even better – per seed. To make this possible, the “source-tag-seed” option, which is off by default, must be activated: $order.xml: <boolean name="source-tag-seeds">true</boolean> The crawl.log will then contain in its 11th column a tag for the seed the URI being treated originated from. 2014-02-24T10:28:10.291Z 200 7802 https://fbcdn-sphotos-f-a.akamaihd.net /hphotos-ak-frc1/t1/s261x260/1621759_10202523932754072_n.jpg X https://www.faceb ook.com/brice.roger1 image/jpeg #190 20140224102801374+8902 sha1:DABRRLQPPAKH3QO W7MHGSMSIDDDDRY7D https://www.facebook.com/brice.roger1 content-size:8035,3t $cat crawl.log | clm 11 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \ clm 1 | sort | uniq -c | sort -nr | head -n 10 110510 86479 74621 74007 55742 49135 42308 39503 38426 32830 http://fr.oakley.com/sports http://www.futura-sciences.com/magazines/high-tech/infos/actu http://fr.wikipedia.org/wiki/Sotchi_2014 http://www.linternaute.com/sport/jeux-olympiques/ http://www.tuxboard.com/les-tenues-officielles-des-j-o-de-sotchi/ http://www.meltybuzz.fr/sochi-2014/ http://www.legorafi.fr/tag/jeux-olympiques/ http://www.madmoizelle.com/sexisme-france-televisions-jeux-olymp http://www.ledauphine.com/sport/sports-hautes-alpes https://www.facebook.com/geoffrey.mattei timeouts per time period The -2 fetch status code of a URI stands for “HTTP connect failed”. 2014-02-24T17:48:22.350Z -2 - http://buy.itunes.apple.com/ EX http://s.skimresources.com/js/725X1342.skimlinks.js no-type #185 - - http://wiki leaksactu.wordpress.com/2013/04/07/wikileaks-et-les-jeux-olympiques-de-sotchide-2014/ le:SocketTimeoutException@HTTP,30t Its cause may be on the server side but also on the network. An increase of the number of -2 codes in a certain time period might indicate network problems which can then be further investigated. This is the number of -2 codes per hour: $grep 'Z 9 34 17 19 10 26 36 182 3 4 3 2 1 -2 ' crawl.log | clm -F : 1 | uniq -c 2014-02-24T17 2014-02-24T18 2014-02-24T19 2014-02-24T20 2014-02-24T21 2014-02-24T22 2014-02-24T23 2014-02-25T00 2014-02-25T01 2014-02-25T02 2014-02-25T03 2014-02-25T04 2014-02-25T05 timeouts per time period It is more meaningful to extract the -2 codes from the local-errors.log as every URI is repeatedly tried before it arrives as “given up” in the crawl.log: $ grep 'Z 376 586 1320 1234 1162 1101 892 1008 928 999 -2' local-errors.log | clm -F : 1 | uniq –c 2014-02-24T15 2014-02-24T16 2014-02-24T17 2014-02-24T18 2014-02-24T19 2014-02-24T20 2014-02-24T21 2014-02-24T22 2014-02-24T23 2014-02-25T00 The number of -2 codes can be also be counted per minute: $ grep 'Z -2' local-errors.log | grep ^2014-02-24T17 | \ clm -F : 1 2 | uniq –c 12 19 12 157 12 15 6 5 2014-02-24T17 2014-02-24T17 2014-02-24T17 2014-02-24T17 2014-02-24T17 2014-02-24T17 2014-02-24T17 2014-02-24T17 06 07 08 09 10 11 12 13 timeouts per time period If all the running instances of Heritrix use a common workspace to store their logs, an extraction from all the log files is possible to better detect peaks: $cat /.../jobs/*/logs/crawl.log | grep 'Z 199 446 793 970 677 632 553 606 801 754 701 705 539 508 899 2099 1298 1064 929 983 1274 2131 2639 2288 2596 1950 2012-10-15T11 2012-10-15T12 2012-10-15T13 2012-10-15T14 2012-10-15T15 2012-10-15T16 2012-10-15T17 2012-10-15T18 2012-10-15T19 2012-10-15T20 2012-10-15T21 2012-10-15T22 2012-10-15T23 2012-10-16T00 2012-10-16T01 2012-10-16T02 2012-10-16T03 2012-10-16T04 2012-10-16T05 2012-10-16T06 2012-10-16T07 2012-10-16T08 2012-10-16T09 2012-10-16T10 2012-10-16T11 2012-10-16T12 -2' | clm -F : 1 | uniq -c detect crawler traps via URI patterns The number of URIs fetched from www.clermont-filmfest.com is higher than expected: $head hosts-report.txt 163438 16114773 dns: 0 13 163438 16114773 0 0 0 0 138317 15279274975 www.clermont-filmfest.com 0 125252 138317 15279274975 133786 6131883679 www.ffsa.org 0 314002 133786 6131883679 0 0 0 0 133757 6411525222 www.actu-environnement.com 0 18990 133757 6411525222 Fetched URLs are extracted (4th column), sorted and written into a new file: $cat crawl.log | clm 4 | grep -v dns: | sort > crawl.log.4-sort Crawler traps can then be detected by looking at the list: $grep www.clermont-filmfest.com crawl.log.4-sort http://www.clermont-filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/merc9/6gd.jpg http://www.clermont-filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo c/01_archives/autour_fest/galerie/2011/commun/cloture/1.jpg [...] http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15 http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03 http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15 http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03 http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=15 http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=43 http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=63 detect crawler traps via URI patterns Another approach to detect crawler traps is the extraction of URIs having a high number of URL parameters (separated by “&”). To find the highest number of URL parameters: $cat crawl.log | clm 4 | grep '&' | sed -r 's/[^&]//g' | wc –L 68 To extract URIs having 20 or more URL parameters: $grep -E '(\&.*){20,}' crawl.log | grep --color=always '\&' http://arvise.grenoble-inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=annuairesup&PR OC=SAISIE_DEFAULTSTRUCTUREKSUP&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST ATS=Y&STATS=Y&STATS=Y http://arvise.grenoble-inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=cataloguelien& PROC=SAISIE_LIEN&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y& STATS=Y&STATS=Y&STATS=Y&STATS=Y http://arvise.grenoble-inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=core&PROC=SAIS IE_NEWSGW&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y &STATS=Y&STATS=Y detect crawler traps via URI patterns URL parameters can be sorted and counted to easier detect repetitions: $cat crawl.log | clm 4 | \ grep 'http://cnb.avocat.fr/index.php?start=0&numero=1110' | \ grep --color=always '\&' http://cnb.avocat.fr/index.php?start=0&numero=1110&pre&id_param=1330690&java=fal se&ajax=true&show=liste_articles&numero=1110&pre&id_param=1330690&java=false&aja x=true&show=liste_articles&numero=1110&pre&id_param=1330690&java=false&ajax=true &show=liste_articles&numero=1110&pre&id_param=1330690&java=false&ajax=true&show= liste_articles&numero=1110&preaction=mymodule&id_param=1330690&java=false&ajax=t rue&show=liste_articles&numero=1110 $cat crawl.log | clm 4 | \ grep 'http://cnb.avocat.fr/index.php?start=0&numero=1110' | \ sed 's/&/\n/g' | sort | uniq –c 5 1 5 5 6 4 1 5 ajax=true http://cnb.avocat.fr/index.php?start=0 id_param=1330690 java=false numero=1110 pre preaction=mymodule show=liste_articles arcfiles-report.txt Heritrix does not provide a report for the ARC files written by the completed crawl job. If the option “org.archive.io.arc.ARCWriter.level” in the heritrix.properties file is set to INFO, Heritrix will log opening and closing of ARC files in the heritrix.out file. These information can then be transformed into an arcfiles report. Same for WARC files. $grep "Opened.*arc.gz" heritrix.out 2014-02-24 10:26:08.254 INFO thread-14 org.archive.io.WriterPoolMember.createFile() Opened /dlweb/data/NAS510/jobs/current/g110high/9340_1393236979459/arcs/9340-3220140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz.open $grep "Closed.*arc.gz" heritrix.out 2014-02-24 10:30:40.822 INFO thread-171 org.archive.io.WriterPoolMember.close() Closed /dlweb/data/NAS510/jobs/current/g110high/9340_1393236979459/arcs/9340-3220140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz, size 100010570 $cat arcfiles-report.txt [ARCFILE] [Opened] [Closed] [Size] 9340-32-20140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz 2014-0224T10:26:08.254Z 2014-02-24T10:30:40.822Z 100010570