Interpreting Logs and Reports

advertisement
Interpreting logs and reports
IIPC GA 2014
Crawl engineers and operators workshop
Bert Wendland/BnF
Introduction
Job logs and reports created by Heritrix contain a lot of information, much more than visible on
the first view. Information can be obtained by extracting, filtering and evaluating certain fields of
log files. Specialised evaluation tools are available for everything. However, it is sometimes
difficult to look for the right one and to adapt it to actual needs.
This presentation shows some examples of how information can be obtained by using standard
unix tools. They are available by default on every unix installation and are ready to be used
immediately. This brings a flexibility into the evaluation process that no specialised tool can
provide. The list of examples is not exhaustive at all. It is intended to show some possibilities as
inspiration for further work.
The unix tools used here are: cat, grep, sort, uniq, sed, awk, wc, head, regular expressions, and
pipelining (i.e. the output data of one command is used as input data for the next command in
the same command line).
The crawl.log used in the examples comes from a typical medium-sized job of a selective crawl.
The job run between 2014-02-24T10:26:08.273Z and 2014-03-04T16:16:30.214Z. The crawl.log
contains 3,205,936 lines.
clm
The extraction of columns from log files is a basic action which is heavily used in the evaluation
process. It can be realised with the awk command. Extracted columns can be rearranged in
arbitrary order. They are separated by default by a “white space” (one or several space or tab
characters) or optionally by any other character indicated by the -F option.
To facilitate daily life operations, an alias “clm” (the name stands for “columns”) has been
created which shortens the use of the awk command.
$
$
$
$
awk
awk
awk
awk
'{print $3}'
'{print $1 $3}'
'{print $3 $1}'
-F ':' '{print $1 $2}'




clm
clm
clm
clm
3
1 3
3 1
-F ':' 1 2
sum_col
A perl script “sum_col” calculates the sum of all numerical values that can be found in the first
column of every line of an input data stream.
#!/usr/bin/env perl
use warnings;
my $sum = 0;
while (<STDIN>) {
chomp;
my @parts = split;
if (scalar @parts > 0) {
if ($parts[0] =~ /^(\d+(\.\d+)?)/) {
$sum += $1;
}
}
}
print $sum . "\n";
avg URI fetch duration
The crawl.log holds in its 9th column a timestamp indicating when a network fetch was begun and
the millisecond duration of the fetch, separated from the begin-time by a “+” character.
2014-02-24T10:26:09.345Z
200
3339 http://www.facebook.com/robots.txt P
http://www.facebook.com/antoine.adelisse text/plain #042 20140224102608798+543
sha1:EZ6YOU7YB3VVAGOD4PPMQGG3VKZN42D2 http://www.facebook.com/antoine.adelisse
content-size:3534
One can extract the duration of all the fetches, limited optionally in the 4th field (the URI of the
document downloaded) to a particular host or domain, to compute the average URI fetch
duration of a job.
$cat crawl.log | clm 9 | clm -F '+' 2 | sum_col
2582697481
$cat crawl.log | clm 9 | grep -cE '[0-9]+\+[0-9]+'
3197842
 2,582,697,481 / 3,197,842 = 805.6 [ms]
$cat crawl.log | clm 4 9 | grep www.facebook.com | clm 2 | \
clm -F '+' 2 | sum_col
70041498
$cat crawl.log | clm 4 9 | grep www.facebook.com | wc -l
72825
 70,041,498 / 72,825 = 952.4 [ms]
nb of images per host
To know the number of images fetched per host:
$grep -i image mimetype-report.txt
1139893 33332978181 image/jpeg
296992 7276505356 image/png
110204
380490180 image/gif
79039
663127491 image/jpg
$cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \
clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq –c
1
4
6396
10
4
-gof.pagesperso-orange.fr
0.academia-assets.com
0.gravatar.com
0.media.collegehumor.cvcdn.com
0.static.collegehumor.cvcdn.com
The top 5:
$cat crawl.log | clm 4 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \
clm -F / 3 | sed -r 's/^www[0-9a-z]?\.//' | sort | uniq -c | \
sort -nr | head -n 5
89030
78764
44992
38781
36998
upload.wikimedia.org
fr.cdn.v5.futura-sciences.com
pbs.twimg.com
media.meltybuzz.fr
s-www.ledauphine.com
nb of images per seed
Very often, images embedded into a web page are not fetched from the same host as the page
itself. So, instead counting the number of images per host, it is more interesting to have the
number of images collected per referring site or – even better – per seed. To make this possible,
the “source-tag-seed” option, which is off by default, must be activated:
$order.xml: <boolean name="source-tag-seeds">true</boolean>
The crawl.log will then contain in its 11th column a tag for the seed the URI being treated
originated from.
2014-02-24T10:28:10.291Z
200
7802 https://fbcdn-sphotos-f-a.akamaihd.net
/hphotos-ak-frc1/t1/s261x260/1621759_10202523932754072_n.jpg X https://www.faceb
ook.com/brice.roger1 image/jpeg #190 20140224102801374+8902 sha1:DABRRLQPPAKH3QO
W7MHGSMSIDDDDRY7D https://www.facebook.com/brice.roger1 content-size:8035,3t
$cat crawl.log | clm 11 7 | grep -E ' image/(jpeg|png|gif|jpg)$' | \
clm 1 | sort | uniq -c | sort -nr | head -n 10
110510
86479
74621
74007
55742
49135
42308
39503
38426
32830
http://fr.oakley.com/sports
http://www.futura-sciences.com/magazines/high-tech/infos/actu
http://fr.wikipedia.org/wiki/Sotchi_2014
http://www.linternaute.com/sport/jeux-olympiques/
http://www.tuxboard.com/les-tenues-officielles-des-j-o-de-sotchi/
http://www.meltybuzz.fr/sochi-2014/
http://www.legorafi.fr/tag/jeux-olympiques/
http://www.madmoizelle.com/sexisme-france-televisions-jeux-olymp
http://www.ledauphine.com/sport/sports-hautes-alpes
https://www.facebook.com/geoffrey.mattei
timeouts per time period
The -2 fetch status code of a URI stands for “HTTP connect failed”.
2014-02-24T17:48:22.350Z
-2
- http://buy.itunes.apple.com/ EX
http://s.skimresources.com/js/725X1342.skimlinks.js no-type #185 - - http://wiki
leaksactu.wordpress.com/2013/04/07/wikileaks-et-les-jeux-olympiques-de-sotchide-2014/ le:SocketTimeoutException@HTTP,30t
Its cause may be on the server side but also on the network. An increase of the number of -2
codes in a certain time period might indicate network problems which can then be further
investigated.
This is the number of -2 codes per hour:
$grep 'Z
9
34
17
19
10
26
36
182
3
4
3
2
1
-2 ' crawl.log | clm -F : 1 | uniq -c
2014-02-24T17
2014-02-24T18
2014-02-24T19
2014-02-24T20
2014-02-24T21
2014-02-24T22
2014-02-24T23
2014-02-25T00
2014-02-25T01
2014-02-25T02
2014-02-25T03
2014-02-25T04
2014-02-25T05
timeouts per time period
It is more meaningful to extract the -2 codes from the local-errors.log as every URI is repeatedly
tried before it arrives as “given up” in the crawl.log:
$
grep 'Z
376
586
1320
1234
1162
1101
892
1008
928
999
-2' local-errors.log | clm -F : 1 | uniq –c
2014-02-24T15
2014-02-24T16
2014-02-24T17
2014-02-24T18
2014-02-24T19
2014-02-24T20
2014-02-24T21
2014-02-24T22
2014-02-24T23
2014-02-25T00
The number of -2 codes can be also be counted per minute:
$
grep 'Z
-2' local-errors.log | grep ^2014-02-24T17 | \
clm -F : 1 2 | uniq –c
12
19
12
157
12
15
6
5
2014-02-24T17
2014-02-24T17
2014-02-24T17
2014-02-24T17
2014-02-24T17
2014-02-24T17
2014-02-24T17
2014-02-24T17
06
07
08
09
10
11
12
13
timeouts per time period
If all the running instances of Heritrix use a common workspace to store their logs, an extraction
from all the log files is possible to better detect peaks:
$cat /.../jobs/*/logs/crawl.log | grep 'Z
199
446
793
970
677
632
553
606
801
754
701
705
539
508
899
2099
1298
1064
929
983
1274
2131
2639
2288
2596
1950
2012-10-15T11
2012-10-15T12
2012-10-15T13
2012-10-15T14
2012-10-15T15
2012-10-15T16
2012-10-15T17
2012-10-15T18
2012-10-15T19
2012-10-15T20
2012-10-15T21
2012-10-15T22
2012-10-15T23
2012-10-16T00
2012-10-16T01
2012-10-16T02
2012-10-16T03
2012-10-16T04
2012-10-16T05
2012-10-16T06
2012-10-16T07
2012-10-16T08
2012-10-16T09
2012-10-16T10
2012-10-16T11
2012-10-16T12
-2' | clm -F : 1 | uniq -c
detect crawler traps
via URI patterns
The number of URIs fetched from www.clermont-filmfest.com is higher than expected:
$head hosts-report.txt
163438 16114773 dns: 0 13 163438 16114773 0 0 0 0
138317 15279274975 www.clermont-filmfest.com 0 125252 138317 15279274975 133786
6131883679 www.ffsa.org 0 314002 133786 6131883679 0 0 0 0
133757 6411525222 www.actu-environnement.com 0 18990 133757 6411525222
Fetched URLs are extracted (4th column), sorted and written into a new file:
$cat crawl.log | clm 4 | grep -v dns: | sort > crawl.log.4-sort
Crawler traps can then be detected by looking at the list:
$grep www.clermont-filmfest.com crawl.log.4-sort
http://www.clermont-filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo
c/01_archives/autour_fest/galerie/2011/commun/merc9/6gd.jpg
http://www.clermont-filmfest.com//04_centredoc/01_archives/autour_fest/galerie/2011/04_centredo
c/01_archives/autour_fest/galerie/2011/commun/cloture/1.jpg
[...]
http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15
http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03
http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15
http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03
http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=15
http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=43
http://www.clermont-filmfest.com/index.php?m=90&d=03&d=63&d=43&d=63&d=15&d=03&d=15&d=03&d=63
detect crawler traps
via URI patterns
Another approach to detect crawler traps is the extraction of URIs having a high number of URL
parameters (separated by “&”).
To find the highest number of URL parameters:
$cat crawl.log | clm 4 | grep '&' | sed -r 's/[^&]//g' | wc –L
68
To extract URIs having 20 or more URL parameters:
$grep -E '(\&.*){20,}' crawl.log | grep --color=always '\&'
http://arvise.grenoble-inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=annuairesup&PR
OC=SAISIE_DEFAULTSTRUCTUREKSUP&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&ST
ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST
ATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&ST
ATS=Y&STATS=Y&STATS=Y
http://arvise.grenoble-inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=cataloguelien&
PROC=SAISIE_LIEN&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y&
STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&
STATS=Y&STATS=Y&STATS=Y&STATS=Y
http://arvise.grenoble-inp.fr/servlet/com.jsbsoft.jtf.core.SG?EXT=core&PROC=SAIS
IE_NEWSGW&ACTION=RECHERCHER&TOOLBOX=LIEN_REQUETE&STATS=Y&STATS=Y&STATS=Y&STATS=Y
&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y
&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y&STATS=Y
&STATS=Y&STATS=Y
detect crawler traps
via URI patterns
URL parameters can be sorted and counted to easier detect repetitions:
$cat crawl.log | clm 4 | \
grep 'http://cnb.avocat.fr/index.php?start=0&numero=1110' | \
grep --color=always '\&'
http://cnb.avocat.fr/index.php?start=0&numero=1110&pre&id_param=1330690&java=fal
se&ajax=true&show=liste_articles&numero=1110&pre&id_param=1330690&java=false&aja
x=true&show=liste_articles&numero=1110&pre&id_param=1330690&java=false&ajax=true
&show=liste_articles&numero=1110&pre&id_param=1330690&java=false&ajax=true&show=
liste_articles&numero=1110&preaction=mymodule&id_param=1330690&java=false&ajax=t
rue&show=liste_articles&numero=1110
$cat crawl.log | clm 4 | \
grep 'http://cnb.avocat.fr/index.php?start=0&numero=1110' | \
sed 's/&/\n/g' | sort | uniq –c
5
1
5
5
6
4
1
5
ajax=true
http://cnb.avocat.fr/index.php?start=0
id_param=1330690
java=false
numero=1110
pre
preaction=mymodule
show=liste_articles
arcfiles-report.txt
Heritrix does not provide a report for the ARC files written by the completed crawl job. If the
option “org.archive.io.arc.ARCWriter.level” in the heritrix.properties file is set to INFO, Heritrix
will log opening and closing of ARC files in the heritrix.out file. These information can then be
transformed into an arcfiles report. Same for WARC files.
$grep "Opened.*arc.gz" heritrix.out
2014-02-24 10:26:08.254 INFO thread-14
org.archive.io.WriterPoolMember.createFile() Opened
/dlweb/data/NAS510/jobs/current/g110high/9340_1393236979459/arcs/9340-3220140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz.open
$grep "Closed.*arc.gz" heritrix.out
2014-02-24 10:30:40.822 INFO thread-171 org.archive.io.WriterPoolMember.close()
Closed /dlweb/data/NAS510/jobs/current/g110high/9340_1393236979459/arcs/9340-3220140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz, size 100010570
$cat arcfiles-report.txt
[ARCFILE] [Opened] [Closed] [Size]
9340-32-20140224102608-00000-BnF_ciblee_2014_gulliver110.bnf.fr.arc.gz 2014-0224T10:26:08.254Z 2014-02-24T10:30:40.822Z 100010570
Download