Gawk Tools for Web Log Analysis

advertisement
3b: Gawk
for
Web Log
Analysis
152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140
"http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453
"http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400
740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)"
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145
"http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1;
SV1; MyIE2)"
252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif
HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE
6.0; Windows NT 5.1; SV1; MyIE2)"
© 2006 KDnuggets
Gawk - introduction
• A very powerful text processing and pattern matching language
• gawk is a Gnu version of awk
• Syntax similar to C
See http://www.gnu.org/software/gawk/ for manual
Many awk/gawk tutorials, e.g.
• http://www.cs.hmc.edu/qref/awk.html
• http://www.cs.ucsb.edu/~sherwood/awk/
Note: The name awk comes from the initials of its designers:
Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan.
The original version of awk was written in 1977.
© 2006 KDnuggets
Gawk - running
 Several ways of running from the Unix prompt:
% gawk ‘commands’ file
% cat file | gawk ‘commands’
% cat file | gawk –f prog.gawk’
© 2006 KDnuggets
Gawk – fields and records
 Gawk divides the file into records and fields
 Each line is a record (by default)
 Fields are delimited by a special character
 Default: white space (blank or tab)
 Can be changed with –F option
 E.g. to have comma as a delimiter, use
gawk –F”,” file.csv
© 2006 KDnuggets
Gawk fields and variables
Fields are accessed with the $ prefix
Special variables:
 $1 is the first field, $2 is the second…
 $0 is a special field which is the entire line
 NF is a special variable - number of fields in the current
record
 NR is a special variable – current record number
© 2006 KDnuggets
Gawk conditions
gawk –F"d" 'condition' file
 gawk processes each line of file, using the
delimiter d (default is whitespace) to split each
line into fields.
 The default action is to print the entire line.
© 2006 KDnuggets
Sample log file
 We will use file d100.log – first 100 lines from the
Nov 16, 2005 KDnuggets log file.
 We will give useful code examples – for full gawk
introduction see elsewhere
 You are encouraged to try the code examples in
this lecture on this file
 You should get the same answers!
© 2006 KDnuggets
Sample log file d100.log
ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /robots.txt HTTP/1.0" 200 173 "-" "msnbot/1.0
(+http://search.msn.com/msnbot.htm)"
ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /gpspubs/sigkdd-kdd99-panel.html HTTP/1.0" 200 14199 "-"
"msnbot/1.0 (+http://search.msn.com/msnbot.htm)"
ip2283.unr - - [16/Nov/2005:00:01:02 -0500] "GET /dmcourse/data_mining_course/assignments/assignment-3.html
HTTP/1.1" 200 8090 "http://www.google.com/search?hl=en&q=use+of+data+cleaning+in+data+mining&spell=1"
"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)"
ip2283.unr - - [16/Nov/2005:00:01:03 -0500] "GET /dmcourse/dm.css HTTP/1.1" 200 155
"http://www.kdnuggets.com/dmcourse/data_mining_course/assignments/assignment-3.html" "Mozilla/4.0
(compatible; MSIE 6.0; Windows NT 5.1; SV1)"
ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/sld021.htm HTTP/1.1" 200 1385
"http://www.google.com/search?hs=JnE&hl=en&lr=&client=opera&rls=en&q=lift+curve&btnG=Search" "Mozilla/4.0
(compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5"
ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/img021.gif HTTP/1.1" 200 7465
"http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux
i686; en) Opera 8.5"
ip1389.net - - [16/Nov/2005:00:02:47 -0500] "GET /favicon.ico HTTP/1.1" 200 899
"http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux
i686; en) Opera 8.5"
ip1946.com - - [16/Nov/2005:00:02:49 -0500] "GET /news/2001/n10/15i.html HTTP/1.0" 200 4214 "-" "Mozilla/5.0
(compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)“
…
© 2006 KDnuggets
Example 1:
Lines with Status not equal 200
 Status code is field $9 in the log file
 How many lines had status code not 200:
% gawk '$9 != 200' d100.log | wc
Result: 27
Note: to count status code equal to 200, use
'$9 == 200'
not '$9 = 200' (this sets $9 to be 200)
© 2006 KDnuggets
Example 2:
Count referrals from Google
 Gawk has powerful pattern matching
 variable ~ "pattern"
 Example: how many log lines had a referral (field
$11 in the log line) from google:
% gawk '$11 ~ "google"' d100.log | wc
Result: 2
© 2006 KDnuggets
Example 3: complex condition
 How many hits had GET method and status 404?
 (status 404 is an error code)
 Method is field $6 in the log, but the request is
surrounded by " ". We can use
% gawk '$6 ~ "GET" && $9 == 404' d100.log | wc
Result: 1
© 2006 KDnuggets
Example 4a:
Counting ".html" requests
 The requested file is field $7. We can use this
condition to match files that end in .html
 Note: $ in the pattern matches the end of string
% gawk '$7 ~ ".html$"' d100.log | wc
Result: 21
© 2006 KDnuggets
Example 4b:
Counting htm or html requests
Some files may also end in .htm, so we can use
% gawk '$7 ~ ".html$|.htm$"' d100.log | wc
Result: 22
© 2006 KDnuggets
Example 4c:
Counting directory requests
Some requests can be for a directory, e.g. a
request for the homepage www.kdnuggets.com/
would have "GET / HTTP/1.1" string.
 We can count these requests by
% gawk '$7 ~ "/$"' d100.log | wc
Result: 6
© 2006 KDnuggets
Example 4d:
Counting all HTML pages
 or count html, htm, and directory pages by
% gawk '$7 ~ "(html|htm|/)$"' d100.log | wc
Result: 28
© 2006 KDnuggets
Gawk computations
 More general form of gawk statements is
gawk '{statements;…}' file
 The statements are executed for each line of file
 Statements include the usual conditionals, loops,
etc
 Details in gawk manual/tutorials
© 2006 KDnuggets
Example 5: External referrers
 Example: Print referrers to html pages, excluding
direct access (where referrer is "-" )
 Note: to test if $11 is "-", we need to escape a double
quote as \"
 Code: (all on one line)
% gawk '{if ($7~"html$" && $11!="\"-\"")
print $11}' d100.log | wc
Result: 7
© 2006 KDnuggets
Gawk statements: BEGIN, END
 To execute statements before reading the first
line we use BEGIN keyword
 To execute statements after the last line is read
we use END keyword
gawk 'BEGIN{stat1;…}{stat2;…}END{stat3;…}' file
© 2006 KDnuggets
Example 6
 Sum all the object sizes for access code 200
gawk '{if ($9 == 200) sumsize+=$10}
END{print sumsize}' d100.log
Result: 396460
Note: we did not initialize sumsize; all variables
by default are initialized to zero
© 2006 KDnuggets
Download