3b: Gawk for Web Log Analysis 152.152.98.11 - - [16/Nov/2005:16:32:50 -0500] "GET /jobs/ HTTP/1.1" 200 15140 "http://www.google.com/search?q=salary+for+data+mining&hl=en&lr=&start=10&sa=N" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322)“ 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET / HTTP/1.1" 200 12453 "http://www.yisou.com/search?p=data+mining&source=toolbar_yassist_button&pid=400 740_1006" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /kdr.css HTTP/1.1" 200 145 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" 252.113.176.247 - - [16/Feb/2006:00:06:00 -0500] "GET /images/KDnuggets_logo.gif HTTP/1.1" 200 784 "http://www.kdnuggets.com/" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; MyIE2)" © 2006 KDnuggets Gawk - introduction • A very powerful text processing and pattern matching language • gawk is a Gnu version of awk • Syntax similar to C See http://www.gnu.org/software/gawk/ for manual Many awk/gawk tutorials, e.g. • http://www.cs.hmc.edu/qref/awk.html • http://www.cs.ucsb.edu/~sherwood/awk/ Note: The name awk comes from the initials of its designers: Alfred V. Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of awk was written in 1977. © 2006 KDnuggets Gawk - running Several ways of running from the Unix prompt: % gawk ‘commands’ file % cat file | gawk ‘commands’ % cat file | gawk –f prog.gawk’ © 2006 KDnuggets Gawk – fields and records Gawk divides the file into records and fields Each line is a record (by default) Fields are delimited by a special character Default: white space (blank or tab) Can be changed with –F option E.g. to have comma as a delimiter, use gawk –F”,” file.csv © 2006 KDnuggets Gawk fields and variables Fields are accessed with the $ prefix Special variables: $1 is the first field, $2 is the second… $0 is a special field which is the entire line NF is a special variable - number of fields in the current record NR is a special variable – current record number © 2006 KDnuggets Gawk conditions gawk –F"d" 'condition' file gawk processes each line of file, using the delimiter d (default is whitespace) to split each line into fields. The default action is to print the entire line. © 2006 KDnuggets Sample log file We will use file d100.log – first 100 lines from the Nov 16, 2005 KDnuggets log file. We will give useful code examples – for full gawk introduction see elsewhere You are encouraged to try the code examples in this lecture on this file You should get the same answers! © 2006 KDnuggets Sample log file d100.log ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /robots.txt HTTP/1.0" 200 173 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" ip1664.com - - [16/Nov/2005:00:00:43 -0500] "GET /gpspubs/sigkdd-kdd99-panel.html HTTP/1.0" 200 14199 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)" ip2283.unr - - [16/Nov/2005:00:01:02 -0500] "GET /dmcourse/data_mining_course/assignments/assignment-3.html HTTP/1.1" 200 8090 "http://www.google.com/search?hl=en&q=use+of+data+cleaning+in+data+mining&spell=1" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip2283.unr - - [16/Nov/2005:00:01:03 -0500] "GET /dmcourse/dm.css HTTP/1.1" 200 155 "http://www.kdnuggets.com/dmcourse/data_mining_course/assignments/assignment-3.html" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/sld021.htm HTTP/1.1" 200 1385 "http://www.google.com/search?hs=JnE&hl=en&lr=&client=opera&rls=en&q=lift+curve&btnG=Search" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02:46 -0500] "GET /gpspubs/kdd99-est-ben-lift/img021.gif HTTP/1.1" 200 7465 "http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1389.net - - [16/Nov/2005:00:02:47 -0500] "GET /favicon.ico HTTP/1.1" 200 899 "http://www.kdnuggets.com/gpspubs/kdd99-est-ben-lift/sld021.htm" "Mozilla/4.0 (compatible; MSIE 6.0; X11; Linux i686; en) Opera 8.5" ip1946.com - - [16/Nov/2005:00:02:49 -0500] "GET /news/2001/n10/15i.html HTTP/1.0" 200 4214 "-" "Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)“ … © 2006 KDnuggets Example 1: Lines with Status not equal 200 Status code is field $9 in the log file How many lines had status code not 200: % gawk '$9 != 200' d100.log | wc Result: 27 Note: to count status code equal to 200, use '$9 == 200' not '$9 = 200' (this sets $9 to be 200) © 2006 KDnuggets Example 2: Count referrals from Google Gawk has powerful pattern matching variable ~ "pattern" Example: how many log lines had a referral (field $11 in the log line) from google: % gawk '$11 ~ "google"' d100.log | wc Result: 2 © 2006 KDnuggets Example 3: complex condition How many hits had GET method and status 404? (status 404 is an error code) Method is field $6 in the log, but the request is surrounded by " ". We can use % gawk '$6 ~ "GET" && $9 == 404' d100.log | wc Result: 1 © 2006 KDnuggets Example 4a: Counting ".html" requests The requested file is field $7. We can use this condition to match files that end in .html Note: $ in the pattern matches the end of string % gawk '$7 ~ ".html$"' d100.log | wc Result: 21 © 2006 KDnuggets Example 4b: Counting htm or html requests Some files may also end in .htm, so we can use % gawk '$7 ~ ".html$|.htm$"' d100.log | wc Result: 22 © 2006 KDnuggets Example 4c: Counting directory requests Some requests can be for a directory, e.g. a request for the homepage www.kdnuggets.com/ would have "GET / HTTP/1.1" string. We can count these requests by % gawk '$7 ~ "/$"' d100.log | wc Result: 6 © 2006 KDnuggets Example 4d: Counting all HTML pages or count html, htm, and directory pages by % gawk '$7 ~ "(html|htm|/)$"' d100.log | wc Result: 28 © 2006 KDnuggets Gawk computations More general form of gawk statements is gawk '{statements;…}' file The statements are executed for each line of file Statements include the usual conditionals, loops, etc Details in gawk manual/tutorials © 2006 KDnuggets Example 5: External referrers Example: Print referrers to html pages, excluding direct access (where referrer is "-" ) Note: to test if $11 is "-", we need to escape a double quote as \" Code: (all on one line) % gawk '{if ($7~"html$" && $11!="\"-\"") print $11}' d100.log | wc Result: 7 © 2006 KDnuggets Gawk statements: BEGIN, END To execute statements before reading the first line we use BEGIN keyword To execute statements after the last line is read we use END keyword gawk 'BEGIN{stat1;…}{stat2;…}END{stat3;…}' file © 2006 KDnuggets Example 6 Sum all the object sizes for access code 200 gawk '{if ($9 == 200) sumsize+=$10} END{print sumsize}' d100.log Result: 396460 Note: we did not initialize sumsize; all variables by default are initialized to zero © 2006 KDnuggets