Methodologies for Population/Quantitative Genetics Animal Science 562 AWK Programming Introduction “Computer users spend a lot of time doing simple, mechanical data manipulation - changing the format of data, checking its validity, finding items with some property, adding up numbers, printing reports, and the like. All of these jobs ought to be mechanized, but it’s a real nuisance to have to write a special-purpose program in a standard language like C or Pascal each time such a task comes up. “Awk is a programming language that make it possible to handle simple, mechanical data manipulation tasks with very short programs, often only one or two lines long. An awk program is a sequence of patterns and actions that tell what to look for in the input data and what to do when it’s found.” Aho, Kernighan and Weinberger. 1988. “The AWK Programming Language” 1.1 Getting Started File emp.dat contains name, pay rate in dollars per hour, number of hours worked, one employee record per line. Beth 4.00 0 Dan 3.75 0 Kathy 4.00 10 Mark 5.00 20 Mary 5.50 22 Susie 4.25 18 Task: Print the name and pay (rate times hours) for everyone who worked more than zero hours. Program: awk ‘$3 > 0 { print $1, $2 * $3 }’ emp.dat The Structure of an AWK Program Each AWK program is a sequence of one or more pattern-action statements pattern {action} pattern {action} … Running an AWK Program Type a command line of the form awk ‘program’ input files omit the input files from the command line awk ‘program’ awk will apply the program to whatever you type next on your terminal until you type an end-of-file signal (control-d on UNIX systems). 1 Animal Science 562 AWK Programming Executing long programs awk -f progfile optional list of input files 1.2 Simple Output Only two types of data in awk: numbers and strings of characters. Awk reads one line at a time and splits each line into fields. A field is a sequence of characters that doesn’t contain any blanks or tabs. First field in current input line is called $1, the second $2, and so forth. The entire line is called $0. The number of fields can vary from line to line. Print Every Line Print Certain Fields NF, The Number of Fields awk awk awk ‘{print}’ OR awk ‘{print $0}’ ‘{print $1, $3}’ ‘{print NF, $1, $NF}’ (Any expression can be used after $ to denote a field number.) Computing and Printing awk Printing Line Numbers awk Putting Text in the Output awk ‘{print $1, $2 * $3}’ ‘{print NR, $0}’ ‘{print “total pay for “, $1, “ is “, $2 * $3}’ 1.3 Fancier Output • print statement is meant for quick and easy output • use printf statement to format the output exactly the way you want it Lining Up Fields printf statement form printf (format, value1, value2, …, valuen) where format is a string that contains text to be printed verbatim interspersed with specification of how each of the values is to be printed A specification is a % followed by a few characters that control the format of a value. Task: Use prinf to print the total pay for every employee awk ‘{printf (“total pay for %s is $%.2f \n”, $1, $2 * $3)}’ • contains two % specifications %s %.2f print the first value $1, as a string of characters print the second value, $2*$3, as a number with 2 digits after the decimal point • no blanks or new lines are produced automatically; you must create them yourself. Don’t forget the \n. 2 Animal Science 562 AWK Programming Task: Print each employee’s name and pay. awk ‘{printf (“%-8s $%6.2f \n”, $1, $2 * $3)}’ Sorting the Output Task: Print all data for each employee, along with his or her pay, sorted in order of increasing pay. awk ‘{printf (“%6.2f %s \n”, $2 * $3, $0)}’ emp.data | sort • pipes the output of awk into the sort command. 1.4 Selection • Awk patterns are good for selecting interesting lines from the input for further processing. Selection by Comparison Task: A comparison pattern to select the records of employees who earn $5.00 or more per hour. awk ‘$2 >= 5’ emp.data Selection by Computation Task: Print the pay of those employees whose total pay exceeds $50. awk ‘$2*$3 > 50 {printf(“$%.2f for %s \n”, $2*$3, $1)}’ Selection by Text Content Task: Print all lines in which the first field is Susie awk ‘$1 = =”Susie”’ Combinations of Patterns • Patterns can be combined with parentheses and the logical operators &&, ||, and !, which stand for AND, OR, and NOT, respectively. Task: Print lines where $2 is at least 4 or $3 is at least 20. awk ‘$2 >= 4 || $3 >= 20’ emp.data Data Validation • Awk is an excellent tool for checking that data has reasonable values and that it is in the right format. Task: Use comparison patterns to apply five plausibility tests to each line of emp.data. awk ‘NF !=3 {print $0, “number of fields is not equal to 3”}’ awk ‘$2 <3.35 {print $0, “rate is below minimum wage”}’ awk ‘$2 > 10 {print $0, “rate exceeds $10 per hour”}’ awk ‘$3 <0 {print $0, “negative hours worked”}’ awk ‘$3 >60 {print $0, “too many hours worked”}’ 3 Animal Science 562 AWK Programming BEGIN and END The special pattern BEGIN matches before the first line of the first input file is read, and END matches after the last line of the last file has been processed. Task: Use BEGIN to print a heading. (Note. This is a multiple line file and must be executed from a file.) BEGIN {print “Name Rate Hours”; print “ ”} {print} • You can put several statements on a single line if you separate them by semicolons. 1.5 Computing with AWK • In awk, user-created variables are not declared. Counting Task: Use a variable emp to count employees who have worked more than 15 hours. $3 > 15 {emp=emp+1} END {print emp, “employees worked more than 15 hours”} Computing Sums and Averages Task: Use the built-in variable NR to count the number of employees awk ‘END {print NR, “employees”} Task: Compute the average pay {pay = pay + $2*$3} END {print NR, “employees” print “Total pay is ”, pay print “average pay is ”, pay /NR } Handling Text • One strength of awk is its ability to handle strings of characters as conveniently as most languages handle numbers. Task: Find the employee who is paid the most per hour. $2 > maxrate {maxrate = $2; maxemp=$1}’ END print “highest hourly rate: “, maxrate, “ for “, maxemp} String Concatenation Task: Create new strings by combining old ones {names = names $1 “ “} END {print names} Built-in Functions 4 Animal Science 562 AWK Programming • Provides built-in variables that maintain frequently used quantities: number of fields, input line number • Built-in functions for computing square root logarithms random numbers 1.6 Control-Flow Statements (Note: These constructs are available in gawk and not awk at ISU.) IF-Else Statement $2 > 6 {n = n+1; pay = pay + $2 * $3} END {if (n > 0) print n, “employees, Total pay is “, pay, “average pay is “, pay/n else print “no employees are paid more than $6/hour” } While Statement Task: Show how the value of an amount of money invested at a particular interest rate grows over a number of years, using the formula value = amount (1 + rate)years. #interest1 - compute compound interest # input: amount rate years # output: compounded value at the end of each year. { i=1 while (i <= $3){ printf(“\t%.2f\n”, $1*(1+$2)^i) i=i+1 } } Try gawk -f interest1 1000 .06 5 1000 .12 5 References Aho, A. V., B. W. Kernighan, and P. J. Weinberger. 1988. The AWK Programming Language. Addison-Wesley. New York. Dougherty, D. 1990. Sed and Awk: UNIX Power Tools. O’Reilly and Associates, Inc. California. 5