AWK Programming Introduction

advertisement
Methodologies for Population/Quantitative Genetics
Animal Science 562
AWK Programming
Introduction
“Computer users spend a lot of time doing simple, mechanical data
manipulation - changing the format of data, checking its validity, finding items with
some property, adding up numbers, printing reports, and the like. All of these jobs
ought to be mechanized, but it’s a real nuisance to have to write a special-purpose
program in a standard language like C or Pascal each time such a task comes up.
“Awk is a programming language that make it possible to handle simple,
mechanical data manipulation tasks with very short programs, often only one or two
lines long. An awk program is a sequence of patterns and actions that tell what to
look for in the input data and what to do when it’s found.”
Aho, Kernighan and Weinberger. 1988. “The AWK Programming Language”
1.1 Getting Started
File emp.dat contains name, pay rate in dollars per hour, number of hours worked, one
employee record per line.
Beth 4.00
0
Dan 3.75
0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
Task: Print the name and pay (rate times hours) for everyone who worked more than zero hours.
Program:
awk ‘$3 > 0 { print $1, $2 * $3 }’ emp.dat
The Structure of an AWK Program
Each AWK program is a sequence of one or more pattern-action statements
pattern {action}
pattern {action}
…
Running an AWK Program
Type a command line of the form
awk ‘program’ input files
omit the input files from the command line
awk ‘program’
awk will apply the program to whatever you type next on your terminal until you type an end-of-file
signal (control-d on UNIX systems).
1
Animal Science 562
AWK Programming
Executing long programs
awk -f progfile optional list of input files
1.2 Simple Output
Only two types of data in awk: numbers and strings of characters.
Awk reads one line at a time and splits each line into fields.
A field is a sequence of characters that doesn’t contain any blanks or tabs.
First field in current input line is called $1, the second $2, and so forth.
The entire line is called $0.
The number of fields can vary from line to line.
Print Every Line
Print Certain Fields
NF, The Number of Fields
awk
awk
awk
‘{print}’ OR awk ‘{print $0}’
‘{print $1, $3}’
‘{print NF, $1, $NF}’ (Any expression can be used after $ to denote a field
number.)
Computing and Printing
awk
Printing Line Numbers
awk
Putting Text in the Output awk
‘{print $1, $2 * $3}’
‘{print NR, $0}’
‘{print “total pay for “, $1, “ is “, $2 * $3}’
1.3 Fancier Output
• print statement is meant for quick and easy output
• use printf statement to format the output exactly the way you want it
Lining Up Fields
printf statement form
printf (format, value1, value2, …, valuen)
where format is a string that contains text to be printed verbatim interspersed with specification of
how each of the values is to be printed
A specification is a % followed by a few characters that control the format of a value.
Task: Use prinf to print the total pay for every employee
awk ‘{printf (“total pay for %s is $%.2f \n”, $1, $2 * $3)}’
• contains two % specifications
%s
%.2f
print the first value $1, as a string of characters
print the second value, $2*$3, as a number with 2 digits after the decimal point
• no blanks or new lines are produced automatically; you must create them yourself. Don’t forget
the \n.
2
Animal Science 562
AWK Programming
Task: Print each employee’s name and pay.
awk ‘{printf (“%-8s $%6.2f \n”, $1, $2 * $3)}’
Sorting the Output
Task: Print all data for each employee, along with his or her pay, sorted in order of increasing pay.
awk ‘{printf (“%6.2f %s \n”, $2 * $3, $0)}’ emp.data | sort
• pipes the output of awk into the sort command.
1.4 Selection
• Awk patterns are good for selecting interesting lines from the input for further processing.
Selection by Comparison
Task: A comparison pattern to select the records of employees who earn $5.00 or more per hour.
awk ‘$2 >= 5’ emp.data
Selection by Computation
Task: Print the pay of those employees whose total pay exceeds $50.
awk ‘$2*$3 > 50 {printf(“$%.2f for %s \n”, $2*$3, $1)}’
Selection by Text Content
Task: Print all lines in which the first field is Susie
awk ‘$1 = =”Susie”’
Combinations of Patterns
• Patterns can be combined with parentheses and the logical operators &&, ||, and !, which stand
for AND, OR, and NOT, respectively.
Task: Print lines where $2 is at least 4 or $3 is at least 20.
awk ‘$2 >= 4 || $3 >= 20’ emp.data
Data Validation
• Awk is an excellent tool for checking that data has reasonable values and that it is in the right
format.
Task: Use comparison patterns to apply five plausibility tests to each line of emp.data.
awk ‘NF !=3 {print $0, “number of fields is not equal to 3”}’
awk ‘$2 <3.35 {print $0, “rate is below minimum wage”}’
awk ‘$2 > 10 {print $0, “rate exceeds $10 per hour”}’
awk ‘$3 <0 {print $0, “negative hours worked”}’
awk ‘$3 >60 {print $0, “too many hours worked”}’
3
Animal Science 562
AWK Programming
BEGIN and END
The special pattern BEGIN matches before the first line of the first input file is read, and
END matches after the last line of the last file has been processed.
Task: Use BEGIN to print a heading. (Note. This is a multiple line file and must be executed from
a file.)
BEGIN {print “Name Rate Hours”; print “ ”}
{print}
• You can put several statements on a single line if you separate them by semicolons.
1.5 Computing with AWK
• In awk, user-created variables are not declared.
Counting
Task: Use a variable emp to count employees who have worked more than 15 hours.
$3 > 15 {emp=emp+1}
END {print emp, “employees worked more than 15 hours”}
Computing Sums and Averages
Task: Use the built-in variable NR to count the number of employees
awk ‘END {print NR, “employees”}
Task: Compute the average pay
{pay = pay + $2*$3}
END {print NR, “employees”
print “Total pay is ”, pay
print “average pay is ”, pay /NR
}
Handling Text
• One strength of awk is its ability to handle strings of characters as conveniently as most
languages handle numbers.
Task: Find the employee who is paid the most per hour.
$2 > maxrate {maxrate = $2; maxemp=$1}’
END print “highest hourly rate: “, maxrate, “ for “, maxemp}
String Concatenation
Task: Create new strings by combining old ones
{names = names $1 “ “}
END {print names}
Built-in Functions
4
Animal Science 562
AWK Programming
• Provides built-in variables that maintain frequently used quantities: number of fields, input line
number
• Built-in functions for computing
square root
logarithms
random numbers
1.6 Control-Flow Statements (Note: These constructs are available in gawk and not awk at
ISU.)
IF-Else Statement
$2 > 6 {n = n+1; pay = pay + $2 * $3}
END {if (n > 0)
print n, “employees, Total pay is “, pay,
“average pay is “, pay/n
else
print “no employees are paid more than $6/hour”
}
While Statement
Task: Show how the value of an amount of money invested at a particular interest rate grows over
a number of years, using the formula
value = amount (1 + rate)years.
#interest1 - compute compound interest
# input:
amount
rate
years
# output:
compounded value at the end of each year.
{ i=1
while (i <= $3){
printf(“\t%.2f\n”, $1*(1+$2)^i)
i=i+1
}
}
Try
gawk -f interest1
1000 .06
5
1000 .12
5
References
Aho, A. V., B. W. Kernighan, and P. J. Weinberger. 1988. The AWK Programming Language.
Addison-Wesley. New York.
Dougherty, D. 1990. Sed and Awk: UNIX Power Tools. O’Reilly and Associates, Inc. California.
5
Download