Introduction to BASH, AWK, and PERL

advertisement
April 13, 2015
Introduction to BASH, AWK, and PERL
Victor Anisimov, NCSA
FIU / SSERCA / XSEDE Workshop, Apr 4-5, 2013, Miami, FL
MOTIVATION
• Increase Productivity of Research & Development
Scripting languages require less effort in
implementation of small computational projects than
that when using regular programming languages
Scripts are more portable than binary code
Scripts are easy to maintain
Lab materials: /home/anisimov/labs.tgz on FIU cluster
Important: type “module add make” after logging to FIU cluster
Introduction to BASH, AWK, and PERL
2
BASH, AWK, and PERL
BASH is a Linux shell
AWK is a language for data post-processing
PERL is a versatile programming language
Common feature:
interpreted programming languages
How to decide which one I will need:
project complexity dictates which language to use
Introduction to BASH, AWK, and PERL
3
Objective of the Course
As of now:
• No prerequisites are necessary
• No change in the way you think
• No need to memorize abstract concepts
At the end of the day:
• You will learn three programming languages
• You will improve your project organization skills
• You will increase your productivity
Introduction to BASH, AWK, and PERL
4
Every Project Works with Data
•
•
•
•
•
•
Data generation by computation
Extraction of data from text files
Data format conversion
Data computation
Data analysis and reporting
Data archival and retrieval
Scripting languages can handle this work without
turning the data processing into a major programming
project
Introduction to BASH, AWK, and PERL
5
Projects have Complex Processing Flows
• Input to a program depends on the result of
another program
• The process includes many steps that need to be
automated
• The process is not standard and has to be
created
• The process needs to be optimized
Scripting Languages are perfect for automation of
repetitive processes
Introduction to BASH, AWK, and PERL
6
Elements of Programming Language
•
•
•
•
•
Data types
Conditional statements
Loops
Functions / procedures
Input / Output
Our first guide to this virtual world is BASH shell.
Introduction to BASH, AWK, and PERL
7
BASH Data Types
• BASH treats all variables as text strings
• Limited support of integer arithmetics
#!/bin/bash
greetings="Hello ${USER}!" # example of string
today=`date` # run a program by enclosing it in grave accents
echo "${greetings} Today is ${today}”
N=1; let N=N+2;
echo "Integer math: 1+2=${N}"
R=0.1; R=`echo “$R+1.2” | bc -l`; echo "FP math: 0.1+1.2=${R}”
$ chmod 755 01-hello.sh
$ ./01-hello.sh
Hello victor! Today is Thu Apr 4 13:37:02 EST 2013
Integer math: 1+2=3
FP math: 0.1+1.2=1.3
Introduction to BASH, AWK, and PERL
8
BASH Conditional Statements
One more data type: built-in constants
$# - number of arguments; $0 - self name; $1, $2, … - command-line arguments
#!/bin/bash
# supported string comparison conditions: == !=
# supported arithmetic conditions: -eq (==) -ne (!=) –lt (<) -le (<=) -gt (>) –ge (>=)
if [ $# != 2 ] ; then
echo "USAGE $0 argument1 argument2" ; exit
fi
if [ $1 -gt $2 ] ; then
echo "True: $1 -gt $2"
else
echo "False: $1 -gt $2"
fi
$ ./02-conditions.sh
Introduction to BASH, AWK, and PERL
9
BASH Loops
• Loop over list
LIST="01 02 03 04 05”
for job in ${LIST} ; do
echo "job number ${job}”
done
example: 03-loops.sh
• Conditional loop
N=1
while [ ${N} -le 5 ] ; do
echo ${N}
let N=N+1
done
• C-style loop
for ((a=1; a <= LIMIT ; a++))
Introduction to BASH, AWK, and PERL
10
BASH Procedures / Functions
• Functions contain repetitive part of the code
#!/bin/bash
# declaration of function
filenameGenerator()
{
echo "$1.out"
}
# call the function and supply arguments
filenameGenerator 1
filenameGenerator 2
$ ./04-functions.sh
$ 1.out
$ 2.out
Introduction to BASH, AWK, and PERL
11
BASH Input / Output
• I/O is extremely simple in BASH
cat file.out
mycode.sh | mytool.sh
mycode.sh > /dev/null
mycode.sh &> log.out &
send file content to std output
send output to another program
get rid of unwanted output
detach from terminal
Introduction to BASH, AWK, and PERL
12
Sample BASH Project
• Perform context replacement in text file
05-project.sh
#!/bin/bash
if [ $# -ne 1 ] ; then
echo "Usage: $0 file.coor”
else
# create name for output file
outfile=`echo $1 | sed 's/\.coor/\.pdb/'`
# replace "HETATM" by "ATOM " in the text
cat $1 | sed 's/HETATM/ATOM /' > $outfile
# count number of processed lines
wc -l $outfile
fi
Introduction to BASH, AWK, and PERL
13
AWK
Developed by Aho, Weinberger, and Kernighan
• Although simple and powerful, BASH code can
quickly become bulky because of limited
structural constructs
• AWK designed to simplify data extraction and
post-processing; and thus it nicely complements
BASH when computational projects become a
little more involved
Introduction to BASH, AWK, and PERL
14
The Power of AWK in Action
• Compute sum of number in the one-line code
#!/bin/bash
awk 'BEGIN{sum=0} {for (i = 1; i <= NF; i++) sum += $i} END{print sum}’
$ echo "1.2 2.3 3.4" | ./01-sum.sh
$ 6.9
AWK logistics:
•
•
section BEGIN{…} is executed once in the beginning
standard input is processed by main program body, i.e. by second {…} block
•
•
•
i is loop index, so we can address each field as $i
•
•
•
NF is a built-in constant equal to number of fields obtained from standard input
$1, $2, … individual input fields
input fields are processed in the C-style for-loop and their value is summed up
Section END{…} is executed once in the end of execution
Variable type is automatically recognized by awk based on operation type
Introduction to BASH, AWK, and PERL
15
AWK: Input Field Separator (option –F)
• AWK accepts custom field separators
#!/bin/bash
awk -F$1 '{for (i = 1; i <= NF; i++) print $i}’
Use comma as field separator
$ echo "1,a,3,b:5" | ./02-inpfields.sh ,
1
a
3
b:5
comma character
Challenge: Try using different field separators
Introduction to BASH, AWK, and PERL
16
AWK: PDB-to-XYZ Format Conversion
03-convert.sh
Arrays in AWK are super easy !!!
#!/bin/bash
# Convert PDB file to XYZ format
if [ $# -ne 1 ] ; then
echo "Usage: $0 input.pdb"
else
cat $1 |
awk 'BEGIN {n=0}
{ if($1 == ”ATOM") {n=n+1; a[n]=$3; x[n]=$5; y[n]=$6; z[n]=$7} }
END {
printf "%d\n\n", n;
for (i=1; i<=n; i++)
printf "%-5s %7.3f %7.3f %7.3f\n", a[i], x[i],y[i],z[i];
}'
fi
Introduction to BASH, AWK, and PERL
17
AWK: Column Block-average
04-blockaverage.sh
#!/bin/bash
# compute block-average for data from loan.out
if [ $# -ne 2 ] ; then
echo "USAGE: $0 blocksize column” ; exit
fi
cat loan.out | awk -v blocksize=$1 -v column=$2 '
BEGIN{n=0; j=0}
{ if(NF==10) {x[n]=$column; n++} }
# read all data
END{
nblocks = n / blocksize;
for(i=0; i<nblocks; i++){
# loop over blocks
aver=0.0;
# compute average for each block
for(nRecs=0; nRecs<blocksize && j<n; nRecs++) { aver += x[j]; j++ }
printf "%4d %9.3f %d\n", i+1, aver/nRecs, nRecs;
}
}'
Introduction to BASH, AWK, and PERL
18
AWK: Multiple Input Files
05-nfiles-demo.sh
06-nfiles-full.sh
• Alternative processing of input data from a file
#!/bin/bash
# alternative way of handling input files
inpfile="loan.out”
nlines=`wc -l ${inpfile} | awk '{print $1}’`
awk -v inpfile=${inpfile} -v size=${nlines} '
BEGIN{
command = "cat " inpfile;
for(i=0; i<size; i++) {
command | getline;
if(NF==10) print $0;
}
}'
# input file to be processed
# get number of lines
# string concatenation
# getting a line from the file
# print entire line
Introduction to BASH, AWK, and PERL
19
AWK: Functions – Return Absolute Value
• Compute absolute value
#!/bin/sh
awk 'function abs(x){return ((x+0.0 < 0.0) ? -x : x)} {print abs($1)}’
$ echo -23.11 | ./07-function.sh
23.11
Introduction to BASH, AWK, and PERL
20
AWK: Writing to File
• AWK writes to file by using the mechanism of
output redirection
08-file.sh
#!/bin/sh
# redirecting output to a file
if [ $# -ne 1 ] ; then
echo "Usage $0 input.pdb" ; exit
fi
output=`echo $1 | sed 's/\.pdb/\.txt/'`
cat $1 | awk -v fname=${output} '{print $0 > fname}'
Introduction to BASH, AWK, and PERL
21
Exercise
Write a script to optimize the loan duration
NCSA Loan Simulator (copy left) FIU Workshop 2013, will be our computational kernel
Input:
Starting balance =
$ 1000.00
Annual interest =
%
7.20
Minimum payment = %
1.00
Output:
month:
month:
month:
month:
1
2
3
4
balance: 1006.00 charge:
balance: 751.48 charge:
balance: 495.43 charge:
balance: 237.85 charge:
The program is not flexible
enough; so, how to get the
answer we need?
6.00
4.48
2.95
1.42
payment:
payment:
payment:
payment:
259.00
259.00
259.00
237.85
interest:
interest:
interest:
interest:
6.00
10.48
13.43
14.85
Simulation results:
Borrowed 1000.00
Paid
1014.85 in 4 months
Finance charge
14.85
Introduction to BASH, AWK, and PERL
22
PERL
Practical Extraction and Reporting Language by Larry Wall
•
•
•
•
•
•
Full fledge (interpreted) programming language
Highly optimized and amazingly fast
Ideal for data processing and data extraction
Lots of reusable plug-ins available for download
Fast learning curve
If you know C-language, you already know Perl
Introduction to BASH, AWK, and PERL
23
PERL: Program Structure
#!/usr/bin/perl –w
enable warnings
my $inpFileName = ""; # string
my $sum = 0.0;
# floating point
mandatory semicolon at the end of line
if (@ARGV != 1) {
# number of command-line arguments
printf " USAGE %s loan.out\n", $0; exit }
$0 is self program name
else {
read 1st command-line argument
$inpFileName = $ARGV[0];
unless (open INP, "<$inpFileName") { die "Error: Cannot open input file $inpFileName” }
readData();
open file descriptor for reading
close INP;
close file descriptor after reading is done
print "All Done\n";
}
sub readData {
}
(<)
do the work here (will be described later)
Introduction to BASH, AWK, and PERL
24
PERL: Pattern Matching
Extracting specific parts from text files is often a non-trivial task
# Patterns
my $ap = "\\S+";
my $lp = "\\w+\\d*";
my $ip = "-?\\d+";
my $rp = "-?\\d*\\.?\\d*";
my $ep = "[+|-]?\\d+\\.?\\d*[D|E]?[+|-]?\\d*";
# Any pattern
# Label (text) pattern
# Integer pattern
# Real pattern
# Exponential pattern (scientific format)
mask
\s – space
\S – non-space
\w – word character (a-zA-Z0-9)
\W – anything but word character
\d – numeric character (0-9)
\D – anything except numeric
\. – any character
multiplier
[+|-]?
+
?
*
– either + or – or neither
– one or more same instances
– optional instance
– any number of same instances
Introduction to BASH, AWK, and PERL
25
PERL: Arrays
@ARGV
my @array = ();
# built-in array for command-line arguments
# array declaration
# accessing array elements
for(my $i=0; $i < $nRecords; $i++) {
printf "%9.3f \n", $array[$i];
}
# returning and passing arrays
($nRecords, $total) = readData( $ARGV[1], \@array );
sub readData {
my ($column, $data) = @_;
$$data[$i] = $substring;
# such array must be handled as a pointer
}
Introduction to BASH, AWK, and PERL
26
Exercise: Data Extraction Project
01-parser.pl
• Use the data from loan.out
• Read a specified column
• Sum up the values
• Extra credit: make sure that the values to be
summed up have type real
Introduction to BASH, AWK, and PERL
27
Useful Internet Resources
• BASH
http://tldp.org/LDP/abs/html/
• AWK
http://www.gnu.org/software/gawk/manual/gawk.html
• PERL
http://www.perl.org
book: Learning Perl, Author: Randal L. Schwartz, O’Reilly
Introduction to BASH, AWK, and PERL
28
Let Us know your opinion
http://www.bitly.com/fiuworkshop
Thank you !!!
Introduction to BASH, AWK, and PERL
29
Download