April 13, 2015 Introduction to BASH, AWK, and PERL Victor Anisimov, NCSA FIU / SSERCA / XSEDE Workshop, Apr 4-5, 2013, Miami, FL MOTIVATION • Increase Productivity of Research & Development Scripting languages require less effort in implementation of small computational projects than that when using regular programming languages Scripts are more portable than binary code Scripts are easy to maintain Lab materials: /home/anisimov/labs.tgz on FIU cluster Important: type “module add make” after logging to FIU cluster Introduction to BASH, AWK, and PERL 2 BASH, AWK, and PERL BASH is a Linux shell AWK is a language for data post-processing PERL is a versatile programming language Common feature: interpreted programming languages How to decide which one I will need: project complexity dictates which language to use Introduction to BASH, AWK, and PERL 3 Objective of the Course As of now: • No prerequisites are necessary • No change in the way you think • No need to memorize abstract concepts At the end of the day: • You will learn three programming languages • You will improve your project organization skills • You will increase your productivity Introduction to BASH, AWK, and PERL 4 Every Project Works with Data • • • • • • Data generation by computation Extraction of data from text files Data format conversion Data computation Data analysis and reporting Data archival and retrieval Scripting languages can handle this work without turning the data processing into a major programming project Introduction to BASH, AWK, and PERL 5 Projects have Complex Processing Flows • Input to a program depends on the result of another program • The process includes many steps that need to be automated • The process is not standard and has to be created • The process needs to be optimized Scripting Languages are perfect for automation of repetitive processes Introduction to BASH, AWK, and PERL 6 Elements of Programming Language • • • • • Data types Conditional statements Loops Functions / procedures Input / Output Our first guide to this virtual world is BASH shell. Introduction to BASH, AWK, and PERL 7 BASH Data Types • BASH treats all variables as text strings • Limited support of integer arithmetics #!/bin/bash greetings="Hello ${USER}!" # example of string today=`date` # run a program by enclosing it in grave accents echo "${greetings} Today is ${today}” N=1; let N=N+2; echo "Integer math: 1+2=${N}" R=0.1; R=`echo “$R+1.2” | bc -l`; echo "FP math: 0.1+1.2=${R}” $ chmod 755 01-hello.sh $ ./01-hello.sh Hello victor! Today is Thu Apr 4 13:37:02 EST 2013 Integer math: 1+2=3 FP math: 0.1+1.2=1.3 Introduction to BASH, AWK, and PERL 8 BASH Conditional Statements One more data type: built-in constants $# - number of arguments; $0 - self name; $1, $2, … - command-line arguments #!/bin/bash # supported string comparison conditions: == != # supported arithmetic conditions: -eq (==) -ne (!=) –lt (<) -le (<=) -gt (>) –ge (>=) if [ $# != 2 ] ; then echo "USAGE $0 argument1 argument2" ; exit fi if [ $1 -gt $2 ] ; then echo "True: $1 -gt $2" else echo "False: $1 -gt $2" fi $ ./02-conditions.sh Introduction to BASH, AWK, and PERL 9 BASH Loops • Loop over list LIST="01 02 03 04 05” for job in ${LIST} ; do echo "job number ${job}” done example: 03-loops.sh • Conditional loop N=1 while [ ${N} -le 5 ] ; do echo ${N} let N=N+1 done • C-style loop for ((a=1; a <= LIMIT ; a++)) Introduction to BASH, AWK, and PERL 10 BASH Procedures / Functions • Functions contain repetitive part of the code #!/bin/bash # declaration of function filenameGenerator() { echo "$1.out" } # call the function and supply arguments filenameGenerator 1 filenameGenerator 2 $ ./04-functions.sh $ 1.out $ 2.out Introduction to BASH, AWK, and PERL 11 BASH Input / Output • I/O is extremely simple in BASH cat file.out mycode.sh | mytool.sh mycode.sh > /dev/null mycode.sh &> log.out & send file content to std output send output to another program get rid of unwanted output detach from terminal Introduction to BASH, AWK, and PERL 12 Sample BASH Project • Perform context replacement in text file 05-project.sh #!/bin/bash if [ $# -ne 1 ] ; then echo "Usage: $0 file.coor” else # create name for output file outfile=`echo $1 | sed 's/\.coor/\.pdb/'` # replace "HETATM" by "ATOM " in the text cat $1 | sed 's/HETATM/ATOM /' > $outfile # count number of processed lines wc -l $outfile fi Introduction to BASH, AWK, and PERL 13 AWK Developed by Aho, Weinberger, and Kernighan • Although simple and powerful, BASH code can quickly become bulky because of limited structural constructs • AWK designed to simplify data extraction and post-processing; and thus it nicely complements BASH when computational projects become a little more involved Introduction to BASH, AWK, and PERL 14 The Power of AWK in Action • Compute sum of number in the one-line code #!/bin/bash awk 'BEGIN{sum=0} {for (i = 1; i <= NF; i++) sum += $i} END{print sum}’ $ echo "1.2 2.3 3.4" | ./01-sum.sh $ 6.9 AWK logistics: • • section BEGIN{…} is executed once in the beginning standard input is processed by main program body, i.e. by second {…} block • • • i is loop index, so we can address each field as $i • • • NF is a built-in constant equal to number of fields obtained from standard input $1, $2, … individual input fields input fields are processed in the C-style for-loop and their value is summed up Section END{…} is executed once in the end of execution Variable type is automatically recognized by awk based on operation type Introduction to BASH, AWK, and PERL 15 AWK: Input Field Separator (option –F) • AWK accepts custom field separators #!/bin/bash awk -F$1 '{for (i = 1; i <= NF; i++) print $i}’ Use comma as field separator $ echo "1,a,3,b:5" | ./02-inpfields.sh , 1 a 3 b:5 comma character Challenge: Try using different field separators Introduction to BASH, AWK, and PERL 16 AWK: PDB-to-XYZ Format Conversion 03-convert.sh Arrays in AWK are super easy !!! #!/bin/bash # Convert PDB file to XYZ format if [ $# -ne 1 ] ; then echo "Usage: $0 input.pdb" else cat $1 | awk 'BEGIN {n=0} { if($1 == ”ATOM") {n=n+1; a[n]=$3; x[n]=$5; y[n]=$6; z[n]=$7} } END { printf "%d\n\n", n; for (i=1; i<=n; i++) printf "%-5s %7.3f %7.3f %7.3f\n", a[i], x[i],y[i],z[i]; }' fi Introduction to BASH, AWK, and PERL 17 AWK: Column Block-average 04-blockaverage.sh #!/bin/bash # compute block-average for data from loan.out if [ $# -ne 2 ] ; then echo "USAGE: $0 blocksize column” ; exit fi cat loan.out | awk -v blocksize=$1 -v column=$2 ' BEGIN{n=0; j=0} { if(NF==10) {x[n]=$column; n++} } # read all data END{ nblocks = n / blocksize; for(i=0; i<nblocks; i++){ # loop over blocks aver=0.0; # compute average for each block for(nRecs=0; nRecs<blocksize && j<n; nRecs++) { aver += x[j]; j++ } printf "%4d %9.3f %d\n", i+1, aver/nRecs, nRecs; } }' Introduction to BASH, AWK, and PERL 18 AWK: Multiple Input Files 05-nfiles-demo.sh 06-nfiles-full.sh • Alternative processing of input data from a file #!/bin/bash # alternative way of handling input files inpfile="loan.out” nlines=`wc -l ${inpfile} | awk '{print $1}’` awk -v inpfile=${inpfile} -v size=${nlines} ' BEGIN{ command = "cat " inpfile; for(i=0; i<size; i++) { command | getline; if(NF==10) print $0; } }' # input file to be processed # get number of lines # string concatenation # getting a line from the file # print entire line Introduction to BASH, AWK, and PERL 19 AWK: Functions – Return Absolute Value • Compute absolute value #!/bin/sh awk 'function abs(x){return ((x+0.0 < 0.0) ? -x : x)} {print abs($1)}’ $ echo -23.11 | ./07-function.sh 23.11 Introduction to BASH, AWK, and PERL 20 AWK: Writing to File • AWK writes to file by using the mechanism of output redirection 08-file.sh #!/bin/sh # redirecting output to a file if [ $# -ne 1 ] ; then echo "Usage $0 input.pdb" ; exit fi output=`echo $1 | sed 's/\.pdb/\.txt/'` cat $1 | awk -v fname=${output} '{print $0 > fname}' Introduction to BASH, AWK, and PERL 21 Exercise Write a script to optimize the loan duration NCSA Loan Simulator (copy left) FIU Workshop 2013, will be our computational kernel Input: Starting balance = $ 1000.00 Annual interest = % 7.20 Minimum payment = % 1.00 Output: month: month: month: month: 1 2 3 4 balance: 1006.00 charge: balance: 751.48 charge: balance: 495.43 charge: balance: 237.85 charge: The program is not flexible enough; so, how to get the answer we need? 6.00 4.48 2.95 1.42 payment: payment: payment: payment: 259.00 259.00 259.00 237.85 interest: interest: interest: interest: 6.00 10.48 13.43 14.85 Simulation results: Borrowed 1000.00 Paid 1014.85 in 4 months Finance charge 14.85 Introduction to BASH, AWK, and PERL 22 PERL Practical Extraction and Reporting Language by Larry Wall • • • • • • Full fledge (interpreted) programming language Highly optimized and amazingly fast Ideal for data processing and data extraction Lots of reusable plug-ins available for download Fast learning curve If you know C-language, you already know Perl Introduction to BASH, AWK, and PERL 23 PERL: Program Structure #!/usr/bin/perl –w enable warnings my $inpFileName = ""; # string my $sum = 0.0; # floating point mandatory semicolon at the end of line if (@ARGV != 1) { # number of command-line arguments printf " USAGE %s loan.out\n", $0; exit } $0 is self program name else { read 1st command-line argument $inpFileName = $ARGV[0]; unless (open INP, "<$inpFileName") { die "Error: Cannot open input file $inpFileName” } readData(); open file descriptor for reading close INP; close file descriptor after reading is done print "All Done\n"; } sub readData { } (<) do the work here (will be described later) Introduction to BASH, AWK, and PERL 24 PERL: Pattern Matching Extracting specific parts from text files is often a non-trivial task # Patterns my $ap = "\\S+"; my $lp = "\\w+\\d*"; my $ip = "-?\\d+"; my $rp = "-?\\d*\\.?\\d*"; my $ep = "[+|-]?\\d+\\.?\\d*[D|E]?[+|-]?\\d*"; # Any pattern # Label (text) pattern # Integer pattern # Real pattern # Exponential pattern (scientific format) mask \s – space \S – non-space \w – word character (a-zA-Z0-9) \W – anything but word character \d – numeric character (0-9) \D – anything except numeric \. – any character multiplier [+|-]? + ? * – either + or – or neither – one or more same instances – optional instance – any number of same instances Introduction to BASH, AWK, and PERL 25 PERL: Arrays @ARGV my @array = (); # built-in array for command-line arguments # array declaration # accessing array elements for(my $i=0; $i < $nRecords; $i++) { printf "%9.3f \n", $array[$i]; } # returning and passing arrays ($nRecords, $total) = readData( $ARGV[1], \@array ); sub readData { my ($column, $data) = @_; $$data[$i] = $substring; # such array must be handled as a pointer } Introduction to BASH, AWK, and PERL 26 Exercise: Data Extraction Project 01-parser.pl • Use the data from loan.out • Read a specified column • Sum up the values • Extra credit: make sure that the values to be summed up have type real Introduction to BASH, AWK, and PERL 27 Useful Internet Resources • BASH http://tldp.org/LDP/abs/html/ • AWK http://www.gnu.org/software/gawk/manual/gawk.html • PERL http://www.perl.org book: Learning Perl, Author: Randal L. Schwartz, O’Reilly Introduction to BASH, AWK, and PERL 28 Let Us know your opinion http://www.bitly.com/fiuworkshop Thank you !!! Introduction to BASH, AWK, and PERL 29