Using Unix Shell Scripts to Manage Large Data What is Unix shell script? • A collection of unix commands may be stored in a file, and csh/bash can be invoked to execute the commands in that file. • Like other programming languages, it has variables and flow control statements, e.g., – – – – if-then-else; while; for; goto. • you can run any shell simply by typing its name. Useful Unix commands • grep: globally searches for regular expressions in files and prints all lines that contain the expression • cut: select fields or characters from each line of a file • head/tail: cut the first/last # lines of a file • wc: count # characters/words/lines of a file • split: read a file and writes it in n line pieces into a set of output files • cat/paste: join files by rows or columns • join: merge two files by a common field • awk: a POWERFUL pattern scanning and processing language Use “man command_name” to see the help file Motivating example • Genome-wide DNA methylation data – ~3000 samples (rows) – ~485,000 sites (columns) – Data came in batches (~300 sample per file, ~1Gb each) – For our analysis, we would like to: • Pool all samples together • but split to ~50,000 sites per file – Load to R? will take ~14GB memory and R takes hours to read each file – Using csh scripts, only takes ~10 minutes csh script: pool samples #!/bin/csh cd /dir rm -f cpg.txt cp -f All_Beta_Values1.txt cpg.txt foreach m (`seq 2 9`) # count number of samples @ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1 echo "file = ${m}, nrow = $l" rm -f test.txt # remove the header tail -n $l All_Beta_Values${m}.txt > test.txt cat test.txt >> cpg.txt end csh script: split by sites #!/bin/csh cd /dir foreach n (`seq 1 9`) rm -f beta2950_${n}of10.txt # start @ l = ($n - 1) * 50000 + 2 # end @ r = $n * 50000 + 1 zcat cpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt end zcat cpg.txt.gz | cut -f 1,450002- > beta2950_10of10.txt Some tips • To check whether a data file contains header or not, whether it is tab- or comma-delimited > head -n 1 filename • To check a selected variable/column (e.g., to see how missing values were coded) > head -n 10 filename | cut -f #,# • To get a subset of samples by matching ID > grep -f ID.txt filename • To find a certain column > zcat filename.txt.gz | head -n 1 | awk '/variable_name/{for(i=1;i<=NF;++i)if($i~/variable_name/ )print NR,i,$i}' Using scripts to generate scripts #!/bin/bash -l #PBS -l walltime=16:00:00,pmem=2800mb,nodes=13:ppn=8 #PBS -m abe proc=0 for i in `seq 0 12` do for j in `seq 1 8` do job=$(($i*8+$j-1)) scripts=/path echo "#!/bin/bash -l" >$scripts/sim$job.sh echo "cd $scripts">>$scripts/sim$job.sh echo "module load R" >>$scripts/sim$job.sh echo "R CMD BATCH --no-save --no-restore '--args job=$job' /path/assoc.R /path/log/sim$job.txt" >> $scripts/sim$job.sh chmod 770 $scripts/sim$job.sh pbsdsh -n $proc $scripts/sim$job.sh & proc=$(($proc+1)) done done wait