Using Unix Shell Scripts

Using Unix Shell Scripts to Manage Large Data What is Unix shell script? • A collection of unix commands may be stored in a file, and csh/bash can be invoked to execute the commands in that file. • Like other programming languages, it has variables and flow control statements, e.g., – – – – if-then-else; while; for; goto. • you can run any shell simply by typing its name. Useful Unix commands • grep: globally searches for regular expressions in files and prints all lines that contain the expression • cut: select fields or characters from each line of a file • head/tail: cut the first/last # lines of a file • wc: count # characters/words/lines of a file • split: read a file and writes it in n line pieces into a set of output files • cat/paste: join files by rows or columns • join: merge two files by a common field • awk: a POWERFUL pattern scanning and processing language Use “man command_name” to see the help file Motivating example • Genome-wide DNA methylation data – ~3000 samples (rows) – ~485,000 sites (columns) – Data came in batches (~300 sample per file, ~1Gb each) – For our analysis, we would like to: • Pool all samples together • but split to ~50,000 sites per file – Load to R? will take ~14GB memory and R takes hours to read each file – Using csh scripts, only takes ~10 minutes csh script: pool samples #!/bin/csh cd /dir rm -f cpg.txt cp -f All_Beta_Values1.txt cpg.txt foreach m (`seq 2 9`) # count number of samples @ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1 echo "file = ${m}, nrow = $l" rm -f test.txt # remove the header tail -n $l All_Beta_Values${m}.txt > test.txt cat test.txt >> cpg.txt end csh script: split by sites #!/bin/csh cd /dir foreach n (`seq 1 9`) rm -f beta2950_${n}of10.txt # start @ l = ($n - 1) * 50000 + 2 # end @ r = $n * 50000 + 1 zcat cpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt end zcat cpg.txt.gz | cut -f 1,450002- > beta2950_10of10.txt Some tips • To check whether a data file contains header or not, whether it is tab- or comma-delimited > head -n 1 filename • To check a selected variable/column (e.g., to see how missing values were coded) > head -n 10 filename | cut -f #,# • To get a subset of samples by matching ID > grep -f ID.txt filename • To find a certain column > zcat filename.txt.gz | head -n 1 | awk '/variable_name/{for(i=1;i<=NF;++i)if($i~/variable_name/ )print NR,i,$i}' Using scripts to generate scripts #!/bin/bash -l #PBS -l walltime=16:00:00,pmem=2800mb,nodes=13:ppn=8 #PBS -m abe proc=0 for i in `seq 0 12` do for j in `seq 1 8` do job=$(($i*8+$j-1)) scripts=/path echo "#!/bin/bash -l" >$scripts/sim$job.sh echo "cd $scripts">>$scripts/sim$job.sh echo "module load R" >>$scripts/sim$job.sh echo "R CMD BATCH --no-save --no-restore '--args job=$job' /path/assoc.R /path/log/sim$job.txt" >> $scripts/sim$job.sh chmod 770 $scripts/sim$job.sh pbsdsh -n $proc $scripts/sim$job.sh & proc=$(($proc+1)) done done wait

Using Unix Shell Scripts

Related documents

Products

Support

Using Unix Shell Scripts

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib