Using Unix Shell Scripts

advertisement
Using Unix Shell Scripts
to Manage Large Data
What is Unix shell script?
• A collection of unix commands may be stored in a
file, and csh/bash can be invoked to execute the
commands in that file.
• Like other programming languages, it has
variables and flow control statements, e.g.,
–
–
–
–
if-then-else;
while;
for;
goto.
• you can run any shell simply by typing its name.
Useful Unix commands
• grep: globally searches for regular expressions in files and prints all
lines that contain the expression
• cut: select fields or characters from each line of a file
• head/tail: cut the first/last # lines of a file
• wc: count # characters/words/lines of a file
• split: read a file and writes it in n line pieces into a set of output
files
• cat/paste: join files by rows or columns
• join: merge two files by a common field
• awk: a POWERFUL pattern scanning and processing language
Use “man command_name” to see the help file
Motivating example
• Genome-wide DNA methylation data
– ~3000 samples (rows)
– ~485,000 sites (columns)
– Data came in batches (~300 sample per file, ~1Gb
each)
– For our analysis, we would like to:
• Pool all samples together
• but split to ~50,000 sites per file
– Load to R? will take ~14GB memory and R takes hours
to read each file
– Using csh scripts, only takes ~10 minutes
csh script: pool samples
#!/bin/csh
cd /dir
rm -f cpg.txt
cp -f All_Beta_Values1.txt cpg.txt
foreach m (`seq 2 9`)
# count number of samples
@ l = `wc -l All_Beta_Values${m}.txt | cut -f 1 -d " "` - 1
echo "file = ${m}, nrow = $l"
rm -f test.txt
# remove the header
tail -n $l All_Beta_Values${m}.txt > test.txt
cat test.txt >> cpg.txt
end
csh script: split by sites
#!/bin/csh
cd /dir
foreach n (`seq 1 9`)
rm -f beta2950_${n}of10.txt
# start
@ l = ($n - 1) * 50000 + 2
# end
@ r = $n * 50000 + 1
zcat cpg.txt.gz | cut -f 1,$l-$r > beta2950_${n}of10.txt
end
zcat cpg.txt.gz | cut -f 1,450002- > beta2950_10of10.txt
Some tips
• To check whether a data file contains header or not,
whether it is tab- or comma-delimited
> head -n 1 filename
• To check a selected variable/column (e.g., to see how
missing values were coded)
> head -n 10 filename | cut -f #,#
• To get a subset of samples by matching ID
> grep -f ID.txt filename
• To find a certain column
> zcat filename.txt.gz | head -n 1 | awk
'/variable_name/{for(i=1;i<=NF;++i)if($i~/variable_name/
)print NR,i,$i}'
Using scripts to generate scripts
#!/bin/bash -l
#PBS -l walltime=16:00:00,pmem=2800mb,nodes=13:ppn=8
#PBS -m abe
proc=0
for i in `seq 0 12`
do
for j in `seq 1 8`
do
job=$(($i*8+$j-1))
scripts=/path
echo "#!/bin/bash -l" >$scripts/sim$job.sh
echo "cd $scripts">>$scripts/sim$job.sh
echo "module load R" >>$scripts/sim$job.sh
echo "R CMD BATCH --no-save --no-restore '--args job=$job' /path/assoc.R /path/log/sim$job.txt" >>
$scripts/sim$job.sh
chmod 770 $scripts/sim$job.sh
pbsdsh -n $proc $scripts/sim$job.sh &
proc=$(($proc+1))
done
done
wait
Download