Uploaded by Cypher #

Xtra exercises

advertisement
Unix for Bioinformatics Environment
Xtra exercises
Exercise 1
Consider the file /users/l/lsterck/olive genes.gff. This is a gff file containing
the annotation for all genes in a genome. There are several columns in such a file, the
most important ones are: column1: sequence name; column4: start coordinate;
column5: stop coordinate;column7: strand of the gene; column9: gene IDs and other
info. As input for a different software we need to transform this to a new format.
What we need are several files, one per sequence (eg. scaffold1.lst, scaffold2.lst, ...)
that contains a list of the geneIDs and the strand sorted ascending on starting
position. TIP: only the lines denoted as mRNA are needed
The files should look like this:
$ head scaffold1.lst
Oeu000001.1+
Oeu000002.1grep -P "\tmRNA\t" olive_genes.gff
Oeu000004.1Oeu000005.1+
cut -f1 -d ";" | tr -d "ID=" | sort -k1,1
Oeu000006.2+
awk '{print $3,$2 > $1".lst"}'
Oeu000007.1Oeu000008.1Oeu000009.1+
Oeu000010.1Oeu000011.1+
| cut -f1,4,7,9 |
-k2,2n | cut -f1,3,4 |
Exercise 2
You need to run a blast analysis for several files in a folder only containing fasta files,
always against the same blast database called fixedBlastDB. Write a shell script that
takes as argument the name of an input fasta file. The script should make a
tmp <fasta file >/ directory, copy the necessary file to it and than execute the blast
analysis in that directory. Afterwards it should copy the result file back to the main
folder where the input file is located and clean up all the temporarily files and folder.
(You can just write blast db fixedBlastDB input fasta input file -out resultFile, the
output file is thus called resultFile).
Now that you have your blast shell script, how would you execute it for each fasta
input file in the folder?
Exercise 3
From a tabular output file of an all-vs-all blast analysis, count the number of
eucalyptus genes (EGR) that have non-self hits (= that have hits other than itself).
Use the file /users/l/lsterck/goodProteins.blastp.gz as input.
Exercise 4
You want to download a rather big amount of data from somewhere and the best way
to do this is to use rsync. However when this takes a long time it might happen that
the connection will be lost at some point during the transfer. Since we all have better
things to do then to restart the command ever so often, write a little shell script that
checks if the rsync command is still running and if not restarts it.
Exercise 5
Assembling genomes usually requires as input several files with read data in it. For a
specific assembly software we need add in several dozen of such files. Moreover the
files need to be inputed in a specific manner.
The command line should look something like this:
abyss-pe -C ABySS_Ppin2 name=ppinV1 k=64 \
lib=’C1J25ACXX_1 C1J25ACXX_2 C1J25ACXX_3 C1J25ACXX_4 C1J25ACXX_5 C1J25ACXX_6 C1J25ACXX_7 C1J25ACXX_8’ \
C1J25ACXX_1="$inDIR/C1J25ACXX_1_0_1.cleanTrim.fq.gz $inDIR/C1J25ACXX_1_0_2.cleanTrim.fq.gz" C1J25ACXX_2="$
se="$inDIR/C1J25ACXX_1_0.singl.merged.fq.gz $inDIR/C1J25ACXX_2_0.singl.merged.fq.gz $inDIR/C1J25ACXX_3_0.s
Write a script that does this automatically, take files in the directory
/users/l/lsterck/Ex5 files as input.
Exercise 6
Typically when you have ordered sequencing data you will receive this in one big tar
file. Nowadays, it is also often the case the any sequencing experiment you order will
be run on several lanes of a sequencing machine. As such the files you get back from
a sequencing provider will have indication of this in the filename (eg.
4C1 S30 L001 R1 001.fq.gz & 4C1 S30 L002 R1 001.fq.gz , where the ” L00X ” part
is the lane indicator). Often those files can also be grouped in their own folder. As
those are in theory technical replicates of the run, they should be merged together to
have all the data of one biological sample in a single file to be analysed later-on.
Consider the file /users/l/lsterck/20220429 NovaSeq FCB-RawDataSUB-4152.tar.gz ,
this is one such file you typically will get from a sequencing provider. ”download”
(== copy in this case) the file to one of your local folders and extract it. Write a
script or a cmdline to process the files from that archive as described above (==
merge all file from a single sample into one file per sample). Also provided is the md5
key of that file. Use that to check if your ”download” was successful.
Exercise 7
Write a script to create an expression matrix of several mapping result files.
Download