Xtra exercises

Unix for Bioinformatics Environment Xtra exercises Exercise 1 Consider the file /users/l/lsterck/olive genes.gff. This is a gff file containing the annotation for all genes in a genome. There are several columns in such a file, the most important ones are: column1: sequence name; column4: start coordinate; column5: stop coordinate;column7: strand of the gene; column9: gene IDs and other info. As input for a different software we need to transform this to a new format. What we need are several files, one per sequence (eg. scaffold1.lst, scaffold2.lst, ...) that contains a list of the geneIDs and the strand sorted ascending on starting position. TIP: only the lines denoted as mRNA are needed The files should look like this: $ head scaffold1.lst Oeu000001.1+ Oeu000002.1grep -P "\tmRNA\t" olive_genes.gff Oeu000004.1Oeu000005.1+ cut -f1 -d ";" | tr -d "ID=" | sort -k1,1 Oeu000006.2+ awk '{print $3,$2 > $1".lst"}' Oeu000007.1Oeu000008.1Oeu000009.1+ Oeu000010.1Oeu000011.1+ | cut -f1,4,7,9 | -k2,2n | cut -f1,3,4 | Exercise 2 You need to run a blast analysis for several files in a folder only containing fasta files, always against the same blast database called fixedBlastDB. Write a shell script that takes as argument the name of an input fasta file. The script should make a tmp <fasta file >/ directory, copy the necessary file to it and than execute the blast analysis in that directory. Afterwards it should copy the result file back to the main folder where the input file is located and clean up all the temporarily files and folder. (You can just write blast db fixedBlastDB input fasta input file -out resultFile, the output file is thus called resultFile). Now that you have your blast shell script, how would you execute it for each fasta input file in the folder? Exercise 3 From a tabular output file of an all-vs-all blast analysis, count the number of eucalyptus genes (EGR) that have non-self hits (= that have hits other than itself). Use the file /users/l/lsterck/goodProteins.blastp.gz as input. Exercise 4 You want to download a rather big amount of data from somewhere and the best way to do this is to use rsync. However when this takes a long time it might happen that the connection will be lost at some point during the transfer. Since we all have better things to do then to restart the command ever so often, write a little shell script that checks if the rsync command is still running and if not restarts it. Exercise 5 Assembling genomes usually requires as input several files with read data in it. For a specific assembly software we need add in several dozen of such files. Moreover the files need to be inputed in a specific manner. The command line should look something like this: abyss-pe -C ABySS_Ppin2 name=ppinV1 k=64 \ lib=’C1J25ACXX_1 C1J25ACXX_2 C1J25ACXX_3 C1J25ACXX_4 C1J25ACXX_5 C1J25ACXX_6 C1J25ACXX_7 C1J25ACXX_8’ \ C1J25ACXX_1="$inDIR/C1J25ACXX_1_0_1.cleanTrim.fq.gz $inDIR/C1J25ACXX_1_0_2.cleanTrim.fq.gz" C1J25ACXX_2="$ se="$inDIR/C1J25ACXX_1_0.singl.merged.fq.gz $inDIR/C1J25ACXX_2_0.singl.merged.fq.gz $inDIR/C1J25ACXX_3_0.s Write a script that does this automatically, take files in the directory /users/l/lsterck/Ex5 files as input. Exercise 6 Typically when you have ordered sequencing data you will receive this in one big tar file. Nowadays, it is also often the case the any sequencing experiment you order will be run on several lanes of a sequencing machine. As such the files you get back from a sequencing provider will have indication of this in the filename (eg. 4C1 S30 L001 R1 001.fq.gz & 4C1 S30 L002 R1 001.fq.gz , where the ” L00X ” part is the lane indicator). Often those files can also be grouped in their own folder. As those are in theory technical replicates of the run, they should be merged together to have all the data of one biological sample in a single file to be analysed later-on. Consider the file /users/l/lsterck/20220429 NovaSeq FCB-RawDataSUB-4152.tar.gz , this is one such file you typically will get from a sequencing provider. ”download” (== copy in this case) the file to one of your local folders and extract it. Write a script or a cmdline to process the files from that archive as described above (== merge all file from a single sample into one file per sample). Also provided is the md5 key of that file. Use that to check if your ”download” was successful. Exercise 7 Write a script to create an expression matrix of several mapping result files.

Xtra exercises

Related documents

Products

Support

Xtra exercises

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib