Exam Questions 1) Why do we use divergent crosses for QTL analyses, i.e. crosses between breeds or lines that are very different for our trait of interest? (2 points) 2) We have identified a QTL affecting aggression in dogs. We have sequenced a region of 1 Mb underlying the QTL in 10 aggressive and 10 non-aggressive dogs. Sequencing of this region has identified hundreds of SNPs that differ between the QTL genotypes. How can we prioritize the SNPs? (4 points) 3) Describe briefly the steps to perform a genome-wide association analysis with PLINK (4p). 4) Is the consensus important in secondary structure prediction? Explain. (2 points) 5) Analyze the following protein sequence. What is the function? Do you have an idea of the structure? Give the evidences that you find (patterns, profiles, structure predictions…) >prot_A YLSNHNYVHRDLAARNILVNQNLCCKVSDFGLTRLLDDFDGTYETQGGKIPIR WTAPEAIAHRIFTTASDVWSFGIVMWE VLSFGDKPYGEMSNQEVMKSIEDGYRLPPPVDCPAPLYELMKNCWAYDRARR PHFQKLQAHLEQLLANPHSLRTIANFD (4 points) 6) What does ‘–r ’ option mean in cp (copy) command. a- Which other command does also use this option? (1pt) b- Explain what the following command does. grep –l ^Bio > file.txt (1 pt) 7) Write commands that do the following: a. Prints first 5 lines of the file long.seq on the screen (1 pt) b. Does a case insensitive search for the string "length" in all files in the current directory. (1 pt) c. Puts the first 7 lines of the file long.seq into a file called first-and-last.txt (1 pt) d. Puts last 7 lines of the file long.seq into a file called first-and-last.txt (1 pt) 8) After performing a sequencing run using an Illumina NGS instrument, you decide to assess the quality of the generated data using FastQC. This is the resulting plot for “quality scores across all bases”: You need the read data to have as few errors as possible for a downstream analysis. What is the problem with the data as it is, and what would you do to fix this? (2p) 9) You have been tasked with mapping a set of NGS reads generated from genomic DNA against a reference sequence. The read data consists of a large number of entries like this: a) What is the file format of the read data? (1 p) The reference sequence was from a different individual of the same species as the sequenced individual. Part of the mapping was visualized in Tablet: b) What term is commonly used for the total number of blue lines (i.e. reads, marker 1 in the figure) that are stacked under a given position in the reference sequence? (1 p) c) Suggest explanations for the phenomena observed at markers 2 and 3. (2 p) 10) Your PI wants you to sequence the parasite genome of Babesia microti. Do a quick literature search, establish the genome size and familiarise yourself with the layout of the genome. Answer these four questions (0,5p each): 1) What’s the genome size estimated to be? 2) How is the genome arranged; Number of chromosomes, Number of genes, average length of a gene and G/C content 3) Is the genome previously sequenced? If so, what’s your proposed methodology for the sequencing experiment? 4) Given your chosen methodology quickly describe the initial pipeline for analysing the data (expect raw fastq data from sequencing centre) 11) Study the picture below, it’s an output graph from prinseq a sequence data QC software suite. Two datasets are shown, Input data should resemble each other (same technology used and same preparation of library) what’s the possible cause of the difference between the datasets? (1,5p) 12) Which type of job usually pass through the job queue on a cluster the fastest, and why? a) a job booked for 2 days on a whole node. (1P) b) a job booked for 5 hours on a single core. (1P) 13) Describe the concept and purpose with the CRAM format. (2P) 14) Describe ways how you can improve the statistical detection of differentially expressed (DE) genes in RNAseq data. What is the most important thing? What can you do when planning the experiment? What should you take into account when choosing the DE analysis algorithm? (4P) 15) Use the following chromosomal sequence and answer to the following questions (explain how you did): a) Determine the species this coming from (1P) b) Translate the coding part (OBS, think exon-introns)(3 P) c) Any know disease in any species that is know to be connected to this gene? (2P) >chromosomal DNA GGCACTCTTCCCACCTAGAAGCGGCTCCTCGCGCTCCTTCTGGAACCTCTGTCAGGTT CGGCCTCCTCGCCTCCACTCCAGCCTCCACCATGTCCATCAGGGTGACCCAGAAGTCC TACAAGATGTCCACCTCCAGCCCCCGGGCCTTCAGCAGCCGCTCCTACACGAGCGGGC CCAGCTCCCGCATCAGCTCCTCCGCCTTCTCCCGGGTGGGCAGCAGCAGCGGCAGCTT CCGGGGTGGCCTGAACAGCAGCATGAGTGTGGTCGGGGGCTACGGCGGGCCCGGGGT CGTGGGGAGCATCACGGCCGTCTCAGTGAACCAGAGCCTGCTGAACCCCCTGAAGCTG GAGGTGGACCCCAACATCCAGGCGGTGCGCACCCAGGAGAAGGAGCAGATCAAGAGC CTCAACAACAAGTTTGCCTCCTTCATCGACAAGGTGAGCCCCCCACCCTCCCCCGCGG GGCGGGCAGTGCCTGGGG CTGGCGAGGGGCTCCGCCTGTGTCTTGGTGGCC