BIOM262/BGGN237 Quantitative Methods in Genetics Winter, 2014 Final Exam Due Wednesday, March 19, by noon (PDT) to bah@ucsd.edu Instructions: Each question relates to a specific module of the course and was designed by the faculty member in charge of that module. If the meaning of an exam question is unclear, you should first contact the faculty member in charge of the relevant module for clarification. You may (and should) consult class notes, handouts and assigned readings for the relevant module. You may consult other sources, including books, published research articles, and web resources. You may NOT consult any person, known or anonymous, including by email, chat rooms, social media, discussion boards nor any other medium. Violations of this restriction will be pursued as academic dishonesty per University policy. If there are questions regarding this policy, please address them to bah@ucsd.edu before going further. Course grades will be based on the best eight of nine modules. Modules 5 and 9 will be graded by the project assigned by Dr. Yeo. The exam includes questions from seven modules. The grading basis will be the average of the top three student scores for their eight best modules, prior to awarding extra credit. This average will define a normalized score of 100. If you complete the project and all exam questions, half the value of your lowest score will be awarded as extra credit. Some exam questions offer additional opportunities for extra credit, though these may be more challenging than the primary exam questions. A final score ≥100 will be awarded an A+, up to the top 5 scores. 90 and above is an A, 80 and above is a B, etc. You can do well in the course even if you drop a module completely, but it is quantitatively in your favor to attempt each module. Extra credit question are pure bonus, no penalty if you skip them. Module 1 (Raffi Aroian) Question 1.1: A microscope company claims their new confocal can scan an entire 18x18 coverslip at 400X magnification in 60 seconds with a standard deviation of 30 seconds. You decide to demo the system. You scan 10 coverslips at 400X magnification and find it took 750 seconds to collect the data from all 10 coverslips. Is this enough evidence to refute the company’s claim? Question 1.2: It has been hypothesized that changes in microbiota can correlate with changes in learning capabilities. Germfree mice were reconstituted with microbiota from C57BL/6 mice or microbiota from Swiss Webster mice (the two strains have different baseline microbiota). The mice were then subject to a learning test, scored on a scale of 1-200, where 200 indicates high performance and 1 indicates low performance. Does microbiota influence learning? Be sure you justify the test you decide to use. score (C57BL/6 microbiota) 154 109 137 115 152 140 154 178 101 103 126 137 165 165 129 200 148 score (Swiss Webster microbiota) 108 140 114 91 180 115 126 92 169 146 109 132 75 86 70 115 187 104 Module 2 (Bruce Hamilton) Postdoc Peter Prettygood has picked a project he can pilot with PCR. To test the hypothesis that the marvelous mutation modifies the amount of Mfr2 mRNA in maxillary glands, he performs quantitative RT-PCR and calculates the relative quantity compared to the geometric mean of a robust set of reference genes. Here are Peter’s data: Sample Genotype AA Mfr2 0.42016 Mfr1 1.07500 mutant1 mutant2 AA 0.47185 1.02880 mutant3 AA 0.53506 0.94277 mutant4 AA 0.49547 0.78694 mutant5 AA 0.42986 0.94166 mutant6 AA 0.56094 1.23203 mutant7 AA 1.45536 1.04192 mutant8 AA 0.89592 0.76590 mutant9 AA 0.39923 0.79212 mutant10 AA 0.35265 0.79979 mutant11 AA 0.46934 1.00656 control1 BB 0.44401 1.44315 control2 BB 0.64210 1.02843 control3 BB 0.47597 0.79809 control4 BB 0.67194 0.83985 control5 BB 0.51120 0.97134 control6 BB 0.61028 1.37147 control7 BB 1.60358 1.14948 control8 BB 1.12220 0.89417 control9 BB 0.41608 0.93955 control10 BB 0.54020 0.93507 control11 BB 0.42419 1.25809 Assuming these samples are unpaired, how would you analyze Peter’s data? Justify your choice of test, including tails. Perform the calculation in R. Under your test, what is the probability Peter would have obtained a difference this large by chance? How much would your answer change–what is the relevant test and p-value–if the samples were paired (mutant1 with control1, etc.)? Extra credit: If Peter’s PI, Betsy Bayes, had previously thought the likelihood of Peter’s hypothesis was 1,000,000 to 1 against, based on prior evidence in the field, how should she change her view in light of this data? What should her new view be? If Betsy Bayes had instead suggested Peter’s hypothesis as likely, based on prior data for Mfr1 provided in the table and on a known likelihood that these genes are co-regulated, should this matter for how she views the data on Mfr2? Module 3 (Bruce Hamilton) A friend asks you to help analyze a large set of genotyping data from a rat genetic linkage cross. The experimental design is an intercross set up to detect modifiers of a nasty mutation. As part of your initial analysis you notice that some markers have unusual distributions of genotypes, unusual enough to have significant chi-square tests. 3.1. Provide three plausible explanations for deviations from expected genotype distribution that you might find in genome-wide data. 3.2. Which of these explanations would you favor if you had densely spaced markers and the significant chi-square tests were for consecutive markers? Which explanations would you favor if the significant chi-square tests were for non-adjacent markers? 3.3. Using the qtl package in R and the n12_262.csv intercross data set we explored in class, plot the nonparametric linkage evidence (LOD scores) for the class trait (phenotype 2) as we did in class. (Be sure to follow the map estimation steps we used in class to generate a smooth curve of imputed values). What LOD value would define significant linkage according to the Lander and Kruglyak (1995) guidelines (making no assumptions about inheritance mode, therefore 2 degrees of freedom)? Draw a red line at this threshold. How many linkage peaks are significant by this standard? Paste your plot below (screen shots are fine). Extra credit: For the plot above, add a solid, salmon-colored line for the 95% Bayes credible interval. Module 4 (Elizabeth Winzeler) You have heard rumors that a family that you know at the University may have contributed DNA for the CEU trio whole genome sequencing effort. In this family the mother has blue eyes, the father brown, and the daughter blue. Based on your reading of the literature, you know that the rs12913832 SNP in her2, located at the center of chr15 interval:28,365,602-28,365,634 of hg19 assembly is associated with eye color, with a GG alleles being associated with blue eyes. Could your acquaintances be the donors? Explain your logic. Extra Credit history questions where Module 6 should be (all opportunity, does not hurt you if you skip it) 6.1 Which of the following statisticians may be considered Bayesians? a. Thomas Bayes b. Pierre-Simon Laplace c. David Blackwell d. Ronald Aylmer Fisher e. Nate Silver 6.2 Who is credited with coining the term “Baysian?” 6.3 Densitometry of x-ray film after exposure to 32P-labled probes revolutionized quantitative assessment of DNA and RNA in blotting experiments. Which of the following are significant limitations of this anachronistic quantitative method: a. Decay of 32P is non-linear. b. Background may be non-uniform and difficult to subtract precisely. c. Hybridization kinetics depend on salt concentration and reaction temperature relative to the Tm. d. The clicking noise made by a standard Geiger counter sounds scary. e. Conversion of silver grains in film is non-linear. f. Performing this properly is labor-intensive and therefore low-throughput. Module 7 (Scott Rifkin) Under genetic drift, an allele at a frequency of 0.4 will persist in a population for an average of 2.7N generations where N is the population size. Determine the standard deviation of the persistence time for such an allele in terms of N using simulation. Include plots from your simulation to support your answer. What is the standard deviation if the allele starts at a frequency of 0.15? As a reminder, you can import the simulation programs into R via: source("http://labs.biology.ucsd.edu/rifkin/courses/BIOM262/allProgs.R") Module 8 (Nik Schork) Suppose you want to identify a genomic locus harboring variants that influence a particular phenotype. You have access to a very large family, with some members of the family exhibiting the phenotype and others not. Explain how a meiotic mapping approach leveraging linkage analysis to identifying the locus would work. Include how concepts such as recombination, genetic maps, polymorphic markers, linkage disequilibrium, haplotypes, transmission functions, penetrance functions, and the Elston-Stewart algorithm fit into this strategy. Module 10 (Jonathan Sebat) You are a first year bioinformatics student, and you are doing a rotation in the laboratory of Professor Gertrude Gogglestein. Gerty has just performed whole genome sequencing on a boy with Duchenne muscular dystrophy (DMD). Pairedend sequencing was performed to 30X coverage using a library of 500 bp fragments DMD is a recessive X-linked muscle disease that affects only boys. DMD is caused by mutations in the gene dystrophin. Partial and full deletions and duplications of the gene dystrophin make up approximately 60% of the disease causing mutations. Dr. Gogglestein naturally suspects that the boy could carry a deletion of dystrophin, and she asks you to analyze the sequence data for the presence of such a deletion. You are provided with 2 pieces of data: • Table A lists the depth of coverage in 100 bp intervals across the X chromosome • Table B lists the estimated fragment (insert) sizes and mapping positions for all read pairs on the X chromosome. You are welcome to use a schematic diagrams to illustrate. 1. Describe how you would use Table A to identify a mutation in dystrophin. 2. Describe how you would use Table B to identify a mutation of dystrophin