A unit of measurement on genetic maps is:

advertisement

Name:________________________________

5.

4.

3.

2.

GN415 Midterm Exam March 4, 2005

This exam is worth 30% of your final grade. There are 10 multiple choice and 5 True or False questions (1 point each) and you should answer 6 of the 8 short answer (6 points each) questions, three from each of the two lists (problems and essays).

You have 60 minutes for the exam.

1. A unit of measurement on physical maps is: a) kilobases b) centimorgans c) cytological bands d) centimeters

ENCODE stands for: a) encyclopedia of database entries b) encrypted oracle database encyclopedia c) encyclopedia of DNA elements d) eukaryotic normalized collection of DNA elements

Which of the following does not host a commonly used genome browser: a) the National Center for Biotechnology Information (NCBI) b) the European Bioinformatics Institute (EBI) c) the University of California at Santa Cruz (UCSC) d) the United States National Science Foundation (NSF)

An E-value is: a) the expected number of codons in a sequence of nucleotides of length l b) an expression of the likelihood that a QTL exists in an genome interval c) the expected number of equally good sequence matches in a sequence database d) an expression of the goodness of fit of a gene prediction to a cDNA sequence

The average number of PHRED 20 quality bases in a typical sequence read is close to: a) 10 b) 100 c) 500 d) 2000

1

Name:________________________________

6. A haplotype is: a) the set of polymorphic nucleotides found together on a single chromosome b) a genotype that is unique to non-African populations c) a genotype that is only found in a single individual in a population d) a set of diploid genotypes at two or more loci in an individual

7. Alternative splicing refers to: a) a difference in the number of exons in two or more species b) the production of two or more mRNAs from a single gene c) regulation of two different genes by a single regulatory element d) post-translational modification of the cleavage site of receptor proteins

8. Which of the following may contribute to false positive case-control associations? a) differences in allele frequency between populations b) cultural differences between populations c) nutritional differences between populations d) all of the above

9. The F1 generation is: a) the Ferrari Fanclub b) the progeny of a cross between a father and his daughter c) the first filial generation from a cross between two inbred lines d) the J. Craig Venter Foundation

10. Transmission Disequilibrium refers to: a) a 1:1 ratio of the two alleles in a set of unrelated affected individuals b) an unexpectedly long haplotype block in a set of siblings c) linkage disequilibrium between markers on two different chromosomes d) un unequal ratio of two alleles in a set of unrelated affected individuals derived from heterozygous parents.

2

Name:________________________________

True or False (circle the appropriate option):

11. Synteny refers to conservation of the order of genes along a chromosome in two different species.

T F

12. 2X shotgun sequencing coverage is sufficient to assemble 99% of the genome of a multicellular eukaryote into fragments at least 1 Mb in length.

T F

13. The closer together two genetic polymorphisms are on a chromosome, the more rapidly linkage disequilibrium decays over time.

T F

14. Current estimates suggest that there are fewer than 30,000 genes in the human genome.

T F

15. Tagging SNPs are designed to capture the majority of the genetic variation in haplotype blocks, reducing the number of sites that must be tested in a genome scan.

T F

Short Answer Problem Questions (6 points each; answer 3 of the 4 questions)

16. Compute the best possible alignment for the following two sequences, assuming a gap penalty of -2, a mismatch penalty of -1, and a match score of +1. If the last nucleotide in the right hand sequence is a G instead of an A, does your answer change?

GCGCATA and GCGCTAA

GCGCATA

||||..|

GCGCTAA

GCGCATA-

||||-||-

GCGC-TAA

Best fit

5 –2 = 3 6 - 4 = 2

GCGCATA

||||...

GCGCTAG

4 –3 = 1

GCGCATA-

||||-||-

GCGC-TAG

Best fit

6 - 4 = 2

3

Name:________________________________

17. Generate a hypothetical Gene Ontology (GO) classification for any gene, real or imaginary, that you may care to choose. Include two hierarchical levels in each of three GO categories that you list.

Antennapedia

Cellular Location

Nucleus

Chromatin

Molecular Function

Transcription Factor

Homeodomain Class

Biological Process

Developmental Regulation

Axial pattern formation

18. A BLAST search identifies two genes with significant matches and the following alignments (the query sequence is the one in the middle; identity is indicated by a vertical line, mismatch by a dot, and a gap by a dash):

1. ACCCGTA------------TATAATGCATTACGATGGGGATCGACTAC---------

|||||||------------|||||||||||||||||||||||||||||---------

Q. ACCCGTATCGATGCCTAGCTATAATGCATTACGATGGGGATCGACTACGGATCCATC

||||.|||--||||||.|||||.|||||||.|||||.|.||||..|||||.||||.|

2. ACCCATAT--ATGCCTTGCTATGATGCATTGCGATGAGCATCGCATACGGCTCCAGC a) Which alignment would you consider more likely to identify a homologous gene, and why?

The first alignment shows identity over a long stretch, and only two molecular changes differentiate the sequences. The large number of gaps may suggest that the function of the gene has changed, but these are still likely to be homologous (that is, derived from a common anscestor). At least 11 molecular changes separate the query and sequence 2, so they probably diverged in the more distant past. b) How would you determine whether the sequences are orthologs or paralogs?

Paralogs are different copies of the gene in the same lineage, while orthologs are the same gene in different species (lineages). You would first look to see if there are multiple similar sequences in each of the species that have the gene, and if possible pull out the homologous sequence from an outgroup. One possible explanation for Sequence 1 is that it is a pseudogene, or a recently retrotransposed copy of the original gene, without small introns, either of which would make it a paralog.

4

Name:________________________________

19. In a recent survey, 500 Americans were asked the question “Do you feel that the income tax rate in this country is too high”, and simultaneously genotyped for a G/T regulatory polymorphism in the promoter of the GspnA1 locus. The following genotype frequencies were observed:

Yes group No group

GG

GT

TT

10

115

125

25

105

120 a) What is the approximate allele frequency of G in the population?

There are 35 homozygotes for GG and 220 hetrozygotes, in a total of 500 individuals.

Therefore, the minor allele frequency (G) is (2x35 + 220)/(2x500) = 0.29. It is slightly lower in the Yes group (135/500 = 0.27) than in the No group (155/500 = 0.31). b) Is there any suggestion of an association between GspnA1 and taxation policy?

Possibly: there appears to be a deficit of GG homozygotes in the Yes group (21 would be expected for an allele frequency of 0.29 in 250 individuals). The genotype frequencies match

Hardy-Weinberg proportions in the No group. A statistical test would have to be performed to assess whether the difference is significant.

c) What does this say about the genetic basis of an economic belief?

Taken at face value, this may suggest that GG homozygotes are unlikely to believe that they have high taxation. However, even if the association is significant, there may be a lot of reasons causing a false positive result, including population stratification, and sampling artifacts. The result would have to be replicated several times before it implied something genetic, and some mechanistic hypothesis for the function of the GREENSPAN protein developed.

5

Name:________________________________

Short Answer Essay Questions (6 points each; answer 3 of the 4 questions)

20. If you were CEO of a company that produces a drug that significantly reduces the memory loss associated with Alzheimer’s disease, but increases the risk of aneurysm in 5% of patients, how might you use genomics to increase the likelihood that your drug is approved by the FDA? Be as specific as you can in describing the pharmacogenetic approach and the potential drawbacks.

My objective would be to identify a genetic marker that predicts the adverse side-effect. In this case, I would conduct a case-control genome scan with the 100,000 human tagging SNPs from the HapMap project, where the cases are as large a sample as I can find (at least 200) of patients who took the drug and developed aneurysms, and the controls are patients who took the drug without developing aneurysms. I would use a statistical test of association to find genotypes that are more prevalent in either the cases or the controls.

Having found an association, it would be essential to replicate the study. With Alzheimers’

Disease, it is unlikely that most affected individuals still have living parents, so it will probably not be possible to perform a parent-offspring transmission disequilibrium test, but it may be possible to perform a sibling-based transmission disequilibrium study. Also, since the drug has already been found to increase aneurysms, it may have been withdrawn prior to completion of the clinical study, and it may not be possible to find enough individuals to perform a replication. I would also have to check to see whether the susceptibility allele is at a different frequency in different populations.

Even if an association isfound, it is unlikely that one marker will predict who will have the adverse side-effects. It is possible that the identity of the gene might help the drug company develop another drug to counteract the aneurysms. Alternatively, it might be used to identify the at-risk-population and exclude them from taking the drug, even though many individuals who might benefit are excluded.

6

Name:________________________________

21. What are the three main methods used to identify genes in a genome sequence, and what attributes of the gene annotation are currently most unreliable?

1. Experimental evidence for gene expression. For example, a match to an EST or cDNA sequence in the database. Since genes are only transcribed in a subset of tissues, and many have very low transcript abundance, failure to identify a cDNA does not mean that the sequence is not part of a gene. Primers can be designed to specifically detect predicted transcripts by RT-PCR.

2. Ab initio gene detection. Most gene annotation starts with Hidden Markov Models

(HMMs) that search for ordered strings of promteres, start sites, exons, introns, stop codons, and 3’ polyA sites. Different parameters in the proabilistic models lead to different predictions and multipel algorithms should be employed.

3. Search for a sequence match in the database of all genomes, generally using the Basic Local

Alignment Search Tool (BLAST). This looks for sequence conservation of at least 60 nucleotides (or 20 codons), and can be performed both with nucleotide and amino acid sequences. It is based on the idea that genes are likely to be conserved over evolutionary time.

The most unreliable aspect of gene annotation is the gene structure. Gene prediction is pretty good (perhaps 10% false prediction and failure to detect genes), but the identification of exon boundaries and start sites is much more error prone. It is expensive and time-consuming to determine mRNA structures completely, especially given alternative splicing. Another aspect of poor annotation is the prediction of regulatory micro-RNAs and non-coding RNA genes.

7

Name:________________________________

22. What is the difference between linkage mapping and linkage disequilibrium mapping?

Describe a general strategy for using both methods to identify a gene that predisposes human children to autism.

Linkage mapping is performed in pedigrees, and is based on the idea that physically linked genes on a chromosome are likely to co-segregate. Consequently, markers within several centiMorgans tend to be linked and to give similar test statisitcs. Statistical methods are used to infer the most likely location of a gene based on the association between a set of adjacent merkers and the phenotype.

Linkage disequilibrium mapping is performed in populations, and is based on the idea that after thousands of generations, recombination leaves the genome in small chunks of no more than a few hundred kilobases (less than 0.1 centiMorgans) that are in linkage disequilibrium. It is much higher resolution that linkage mapping, but requires two or three orders of magnitude more genetic markers.

In general, linkage mapping could be used to narrow down an interval to 5cM (several Mb) in a dozen or so pedigrees each with several autistic children. Next, I would examine the predicted genes in the interval, and develop a set of 100 or so tagging SNPs that are expected to capture most of the haplotype variation. I would then conduct a case-control association test on a large population of 500 autistic children compared with 500 age and sex-matched contrl children from a single population. Subsequently, it would be necessary to replicate the study with an independent sample, or to perform a family-based transmission-disequilibrium test.

8

Name:________________________________

23. Write a short essay describing one ethical, one social, and one legal implication of the

Human Genome Project.

An example of an ethical implication is whether parents should be encouraged to use genotypic information to make decisions about their family planning. In some cases, couples may be advised not to get married because their children are likely to have a particular disease.

An example of a social implication is ensuring equal access to the benefits of genomic research for all sectors of society: urban and rural, wealthy and poor, and all racial groups. There is already a concern that particular races are less likely to participate in genomic studies on the basis of past experience with eugenics and unfair treatment.

An example of a legal implication is the potential for stigmatization of people on the basis of their genetic predisposition to disease, leading to failure to provide health insurance or employment.

9

Download