Review Concepts for Bio 224 Midterm

advertisement
Review Concepts for Bio 224 Midterm
1. Databases
a. Genbank, EMBL, Swiss-Prot, UniGene, Ensembl
b. accession numbers, and how to read information within entries
2. Structure of a typical eukaryotic gene and modifications
3. Degeneracy of genetic code (e.g. 4-fold degenerate)
4. conservative substitutions
5. nonsynonymous vs. synonymous mutations
6. orthologous vs. paralogous genes
(alleles found for the same gene are to be considered orthologous)
7. Homology vs percent identity
8. Difference between global and local pairwise alignments [designed to detect either
closely or distantly related proteins]
9. Substitution matrices (BLOSUM vs PAM) [high BLOSUM and low PAM used for
detection of closely related proteins] how were the matrices made
10. general concept of calculating an alignment score
a. score= summation of (identities, mismatches) – gap penalties
b. take into account conservation, frequency of occurrence of particular residues
11. Mutation probability vs logs odd
S(a,b)= 10log10(Probability of alignment being authentic/probability alignment is
random)
12. What does the “twilight zone” refer to in regards to sequence relatedness
predictions?
13. Why do the percent identities for divergent proteins under-represent the number of
actual nucleotide and amino acid substitutions? (multiple hits)
14. What is a protein family? Superfamily?
15. Why is it that proteins that serve similar functions do not have to be structurally
related?
16. Advantages vs disadvantages of using nucleotide vs proteins for alignment purposes
17. What is BLAST and what are the basic steps used?
(compiles a list of word pairs above the threshold value, then scans the database for
matches to list of these word pairs, extends hits in either direction, returns results for
those above the set E value)
18. What is an expect value and how is it used?
19. What are domains, motifs and signatures? Which does the CDD database catalog?
How about the PROSITE database? How do you read a consensus sequence? []=must be
one of these, x =any of the 20 amino acids, {} =can’t be one of these
20. Predictions of the physical properties for proteins
21. What techniques are used to examine gene expression (ESTs, cDNAs, SAGE,
Microarrays) and how do these work?
22. cDNA libraries (standard, normalized, subtraction)
23. What are ESTs? What is a UniGene entry?
24. What is Digital Differential Display and how does it work?
25. What are the pitfalls in analyzing gene expression? Representation in library,
quality, sequencing errors, not necessarily correspond to actual protein expression,
similarities to other mRNAs, differential splicing
26. Complexity of a genome (protein-coding genes, pseudogenes, RNA genes, repeat
regions, viral sequences, and other miscellaneous)
27. Why is that the start and end sites of transcription is not the start and end sites for
translation?
28. Why is that genomes can vary such greatly as to their size but yet contain close to the
same number of genes?
29. How do you define a gene coding segement within the genome using both extrinsic
and intrinsic methods of detection?
30. Why is it difficult to predict the promoter elements for a gene?
31. How are genomes sequenced? (shotgun vs hierarchical methods)
32. What does it mean when a genome has been covered 3x? Is this considered
complete?
33. How are microarrays performed?
34. How is PCR performed? What are the considerations and pitfalls?
35. What can you do to optimize your PCR (strategies included)?
36. What is it that Taq enzyme does not have the highest fidelity?
37. What do you look for in generating degenerate primers?
Download