Review Concepts for Bio 224 Midterm 1. Databases a. Genbank, EMBL, Swiss-Prot, UniGene, Ensembl b. accession numbers, and how to read information within entries 2. Structure of a typical eukaryotic gene and modifications 3. Degeneracy of genetic code (e.g. 4-fold degenerate) 4. conservative substitutions 5. nonsynonymous vs. synonymous mutations 6. orthologous vs. paralogous genes (alleles found for the same gene are to be considered orthologous) 7. Homology vs percent identity 8. Difference between global and local pairwise alignments [designed to detect either closely or distantly related proteins] 9. Substitution matrices (BLOSUM vs PAM) [high BLOSUM and low PAM used for detection of closely related proteins] how were the matrices made 10. general concept of calculating an alignment score a. score= summation of (identities, mismatches) – gap penalties b. take into account conservation, frequency of occurrence of particular residues 11. Mutation probability vs logs odd S(a,b)= 10log10(Probability of alignment being authentic/probability alignment is random) 12. What does the “twilight zone” refer to in regards to sequence relatedness predictions? 13. Why do the percent identities for divergent proteins under-represent the number of actual nucleotide and amino acid substitutions? (multiple hits) 14. What is a protein family? Superfamily? 15. Why is it that proteins that serve similar functions do not have to be structurally related? 16. Advantages vs disadvantages of using nucleotide vs proteins for alignment purposes 17. What is BLAST and what are the basic steps used? (compiles a list of word pairs above the threshold value, then scans the database for matches to list of these word pairs, extends hits in either direction, returns results for those above the set E value) 18. What is an expect value and how is it used? 19. What are domains, motifs and signatures? Which does the CDD database catalog? How about the PROSITE database? How do you read a consensus sequence? []=must be one of these, x =any of the 20 amino acids, {} =can’t be one of these 20. Predictions of the physical properties for proteins 21. What techniques are used to examine gene expression (ESTs, cDNAs, SAGE, Microarrays) and how do these work? 22. cDNA libraries (standard, normalized, subtraction) 23. What are ESTs? What is a UniGene entry? 24. What is Digital Differential Display and how does it work? 25. What are the pitfalls in analyzing gene expression? Representation in library, quality, sequencing errors, not necessarily correspond to actual protein expression, similarities to other mRNAs, differential splicing 26. Complexity of a genome (protein-coding genes, pseudogenes, RNA genes, repeat regions, viral sequences, and other miscellaneous) 27. Why is that the start and end sites of transcription is not the start and end sites for translation? 28. Why is that genomes can vary such greatly as to their size but yet contain close to the same number of genes? 29. How do you define a gene coding segement within the genome using both extrinsic and intrinsic methods of detection? 30. Why is it difficult to predict the promoter elements for a gene? 31. How are genomes sequenced? (shotgun vs hierarchical methods) 32. What does it mean when a genome has been covered 3x? Is this considered complete? 33. How are microarrays performed? 34. How is PCR performed? What are the considerations and pitfalls? 35. What can you do to optimize your PCR (strategies included)? 36. What is it that Taq enzyme does not have the highest fidelity? 37. What do you look for in generating degenerate primers?