Programa de Maestría y Doctorado en Ciencias Bioquímicas Curso Genomics and Bioinformatics: A Practical Course in Sequence Data Analysis Semestre: 2011-1 Sede: Instituto de Biotecnología-UNAM Responsables: Dr. Robert Edwards Departments of Computer Science and Biology, San Diego State University, San Diego, CA, USA and Department of Mathematics and Computer Science, Argonne National Laboratory, Argonne, IL, USA. Dra. Rosa María Gutiérrez Ríos Departamento de Microbiología Molecular, Instituto de Biotecnología, UNAM. Name of the course: Genomics and Bioinformatics: A Practical Course in Sequence Data Analysis People in charge: Dr. Robert Edwards Ph.D. Departments of Computer Science and Biology, San Diego State University, San Diego, CA, USA and Department of Mathematics and Computer Science, Argonne National Laboratory, Argonne, IL, USA. Dra. Rosa María Gutiérrez Ríos, Departamento de Microbiología Molecular, Instituto de Biotecnología, UNAM. Invited speakers Dr. Ramy Aziz, San Diego State University (Phage-encoded toxins and prophages). Dr. Ross Overbeek, Argonne National Laboratory (Bioinformatics and genomics). Dr. Fabiano Thompson, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil (taxonomic classification). Introduction and justification Biology is becoming an information science, and DNA sequencing is a fundamental tool used to understand both prokaryotic and eukaryotic biological systems. Bioinformatics, using computers to understand biological data, covers many different areas of study, from sequencing complete genomes to using statistics to study gene expression results from microarrays. Understanding the tools that are available, and the programming skills that are needed to be a successful biologist is a daunting task. This class will focus on the fundamental basics of bioinformatics, covering the algorithms and approaches used to analyze biological data with computers. Most biologists are familiar with common bioinformatics tools such as BLAST (the Basic Local Alignment Search Tool available, for example, from NCBI). However, they do not understand the limitations of these tools, or how the parameters chosen by the user will affect the results and outcome. Students in this course will learn to apply computational algorithms to biological questions, using post-genomic tools to explore, annotate, and compare complete microbial and eukaryotic genome and metagenome sequences. The course is designed to appeal to both biologists with limited computer programming expertise, and computer scientists with limited biological expertise. By teaming students with different backgrounds, they can learn from each other and achieve outcomes far beyond their individual capabilities. Edwards’ experience in both computer science and biological science allows him to communicate with students from both backgrounds, and cover the gaps in their respective backgrounds. Aims The three most important learning outcomes of this class are: 1. The ability to use existing tools to analyze genome sequences. Students will learn which tools are applicable to which computational analyses, when, and how each tool should be used. 2. Ability to choose and explain the correct bioinformatics tool for the task at hand. There are many tools available for analysis of DNA sequences, protein sequences, and derived data like microarrays. However, each task requires a different bioinformatics tool and approach. Furthermore, the parameters that may be set by the user for each tool need to be defined and explained. This course will explain the underlying algorithms behind the tools, to enable the user to decide which parameters to adjust and to explain why. 3. Ability to develop new tools appropriate for sequence analysis. Although there are so many bioinformatics tools available to researchers, there is often not the right tool for the analysis that is required. On these occasions, the research needs to design and implement a new tool for their particular analysis, or to combine the output from multiple different tools to a new outcome. This course will demonstrate how to design software and write simple applications using scripting programming languages (Perl, Python) that are the cornerstone of bioinformatics. Each student will be provided with an uncharacterized sequence at the start of the semester, and will annotate and analyze that sequence during the course of the program. Both prokaryotic and eukaryotic genomes, and prokaryotic and viral metagenomes are available from a companion course that has been taught at SDSU in the spring semester, 2010 (the course has just concluded). During that course the students in the class generated approximately 5-fold coverage of the California Sea Lion genome, 5 microbial genomes, and 8 metagenomes. These sequences are not yet available in GenBank or other publicly repositories, and therefore provide the perfect teaching tool. Program: including dates and times Week 1: The three domains of life; codon usage, genes and genomes, types of sequencing. Week 2: Identifying protein-encoding sequences. Prokaryotic and eukaryotic ORF calling. Week 3: Non-coding elements in sequences (rRNA, tRNA) Week 4: Comparing sequences via global alignments Week 5: Local sequence alignments with BLAST, gapped BLAST, psi-BLAST Week 6: Sequence Databases (NCBI, DDBJ, EMBL, SEED) Week 7: Protein families; PIR superfamilies Week 8: Building and using HMMs; Pfam, Rfam Week 9: Subsystems and annotating genomes; Functional coupling Week 10: Phage and virus genomes Week 11: Standard and alternate phylogenetic trees (16S; phage proteomic tree) Week 12: Comparative genomics I: DNA level comparisons; Mauve, ACT, Mummer Week 13: Comparative genomics II: Protein level comparisons; signature genes; ORFans Week 14: Metagenomics I: Approaches and tools. Week 15: Metagenomics II: mg-rast, real time, camera, etc. Number of sessions and duration of each one. The course will be two sessions per week, 2.5 hours each session. Dynamics of the course and Mechanisms of evaluation. This will be an interactive course via video conference from the University of San Diego (by Dr. Edwards) and local supervision and assistance (by Dr. Gutiérrez). In addition, the course will contain an intensive tutorial period where Edwards will spend 7-14 days in Cuernavaca with daily tutorials to cover material in more detail. Students will be graded based on both assignments and a final, semester long project. The assignments will build up to the final project, where the student will describe the sequence that they have been provided at the beginning of the course (see above). The student’s grade will be assigned from a combination of in class participation (10%), three written assignments (15% each) and the final project write up (45%). Definition of Grades: Grades used in reporting are as follows: Grade 10 (outstanding achievement; available for the highest accomplishment); 8 or 9 (average; awarded for satisfactory performance); 6 or 7 (minimally passing); 5 (unacceptable; course must be repeated for credit). Space will be limited according to the capacity of the video conference facility. Bibliography: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403-410. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389-3402. Badger JH, Olsen GJ: CRITICA: coding region identification tool invoking comparative analysis. Mol Biol Evol 1999, 16(4):512-524. Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL: BLAST+: architecture and applications. BMC Bioinformatics 2009, 10:421. Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene identification with GLIMMER. Nucleic Acids Res 1999, 27(23):4636-4641. Dowell RD, Eddy SR: Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics 2004, 5:71. Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755-763. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL et al: The Pfam protein families database. Nucleic Acids Res 2008, 36(Database issue):D281-288. Krammer EM, Sebban P, Ullmann GM: Profile hidden Markov models for analyzing similarities and dissimilarities in the bacterial reaction center and photosystem II. Biochemistry 2009, 48(6):1230-1243. Krause L, McHardy AC, Nattkemper TW, Puhler A, Stoye J, Meyer F: GISMO-gene identification using a support vector machine for ORF classification. Nucleic Acids Res 2007, 35(2):540-549. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. Lee ZM, Bussema C, 3rd, Schmidt TM: rrnDB: documenting the number of rRNA and tRNA genes in bacteria and archaea. Nucleic Acids Res 2009, 37(Database issue):D489-493. Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 1997, 25(5):955-964. Osterman A, Overbeek R: Missing genes in metabolic pathways: a comparative genomics approach. Curr Opin Chem Biol 2003, 7(2):238-251. Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R et al: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33(17):5691-5702. Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, 96(6):28962901. Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195-197. Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R: Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 1998, 26(1):320-322. Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families. Science 1997, 278(5338):631-637. Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22(22):4673-4680. Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ, Mazumder R, Kumar S, Kourtesis P et al: PIRSF: family classification system at the Protein Information Resource. Nucleic Acids Res 2004, 32(Database issue):D112-114. Xiong J: Essential bioinformatics. New York: Cambridge University Press; 2006.