Name of the course: - Instituto de Biotecnología

advertisement
Programa de Maestría y Doctorado en Ciencias Bioquímicas
Curso
Genomics and Bioinformatics: A Practical Course in Sequence
Data Analysis
Semestre: 2011-1
Sede: Instituto de Biotecnología-UNAM
Responsables:
Dr. Robert Edwards
Departments of Computer Science and Biology, San Diego State University, San
Diego, CA, USA and Department of Mathematics and Computer Science,
Argonne National Laboratory, Argonne, IL, USA.
Dra. Rosa María Gutiérrez Ríos
Departamento de Microbiología Molecular, Instituto de Biotecnología, UNAM.
Name of the course:
Genomics and Bioinformatics: A Practical Course in Sequence Data Analysis
People in charge:
Dr. Robert Edwards Ph.D. Departments of Computer Science and Biology, San Diego
State University, San Diego, CA, USA and Department of Mathematics and Computer
Science, Argonne National Laboratory, Argonne, IL, USA. Dra. Rosa María Gutiérrez
Ríos, Departamento de Microbiología Molecular, Instituto de Biotecnología, UNAM.
Invited speakers
Dr. Ramy Aziz, San Diego State University (Phage-encoded toxins and prophages).
Dr. Ross Overbeek, Argonne National Laboratory (Bioinformatics and genomics).
Dr. Fabiano Thompson, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
(taxonomic classification).
Introduction and justification
Biology is becoming an information science, and DNA sequencing is a fundamental tool
used to understand both prokaryotic and eukaryotic biological systems. Bioinformatics, using
computers to understand biological data, covers many different areas of study, from
sequencing complete genomes to using statistics to study gene expression results from
microarrays.
Understanding the tools that are available, and the programming skills that are needed to be
a successful biologist is a daunting task. This class will focus on the fundamental basics of
bioinformatics, covering the algorithms and approaches used to analyze biological data with
computers.
Most biologists are familiar with common bioinformatics tools such as BLAST (the Basic
Local Alignment Search Tool available, for example, from NCBI). However, they do not
understand the limitations of these tools, or how the parameters chosen by the user will
affect the results and outcome.
Students in this course will learn to apply computational algorithms to biological questions,
using post-genomic tools to explore, annotate, and compare complete microbial and
eukaryotic genome and metagenome sequences.
The course is designed to appeal to both biologists with limited computer programming
expertise, and computer scientists with limited biological expertise. By teaming students with
different backgrounds, they can learn from each other and achieve outcomes far beyond
their individual capabilities.
Edwards’ experience in both computer science and biological science allows him to
communicate with students from both backgrounds, and cover the gaps in their respective
backgrounds.
Aims
The three most important learning outcomes of this class are:
1. The ability to use existing tools to analyze genome sequences.
Students will learn which tools are applicable to which computational analyses, when,
and how each tool should be used.
2. Ability to choose and explain the correct bioinformatics tool for the task at hand.
There are many tools available for analysis of DNA sequences, protein sequences, and
derived data like microarrays. However, each task requires a different bioinformatics tool
and approach. Furthermore, the parameters that may be set by the user for each tool
need to be defined and explained. This course will explain the underlying algorithms
behind the tools, to enable the user to decide which parameters to adjust and to explain
why.
3. Ability to develop new tools appropriate for sequence analysis.
Although there are so many bioinformatics tools available to researchers, there is often
not the right tool for the analysis that is required. On these occasions, the research needs
to design and implement a new tool for their particular analysis, or to combine the
output from multiple different tools to a new outcome. This course will demonstrate
how to design software and write simple applications using scripting programming
languages (Perl, Python) that are the cornerstone of bioinformatics.
Each student will be provided with an uncharacterized sequence at the start of the semester,
and will annotate and analyze that sequence during the course of the program. Both
prokaryotic and eukaryotic genomes, and prokaryotic and viral metagenomes are available
from a companion course that has been taught at SDSU in the spring semester, 2010 (the
course has just concluded). During that course the students in the class generated
approximately 5-fold coverage of the California Sea Lion genome, 5 microbial genomes, and
8 metagenomes. These sequences are not yet available in GenBank or other publicly
repositories, and therefore provide the perfect teaching tool.
Program: including dates and times
Week 1: The three domains of life; codon usage, genes and genomes, types of sequencing.
Week 2: Identifying protein-encoding sequences. Prokaryotic and eukaryotic ORF calling.
Week 3: Non-coding elements in sequences (rRNA, tRNA)
Week 4: Comparing sequences via global alignments
Week 5: Local sequence alignments with BLAST, gapped BLAST, psi-BLAST
Week 6: Sequence Databases (NCBI, DDBJ, EMBL, SEED)
Week 7: Protein families; PIR superfamilies
Week 8: Building and using HMMs; Pfam, Rfam
Week 9: Subsystems and annotating genomes; Functional coupling
Week 10: Phage and virus genomes
Week 11: Standard and alternate phylogenetic trees (16S; phage proteomic tree)
Week 12: Comparative genomics I: DNA level comparisons; Mauve, ACT, Mummer
Week 13: Comparative genomics II: Protein level comparisons; signature genes; ORFans
Week 14: Metagenomics I: Approaches and tools.
Week 15: Metagenomics II: mg-rast, real time, camera, etc.
Number of sessions and duration of each one.
The course will be two sessions per week, 2.5 hours each session.
Dynamics of the course and Mechanisms of evaluation.
This will be an interactive course via video conference from the University of San Diego (by
Dr. Edwards) and local supervision and assistance (by Dr. Gutiérrez). In addition, the course
will contain an intensive tutorial period where Edwards will spend 7-14 days in Cuernavaca
with daily tutorials to cover material in more detail. Students will be graded based on both
assignments and a final, semester long project. The assignments will build up to the final
project, where the student will describe the sequence that they have been provided at the
beginning of the course (see above). The student’s grade will be assigned from a
combination of in class participation (10%), three written assignments (15% each) and the
final project write up (45%).
Definition of Grades: Grades used in reporting are as follows: Grade 10 (outstanding
achievement; available for the highest accomplishment); 8 or 9 (average; awarded for
satisfactory performance); 6 or 7 (minimally passing); 5 (unacceptable; course must be
repeated for credit).
Space will be limited according to the capacity of the video conference facility.
Bibliography:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment
search tool. J Mol Biol 1990, 215(3):403-410.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ:
Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs. Nucleic Acids Res 1997, 25(17):3389-3402.
Badger JH, Olsen GJ: CRITICA: coding region identification tool invoking
comparative analysis. Mol Biol Evol 1999, 16(4):512-524.
Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL:
BLAST+: architecture and applications. BMC Bioinformatics 2009, 10:421.
Delcher AL, Harmon D, Kasif S, White O, Salzberg SL: Improved microbial gene
identification with GLIMMER. Nucleic Acids Res 1999, 27(23):4636-4641.
Dowell RD, Eddy SR: Evaluation of several lightweight stochastic context-free
grammars for RNA secondary structure prediction. BMC Bioinformatics 2004,
5:71.
Eddy SR: Profile hidden Markov models. Bioinformatics 1998, 14(9):755-763.
Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K,
Eddy SR, Sonnhammer EL et al: The Pfam protein families database. Nucleic Acids
Res 2008, 36(Database issue):D281-288.
Krammer EM, Sebban P, Ullmann GM: Profile hidden Markov models for
analyzing similarities and dissimilarities in the bacterial reaction center and
photosystem II. Biochemistry 2009, 48(6):1230-1243.
Krause L, McHardy AC, Nattkemper TW, Puhler A, Stoye J, Meyer F: GISMO-gene identification using a support vector machine for ORF classification.
Nucleic Acids Res 2007, 35(2):540-549.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
Lee ZM, Bussema C, 3rd, Schmidt TM: rrnDB: documenting the number of
rRNA and tRNA genes in bacteria and archaea. Nucleic Acids Res 2009,
37(Database issue):D489-493.
Lowe TM, Eddy SR: tRNAscan-SE: a program for improved detection of
transfer RNA genes in genomic sequence. Nucleic Acids Res 1997, 25(5):955-964.
Osterman A, Overbeek R: Missing genes in metabolic pathways: a comparative
genomics approach. Curr Opin Chem Biol 2003, 7(2):238-251.
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de
Crecy-Lagard V, Diaz N, Disz T, Edwards R et al: The subsystems approach to
genome annotation and its use in the project to annotate 1000 genomes. Nucleic
Acids Res 2005, 33(17):5691-5702.
Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene
clusters to infer functional coupling. Proc Natl Acad Sci U S A 1999, 96(6):28962901.
Smith TF, Waterman MS: Identification of common molecular subsequences. J
Mol Biol 1981, 147(1):195-197.
Sonnhammer EL, Eddy SR, Birney E, Bateman A, Durbin R: Pfam: multiple
sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res
1998, 26(1):320-322.
Tatusov RL, Koonin EV, Lipman DJ: A genomic perspective on protein families.
Science 1997, 278(5338):631-637.
Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting,
position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994,
22(22):4673-4680.
Wu CH, Nikolskaya A, Huang H, Yeh LS, Natale DA, Vinayaka CR, Hu ZZ,
Mazumder R, Kumar S, Kourtesis P et al: PIRSF: family classification system at
the Protein Information Resource. Nucleic Acids Res 2004, 32(Database
issue):D112-114.
Xiong J: Essential bioinformatics. New York: Cambridge University Press; 2006.
Download