Lecture 1

advertisement
IBGP/BMI 730
Introduction to Bioinformatics
Director: Prof. Victor Jin
Basic Molecular Biology
 All living things are made of Cells
Prokaryote, Eukaryote
 Cell Signaling
 What is Inside the cell: From DNA, to RNA, to Proteins
Cells
 Fundamental working units of every living system.
 Every organism is composed of one of two
radically different types of cells:
prokaryotic cells or eukaryotic cells.
 Prokaryotes and Eukaryotes are descended from the
same primitive cell.
 All extant prokaryotic and eukaryotic cells are the
result of a total of 3.5 billion years of evolution.
Cell Structure
 A cell is a smallest structural unit of
an organism that is capable of
independent functioning
 All cells have some common features
Cell Cycle
 Born, eat, replicate, and die
The Tree of Life
According to the most recent evidence, there are three main branches to the tree of life.
Prokaryotes include Archaea (“ancient ones”) and bacteria.
Eukaryotes are kingdom Eukarya and includes plants, animals, fungi and certain algae.
Prokaryotes and Eukaryotes
Prokaryotes
Eukaryotes
Single cell
Single or multi cell
No nucleus
Nucleus
No organelles
Organelles
One piece of circular DNA
Chromosomes
No mRNA post
transcriptional
modification
Exons/Introns splicing
Signaling Pathways: Control Gene Activity
 Instead of having brains, cells make decision
through complex networks of chemical
reactions, called pathways
 Synthesize new materials
 Break other materials down for spare parts
 Signal to eat or die
An Example -- Cell Cycle Signaling
Cells Information and Machinery
 Cells store all information to replicate itself
 Human genome is around 3 billions base pair long
 Almost every cell in human body contains same
set of genes
 But not all genes are used or expressed by those
cells
 Machinery:
 Collect and manufacture components
 Carry out replication
 Kick-start its new offspring
Terminology
 Genome: an organism’s genetic material
 Gene: a discrete units of hereditary information located on
the chromosomes and consisting of DNA
 Genotype: The genetic makeup of an organism
 Phenotype: the physical expressed traits of an organism
 Nucleic acid: Biological molecules(RNA and DNA) that allow
organisms to reproduce
 Amino acid: Organic molecules that build blocks of proteins.
 Protein: a large, complex molecule that is essential part of
organisms and participates in every process within cells and
achieve a particular function.
Three critical molecules
 DNAs
 Hold information on how cell works
 RNAs
 Act to transfer short pieces of information to
different parts of cell
 Provide templates to synthesize into protein
 Proteins
 Form enzymes that send signals to other cells and
regulate gene activity
 Form body’s major components (e.g. hair, skin,
etc.)
Overview of DNA to RNA to Protein
 A gene is expressed in two steps
 Transcription: RNA synthesis
 Translation: Protein synthesis
DNA the Genetics Makeup
 Genes are inherited and
are expressed
 genotype (genetic
makeup)
 phenotype (physical
expression)

On the left, is the eye’s
phenotypes of green and
black eye genes.
Central Dogmas of Molecular Biology
1) The concept of genes is historically defined on the basic of
genetic inheritance of a phenotype. (Mendellian Inheritance)
2) The DNA an organism encodes the genetic information. It is
made up of a double stranded helix composed of ribose sugars.
Adenine(A), Citosine (C), Guanine (G) and Thymine (T).
[note that only 4 values nees be encode ACGT.. Which can be
done using 2 bits.. But to allow redundant letter combinations
(like N means any 4 nucleotides), one usually resorts to a 4 bit
alphabet.]
Central Dogmas of Molecular Biology
3) Each side of the double helix faces it´s complementary base.
A T, and G  C.
4) Biochemical process that read off the DNA always read it
from the 5´´side towards the 3´ side. (replication and
transcription).
5) A gene can be located on either the ´plus strand´ or the
minus strand. But rule 4) imposes the orientation of reading ..
And rule 3 (complementarity) tells us to complement each base
E.g.
If the sequence on the + strand is ACGTGATCGATGCTA, the –
strand must be read off by reading the complement of this
sequence going ´backwards´
e.g. TAGCATCGATCACGT
Central Dogmas of Molecular Biology
6) DNA information is copied over to mRNA that acts
as a template to produce proteins.
We often concentrate on protein coding genes, because
proteins are the building blocks of cells and the majority
of bio-active molecules. (but let´s not forget the various
RNA genes)
Bioinformatics
Bioinformatics (computational biology) solves
biological problems on the molecular level with
the use of techniques including:
 applied mathematics
 statistics
 computer science
 artificial intelligence
Bioinformatics
Biological
Computer
+
Data
Calculations
Molecular Biology as an Information
Science
Central Dogma
of Molecular Biology
DNA
-> RNA
-> Protein
-> Phenotype
Molecules
Sequence, Structure, Function
Processes
Mechanism, Specificity,
Regulation
Central Paradigm
for Bioinformatics
-> Genomic Sequence
-> Transcript
-> Protein Structure
-> Protein Function
Large Amounts of
Information
Data Management
Computer Algorithms
Statistical Methods
Major research efforts
 Sequence alignment
 Gene finding
 Genome assembly
 RNA structure prediction
 Protein structure prediction
 Analysis of gene regulation
 Prediction of protein-protein interactions
 Modeling of evolution
Major research areas


















Sequence analysis
Genome annotation
Computational evolutionary biology
Measuring biodiversity
Analysis of gene expression
Analysis of regulation
Analysis of protein expression
Analysis of mutations in cancer
Analysis of epigenetics in cancer
High-throughput in vivo binding analysis
Prediction of protein structure
Comparative genomics
Modeling biological systems
High-throughput image analysis
Protein-protein docking
Software and tools
Databases
Web services in bioinformatics
Data types
 DNA sequences
 RNA sequences
 Protein sequences
 Gene Expression
 cDNA, mRNA microarray data
 Now tiling array technology
 50 M data points to tile the human genome at ~50 bp res.
 Can only sequence genome once but can do an infinite variety of array
experiments
 Protein-DNA interactions
 ChIP-chip, ChIP-seq, ChIP-PET and so on
 Phenotype Experiments
 KOs
 Protein Interactions
 Yeast hybrid
 Proteomics
Other Integrative Data
Information to
understand genomes
 Metabolic Pathways
 Regulatory Networks
 Signaling Networks
 Whole Organisms
Phylogeny
 The Literature
(MEDLINE)
GenBank Growth
Year
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
GenBank Data
Base Pairs
Sequences
680338
606
2274029
2427
3368765
4175
5204420
5700
9615371
9978
15514776
14584
23800000
20579
34762585
28791
49179285
39533
71947426
55627
101008486
78608
157152442
143492
217102462
215273
384939485
555694
651972984
1021211
1160300687
1765847
2008761784
2837897
3841163011
4864570
8604221980
7077491
Exponential Growth of Data Matched by
Development of Computer Technology
 CPU vs Disk & Internet
 Driving Force in
Bioinformatics
Internet
Hosts
1981
1983
1985
1987
1989
1991
1995
140
Structures in PDB
1980
1993
120
100
80
60
40
20
0
1985
1990
1995
CPU Instruction
Time (ns)
1979
4500
4000
3500
No.
3000
Protein
2500
2000
Domain
1500
Structures
1000
500
0
Types of Relational databases
 The Internet can be thought of as one enormous
relational database.
 The “links”/URL are the primary keys.
 SQL (Standard Query Language)
 Sybase; Oracle ; Access; (Databases systems)
Sybase used at NCBI.
 SRS(One type of database querying system of use
in Biology)
XML Database and vocabularies for life
science
 HTML: Hypertext Markup Language
 XML: a general-purpose specification for creating
custom markup languages. It is classified as an
extensible language, because it allows the user to
define the mark-up elements
 BSML: an extensible language specification and
container for bioinformatic data. BSML was
developed under a 1997 grant from the National
Human Genome Research Institute (NHGRI) as an
evolving public domain standard for the
bioinformatics community
Examples of XML
 <?xml version="1.0" encoding="UTF-8"?>
 <element_name
attribute_name="attribute_value">Element
Content</element_name>
 <book>This is a book... </book>
Primary Databases
 A primary Database is a repository of data
derived from experiments or from research
knowledge.






Genbank (Nucleotide repository)
Protein DB, Swissprot
PDB (MMDB) are primary databases.
Pubmed (literature)
Genome Mapping databases.
Kegg Database.(pathways)
Secondary Databases
 A secondary database contains information derived
from other sources.
 Refseq (Currated collection of Genbank at NCBI)
 UniGene (Clustering of ESTs at NCBI)
 GeneID (Unique ID for each Gene at NCBI)
 Organism-specific databases are often a mix
between primary and secondary.
Biological Databases
 Nucleotide databases:
 Genbank: International Collaboration
• NCBI (USA), EMBL (Europe), DDBJ (Japan and Asia)
• A “bank” No curation.. Submission to these
database is required for publication in a journal.
 Organism specific databases
(Quick quiz: Find URLs using search engines)
• FlyBase
• ChickGBASE
• pigbase
• wormpep
• YPD (Yeast Protein Database)
• SGD(Saccharomyces Genome Database)
 Protein Databases:
 NCBI: More on next week
 Swiss Prot:(Free for academic use, otherwise
commercial. Licensing restrictions on discoveries made
using the DB. 1998 version free of any licensing)
• http://www.expasy.ch(latest pay version)
• NCBI has the latest free version.
• Translated Proteins from Genbank Submissions
 EMBL
• TrEMBL is a computer-annotated supplement of
SWISS-PROT that contains all the translations of
EMBL nucleotide sequence entries not yet integrated
in SWISS-PROT
 PIR
• Structure databases:
 PDB: Protein structure database.
• Http://www.rscb.org/pdb/
 MMDB: NCBI’s version of PDB with entrez links.
• Http://www.ncbi.nlm.nih.gov
• Genome mapping information:
 http://www.il-st-acad-sci.org/health/genebase.html
 NCBI (Human)
 Genome Centers:
Stanford, Washington University, UC Berkeley
 Research Centers and Universities
 Literature databases:
 NCBI: Pubmed: All biomedical literature.
• www.ncbi.nlm.nih.gov
• Abstracts and links to publisher sites for
 full text retrieval/ordering
 journal browsing.
 Publisher web sites.
 Biomednet: Commercial site for litterature search.
 Pathways database:
 KEGG: Kyoto Encyclopedia of Genes and Genomes:
www.genome.ad.jp/kegg/kegg/html
 Genome Search and Visualization
database:
 UCSC Genome Browser (genome.uscs.edu/)
Information techniques
 Databases
– Building, Querying
– Complex data
 Text String Comparison
– Text Search
– 1D Alignment
– Significance Statistics
– Alta Vista, grep
 Finding Patterns
– Machine Learning
– Clustering
– Data mining
 Geometry
– Robotics
– Graphics (Surfaces,
Volumes)
– Comparison and 3D
Matching
(Vision, recognition)
 Physical Simulation
– Newtonian Mechanics
– Electrostatics
– Numerical Algorithms
– Simulation
Bioinformatics as New Paradigm for
Scientific Computing
 Physics
 Prediction based on
physical principles
 EX: Exact Determination of
Rocket Trajectory
 Emphasizes:
Supercomputer, CPU
 Biology
 Classifying information and
discovering unexpected
relationships
 EX: Gene Expression Network
 Emphasizes: networks,
“federated” database
Topics -- Genome Sequence
 Finding Genes in Genomic DNA
 introns
 exons
 promotors
 Characterizing Repeats in Genomic DNA
 Statistics
 Patterns
 Duplications in the Genome
 Large scale genomic alignment
 Whole-Genome Comparisons
 Finding Structural RNAs
Topics -- Protein Sequence
 Sequence Alignment
 How to align two strings
optimally via Dynamic
Programming
 Local vs Global Alignment
 Suboptimal Alignment
 Hashing to increase speed
(BLAST, FASTA)
 Amino acid substitution
scoring matrices
 Multiple Alignment and
Consensus Patterns
 How to align more than one
sequence and then fuse the
result in a consensus
representation
 HMMs, Profiles
 Motifs
 Scoring schemes and Matching
statistics
 How to tell if a given alignment
or match is statistically
significant
 A P-value or An E-value)?
 Score Distributions
 Low Complexity Sequences
 Evolutionary Issues
 Rates of mutation and change
Topics – Structures
 Secondary Structure
“Prediction”
 via Propensities
 Neural Networks,
Genetic Alg.
 Simple Statistics
 TM-helix finding
 Assessing Secondary
Structure Prediction
 Structure Prediction:
Protein vs RNA
 Tertiary Structure Prediction
 Fold Recognition
 Threading
 Ab initio
 Direct Function Prediction
 Active site identification
 Relation of Sequence Similarity to
Structural Similarity
Topics -- Structures
 Structure Comparison
 Structural Alignment
 Basic Protein Geometry and
 Aligning sequences on the
Least-Squares Fitting
basis of 3D structure.
 Distances, Angles, Axes, Rotations
 DP does not converge,
 Calculating a helix axis in 3D via
unlike sequences, what to
fitting a line
do?
 LSQ fit of 2 structures
 Other Approaches:
 Molecular Graphics
Distance Matrices, Hashing
 Calculation of Volume and Surface
 Fold Library
 How to represent a plane
 Docking and Drug Design as
 How to represent a solid
Surface Matching
 How to calculate an area
 Hinge prediction
 Packing Measurement
Topics – Function Genomics
 Expression Analysis
 Time Courses clustering
 Measuring differences
 Identifying Regulatory
Regions
 Large scale cross referencing
of information
 Function Classification and
Orthologs
 The Genomic vs. Singlemolecule Perspective
 Genome Comparisons
 Ortholog Families, pathways
 Large-scale censuses
 Frequent Words Analysis
 Genome Annotation
 Identification of interacting
proteins
 Networks
 Global structure and local
motifs
 Structural Genomics
 Folds in Genomes, shared &
common folds
 Bulk Structure Prediction
 Genome Trees
Bioinformatics tools
 Sequence comparison (pairwise and multiple
alignments, e.g. ClustalW, Blastz, )
 Phylogenetic reconstruction (e.g. Phylip, IQPNNI,
SplitsTree)
 Database search (e.g. BLAST, HMMer)
 Comparative sequence assembly (e.g. OSLay)
 Gene finding (e.g. genscan, FirstEF)
 Motif discovery (e.g. MEME, Weeder)
 Protein structure (e.g. CE)
Bioinformatics algorithms







Dynamic Programming
EM algorithms
Neural Networks
Hidden Markov Models
Support Vector Machine
Phylogenetic Trees
Clustering
Bioinformatics Topics?
 (YES?) Digital Libraries
 Automated Bibliographic Search and Textual Comparison
 Knowledge bases for biological literature
 (YES) Motif Discovery Using Gibb's Sampling
 (YES) Metabolic Pathway Simulation
 (YES) Gene identification by sequence inspection
 Prediction of splice sites
 (YES) Linkage Analysis
 Linking specific genes to various traits
 YES) RNA structure prediction
Identification in sequences
 (YES) Homology modeling
Download