BIOINFORMATICS

advertisement
REMINDERS
2nd Exam on Nov.17
 Coverage:


Central Dogma of DNA
• Replication
• Transcription
• Translation
Cell structure and function
 Recombinant DNA technology and
molecular biology
 Protein analysis

BIOINFORMATICS
BIOINFORMATICS
Study of the structure of biological
information and biological systems
 Integrates theories and tools of
mathematics/statistics, computer
science and information technology
 Involves the use of hardware and
software to study vast amounts of
biological data

What is Bioinformatics?

the field of science in which biology,
computer science, and information
technology merge to form a single
discipline

application of information technology
to the storage, management and
analysis of biological information

facilitated by the use of computers
FUNCTIONS

Data Management
Storage
 Retrieval


Data Analysis
*Literature/Bibliography, Sequence,
Structure, Taxonomy, Expression, etc.
BIOLOGICAL DATABASES
Systematic data storage/retrieval
 Maintained on a regular basis
 Can contain various types of data
(integration)

Sequence
 Structure
 Other pertinent information


Nucleotides and proteins are most
common
DATABASES

a large, organized body of persistent data,
usually associated with computerized
software designed to update, query, and
retrieve components of the data stored
within the system

Biological databases consist usually of the
nucleic acid sequences of the genetic
material of various organisms as well as
protein sequences and structures
DATABASES


e.g. nucleotide sequence database typically
contains information such as
 contact name
 the input sequence with a description of the
type of molecule
 the scientific name of the source organism
from which it was isolated
additional requirements
 easy access to the information
 a method for extracting only that information
needed to answer a specific biological question
DATABASES
•
Sequence
–
–
–
–
–
GenBank, European Nucleotide Archive
(ENA) and DNA Data Bank of Japan
(DDBJ); managed by the International
Nucleotide Sequence Database
Collaboration (INSDC)
UniGene
Saccharomyces Genome Database
(SGD)
UniProtKB (UniProtKB/Swiss-Prot or
UniProt/TrEMBL)
ExPASy
DATABASES

Structure
Nucleic Acid Database (NDB)
 Protein Data Bank (PDB)
 Worldwide Protein Data Bank (wwPDB)
 ExPASy

DATA MINING
Process by which testable hypotheses
are created regarding function/structure
of gene/protein of interest through
identifying similar sequences in “more
established” organisms
 Tools:

Text-term search
 Sequence similarity search

Machine Learning
Studies methods and the design of
computer programs based on past
experience
 Why?

New methods are being introduced
 Old ones should be improved

“Units” of Information
DNA (genome)
 RNA (transcriptome)
 Protein (proteome)

What is Being Analyzed?
Sequence
 Structure
 Interactions
 Pathways
 Mutations/Evolutions

Why?

Increasing amount of biological
information entails
Organization
 Archiving

Global unification/harmonization
 More biological discoveries

Functional/Structural similarities
 Phylogenetic/Evolutionary patterns

Applications
Medicine
 Pharmaceuticals
 Biotechnology
 Agriculture

STRUCTURE
DATABASES
Molecular Data
•
When you draw a molecule,
–
–
–
•
You start with atoms
Then proceed with the structure
And the three-dimensional data
What can be stored?
–
–
–
Coordinates
Sequences
Chemical graphs
• Atoms and bonds
Databases
Protein Data Bank (PDB)
 Molecular Modeling Database (MMDB)

Techniques in the
Laboratory
X-ray Crystallography
 Nuclear Magnetic Resonance

Formats
PDB
 mmCIF
 MMDB

Structure Viewers
Cn3D
 RasMol
 WebMol
 Mage
 VRML
 CAD
 Swiss PDB Viewer

Promises of bioinformatics


Medicine
 Knowledge of protein structure facilitates drug
design
 Understanding of genomic variation allows the
tailoring of medical treatment to the individual’s
genetic make-up
 Genome analysis allows the targeting of
genetic diseases
 The effect of a disease or of a therapeutic on
RNA and protein levels can be elucidated
The same techniques can be applied to
biotechnology, crop and livestock improvement,
etc...
Challenges in bioinformatics


Explosion of information
 Need for faster, automated analysis to process
large amounts of data
 Need for integration between different types of
information (sequences, literature,
annotations, protein levels, RNA levels etc…)
 Need for “smarter” software to identify
interesting relationships in very large data sets
Lack of “bioinformaticians”
 Software needs to be easier to access, use
and understand
 Biologists need to learn about the software, its
limitations, and how to interpret its results
SEQUENCE
ALIGNMENT
Two or More Sequences
Measure similarity
 Determine correspondences between
residues
 Find patterns of conservation
 Derive evolutionary relationships

Alignment

Correspondences of nucleotides/amino
acids in two sequences or more are
assigned
An assignment of correspondences that
preserves the order of the residues
within the sequences is an alignment
 Gaps are used to achieve this


Sequence alignment refers to the
identification of residue-residue
correspondences
Uses

Homology
Similarities
 “Ancestry”


Genome annotation


Assigning structure and function to
genes
Database queries

For newly-discovered/unknown
sequences
Tools
•
Dot Plots
–
•
Scoring Matrices
–
–
–
•
Diagonal lines of dots showing similarities
between two sequences
Score reflects quality of each possible
alignment; best possible score is identified
Scoring scheme is crucial
PAM (Point Accepted Mutations) and
BLOSUM (BLOCKS Substitution Matrix)
Dynamic Programming
–
Algorithmic technique that reuses previous
computations
Scoring

Penalties/Scores
Match (e.g. A – A)
 Mismatch (e.g. A C)
 Gap (e.g. A _)

• Linear Gap Penalty: Uniform
• Affine Gap Penalty: Gap Existence vs. Gap
Extension
Local vs. Global Alignments

Global Alignment


Similarities between majority of two
sequences
Local Alignment

Similarities between specific parts of
two sequences
Programs
Pairwise Sequence Alignment
 BLAST
 VAST
 FASTA
Multiple Sequence Alignment
 MAFFT
Needleman-Wunsch
Algorithm
•
•
•
Can be used for global and alignments
Maximum-value function
A simple scoring scheme is assumed
Three steps
–
–
–
Initialization
Matrix fill (scoring)
Traceback (alignment)
Download