Various Career Options Available

advertisement
Role of Computer and Information Science in
Biology
Presented By
Dr G. P. S. Raghava
Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India
&
Visiting Professor, Pohang Univ. of Science & Technology, Republic of Korea
Email: raghava@imtech.res.in
Web: http://www.imtech.res.in/raghava/
Major Applications & Challenges









Introduction to Biology
Genome Annotation: Gene Prediction
Analysis and Comparison of Sequences
Protein Structure Prediction
DNA Chip (Microarray) technology
Proteomics: Analysis of 2D gel
Fingerprinting Technique
Drug development
Computer-Aided Vaccine Design
Hierarchy in Biology
Atoms
Molecules
Macromolecules
Organelles
Cells
Tissues
Organs
Organ Systems
Individual Organisms
Populations
Communities
Ecosystems
Biosphere
Animal cell
Human Chromosomes
Genes are linearly arranged along chromosomes
Chromosomes and DNA
DNA can be simplified to a
string of four letters
GATTACA
(RT)
Sequence to Structure:
It’s a matter of dimensions!

1D Nucleic acid sequence
AGT-TTC-CCA-GGG…

1D Protein sequence
Met-Ala-Gly-Lys-His…
M – A – G – K – H…

3D Spatial arrangement of atoms
Genome Annotation
The Process of Adding Biology Information and
Predictions to a Sequenced Genome Framework
Importance of Sequence Comparison

Protein Structure Prediction
– Similar sequence have similar structure & function
– Phylogenetic Tree
– Homology based protein structure prediction

Genome Annotation
– Homology based gene prediction
– Function assignment & evolutionary studies

Searching drug targets
– Searching sequence present or absent across
genomes
Protein Sequence Alignment and Database
Searching
Alignment
of Two Sequences (Pair-wise Alignment)
– The Scoring Schemes or Weight Matrices
– Techniques of Alignments
– DOTPLOT
Multiple Sequence Alignment (Alignment of > 2 Sequences)
–Extending Dynamic Programming to more sequences
–Progressive Alignment (Tree or Hierarchical Methods)
–Iterative Techniques

Stochastic Algorithms (SA, GA, HMM)
Non Stochastic Algorithms
Database Scanning
– FASTA, BLAST, PSIBLAST, ISS
 Alignment of Whole Genomes
– MUMmer (Maximal Unique Match)

Alignment of Two Sequences
Dealing Gaps in Pair-wise Alignment
Sequence Comparison without Gaps
Slide Windos method to got maximum score
ALGAWDE
ALATWDE
Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7
Sequence with variable length should use dynamic programming
Sequence Comparison with Gaps
•Insertion and deletion is common
•Slide Window method fails
•Generate all possible alignment
•100 residue alignment require > 1075
Alternate Dot Matrix Plot
Diagnoal * shows align/identical regions
Dynamic Programming





Dynamic Programming allow Optimal Alignment
between two sequences
Allow Insertion and Deletion or Alignment with gaps
Needlman and Wunsh Algorithm (1970) for global
alignment
Smith & Waterman Algorithm (1981) for local
alignment
Important Steps
– Create DOTPLOT between two sequences
– Compute SUM matrix
– Trace Optimal Path
Alignment of Multiple Sequences
Extending Dynamic Programming to more sequences
–Dynamic programming can be extended for more than two
–In practice it requires CPU and Memory (Murata et al 1985)
– MSA, Limited only up to 8-10 sequences (1989)
–DCA (Divide and Conquer; Stoye et al., 1997), 20-25 sequences
–OMA (Optimal Multiple Alignment; Reinert et al., 2000)
–COSA (Althaus et al., 2002)
Progressive or Tree or Hierarchical Methods (CLUSTAL-W)
–Practical approach for multiple alignment
–Compare all sequences pair wise
–Perform cluster analysis
–Generate a hierarchy for alignment
–first aligning the most similar pair of sequences
–Align alignment with next similar alignment or sequence
Database scanning
Basic principles of Database searching
– Search query sequence against all sequence in database
– Calculate score and select top sequences
– Dynamic programming is best
Approximation Algorithms
FASTA
Fast sequence search
Based on dotplot
Identify identical words (k-tuples)
Search significant diagonals
Use PAM 250 for further refinement
Dynamic programming for narrow region
Principles of FASTA Algorithms
Database Scanning or Fold Recognition

Concept of PSIBLAST
–
–
–
–

Perform the BLAST search (gap handling)
GeneImprove the sensivity of BLAST
rate the position-specific score matrix
Use PSSM for next round of search
Intermediate Sequence Search
– Search query against protein database
– Generate multiple alignment or profile
– Use profile to search against PDB
Comparison of Whole Genomes

MUMmer (Salzberg group,
1999, 2002)
–
–
–
–
–

Pair-wise sequence alignment of
genomes
Assume that sequences are closely
related
Allow to detect repeats, inverse repeats,
SNP
Domain inserted/deleted
Identify the exact matches
How it works
–
–
–
–
–
Identify the maximal unique match
(MUM) in two genomes
As two genome are similar so larger
MUM will be there
Sort the matches found in MUM and
extract longest set of possible matches
that occurs in same order (Ordered
MUM)
Suffix tree was used to identify MUM
Close the gaps by SNPs, large inserts
Protein Structure Prediction

Experimental Techniques
– X-ray Crystallography
– NMR

Limitations of Current Experimental
Techniques
– Protein DataBank (PDB) -> 24000 protein structures
– SwissProt -> 100,000 proteins
– Non-Redudant (NR) -> 1,000,000 proteins

Importance of Structure Prediction
– Fill gap between known sequence and structures
– Protein Engg. To alter function of a protein
– Rational Drug Design
Protein Structures
Techniques of Structure Prediction

Computer simulation based on energy calculation
– Based on physio-chemical principles
– Thermodynamic equilibrium with a minimum free energy
– Global minimum free energy of protein surface

Knowledge Based approaches
– Homology Based Approach
– Threading Protein Sequence
– Hierarchical Methods
Energy Minimization Techniques
Energy Minimization based methods in their pure form, make no
priori assumptions and attempt to locate global minma.
 Static Minimization Methods
– Classical many potential-potential can be construted
– Assume that atoms in protein is in static form
– Problems(large number of variables & minima and validity of
potentials)
 Dynamical Minimization Methods
– Motions of atoms also considered
– Monte Carlo simulation (stochastics in nature, time is not
cosider)
– Molecular Dynamics (time, quantum mechanical, classical
equ.)
 Limitations
– large number of degree of freedom,CPU power not adequate
– Interaction potential is not good enough to model
Knowledge Based Approaches


Homology Modelling
– Need homologues of known protein structure
– Backbone modelling
– Side chain modelling
– Fail in absence of homology
Threading Based Methods
– New way of fold recognition
– Sequence is tried to fit in known structures
– Motif recognition
– Loop & Side chain modelling
– Fail in absence of known example
Hierarcial Methods
Intermidiate structures are predicted, instead of
predicting tertiary structure of protein from amino
acids sequence
 Prediction of backbone structure
– Secondary structure (helix, sheet,coil)
– Beta Turn Prediction
– Super-secondary structure


Tertiary structure prediction
Limitation
Accuracy is only 75-80 %
Only three state prediction
excitation
cDNA clones
(probes)
laser 2
PCR product amplification
purification
printing
scanning
laser 1
emission
mRNA target)
overlay images and normalise
0.1nl/spot
microarray
Hybridise target
to microarray
analysis
Major Applications



Identification of differentially
expressed genes in diseased tissues
(in presence of drug)
Classification of differentially
expressed (genes) or clustering/
grouping of genes having similar
behaviour in different conditions
Use expression profile of known
disease to diagnosis and classify of
unknown genes
Terms/Jargons
Stanford/cDNA chip Affymetrix/oligo
chip
 one slide/experiment
 one chip/experiment
 one spot
 1 gene => one spot  one
probe/feature/cell
or few spots(replica)
 control: control spots  1 gene => many
probes
(20~25
 control: two
mers)
fluorescent dyes
 control: match and
(Cy3/Cy5)
mismatch cells.
Images : examples
Pseudo-colour overlay
Cy3
Cy5
Spot colour
Signal strength
Gene expression
yellow
Control = perturbed
unchanged
red
Control < perturbed
induced
green
Control > perturbed
repressed
Processing of images

Addressing or gridding
– Assigning coordinates to each of the spots

Segmentation
– Classification of pixels either as foreground or as
background

Intensity determination for each spot
– Foreground fluorescence intensity pairs (R, G)
– Background intensities
– Quality measures
Management of Microarray Data

Magnitude of Data
– Experiments






50 000 genes in human
320 cell types
2000 compunds
3 times points
2 concentrations
2 replicates
– Data Volume


4*1011 data-points
1015 = 1 petaB of Data
Management of Microarray
Data
Major Issues

Large volume of microarray data in last few years
– Storage and efficient access
– Comparison and integration of data

Problem of data access and exchange
– Data scattered around Internet
– Supplementary material of publications
– Difficult for user to access relivent data

Problems with existing databases
– Diverse purpose
– Developed for specific purpose
Management of Microarray
Data

Specific Database
– Platform (eg.Stanford MA Database; SMD)
– Organism (Yeast MA global viewer)
– Project (Life cycle database of Drosophila)

Problem with Supplement and MA databases
–
–
–
–
Lack of direct access
Quality not checked
No standard format
Incomplete data
Pre-processed cDNA Gene
Expression Data
On p genes for n slides: p is O(10,000), n is O(10-100), but
growing,
Slides
Genes
1
2
3
4
5
slide 1
slide 2
slide 3
slide 4
slide 5
…
0.46
-0.10
0.15
-0.45
-0.06
0.30
0.49
0.74
-1.03
1.06
0.80
0.24
0.04
-0.79
1.35
1.51
0.06
0.10
-0.56
1.09
0.90
0.46
0.20
-0.32
-1.09
...
...
...
...
...
Gene expression level of gene 5 in slide 4
=
Log2( Red intensity / Green intensity)
These values are conventionally displayed
on a red (>0) yellow (0) green (<0) scale.
Analysis of Microarray Data



Analysis of images
Preprocessing of gene expression data
Normalization of data
–
–
–
–

Subtraction of Background Noise
Global/local Normalization
House keeping genes (or same gene)
Expression in ratio (test/references) in log
Differential Gene expression
– Repeats and calculate significance (t-test)
– Significance of fold used statistical method


Clustering
– Supervised/Unsupervised (Hierarchical, K-means,
SOM)
Prediction or Supervised Machine Learnning (SVM)
Normalization Techniques


Global normalization
– Divide channel value by means
Control spots
– Common spots in both channels
– House keeping genes
– Ratio of intensity of same gene in two channel is used for
correction


Iterative linear regression
Parametric nonlinear nomalization
– log(CY3/CY5) vs log(CY5))
– Fitted log ratio – observed log ratio

General Non Linear Normalization
– LOESS
– curve between log(R/G) vs log(sqrt(R.G))
Classification
Task: assign objects to classes (groups) on
the basis of measurements made on the
objects
 Unsupervised: classes unknown, want to
discover them from the data (cluster
analysis)
 Supervised: classes are predefined, want to
use a (training or learning) set of labeled
objects to form a classifier for classification
of future observations

Issues in Clustering

Pre-processing (Image analysis and
Normalization)

Which genes (variables) are used
Which samples are used
 Which distance measure is used
 Which algorithm is applied
 How to decide the number of clusters K

Unsupervised Learnning







Hierarchical clustering: merging two branches at
the time until all vari-ables
(genes) are in one tree. [it does not answer the
question of “how
many gene clusters there are”?]
K-mean clustering: assuming there are K clusters.
[what if this assump-tion
is incorrect?]
Model-based clustering: the number of clusters is
determined dynami-cally
[could be one of the most promising methods]
Supervised Analysis
Fisher’s linear discriminant
analysis
 Quadratic discriminant analysis
 Logistic regression (a linear
discriminant analysis)
 Neural networks
 Support vector machine

Traditional Proteomics
1D gel electrophoresis (SDS-PAGE)
 2D gel electrophoresis
 Protein Chips

– Chips coated with proteins/Antibodies
– large scale version of ELISA

Mass Spectrometry
– MALDI: Mass fingerprinting
– Electrospray and tandem mass
spectrometry


Sequencing of Peptides (N->C)
Matching in Genome/Proteome Databases
Overview of 2D Gel

SDS-PAGE + Isoelectric focusing (IEF)
– Gene Expression Studies
– Medical Applications
– Sample Experiments

Capturing and Analyzing Data
– Image Acquistion
– Image Sizing & Orientation
– Spot Identification
– Matching and Analysis
Comparision/Matcing of Gel Images

Compare 2 gel images
– Set X and y axis
– Overlap matching spots
– Compare intensity of spots

Scan against database
– Compare query gel with all gels
– Calculate similarity score
– Sort based on score
Proteomics:
Fingerprints of
Disease
Normal Cells
Disease Cells
Phenotypic
Changes
•Differential protein expression
• Protein nitration patterns
•Altered phosporylation
•Altered glycosylation profiles
Utility
•Target discovery
•Disease pathways
•Disease biomarkers
Fingerprinting Technique

What is fingerprinting
– It is technique to create specific pattern for a given
organism/person
– To compare pattern of query and target object
– To create Phylogenetic tree/classification based on pattern

Type of Fingerprinting
–
–
–
–

DNA Fingerprinting
Mass/peptide fingerprinting
Properties based (Toxicity, classification)
Domain/conserved pattern fingerprinting
Common Applications
–
–
–
–
–
Paternity and Maternity
Criminal Identification and Forensics
Personal Identification
Classification/Identification of organisms
Classification of cells
Fingerprinting Techniques: Principles & Applications



What is fingerprinting
Type of Fingerprinting
Common Applications
Role of Computer in DNA Fingerprinting
–
–
–
–
–
–
Searching Restriction Enzymes
Searching VNTRs
Computation of size of DNA fragments
Optimization of gels
Comparison of patterns
Creation of Phylogenetic tree
Drug Design
History of Drug/Vaccine development
– Plants or Natural Product




Plant and Natural products were source for medical substance
Example: foxglove used to treat congestive heart failure
Foxglove contain digitalis and cardiotonic glycoside
Identification of active component
– Accidental Observations






Penicillin is one good example
Alexander Fleming observed the effect of mold
Mold(Penicillium) produce substance penicillin
Discovery of penicillin lead to large scale screening
Soil micoorganism were grown and tested
Streptomycin, neomycin, gentamicin, tetracyclines etc.
Drug Design

Chemical Modification of Known Drugs
– Drug improvement by chemical modification
– Pencillin G -> Methicillin; morphine->nalorphine

Receptor Based drug design
–
–
–
–

Receptor is the target (usually a protein)
Drug molecule binds to cause biological effects
It is also called lock and key system
Structure determination of receptor is important
Ligand-based drug design
– Search a lead ocompound or active ligand
– Structure of ligand guide the drug design process
Drug Design based on Bioinformatics Tools

Detect the Molecular Bases for Disease
– Detection of drug binding site
– Tailor drug to bind at that site
– Protein modeling techniques
– Traditional Method (brute force testing)

Rational drug design techniques
– Screen likely compounds built
– Modeling large number of compounds (automated)
– Application of Artificial intelligence
– Limitation of known structures
Important Points in Drug Design based on
Bioinformatics Tools

Application of Genome
–
–
–
–
–
3 billion bases pair
30,000 unique genes
Any gene may be a potential drug target
~500 unique target
Their may be 10 to 100 variants at each target
gene
– 1.4 million SNP
– 10200 potential small molecules
Concept of Drug and Vaccine

Concept of Drug
– Kill invaders of foreign pathogens
– Inhibit the growth of pathogens

Concept of Vaccine
– Generate memory cells
– Trained immune system to face various
existing disease agents
VACCINES
A. SUCCESS STORY:
•
COMPLETE ERADICATION OF SMALLPOX
•
WHO PREDICTION : ERADICATION OF PARALYTIC
POLIO THROUGHOUT THE WORLD BY YEAR 2003
•
SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES:
DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA,
POLIOMYELITIS, TETANUS
B.NEED OF AN HOUR
1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES FOR
DISEASES LIKE:
MALARIA, TUBERCULOSIS AND AIDS
2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT
VACCINES
3) LOW COST
4) EFFICIENT DELIVERY TO NEEDY
5) REDUCTION OF ADVERSE SIDE EFFECTS
Computer Aided Vaccine Design

Whole Organism of Pathogen
– Consists more than 4000 genes and proteins
– Genomes have millions base pair

Target antigen to recognise pathogen
– Search vaccine target (essential and non-self)
– Consists of amino acid sequence (e.g. A-V-LG-Y-R-G-C-T ……)

Search antigenic region (peptide of length
9 amino acids)
Major steps of endogenous antigen processing
Computer Aided Vaccine Design

Problem of Pattern Recognition
– ATGGTRDAR
– LMRGTCAAY
– RTTGTRAWR
– EMGGTCAAY
– ATGGTRKAR
– GTCVGYATT

Epitope
Non-epitope
Epitope
Non-epitope
Epitope
Epitope
Commonly used techniques
– Statistical (Motif and Matrix)
– AI Techniques
Why computational tools are required for prediction.
200 aa proteins
Chopped to overlapping
peptides of 9 amino
acids
Bioinformatics Tools
192 peptides
10-20 predicted peptides
invitro or invivo experiments for
detecting which snippets of protein will
spark an immune response.
Thanks
Download