Role of Computer and Information Science in Biology Presented By Dr G. P. S. Raghava Co-ordinator, Bioinformatic Centre, IMTECH, Chandigarh, India & Visiting Professor, Pohang Univ. of Science & Technology, Republic of Korea Email: raghava@imtech.res.in Web: http://www.imtech.res.in/raghava/ Major Applications & Challenges Introduction to Biology Genome Annotation: Gene Prediction Analysis and Comparison of Sequences Protein Structure Prediction DNA Chip (Microarray) technology Proteomics: Analysis of 2D gel Fingerprinting Technique Drug development Computer-Aided Vaccine Design Hierarchy in Biology Atoms Molecules Macromolecules Organelles Cells Tissues Organs Organ Systems Individual Organisms Populations Communities Ecosystems Biosphere Animal cell Human Chromosomes Genes are linearly arranged along chromosomes Chromosomes and DNA DNA can be simplified to a string of four letters GATTACA (RT) Sequence to Structure: It’s a matter of dimensions! 1D Nucleic acid sequence AGT-TTC-CCA-GGG… 1D Protein sequence Met-Ala-Gly-Lys-His… M – A – G – K – H… 3D Spatial arrangement of atoms Genome Annotation The Process of Adding Biology Information and Predictions to a Sequenced Genome Framework Importance of Sequence Comparison Protein Structure Prediction – Similar sequence have similar structure & function – Phylogenetic Tree – Homology based protein structure prediction Genome Annotation – Homology based gene prediction – Function assignment & evolutionary studies Searching drug targets – Searching sequence present or absent across genomes Protein Sequence Alignment and Database Searching Alignment of Two Sequences (Pair-wise Alignment) – The Scoring Schemes or Weight Matrices – Techniques of Alignments – DOTPLOT Multiple Sequence Alignment (Alignment of > 2 Sequences) –Extending Dynamic Programming to more sequences –Progressive Alignment (Tree or Hierarchical Methods) –Iterative Techniques Stochastic Algorithms (SA, GA, HMM) Non Stochastic Algorithms Database Scanning – FASTA, BLAST, PSIBLAST, ISS Alignment of Whole Genomes – MUMmer (Maximal Unique Match) Alignment of Two Sequences Dealing Gaps in Pair-wise Alignment Sequence Comparison without Gaps Slide Windos method to got maximum score ALGAWDE ALATWDE Total score= 1+1+0+0+1+1+1=5 ; (PID) = (5*100)/7 Sequence with variable length should use dynamic programming Sequence Comparison with Gaps •Insertion and deletion is common •Slide Window method fails •Generate all possible alignment •100 residue alignment require > 1075 Alternate Dot Matrix Plot Diagnoal * shows align/identical regions Dynamic Programming Dynamic Programming allow Optimal Alignment between two sequences Allow Insertion and Deletion or Alignment with gaps Needlman and Wunsh Algorithm (1970) for global alignment Smith & Waterman Algorithm (1981) for local alignment Important Steps – Create DOTPLOT between two sequences – Compute SUM matrix – Trace Optimal Path Alignment of Multiple Sequences Extending Dynamic Programming to more sequences –Dynamic programming can be extended for more than two –In practice it requires CPU and Memory (Murata et al 1985) – MSA, Limited only up to 8-10 sequences (1989) –DCA (Divide and Conquer; Stoye et al., 1997), 20-25 sequences –OMA (Optimal Multiple Alignment; Reinert et al., 2000) –COSA (Althaus et al., 2002) Progressive or Tree or Hierarchical Methods (CLUSTAL-W) –Practical approach for multiple alignment –Compare all sequences pair wise –Perform cluster analysis –Generate a hierarchy for alignment –first aligning the most similar pair of sequences –Align alignment with next similar alignment or sequence Database scanning Basic principles of Database searching – Search query sequence against all sequence in database – Calculate score and select top sequences – Dynamic programming is best Approximation Algorithms FASTA Fast sequence search Based on dotplot Identify identical words (k-tuples) Search significant diagonals Use PAM 250 for further refinement Dynamic programming for narrow region Principles of FASTA Algorithms Database Scanning or Fold Recognition Concept of PSIBLAST – – – – Perform the BLAST search (gap handling) GeneImprove the sensivity of BLAST rate the position-specific score matrix Use PSSM for next round of search Intermediate Sequence Search – Search query against protein database – Generate multiple alignment or profile – Use profile to search against PDB Comparison of Whole Genomes MUMmer (Salzberg group, 1999, 2002) – – – – – Pair-wise sequence alignment of genomes Assume that sequences are closely related Allow to detect repeats, inverse repeats, SNP Domain inserted/deleted Identify the exact matches How it works – – – – – Identify the maximal unique match (MUM) in two genomes As two genome are similar so larger MUM will be there Sort the matches found in MUM and extract longest set of possible matches that occurs in same order (Ordered MUM) Suffix tree was used to identify MUM Close the gaps by SNPs, large inserts Protein Structure Prediction Experimental Techniques – X-ray Crystallography – NMR Limitations of Current Experimental Techniques – Protein DataBank (PDB) -> 24000 protein structures – SwissProt -> 100,000 proteins – Non-Redudant (NR) -> 1,000,000 proteins Importance of Structure Prediction – Fill gap between known sequence and structures – Protein Engg. To alter function of a protein – Rational Drug Design Protein Structures Techniques of Structure Prediction Computer simulation based on energy calculation – Based on physio-chemical principles – Thermodynamic equilibrium with a minimum free energy – Global minimum free energy of protein surface Knowledge Based approaches – Homology Based Approach – Threading Protein Sequence – Hierarchical Methods Energy Minimization Techniques Energy Minimization based methods in their pure form, make no priori assumptions and attempt to locate global minma. Static Minimization Methods – Classical many potential-potential can be construted – Assume that atoms in protein is in static form – Problems(large number of variables & minima and validity of potentials) Dynamical Minimization Methods – Motions of atoms also considered – Monte Carlo simulation (stochastics in nature, time is not cosider) – Molecular Dynamics (time, quantum mechanical, classical equ.) Limitations – large number of degree of freedom,CPU power not adequate – Interaction potential is not good enough to model Knowledge Based Approaches Homology Modelling – Need homologues of known protein structure – Backbone modelling – Side chain modelling – Fail in absence of homology Threading Based Methods – New way of fold recognition – Sequence is tried to fit in known structures – Motif recognition – Loop & Side chain modelling – Fail in absence of known example Hierarcial Methods Intermidiate structures are predicted, instead of predicting tertiary structure of protein from amino acids sequence Prediction of backbone structure – Secondary structure (helix, sheet,coil) – Beta Turn Prediction – Super-secondary structure Tertiary structure prediction Limitation Accuracy is only 75-80 % Only three state prediction excitation cDNA clones (probes) laser 2 PCR product amplification purification printing scanning laser 1 emission mRNA target) overlay images and normalise 0.1nl/spot microarray Hybridise target to microarray analysis Major Applications Identification of differentially expressed genes in diseased tissues (in presence of drug) Classification of differentially expressed (genes) or clustering/ grouping of genes having similar behaviour in different conditions Use expression profile of known disease to diagnosis and classify of unknown genes Terms/Jargons Stanford/cDNA chip Affymetrix/oligo chip one slide/experiment one chip/experiment one spot 1 gene => one spot one probe/feature/cell or few spots(replica) control: control spots 1 gene => many probes (20~25 control: two mers) fluorescent dyes control: match and (Cy3/Cy5) mismatch cells. Images : examples Pseudo-colour overlay Cy3 Cy5 Spot colour Signal strength Gene expression yellow Control = perturbed unchanged red Control < perturbed induced green Control > perturbed repressed Processing of images Addressing or gridding – Assigning coordinates to each of the spots Segmentation – Classification of pixels either as foreground or as background Intensity determination for each spot – Foreground fluorescence intensity pairs (R, G) – Background intensities – Quality measures Management of Microarray Data Magnitude of Data – Experiments 50 000 genes in human 320 cell types 2000 compunds 3 times points 2 concentrations 2 replicates – Data Volume 4*1011 data-points 1015 = 1 petaB of Data Management of Microarray Data Major Issues Large volume of microarray data in last few years – Storage and efficient access – Comparison and integration of data Problem of data access and exchange – Data scattered around Internet – Supplementary material of publications – Difficult for user to access relivent data Problems with existing databases – Diverse purpose – Developed for specific purpose Management of Microarray Data Specific Database – Platform (eg.Stanford MA Database; SMD) – Organism (Yeast MA global viewer) – Project (Life cycle database of Drosophila) Problem with Supplement and MA databases – – – – Lack of direct access Quality not checked No standard format Incomplete data Pre-processed cDNA Gene Expression Data On p genes for n slides: p is O(10,000), n is O(10-100), but growing, Slides Genes 1 2 3 4 5 slide 1 slide 2 slide 3 slide 4 slide 5 … 0.46 -0.10 0.15 -0.45 -0.06 0.30 0.49 0.74 -1.03 1.06 0.80 0.24 0.04 -0.79 1.35 1.51 0.06 0.10 -0.56 1.09 0.90 0.46 0.20 -0.32 -1.09 ... ... ... ... ... Gene expression level of gene 5 in slide 4 = Log2( Red intensity / Green intensity) These values are conventionally displayed on a red (>0) yellow (0) green (<0) scale. Analysis of Microarray Data Analysis of images Preprocessing of gene expression data Normalization of data – – – – Subtraction of Background Noise Global/local Normalization House keeping genes (or same gene) Expression in ratio (test/references) in log Differential Gene expression – Repeats and calculate significance (t-test) – Significance of fold used statistical method Clustering – Supervised/Unsupervised (Hierarchical, K-means, SOM) Prediction or Supervised Machine Learnning (SVM) Normalization Techniques Global normalization – Divide channel value by means Control spots – Common spots in both channels – House keeping genes – Ratio of intensity of same gene in two channel is used for correction Iterative linear regression Parametric nonlinear nomalization – log(CY3/CY5) vs log(CY5)) – Fitted log ratio – observed log ratio General Non Linear Normalization – LOESS – curve between log(R/G) vs log(sqrt(R.G)) Classification Task: assign objects to classes (groups) on the basis of measurements made on the objects Unsupervised: classes unknown, want to discover them from the data (cluster analysis) Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations Issues in Clustering Pre-processing (Image analysis and Normalization) Which genes (variables) are used Which samples are used Which distance measure is used Which algorithm is applied How to decide the number of clusters K Unsupervised Learnning Hierarchical clustering: merging two branches at the time until all vari-ables (genes) are in one tree. [it does not answer the question of “how many gene clusters there are”?] K-mean clustering: assuming there are K clusters. [what if this assump-tion is incorrect?] Model-based clustering: the number of clusters is determined dynami-cally [could be one of the most promising methods] Supervised Analysis Fisher’s linear discriminant analysis Quadratic discriminant analysis Logistic regression (a linear discriminant analysis) Neural networks Support vector machine Traditional Proteomics 1D gel electrophoresis (SDS-PAGE) 2D gel electrophoresis Protein Chips – Chips coated with proteins/Antibodies – large scale version of ELISA Mass Spectrometry – MALDI: Mass fingerprinting – Electrospray and tandem mass spectrometry Sequencing of Peptides (N->C) Matching in Genome/Proteome Databases Overview of 2D Gel SDS-PAGE + Isoelectric focusing (IEF) – Gene Expression Studies – Medical Applications – Sample Experiments Capturing and Analyzing Data – Image Acquistion – Image Sizing & Orientation – Spot Identification – Matching and Analysis Comparision/Matcing of Gel Images Compare 2 gel images – Set X and y axis – Overlap matching spots – Compare intensity of spots Scan against database – Compare query gel with all gels – Calculate similarity score – Sort based on score Proteomics: Fingerprints of Disease Normal Cells Disease Cells Phenotypic Changes •Differential protein expression • Protein nitration patterns •Altered phosporylation •Altered glycosylation profiles Utility •Target discovery •Disease pathways •Disease biomarkers Fingerprinting Technique What is fingerprinting – It is technique to create specific pattern for a given organism/person – To compare pattern of query and target object – To create Phylogenetic tree/classification based on pattern Type of Fingerprinting – – – – DNA Fingerprinting Mass/peptide fingerprinting Properties based (Toxicity, classification) Domain/conserved pattern fingerprinting Common Applications – – – – – Paternity and Maternity Criminal Identification and Forensics Personal Identification Classification/Identification of organisms Classification of cells Fingerprinting Techniques: Principles & Applications What is fingerprinting Type of Fingerprinting Common Applications Role of Computer in DNA Fingerprinting – – – – – – Searching Restriction Enzymes Searching VNTRs Computation of size of DNA fragments Optimization of gels Comparison of patterns Creation of Phylogenetic tree Drug Design History of Drug/Vaccine development – Plants or Natural Product Plant and Natural products were source for medical substance Example: foxglove used to treat congestive heart failure Foxglove contain digitalis and cardiotonic glycoside Identification of active component – Accidental Observations Penicillin is one good example Alexander Fleming observed the effect of mold Mold(Penicillium) produce substance penicillin Discovery of penicillin lead to large scale screening Soil micoorganism were grown and tested Streptomycin, neomycin, gentamicin, tetracyclines etc. Drug Design Chemical Modification of Known Drugs – Drug improvement by chemical modification – Pencillin G -> Methicillin; morphine->nalorphine Receptor Based drug design – – – – Receptor is the target (usually a protein) Drug molecule binds to cause biological effects It is also called lock and key system Structure determination of receptor is important Ligand-based drug design – Search a lead ocompound or active ligand – Structure of ligand guide the drug design process Drug Design based on Bioinformatics Tools Detect the Molecular Bases for Disease – Detection of drug binding site – Tailor drug to bind at that site – Protein modeling techniques – Traditional Method (brute force testing) Rational drug design techniques – Screen likely compounds built – Modeling large number of compounds (automated) – Application of Artificial intelligence – Limitation of known structures Important Points in Drug Design based on Bioinformatics Tools Application of Genome – – – – – 3 billion bases pair 30,000 unique genes Any gene may be a potential drug target ~500 unique target Their may be 10 to 100 variants at each target gene – 1.4 million SNP – 10200 potential small molecules Concept of Drug and Vaccine Concept of Drug – Kill invaders of foreign pathogens – Inhibit the growth of pathogens Concept of Vaccine – Generate memory cells – Trained immune system to face various existing disease agents VACCINES A. SUCCESS STORY: • COMPLETE ERADICATION OF SMALLPOX • WHO PREDICTION : ERADICATION OF PARALYTIC POLIO THROUGHOUT THE WORLD BY YEAR 2003 • SIGNIFICANT REDUCTION OF INCIDENCE OF DISEASES: DIPTHERIA, MEASLES, MUMPS, PERTUSSIS, RUBELLA, POLIOMYELITIS, TETANUS B.NEED OF AN HOUR 1) SEARCH FOR NONAVAILABILE EFFECTIVE VACCINES FOR DISEASES LIKE: MALARIA, TUBERCULOSIS AND AIDS 2) IMPROVEMENT IN SAFETY AND EFFICACY OF PRESENT VACCINES 3) LOW COST 4) EFFICIENT DELIVERY TO NEEDY 5) REDUCTION OF ADVERSE SIDE EFFECTS Computer Aided Vaccine Design Whole Organism of Pathogen – Consists more than 4000 genes and proteins – Genomes have millions base pair Target antigen to recognise pathogen – Search vaccine target (essential and non-self) – Consists of amino acid sequence (e.g. A-V-LG-Y-R-G-C-T ……) Search antigenic region (peptide of length 9 amino acids) Major steps of endogenous antigen processing Computer Aided Vaccine Design Problem of Pattern Recognition – ATGGTRDAR – LMRGTCAAY – RTTGTRAWR – EMGGTCAAY – ATGGTRKAR – GTCVGYATT Epitope Non-epitope Epitope Non-epitope Epitope Epitope Commonly used techniques – Statistical (Motif and Matrix) – AI Techniques Why computational tools are required for prediction. 200 aa proteins Chopped to overlapping peptides of 9 amino acids Bioinformatics Tools 192 peptides 10-20 predicted peptides invitro or invivo experiments for detecting which snippets of protein will spark an immune response. Thanks