Introduction to Bioinformatics Data analysis of bio-activities for drug discovery Course 341 Department of Computing Imperial College, London Yike Guo Vasa Curcin Henry Morris Recommended Texts For this part of the course – – General overview of microarray data analysis – – Lecture Notes Handouts “Microarray Gene Expression Data Analysis: A Beginner’s Guide” (Causton, Quakenbush and Brazma) “Microarray Bioinformatics” (Stekel) Data Mining – “Data Mining: Concepts and Techniques” (Han) Goal: Understand the basic bioarray technology including microarray technology for gene expression, NMR spectroscopy and other high throughout devices Learn the basic analytical technology and its applications to the bioarray information Learn the analysis processes of processing and analysing bioarray data (e.g. gene expression analysis) Lecture Overview Lecture One : BioArray Informatics in Drug Discovery Lecture Two : BioArray Technology Lecture Three : Analysis Technology (1)—Data Normalisation and Transformation Lecture Four : Analysis Technology (2)--Clustering Lecture Five : Analysis Technology (3)– Classification and Ontology Lecture Six : Integrative Analysis of gene expression data Lecture Seven: Kernel Method Lecture Eight : Analysis for NMR Metabinomics Data The Drug Discovery Process database /genes protein chemical targets diversity identify ‘hit’ optimize ‘hit’ structure test safety/efficacy animals The aim is to translate new information into new therapies humans Complexity of Drug Discovery Finding a Molecule that Satisfies Multiple Criteria 1 Drug Molecule patentable non-teratogenic 10,000 Drug Candidates Valid Biomedical Hypothesis? Complexity of Drug Discovery Finding a Molecule that Satisfies Multiple Criteria 1 Drug Launch Cost-effective manufacturing Carcinogenicity studies 10 Drug Molecules Bioarray : High Throughput Measurement of Biological Activities Gene Expression Protein Expression SNP Metabonomic Expression Chemical Hits A Dynamics in BioArray Informatics Interactions Environment Metabolites DNA RNA Protein Growth rate Expression Gene Expression Cells are different because of differential gene expression. About 40% of human genes are expressed at one time. Gene is expressed by transcribing DNA into single-stranded mRNA mRNA is later translated into a protein Microarrays measure the level of mRNA expression Gene Expression Measurement mRNA expression represents dynamic aspects of cell mRNA expression can be measured with latest technology mRNA is isolated and labeled with fluorescent protein mRNA is hybridized to the target; level of hybridization corresponds to light emission which is measured with a laser Gene Expression Microarrays The main types of gene expression microarrays: Short oligonucleotide arrays (Affymetrix); cDNA or spotted arrays (Brown/Botstein). Long oligonucleotide arrays (Agilent Inkjet); Fiber-optic arrays ... Affymetrix Microarrays Raw image 1.28cm 50um ~107 oligonucleotides, half Perfectly Match mRNA (PM), half have one Mismatch (MM) Raw gene expression is intensity difference: PM - MM BioArray Informatics: BioArray is the data, everything else is Informatics Data Engineering Data Warehousing Data Integration Data Analysis Knowledge Discovery Discovery Integration Discovery Validation Knowledge Integration Quantitative Analysis Reproducibility confidence intervals to find significant deviations Data Warehousing Data Sources External Data Sources Operational Data Sources Sample & Clinical Data BioArray Data KEGG Unigene Genbank Data Warehousing: Experimental/Sample Database Expression Database Function Annotations Structure Annotations Data Schema in Warehousing : A Gene Expression Example Gene Expression Warehouse OMIM Disease ExPASy SwissProt PDB ExPASy Enzyme Protein Enzyme LocusLink Affy Fragment Known Gene MGD Sequence Metabolite SNP SPAD Sequence Cluster NCBI dbSNP Genbank NMR Pathway UniGene KEGG A Workflow of Gene Expression Database Data Reduction Queries GXDW Comparisons between 2 samples Set Fold Change (e.g., > 2X) User defined dataset Warehousing Output Comparisons between multiple samples Profile Report Data in analysis Set higher avg difference value (e.g., >200) Visualisation A->P/ P->A stringency (e.g., 80%) Advanced Gene Expression Analysis Queries, Queries….. Query to the data – Which genes are linked ? – Which genes are expressed similarly to my gene XYZ? – Which genes are co-expressed in differing conditions ? – classification (of tumors, diseased tissues etc.): which patterns are characteristic for a certain class of samples, which genes are involved? – functional classification of genes: Are changes clustered in particular classes? – metabolic pathway information: Is a certain pathway/route in a pathway affected? – disease information & clinical follow up: correlation to expression patterns. – phenotype information for mutants: Are there correlations between particular phenotypes and expression patterns? Gene Expression Data Analysis Work Flow Data in analysis Interactive Analysis Procedures Cluster by genes Study outliers Correlate clinical measurements Literature analysis Time course analysis Defined subsets of genes Classic drug targets [Examples, not exhaustive] Known disease association Cross species indices Knowledge Deliverables Microarray Data Analysis Types Gene Selection – – Classification (Supervised) – – identify disease predict outcome / select best treatment Clustering (Unsupervised) – – find genes for therapeutic targets avoid false positives (FDA approval ?) … find new biological classes / refine existing ones exploration Microarray Data Mining Challenges too few records (samples), usually < 100 too many columns (genes), usually > 1,000 Too many columns likely to lead to False positives for exploration, a large set of all relevant genes is desired for diagnostics or identification of therapeutic targets, the smallest set of genes is needed model needs to be explainable to biologists Classification desired features: – – – – robust in presence of false positives understandable return confidence/probability fast enough simplest approaches are most robust advanced approaches can be more accurate Microarray Data Classification Microarray chips Images scanned by laser Value 193 -70 144 33 318 1764 1537 1204 707 Datasets New sample Prediction: ALL or AML Gene D26528_at D26561_cds1_at D26561_cds2_at D26561_cds3_at D26579_at D26598_at D26599_at D26600_at D28114_at Data Mining model Class Sno D26528 D63874 D63880 … ALL 2 193 4157 556 ALL 3 129 11557 476 ALL 4 44 12125 498 ALL 5 218 8484 1211 AML 51 109 3537 131 AML 52 106 4578 94 AML 53 211 2431 209 … FALSE POSITIVES PROBLEM Not enough records (samples), usually < 100 Too many columns (genes), usually >>1,000 FALSE POSITIVES are very likely because of few records and many columns Popular Classification Methods Decision Trees/Rules – Neural Nets - work well for reduced # of genes K-nearest neighbor - robust for small # genes TreeNet from authors of CART and MARS – networks of simple trees; very robust against outliers Support Vector Machines (SVM) – find smallest gene sets, but not robust false positives ... good accuracy, does its own gene selection, but hard to understand Microarrays: An Example Leukemia: Acute Lymphoblastic (ALL) vs Acute Myeloid (AML), Golub et al, Science, v.286, 1999 – 72 examples (38 train, 34 test), about 7,000 genes well-studied (CAMDA-2000), good test example ALL AML – Visually similar, but genetically very different Results on the test data Genes selected and model trained on Train set ONLY! Best Clementine neural net model used 10 genes per class Evaluation on test data (34 samples) gives – – 1 or 2 errors (94-97% accuracy), Note: all methods give error on sample 66, believed to be mis-classified by a pathologist Clustering Goals Find natural classes in the data Identify new classes / gene correlations Refine existing taxonomies Support biological analysis / discovery Different Methods – Hierarchical clustering, SOM's, etc Yeast SOM Clusters Yeast Cell Cycle SOM. www.pnas.org/cgi/content/full/96/6/2907 (a) 6 × 5 SOM. The 828 genes that passed the variation filter were grouped into 30 clusters. Each cluster is represented by the centroid (average pattern) for genes in the cluster. Expression level of each gene was normalized to have mean = 0 and SD = 1 across time points. Expression levels are shown on y-axis and time points on x-axis. Error bars indicate the SD of average expression. n indicates the number of genes within each cluster. Note that multiple clusters exhibit periodic behavior and that adjacent clusters have similar behavior. (b) Cluster 29 detail. Cluster 29 contains 76 genes exhibiting periodic behavior with peak expression in late G1. Normalized expression pattern of 30 genes nearest the centroid are shown. (c) Centroids for SOM-derived clusters 29, 14, 1, and 5, corresponding to G1, S, G2 and M phases of the cell cycle, are shown. Yeast SOM Clusters BioArray Informatics: Data Analysis of Bioarray Data within the Biological Context secondary structure tertiary structure polymorphism patient records epidemiology expression patterns physiology sequences alignments ATGCAAGTCCCT AAGATTGCATAA GCTCGCTCAGTT receptors signals pathways linkage maps cytogenetic maps physical maps An illustration of iterative analysis of Bio-activities Gene 2 1 3 9 Receptor 4,5,6 Protein 7 Relations 1- gene homologs 2- gene encodes a protein 3- protein can regulate the expression of a gene 4- protein phosphorylates another protein 5- protein binds to another protein 6- protein lyses another protein 7- Proteins can sometimes be receptors 8- Receptors bind a ligand 9- Receptors (if bound) activate other proteins Ligand 8 Advanced Analysis Discovery Annotation and Validation – – – Integrative Analysis – – E.X. Annotating a set of co-expressed genes with some conserved regulatory motifs E.X. Scoring a co-expression pattern with pathways E.X. Literature analysis to annotate biological meaning E.X. Multi-modality Analysis E.X. Cross Annotation of Discovered Patterns Modelling and Simulation – – E.X. Pathway Synthesis E.X. Virtual Cell Modelling Multi-Modality Analysis “REAL WORLD” “INPUTS” NOXIOUS AGENT/STRESSOR “OUTPUTS” “BIOLOGICAL END-POINTS” PATHOLOGY ALTERED PHYSIOLOGY AND METABOLISM “-OMICS WORLD” Time Gene Profile Time Time Protein Profile Time Time Metabolic Profile A mathematical model forwards-propagated correlations metabolites protein mRNA time event Integrated Analysis of Metabonomic (plasma) versus Muscle Gene Expression Data for Insulin Resistance (Prof. Jeremy Nicolson et.al). zones showing high gene-metabolite correlations Strongly weighting variables WT > KO Minimally weighting variables Strongly weighting variables KO > WT Red = mRNA expression levels for each gene/EST Black = individual quantitative single pulse plasma NMR spectral descriptors Integration of RNA and NMR results Discovery of causal processes A long term goal of Systems Biology is to discover the causal processes among genes, proteins, and other molecules in cells Can this be done (in part) by using data from High Throughput experiments, such as microarrays? Bayesian Causal Network Structure P(GAL4) P(GAL2 | GAL4) P(Intracellular Galactose | GAL2) Each variable is independent of its distant causes given all of its direct causes. Thanks to Greg Cooper, U. Pitt Bayesian Network Learned for Yeast Hartemink et al, Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models, PSB 2002 psb.stanford.edu/psb-online Integrate biological knowledge when analyzing microarray data (from Cheng Li, Harvard SPH) Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25 Enjoy the lecture so you can find a drug (or many jobs)