Bioinformatics Tools in Context How Informatics Improves Research Every Day Adelaide Fletcher, MLIS Tzu L. Phang Ph.D. July 27, 2012 True or False? You have to be a collaborator on someone’s clinical trial to make discoveries with their genetic data... 2 3 • Stanford School of Medicine's Atul Butte identified a new drug target for diabetes by downloading data from 130 gene-expression studies in mice, rats, and humans that were done by other researchers and doing a meta-analysis to look for a common link • wet lab experiments are more for validating hypotheses than making discoveries Meet Our Hero... • Name: Hunter • Research Interests: The role of mammary epithelial cells in breast cancer • Goal: Develop a genetic drug tarFget for breast cancer • Post-grad experience: < 1 year • Funding: $0 7 Where should he start? • A. Ask for $$! • B. Do a lit search • C. Try to find free genetic data 8 Finding out what’s known • Google Scholar - http://scholar.google.com • Web Of Science - http://isiknowledge.com/WOS – (http://hslezproxy.ucdenver.edu/login?url=http://isiknowl edge.com/WOS) 9 Google Scholar • http://scholar.google.com - search “mammary epithelial 10 cells” What’s this? Free data? 11 Follow the path of serendipity “Data” 12 Now that we’ve found “Data” what are we going to “Tzu”? http://cctsi.ucdenver.edu/RIIC 13 GEO (Gene Expression Omnibus) http://www.ncbi.nlm.nih.gov/geo/ As of July 19, 2012 Using GEO as an example • Naming schemes: GPL GSM GSE GDS GPL (Geo PLatform) • Describe list of elements in the array – cDNAs, oligonucleotide probesets, ORFs, antibodies) • Each platform is assigned a unique and stable GEO accession number (GPLxxx) • Example: – GPL570: Affymetrix GeneChip Human Genome U133 Plus 2.0 Array GSM (Geo SaMple) • Describe the conditions under which an individual Sample was handled, the manipulation it underwent, and the abundance measurement of each element derived from it! • A Sample entity must reference only one Platform and may be included in multiple Series • Example: GSM300166 (remember HW 2??!) PostcentralGyrus_female_91yrs_indiv10 GSE (Geo SEries) • Defines a set of related Samples considered to be part of a group • Provide a focal point and description of the experiment as a whole • Example: Let’s look at an example • Goto the GEO site • Under “GEO accession”, type: – GSE11882 • Find these terms: – GPL – GSM – GSE GDS (Geo DataSet) • Curated sets of GEO Sample data • Represents a collection of biologically and statistically comparable GEO Samples – Same platform – Shared common set of probe elements – Samples’ intensities calculated in an equivalent manner (background correction, normalization, etc) • Example: GSD200 (see next page) What can you do in GEO? Clustering Analysis Class Comparison Analysis Gene Expression Profile Let’s import the dataset • GDS2789 What’s wrong with the approach? • Only show one gene at a time • Hard to select a gene set for downstream analysis such as clustering • Hard to output a gene list. BRB-ArrayTools http://linus.nci.nih.gov/BRB-ArrayTools.html Free, open-source software Microsoft Excel plug-in Only works on Windows platform Imposed by all Excel limitations BRB-ArrayTools • Biometric Research Branch (BRB) – Statistical/biomathematical component – Division of Cancer Treatment and Diagnosis (NCI) • Richard Simon & BRB-ArrayTools Development Team • BRB ArrayTools – Visualization and statistical analysis of DNA microarray gene expression data – Developed by statisticians – Excel add-in – Analytic/visualization tools: R statistical system, C and Fortran programs, Java applications. – Visual Basic for Applications integrates components Objectives • “provide scientists with software … without requiring them to learn a programming language” • “encapsulate into software the experience of professional statisticians” • “facilitate education of scientists in statistical methods for the analysis of DNA microarray data” Installing BRB-ArrayTools • Windows 98/2000/NT/XP/Vista/7 • Loads package as add-in to Microsoft Excel – Excel 2000 or later – Creates ArrayTools menu on Excel menu bar • Intensive computations performed in R or compiled programs Installation • Go to “http://linus.nci.nih.gov/BRB-ArrayTools.html” • Click on “All required components in ONE file” Installation • Click on “Download Standard Version 3.7.1 (All in one file)” • When prompted, enter User name and Password (these will be sent to you after your FREE registration) Demonstration Installation • Follow the step-by-step procedures • In the interest of time, the software has already been installed on your machine Demonstration Excel 2007: Security Setting Now, a video demo …. 1 http://david.abcc.ncifcrf.gov/home.jsp 2 3 4 5 A quick recap... 42 List of 220 or so genes with potential indications for treatment or further understanding of Breast Cancer pathways 43 List of 220 or so genes with potential indications for treatment or further understanding of Breast Cancer pathways List of 6 or so genes with a shared biological pathway (transcription factor activity) 44 Do these genes have a CA connection? • In NCBI GENE search: “(TBX6 OR ZNF423 OR NR4A3 OR SCAND2 OR CEBPE OR SIX2) AND Cancer” 45 NCBI Gene - a 1 stop shop 46 All Roads Lead to GENE • Summary – Official Symbol, Aliases • Context, Regions, Transcripts • Related Article and GeneRIFs • Phenotypes • General Info – Homology, Pathways, Ontology • Reference Sequences • Internal Links – MapViewer – OMIM – BLAST • External Links – Ensembl – UCSC 47 Browsing Genes and Genomes • NCBI • Ensembl • UCSC Genome Browser – Which one to use? • http://cctsi.ucdenver.edu/RIIC/Pages/TranslationalI nformaticsVideos.aspx#GenomeBrowsers – A full day of Ensembl training: http://hsl2.ucdenver.edu/ensembl/ 48 BLASTing • To what gene does this nucleotide sequence most likely belong? • gggtgaacag ccgcacggga gtaggtacgc acctgacctc gctggcactg ccgggcaagg cagagggtgt ggcgtcgctc accagccagt gcagctacag cagcaccatc gtccatgtgg gagacaagaa gccgcagccg gagttagaga tggtggaaga tgctgcgagt gggccagaat • http://blast.ncbi.nlm.nih.gov/Blast.cgi • http://www.ensembl.org/Danio_rerio/blastview • http://genome.ucsc.edu/cgi-bin/hgBlat?command=start 49 BLASTing • What about this one? • acatttgctt ctgacacaac tgtgttcact agcaacctca aacagacacc atggtgcacc tgactcctga ggagaagtct gcggttactg ccctgtgggg caaggtgaac gtggatgaag ttggtggtga ggccctgggc aggctgctgg tggtctaccc ttggacccag aggttctttg agtcctttgg ggatctgtcc actcctgatg cagttatggg caaccctaag gtgaaggctc atggcaagaa agtgctcggt gcctttagtg atggcctggc tcacctggac aacctcaagg gcacctttgc cacactgagt gagctgcact gtgacaagct gcacgtggat cctgagaact tcaggctcct gggcaacgtg ctggtctgtg tgctggccca tcactttggc aaagaattca ccccaccagt gcaggctgcc tatcagaaag tggtggctgg tgtggctaat gccctggccc acaagtatca ctaagctcgc tttcttgctg tccaatttct attaaaggtt cctttgttcc ctaagtccaa ctactaaact gggggatatt atgaagggcc ttgagcatct ggattctgcc taataaaaaa catttatttt 50 Genetics in Literature • What does this Sequence: • ATTAAAGATGATTTTTACAGTCAATGAGCCACGTCAGGGAGCGATGGCACCCGCAGGCGGTATCAACTGAT GCAAGTGTTCAAGCGAATCTCAACTCGTTTTTTCCGGTGACTCATTCCCGGCCCTGCTTGGCAGCGCTGCA CCCTTTAACTTAAACCTCGGCCGGCCGCCCGCCGGGGGCACAGAGTGTGCGCCGGGCCGCGCGGCAATT GGTCCCCGCGCCGACCTCCGCCCGCGAGCGCCGCCGCTTCCCTTCCCCGCCCCGCGTCCCTCCCCCTCG GCCCCGCGCGTCGCCTGTCCTCCGAGCCAGTCGCTGACAGCCGCGGCGCCGCGAGCTTCTCCTCTCCTC ACGACCGAGGCAGGTAAACGCCCGGGGTGGGAGGAACGCGGGCGGGGGCAGGGGAGCCGCGGGGGCC GAGTGAGGACCCCGGGCCTCGGGTCCCAGGCGCAAGGGTGCCCGGCCGGGCGGGGTCGGGACCCCAG TGAGGAGGGGCCGGGGGCTGCCCCGCGGGCGCGTGACGCGTCTCGGGCCTGCCCGGCTGCGCTGGTCT CCGCTCGGGTGAGGCGGCTTGGCTTCGCTTTTCAGGTTAGGAAAGCTCCCTTTACTGCGCGTTGGGGGGC TGGGGGAGCTGGCGGAGCCCCGTTAGGGAGGTCGGTGGCGCCGGGGTGTCTCAGCGCCCCCTGCACCC CGCGCGGGTCCGGCCCAGCGGGCGATCGCTGGCGCCCAGGGAACTCCGGGAGGGCCGCCAGCGGGCT CCGCAGGGCGCGGGGCGGGGAGGGGCGCCTGGGGGCCGCGGGGCTCGCGCTCCCCGCCCGTTGGCCG CCCCTCGGAGGCCGAGATCGGGGCCCAGAACGCCCCTTGGCAAGGCCTGGCGCTTCCGCGATGCCCAGA GGGTGCTTGGGGGGATGGAGAGAGGGGCGCCCGCCGGGGGAGTTCCGGGAGCCTCGGTGCCTCCCGCC GCAGCTGCAGCGTTCCTCCCGGGAGGCGGCCCAGCCCTTCATCCTCGCCGCCTGAGCTTCTCCGAGGGG GGCTGCAGCCTTGCGGCCGTTGCCACCGCCTGGAGAAGCGGCCCACGCGGACTGACGGGCGGGGGCGG GGCCTCGGGCCTCGGCGGGGGCGGGGTCCGGGGAGGCCCCACCCTCTGTTCTCCAGGGGCGGGGAGA GAGGAGCTGCAGGTCTGCGGCCTGGC • Have to do with this book? http://www.amazon.com/The-Family-That-Couldnt-Sleep/dp/1400062454 51 Oh yeah, him 52 Phylogenetics • Scientific procedure to reconstruct the evolutionary history of organism or sequences • Evolutionary theory: groups of similar organisms are descended from common ancestor. • Cladistics: – Developed by Will Hennig, German entomologist (1950) – Phylogenetic systematics: a mathematical approach – Method of taxonomic classification of organism based on their evolution • So, why do we study phylogenetics? What can Phylogenetic tell you? • Discovering the function of a gene – Is your gene of interest orthologous to another well-characterized gene from another species • Retracing the origin of a gene – Most genes travel together through evolutionary time. – Determine if genes undergo genomic modification such as mutation, deletion, duplication, speciation, loss and gain of function, inactivation and etc. DNA; a good measurement • Advantages over morphological taxonomic characters: – Character states are unambigous – Large number of characters can be used to perform the analysis. Using clustalw: www.ebi.ac.uk/clustalw Now, a video demo … Find collaborators • Colorado Profiles: http://profiles.ucdenver.edu/Search.aspx – Search: “mammary epithelial cells” • Colorado Translational Informatics Community on Facebook: http://www.facebook.com/pages/Colorado-TranslationalInformatics-Community/136023206424789 62 Get Informatics Help • http://cctsi.ucdenver.edu/RIIC – 5 x 5 Videos – Find informatics experts – Monthly podcast – SeDLAC (Secondary Database Library and Analysis Center) – Consultation and Data Analysis 63 Get $$ • NLM Professional Development Repository: http://cnx.org/content/m37008/latest/ • CCTSI Funding: http://cctsi.ucdenver.edu/Funding/Pages/default.aspx • UC Denver Office of Grants and Contracts: http://www.ucdenver.edu/academics/research/AboutUs/ GrantsContractsOffice/Pages/default.aspx 64 Find a Journal to Publish Findings • http://www.biosemantics.org/jane/ - Example Search: • “cDNA microarrays and a clustering algorithm were used to identify patterns of gene expression in human mammary epithelial cells growing in culture and in primary human breast tumors. Clusters of coexpressed genes identified through manipulations of mammary epithelial cells in vitro also showed consistent patterns of variation in expression among breast tumor samples. By using immunohistochemistry with antibodies against proteins encoded by a particular gene in a cluster, the identity of the cell type within the tumor specimen that contributed the observed gene expression pattern could be determined. Clusters of genes with coherent expression patterns in cultured cells and in the breast tumors samples could be related to specific features of biological variation among the samples. Two such clusters were found to have patterns that correlated with variation in cell proliferation rates and with activation of the IFN-regulated signal transduction pathway, respectively. Clusters of genes expressed by stromal cells and lymphocytes in the breast tumors also were identified in this analysis. These results support the feasibility and usefulness of this systematic approach to studying variation in gene expression patterns in human cancers as a means to dissect and classify solid tumors.” 65 Get Informatics Help! • http://cctsi.ucdenver.edu/RIIC – 5 x 5 Videos – Find informatics experts – Monthly podcast – SeDLAC (Secondary Database Library and Analysis Center) – Consultation and Data Analysis 66 Thank You! • Tzu Phang, Ph.D. – Tzu.Phang@UCDenver.EDU • Addie Fletcher, MLIS – Adelaide.Fletcher@UCDenver.EDU 67