Short Introduction To EMBL-EBI Vicky Schneider, EMBL-EBI Training Programme Project leader vicky@ebi.ac.uk What is EMBL-EBI? • Based on the Wellcome Trust Genome Campus near Cambridge, UK • Part of the European Molecular Biology Laboratory • Non-profit organisation 2 13.04 .2015 The five branches of EMBL Heidelberg Basic research in molecular biology Administration EMBO • 1500 staff • >60 nationalities Hamburg Structural biology Grenoble Structural biology 3 Hinxton Bioinformatics Monterotondo Mouse biology EMBL member states Austria, Belgium, Croatia, Denmark, Finland, France, Germany, Greece, Iceland, Ireland, Israel, Italy, Luxembourg, the Netherlands, Norway, Portugal, Spain, Sweden, Switzerland and the United Kingdom Associate member state: Australia 4 How is EMBL-EBI funded? • In 2010 it cost €41 million to run EMBL EBI. EU (€7.4 M) EMBL member states (€22.4 M) Charity (€4.1 M) 5 US Govt (€2.9 M) UK Research Councils (€2.5 M) What Is Bioinformatics? What is bioinformatics? storing 7 13.04 .2015 retrieving Interdisciplinary analysing Heart of modern biology Biology is changing • Data explosion • New types of data 12000 • High-throughput biology 10000 • Growth of applied biology 8000 Disks (TB) • Emphasis on systems, not reductionism Growth of raw storage at EMBL-EBI (in terabytes) • molecular medicine 6000 4000 2000 0 • agriculture • food • environmental sciences… 8 Year The molecules of life Nature’s ingredients Small molecules provide building blocks, messengers and helpers: Amino acids: the building blocks of proteins Nucleotides and sugars: the building blocks of DNA and RNA Co-enzymes: pigments such as chlorophyll and haem help imprortant processes such as photosynthesis and respiration Hormones: small molecules such as adrenalin and testosterone send important messages from cell to cell 9 13.04 .2015 The ‘book of life’ DNA contains the information needed to build an organism The interpreter RNA translates the DNA code into protein Molecular machines Proteins carry out the functions of life: Catalysts: enzymes enable reactions to occur at body temperature Structural support: keratin and collagen give structure to our tissues Transport: carrier proteins move molecules into and out of cells Defense: antibodies protect us from disease-causing organisms Movement: myosin in muscles enables them to contract Bioinformatics underpins life-science research 1 Genomes Contain genes 2 Genes are transcribed 3 Transcripts translate to protein sequences 4 Proteins form threedimensional structures 5 Proteins interact with each other and with small molecules to form pathways 6 Pathways combine to build systems From molecules to medicine Molecular components Integration Translation Genomes Human populations Nucleotides Biobanks Tissues and organs Transcripts Complexes Therapies Proteins Disease prevention Domains Pathways Cells Structures Small molecules 11 13.04 .2015 Human individuals Early Diagnosis Examples of the importance of biological information to all of us Genome-wide analysis of crop plants • Population growth and climate change are major challenges to food security. • Traditional routes to crop improvement are too slow to keep up with this increase in demand. • Understanding plant genomes helps us identify which species will be most tolerant to drought, salt and pests while still providing optimum nutrition. Matching the treatment to the cancer • One in ten women in the EU-27 will develop breast cancer before the age of 80. • If we can identify patterns of genes that are active in different tumours, we can diagnose and treat cancers earlier. Tracking the source of infectious disease • Methicillin-resistant MRSA (Staphylococcus aureus) infection is a global problem. • Transmission of individual clones can be tracked using small variations in DNA sequence. • This technology can be used to identify the source of new outbreaks across continents and within wards. Barcoding life • DNA barcodes are short sections of DNA that we use to identify an organism. • The Barcode of Life Initiative is developing DNA barcoding as a global standard for identifying species. • Applications include: • Protection of endangered species • Sustaining natural resources through pest control • Food labelling Repurposing drugs for neglected diseases • Schistosomiasis is a parasitic infection that affects 210 million people in 76 countries. • Resistance is developing to the one available drug. • We look at the Schistosome genome to identify the targets of existing drugs. • Candidates can be tested for anti-schistosomal activity or used as leads for further optimisation. Lots of data and new types of data Literature Genomes Protein sequence Proteomes Nucleotide sequence Protein structure Gene expression Protein families, domains and motifs Chemical entities Protein-protein interactions Pathways 18 Systems EMBL-EBI’s mission statement • To provide freely available data and bioinformatics services to all facets of the scientific community in ways that promote scientific progress • To contribute to the advancement of biology through basic investigator-driven research in bioinformatics • To provide advanced bioinformatics training to scientists at all levels, from PhD students to independent investigators • To help disseminate cutting-edge technologies to industry • To coordinate biological data provision across Europe 13/04 /2015 Services www.ebi.ac.uk/services Principles of service provision @ Patrick Hoesly Accessibility Compatibility Portability 21 Comprehensive Quality Databases: molecules to systems Genomes Ensembl Ensembl Genomes EGA Nucleotide sequence ENA Functional genomics ArrayExpress Expression Atlas Literature and ontologies CiteXplore, GO Protein families, motifs and domains InterPro Macromolecular PDBe Protein activity IntAct , PRIDE Pathways Reactome Protein Sequences UniProt Chemical entities ChEBI Chemogenomics ChEMBL 22 Systems BioModels BioSamples Database collaborations 23 Standards development – international collaborations Genomics Standards Consortium (GSC) http://gensc.org Genome annotation www.geneontology.org Protein sequence www.uniprot.org Nucleotide sequence www.insdc.org Functional Genomics Data Society www.fged.org Cheminformatics www.ebi.ac.uk/chebi HUPOProteomics Standards Initiative (PSI) www.psidev.info/ Pathways www.reactome.org www.biopax.org Metabolomics Standards Initiative (MSI) www.metabolomicssociety.org 24 Protein structure www.wwpdb.org Systems modelling standards www.sbml.org CATH BLAST Ensembl PDBsum MACiE VAST ENA PubChem UCSC Genome Browser CiteXplore SCOP GEO STRING Flybase DDBJ UniProt ChEBI RefSeq Gene3D PRIDE PDB Reactome GenBank ProFunc Pfam Pubmed Protein Sequences Macromolecular Structures Small Molecules Gene Expression Protein Families (Diagnostic) Literature Ontologies Proteomics Sequence Similarity & Analysis BioModels Gramene Reactions & Pathways Enzymes ArrayExpress FASTA Nucleotide Sequences Molecular Interactions IntEnz IntAct GO PRINTS InterProScan Atlas Genomes Pattern & Motif Search (Diagnostic) GOA Structure Analysis UCSC Genome Browser Flybase Gramene DDBJ RefSeq Ensembl RefSeq GenBank Gramene SCOP ENA RefSeq PDBsum ChEBI PubChem ArrayExpress Atlas IntAct Reactome InterPro Nucleotide Sequences UniProt PDB CATH PRINTS GEO SCOP PRINTS Small Molecules Gene Expression BioModels Reactions & Pathways CiteXplore GO FASTA Gene3D Macromolecular Structures Molecular Interactions IntEnz MACiE GOA Protein Sequences STRING Pfam Pubmed Genomes Enzymes Literature ChEBI Ontologies PRIDE Proteomics BLAST InterProScan CATH ProFunc Protein Families (Diagnostic) VAST Sequence Similarity & Analysis Pattern & Motif Search (Diagnostic) Structure Analysis New search service Access from the EBI’s homepage Species selector allows for easy comparison Data organised according to: • gene • expression • protein • structure • literature 27 Explore data, return easily to your results Goals of the new EBI Search • Relevant to ‘wet-lab’ biologists • Organises information based around a single gene (or a small number of genes) • User-expectation centric (not database centric) • Smooth transition to the detailed information in many of EBI’s core databases • NOT for bioinformaticians: does not provide programmatic access 28 Quick databases tour 29 Genomes 1: Ensembl Chromosomes Genes Genomic alignments Pick a genome Synteny Variations Variation Effect Predictor Gene trees Gene families 30 User Upload Genomes 2: Ensembl Genomes Genome portals for the five kingdoms of life Interface uses Ensembl technology Variation data for plant, metazoan and fungal species Multi-way comparison of whole bacterial chromosomes 31 Pan-taxonomic comparative analysis Nucleotides: European Nucleotide Archive (ENA) The ENA has a three-tiered data architecture. It consolidates information from EMBL-Bank, the European Trace Archive (containing raw data from electrophoresis-based sequencing machines) and the Sequence Read Archive (containing raw data from next-generation sequencing platforms). Figure adapted from: Cochrane, G. et al. Public Data Resources as the Foundation for a Worldwide Metagenomics Data Infrastructure. In: Metagenomics: Theory, Methods and Applications (Chapter 5), Caister Academic Press, Universidad Nacional de Cordoba, Argentina. Ed. D. Marco (2010). 32 Transcriptomes: ArrayExpress Expand results ArrayExpress Archive: browse experiments Search by keyword Spreadsheets describing the sample properties 33 Transcriptomes: Gene Expression Atlas Atlas: browse changes in gene expression Gene page Experiment page 34 Search by gene or biological condition Some data sources for annotation Input sources for UniProtKB 35 GO Functional info PRIDE Protein identification data InterPro Protein families and domains IntAct Molecular interactions IntEnz Enzymes HAMAP RESID Microbial protein families Post-translational modifications • Manual curation • Literature-based annotation • Sequence analysis InterPro classification Signal prediction UniProt • Automated annotation Transmembrane prediction Other predictions Protein classification Protein families, motifs and domains: InterPro Powerful tool for protein classification, integrating several methods into one resource Compare methods of protein signature prediction Visualise the taxonomic range for a protein signature View architectures of proteins containing a signature 36 Proteomics services PRIDE: protein identifications from proteomics experiments IntAct: molecular interactions ChEBI: small molecules 37 INTENZ: enzyme classification Structures: PDBe 38 Chemogenomics: ChEMBL ChEMBL database Neglected Tropical Disease (NTD) archive ChEMBL Browse targets Target search Kinase SARfari Search results Compound search 39 GPCR SARfari Pathways: Reactome Compare events in different species View expression values overlaid on a pathway Link to source databases Interaction overlay on a pathway diagram 40 Export pathway to your favourite modelling software Data management • Over 4M web requests per day – over 4.6M if Ensembl is included • Over 280,000 unique hosts served per month, excluding Ensembl • Total disk space: 10 petabytes in 2010. • Leased two new data centres (with €11.4M from UK Research Councils) • Over 800 million crossreferences in the databases we serve 41 User support • E-mail support – www.ebi.ac.uk/support • Online help pages – www.ebi.ac.uk/help • 2Can bioinformatics user support – www.ebi.ac.uk/2Can • eLearning Portal – coming soon (elearning@ebi.ac.uk) 42 Research www.ebi.ac.uk/groups Key facts about research • The EBI provides a unique environment for bioinformatics research • Eight dedicated research groups aim to understand biology through new approaches to interpreting biological data • Services teams also carry out R&D to enhance existing services and develop new ones • Research programme complements services and the two are mutually supportive 44 Curiosity-driven research Genomes Transcriptomes Proteins Ewan Birney Alvis Brazma Janet Thornton Nicolas Le Novère Paul Flicek Anton Enright Rolf Apweiler Nick Luscombe Nick Goldman John Marioni Gerard Kleywegt Paul Bertone Text mining biology/medicine chemistry/chem engineering Dietrich RebholzSchuhmann Chemistry Christoph Steinbeck maths physics Pathways and systems John Overington Julio SaezRodriguez Training www.ebi.ac.uk/training Hands-on training for all levels of experience • Interactive training in our purpose-built IT training suite at EMBL-EBI, Hinxton, Cambridge • Learn from the EBI’s experts through a combination of talks and practical exercises • Take a tour of all our core data resources, or focus in on specific data types • Full programme at www.ebi.ac.uk/training/handson 48 Predoc and postdoc training • Open Days for bioinformatics early-stage researchers www.ebi.ac.uk/training/openday • PhD studentships through EMBL International PhD Programme www.ebi.ac.uk/training/Studentships • EIPOD interdisciplinary post-doc fellowship programme www.embl.de/training/postdocs/eipod • EBI–Sanger postdoc programme ww.ebi.ac.uk/training/postdoc/ESPOD 49