Presentation of the CRG Bioinformatics Core facility Jean-François Taly People in the BioCore Jean-Francois •@CRG 2009 •@BioCore 2012 •Acting head •Structur. bioinfo. •MSA •NGS analyst •Galaxy server •Training Luca Toni Sarah •@BioCore 2010 •NGS analyst •Small ncRNA prediction •Motif analysis •Training •@Biocore 2009 •Wikis •Web/DB dev. •DB Mirrors •Struct. bioinfo. •Training •@Biocore 2014 •Micro-arrays •NGS analyst •Galaxy •Training Our mission • Expertise in bioinformatics • • Service Consultation • Trainings • Internal and external • Support in infrastructures • In collaboration with the SIT and TIC • Part of the CRG bioinformaticians network • • 83 @ bioinformatics retreat Many more in PRBB/CNAG Our services Analysis Microarray Chip-seq RNA-seq DE and assembly Genome assembly Variant calling Informatics support Wiki WEB Server API Trainings Galaxy, Perl, Linux, advanced bioinformatics Fee per service Item PRBB fees Public fees without VAT Manual data analysis 13.12 €/hour 39.36 €/hour Automated data analysis (CPU time) 2.38 €/hour 7.16 €/hour Our contribution to projects Project conception Bioinfo exp. design Bioinfo exp. realization Bioinfo output interpretation Project conclusions Our contribution to projects Project conception Bioinfo exp. design Bioinfo exp. realization Bioinfo output interpretation Project conclusions Apply a defined procedures Our contribution to projects Project conception Bioinfo exp. design Bioinfo exp. realization Bioinfo output interpretation Project conclusions Customized Analysis CRG bioinformatics community Big Data WG • EGA initiative • Data Engineering • NoSQL • HPC NGS Tech. Sem. • RNA-seq • G. assembly • Variant Annot. • Metagenomics Other topics • Integrated -omics • Good practice in code dev. • Galaxy dev. • … Micro-arrays Gene expression array data analysis: • Background correction and normalization • Differential expression analysis • Gene Ontology and pathway analysis • Various graphics / plots Additional array-based technologies the Bioinformatics unit supports include: • qPCR arrays • Comparative Genomics Hybridization arrays Main tools are based on the R / Bioconductor environment source: Creative Commons, Wikipedia RNA-seq RNA-seq DNA-seq DNA-seq Pevzner P A et al. PNAS 2001;98:9748-9753 Chip-seq Chip-seq Growing to the next level From gene DE to transcripts DE Users have now access to longer reads and deeper coverage Metagenomics 16S Ribosomal amplicon sequencing with MiSeq Data integration framework Combining different data types into one single analysis RNAseq DE Histone marks Metabolomics data Proteomics Data analysis workflow on Galaxy Leave the basic processing to users and focus on advanced analysis Databases mirroring Biological file sources ENSEMBL UCSC NCBI Blast DBs UniProt PDB Igenomes (Illumina, only Human but the rest is upcoming) All Indexed and formated for NCBI BLAST+ (makeblastdb for proteins and nucleic acids) Bowtie & Bowtie2 BWA Fastaindex (Exonerate) GEM faTo2bit Where are they stored? In CRG common storage: /db More information: http://biocore.crg.cat/wiki/Category:Mirrors IMPORTANT: /db/seq (former /seq) IS DEPRECATED WEB and Database services Applications Data and project management Platforms for big data analysis and complex information querying Promotion and publication of scientific results WEB and Database services Example Superfly for Yogi Jaëger Visual catalogue of gene embryo development of different fly species. WEB and Database services Example PRGDB with Walter Sanseverino Wiki-based Database of plant resistance genes. Activity per category in 2014 Presentation of the Galaxy platform Jean-François Taly Bioinformatics Core Facility CRG (Barcelona, Catalonia, Spain) September 18th 2014 EMBO Global Exchange Course Pasteur Institute of Tunis, Tunisia Why Should I Use Galaxy? Biologists: Linux-free data analysis with a graphical interface Bioinformaticians: Insure reproducibility when sharing analysis and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are not familiar of Software Developers: Diffuse their tools on a standardized platform The Galaxy Team Galaxy is developed by : • The Nekrutenko lab in the center for Comparative Genomics and Bioinformatics at Penn State University • The Taylor lab at Johns Hopkins University • The community https://wiki.galaxyproject.org/GalaxyTeam Rationale behind Galaxy From Goeks et al. Genome Biol. 2010. “Computation has become an essential tool in life science research. This is exemplified in genomics, where first microarrays and now massively parallel DNA sequencing have enabled a variety of genome-wide functional assays, such as ChIP-seq and RNA-seq (and many others), that require increasingly complex analysis tools. However, sudden reliance on computation has created an 'informatics crisis' for life science researchers: computational resources can be difficult to use, and ensuring that computational experiments are communicated well and hence reproducible is challenging. Galaxy helps to address this crisis by providing an open, web-based platform for performing accessible, reproducible, and transparent genomic science. “ Why Should I Use Galaxy? Biologists: Linux-free data analysis with a graphical interface Bioinformaticians: Insure reproducibility when sharing analysis and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are not familiar of Software Developers: Diffuse their tools on a standardized platform Makes bioinformatics accessible From a command line … … to a graphical interface One step Multi-step protocol 1 2 3 4 5 Workflow Galaxy Tutorials https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise https://wiki.galaxyproject.org/Learn NGS in a laptop • MinION brings NGS to your laptop • http://youtu.be/UtXlr19xTh8 Why Should I Use Galaxy? Biologists: Linux-free data analysis with a graphical interface Bioinformaticians: Insure reproducibility when sharing analysis and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are not familiar of Software Developers: Diffuse their tools on a standardized platform Reproducibility Bioinformaticians suffer that too! • Results can change in function of • Libraries and software versions • Genome annotations • Results published without the code Want to share your findings with everybody? • Froze an environment in a Virtual Machine • Use an application controller (Docker) • Prepare a Galaxy workflow Improve the visibility of a paper Why not having as well? “A Galaxy workflow and the corresponding wrappers are available to download at https://mylab.com. A virtual machine containing a pre-set up server can be download at the same address “ Galaxy Workflows Why Should I Use Galaxy? Biologists: Linux-free data analysis with a graphical interface Bioinformaticians: Insure reproducibility when sharing analysis and workflows Teach their knowledge to a broad audience Get access to workflows for topics they are not familiar of Software Developers: Diffuse their tools on a standardized platform Wrapping software XML file Software The wrapper prepare the command line Simple wrapper example venn_diagram.sh Wrapper can launch scripts TopHat wrapper (1) XML file describing tophat parameters TopHat wrapper (2) XML file describing tophat parameters Community Tools/Wrappers Should I install Galaxy? Galaxy Public servers Good points Free No IT tasks Comes with reference genomes and workflows Bad points Offer Limited Resources (Disk/CPUs) Data transfer may be long Give access to the tools they want Data security may not be respected Galaxy Public Servers https://wiki.galaxyproject.org/PublicGalaxyServers Should I install Galaxy? Galaxy Local Server Good points Total control on data and tools Your own disk and CPU limitation Some companies sell a ready-to-use infrastructure Tool shed helps to install wrappers and software Bad points Cost of installation and maintenance Need IT supports if you need a multi-users advanced set up Get Galaxy https://wiki.galaxyproject.org/Admin/GetGalaxy Can be installed only in Linux or Mac Galaxy server Tools User Sequences Indexes DATA Software NFS:/software Files, Back-up, tmp NFS Files > 2Gb HPC /scratch FTP 30 days max. NFS:/db Database engine Galaxy team recommend postgreSQL but can it be MySQL Store users details and data information Tools = wrappers File describing all possible parameters of a software Script preparing the correct command line Apache server Shared file system NFS (2Pb) 10 €/Tb/Group/Month Access to the shared biological resources Ensembl, UCSC Genomes and indexes Uniprot, pfam, smart, PDB Access to the shared software repository High Performance Computing 7 cores 8 CPUS each (56 tot) 47 Gb memory FTP server Proftpd for the server side I recommend Filezila for the client (multiplatform) Upload from Galaxy Files are moved to the shared file system Summary Galaxy is an open, web-based platform for computational biomedical research. Accessible: Users without programming experience can run tools and workflows Reproducible: Galaxy captures analysis details Transparent: Users can share and publish analyses WIKI: https://wiki.galaxyproject.org/FrontPage Demo on Galaxy@CRG http://galaxy.crg.es/