2014-10-MasterCourse - Bioinformatics Core at CRG

advertisement
Presentation of the
CRG Bioinformatics
Core facility
Jean-François Taly
People in the BioCore
Jean-Francois
•@CRG 2009
•@BioCore 2012
•Acting head
•Structur. bioinfo.
•MSA
•NGS analyst
•Galaxy server
•Training
Luca
Toni
Sarah
•@BioCore 2010
•NGS analyst
•Small ncRNA
prediction
•Motif analysis
•Training
•@Biocore 2009
•Wikis
•Web/DB dev.
•DB Mirrors
•Struct. bioinfo.
•Training
•@Biocore 2014
•Micro-arrays
•NGS analyst
•Galaxy
•Training
Our mission
• Expertise in bioinformatics
•
•
Service
Consultation
• Trainings
•
Internal and external
• Support in infrastructures
•
In collaboration with the SIT and TIC
• Part of the CRG bioinformaticians network
•
•
83 @ bioinformatics retreat
Many more in PRBB/CNAG
Our services
 Analysis





Microarray
Chip-seq
RNA-seq DE and assembly
Genome assembly
Variant calling
 Informatics support
 Wiki
 WEB Server
 API
 Trainings
 Galaxy, Perl, Linux, advanced bioinformatics
Fee per service
Item
PRBB fees
Public fees without VAT
Manual data analysis 13.12 €/hour 39.36 €/hour
Automated data
analysis (CPU time)
2.38 €/hour
7.16 €/hour
Our contribution to projects
Project conception
Bioinfo exp. design
Bioinfo exp. realization
Bioinfo output
interpretation
Project conclusions
Our contribution to projects
Project conception
Bioinfo exp. design
Bioinfo exp. realization
Bioinfo output
interpretation
Project conclusions
Apply a defined
procedures
Our contribution to projects
Project conception
Bioinfo exp. design
Bioinfo exp. realization
Bioinfo output
interpretation
Project conclusions
Customized
Analysis
CRG bioinformatics community
Big Data WG
• EGA initiative
• Data Engineering
• NoSQL
• HPC
NGS Tech. Sem.
• RNA-seq
• G. assembly
• Variant Annot.
• Metagenomics
Other topics
• Integrated -omics
• Good practice in
code dev.
• Galaxy dev.
• …
Micro-arrays
Gene expression array data analysis:
• Background correction and normalization
• Differential expression analysis
• Gene Ontology and pathway analysis
• Various graphics / plots
Additional array-based technologies the Bioinformatics
unit supports include:
• qPCR arrays
• Comparative Genomics Hybridization arrays
Main tools are based on the R / Bioconductor environment
source: Creative Commons, Wikipedia
RNA-seq
RNA-seq
DNA-seq
DNA-seq
Pevzner P A et al. PNAS 2001;98:9748-9753
Chip-seq
Chip-seq
Growing to the next level
 From gene DE to transcripts DE
 Users have now access to longer reads and deeper coverage
 Metagenomics
 16S Ribosomal amplicon sequencing with MiSeq
 Data integration framework
 Combining different data types into one single analysis




RNAseq DE
Histone marks
Metabolomics data
Proteomics
 Data analysis workflow on Galaxy
 Leave the basic processing to users and focus on advanced analysis
Databases mirroring
 Biological file sources






ENSEMBL
UCSC
NCBI Blast DBs
UniProt
PDB
Igenomes (Illumina, only Human but the rest is upcoming)
 All Indexed and formated for






NCBI BLAST+ (makeblastdb for proteins and nucleic acids)
Bowtie & Bowtie2
BWA
Fastaindex (Exonerate)
GEM
faTo2bit
Where are they stored?
 In CRG common storage:
 /db
 More information:
 http://biocore.crg.cat/wiki/Category:Mirrors
 IMPORTANT:
 /db/seq (former /seq) IS DEPRECATED
WEB and Database services
 Applications
 Data and project management
 Platforms for big data analysis and complex information
querying
 Promotion and publication of scientific results
WEB and Database services
 Example
 Superfly for Yogi Jaëger
 Visual catalogue of gene embryo development of different fly
species.
WEB and Database services
 Example
 PRGDB with Walter Sanseverino
 Wiki-based Database of plant resistance genes.
Activity per category in 2014
Presentation of the Galaxy platform
Jean-François Taly
Bioinformatics Core Facility
CRG (Barcelona, Catalonia, Spain)
September 18th 2014
EMBO Global Exchange Course
Pasteur Institute of Tunis, Tunisia
Why Should I Use Galaxy?
 Biologists:
 Linux-free data analysis with a graphical
interface
 Bioinformaticians:
 Insure reproducibility when sharing analysis
and workflows
 Teach their knowledge to a broad audience
 Get access to workflows for topics they are
not familiar of
 Software Developers:
 Diffuse their tools on a standardized platform
The Galaxy Team
Galaxy is developed by :
• The Nekrutenko lab in the center for
Comparative Genomics and Bioinformatics at
Penn State University
• The Taylor lab at Johns Hopkins University
• The community
https://wiki.galaxyproject.org/GalaxyTeam
Rationale behind Galaxy
From Goeks et al. Genome Biol. 2010.
“Computation has become an essential tool in life
science research. This is exemplified in genomics, where first
microarrays and now massively parallel DNA sequencing have
enabled a variety of genome-wide functional assays, such as ChIP-seq
and RNA-seq (and many others), that require increasingly complex
analysis tools. However, sudden reliance on computation has created
an 'informatics crisis' for life science researchers: computational
resources can be difficult to use, and ensuring that
computational experiments are communicated well and
hence reproducible is challenging. Galaxy helps to address
this crisis by providing an open, web-based platform for performing
accessible, reproducible, and transparent genomic science. “
Why Should I Use Galaxy?
 Biologists:
 Linux-free data analysis with a graphical
interface
 Bioinformaticians:
 Insure reproducibility when sharing analysis
and workflows
 Teach their knowledge to a broad audience
 Get access to workflows for topics they are
not familiar of
 Software Developers:
 Diffuse their tools on a standardized platform
Makes bioinformatics accessible
From a command line …
… to a graphical interface
One step
Multi-step protocol
1
2
3
4
5
Workflow
Galaxy Tutorials
 https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise
 https://wiki.galaxyproject.org/Learn
NGS in a laptop
• MinION brings NGS to your laptop
• http://youtu.be/UtXlr19xTh8
Why Should I Use Galaxy?
 Biologists:
 Linux-free data analysis with a graphical
interface
 Bioinformaticians:
 Insure reproducibility when sharing analysis
and workflows
 Teach their knowledge to a broad audience
 Get access to workflows for topics they are
not familiar of
 Software Developers:
 Diffuse their tools on a standardized platform
Reproducibility
Bioinformaticians suffer that too!
• Results can change in function of
• Libraries and software versions
• Genome annotations
• Results published without the code
Want to share your findings with
everybody?
• Froze an environment in a Virtual Machine
• Use an application controller (Docker)
• Prepare a Galaxy workflow
Improve the visibility of a paper
Why not having as well?
“A Galaxy workflow and the corresponding wrappers are
available to download at https://mylab.com. A virtual
machine containing a pre-set up server can be download
at the same address “
Galaxy Workflows
Why Should I Use Galaxy?
 Biologists:
 Linux-free data analysis with a graphical
interface
 Bioinformaticians:
 Insure reproducibility when sharing analysis
and workflows
 Teach their knowledge to a broad audience
 Get access to workflows for topics they are
not familiar of
 Software Developers:
 Diffuse their tools on a standardized platform
Wrapping software
XML file
Software
The wrapper prepare
the command line
Simple wrapper example
venn_diagram.sh
 Wrapper can launch scripts
TopHat wrapper (1)
 XML file describing tophat parameters
TopHat wrapper (2)
 XML file describing tophat parameters
Community Tools/Wrappers
Should I install Galaxy?
Galaxy Public servers
 Good points
 Free
 No IT tasks
 Comes with reference genomes and
workflows
 Bad points
 Offer Limited Resources (Disk/CPUs)
 Data transfer may be long
 Give access to the tools they want
 Data security may not be respected
Galaxy Public Servers
 https://wiki.galaxyproject.org/PublicGalaxyServers
Should I install Galaxy?
Galaxy Local Server
 Good points
 Total control on data and tools
 Your own disk and CPU limitation
 Some companies sell a ready-to-use
infrastructure
 Tool shed helps to install wrappers and
software
 Bad points
 Cost of installation and maintenance
 Need IT supports if you need a multi-users
advanced set up
Get Galaxy
 https://wiki.galaxyproject.org/Admin/GetGalaxy
 Can be installed only in Linux or Mac
Galaxy server
Tools
User
Sequences
Indexes
DATA
Software NFS:/software
Files, Back-up, tmp
NFS
Files > 2Gb
HPC
/scratch
FTP
30 days max.
NFS:/db
 Database engine
 Galaxy team recommend postgreSQL but can it be
MySQL
 Store users details and data information
 Tools = wrappers
 File describing all possible parameters of a software
 Script preparing the correct command line
 Apache server
 Shared file system
 NFS (2Pb)
 10 €/Tb/Group/Month
 Access to the shared biological resources
 Ensembl, UCSC Genomes and indexes
 Uniprot, pfam, smart, PDB
 Access to the shared software repository
 High Performance
Computing
 7 cores
 8 CPUS each (56 tot)
 47 Gb memory
 FTP server
 Proftpd for the server side
 I recommend Filezila for the client (multiplatform)
 Upload from Galaxy
 Files are moved to the shared file system
Summary
 Galaxy is an open, web-based platform for
computational biomedical research.
 Accessible: Users without programming
experience can run tools and workflows
 Reproducible: Galaxy captures analysis details
 Transparent: Users can share and publish
analyses
 WIKI:
 https://wiki.galaxyproject.org/FrontPage
Demo on Galaxy@CRG
 http://galaxy.crg.es/
Download