CLC Genomics Workbench Features & Benefits Director of the Einstein Center for Epigenomics at the Albert Einstein College of Medicine, Dr. John Greally: CLC bio's tools are going to put sophisticated analytical ability into the hands of molecular biologists at Einstein, and will greatly enhance their ability to explore the massively-parallel sequencing data that we are generating. We see this as a way of lowering barriers for scientists who have not previously performed these high-throughput epigenomic assays, allowing them to explore their data and explore hypotheses. For Windows, Mac OS X, and Linux CLC bio©Copyright 2013 clcbio.com Solving the data analysis challenges of High-Throughput Sequencing With High-Throughput Sequencing machines, HighThroughput Sequencing has become accessible to a very large group of researchers. However, data analysis represents a serious bottleneck in NGS pipelines of most R&D departments, which in turn dramatically reduces the Return of Investment of current NGS assets. CLC Genomics Workbench solves this problem and will enable everyone to rapidly analyze and visualize the huge amounts of data generated by NGS machines. The userfriendly and intuitive interface essentially takes HighThroughput Analysis away from hardcore bioinformatics programmers doing command-line scripts, and hands it to scientists searching for biological results. Furthermore, the versatile nature of CLC Genomics Workbench allows it to blend seamlessly into existing sequencing analysis workflows, easing implementation and maximizing return on investment. Multi technology – multi platform CLC Genomics Workbench includes High Performance Computing accelerated assembly of High-Throughput Sequencing data as well as a large number of downstream analysis tools. CLC Genomics Workbench is the first comprehensive analysis package which can analyze and visualize data from all major NGS platforms, like SOLiD, 454, Sanger, Illumina and Ion Torrent. Collaboration with instrument manufacturers is a natural part of CLC bio’s development process. Some features of CLC Genomics Workbench Genomics • Read mapping of Sanger, 454, Illumina Genome Analyzer and SOLiD sequencing data • De novo assembly of genomes of any size (only limited by RAM available) • Color space mapping • Advanced visualization, scrolling, and zooming tools • Resequencing tools (variant detection and downstream analysis) • Support for multiplexing with DNA barcoding Transcriptomics • RNA-seq incl. support for paired data and transcriptlevel expression • Small RNA analysis • Expression profiling by tags • EST library construction • Advanced visualization, scrolling, and zooming tools • Gene expression analysis Epigenomics • ChIP-seq analysis • Peak finding and peak refinement • Case/control analysis Classical sequence analysis tools • Primer design • Molecular cloning • BLAST • Alignments • Phylogenetic trees • Advanced RNA structure prediction and editing • Integrated 3D molecule analysis • Secondary protein structure predictions • And much more... A Company CLC Genomics Workbench 1 / 4 Like all other Workbenches from CLC bio, CLC Genomics Workbench runs on Mac OS X, Windows, and Linux platforms. You decide which computer to run your software on – not us. Genomics Features CLC bio’s world renowned scientists have designed completely new and innovative algorithms to power the features of CLC Genomics Workbench. These highly advanced and cutting edge algorithms incorporate SIMD processor accelerating technology to yield a significant speed-up of the read mapping as well as the de novo assembly processes. stead of doing tedious data-crunching. Multiplexing When doing batch sequencing of different samples, you can use multiplexing techniques to run different samples in the same run. There is often a data analysis challenge to separate the sequencing reads, so that the reads from one sample are analyzed together. CLC Genomics Workbench supports a large number of multiplexing protocols for various types of multiplexing based on name and multiplexing based on tags or barcoding. Resequencing CLC Genomics Workbench supports a complete resequencing pipeline from read mapping over variant detection to downstream analysis. Fig. 1: A region of low coverage has been found in the assembly view, and the corresponding region of the contig sequence is automatically highlighted. Read mapping Some of the features of the resequencing pipeline in CLC Genomics Workbench are: • Tracks for comparing and displaying genomics data • Advanced variant detection, also well suited for genomes of higher ploidy • Trio analysis comparing father-mother-child variants • Easy download of genomics sequence and annotation data from public databases The read mapping functionality of CLC Genomics Workbench supports both short and long reads, it supports paired reads, it supports gapped and ungapped alignments, it supports complex genomes with many repeats, and it supports Sanger, 454, Illumina Genome Analyzer and SOLiD sequencing data. In the workbench it is possible to build workflows to combine various tools from the Toolbox into one, e.g. several filtering and annotation steps. Workflows can be run in batch, making it a powerful tool for analyzing a high number of samples through the same pipeline. CLC Genomics Workbench map reads to genomes of any size as long as the computer has the necessary RAM. A 10 fold human genome read mapping can be carried out on a standard computer with 24 GB of RAM. Identifying genomic rearrangements Mapping of SOLiD data is carried out in native color space, using a high performance computing based algorithm. Up to 80% more hits have been found when assembling 35mer SOLiD data in color space, compared to assembling the same data in base space. De novo assembly The de novo assembly of CLC Genomics Workbench supports both short and long reads, paired reads, and Sanger, 454, Illumina Genome Analyzer, and SOLiD sequencing data. The de novo assembly process has two stages: Firstly, contig sequences are created by assembling all the reads. Secondly, all the reads are mapped using the contig sequence as reference. A combination of paired data protocols can be used mixing paired end and mate pair data with various inset sizes in the same assembly. Depending on the coverage and quality of the data, and, CLC Genomics Workbench de novo assembles genomes of any size. Support for analysis of hybrid data Read mapping as well as de novo assembly support the analysis of different kinds of data at the same time. An example would be the de novo assembly of Sanger data, 454 single read data, and Illumina paired end data in the same analysis. This functionality dramatically reduces manual work for the scientists, facilitating focus on deriving biological results from the data in2 / 4 CLC Genomics Workbench Through the advanced graphical user interface, CLC Genomics Workbench supports the identification of a variety of genomic rearrangements like insertions, deletions, duplications and inversions. Transcriptomics Features CLC Genomics Workbench has tools to support a full work flow in analysis of expression data. These include visual quality control tools, such as principal component blots and box plots, transformation and normalization tools, tools for statistical testing and false discovery rate control, clustering algorithms, heat-map visualization, and tests on gene annotations, such as Hyper Geometric tests and Gene Set Enrichment analysis. Data supported for expression analysis is RNA-seq, Small RNA, tag based expression based profiling and single color microaray gene expression data. The interactivity of the multiple available views allows easy navigation and overview of data and analysis results. The complete integration of the expression analysis in the workbench enables the user to carry out downstream analysis of genes of interest with the comprehensive set of sequence analysis tools provided, immediately and without the hassle of switching between softwares. Digital Gene Expression CLC Genomics Workbench includes mRNA seq based on the approach from Mortazavi A, et.al, "Mapping and quantifying mammalian transcriptomes by RNA-Seq", Nat Methods. 2008 Jul;5(7):585-7. One of the advantages with this model is that the statistics is based on RPKM (Reads Per Kilobase exon Model per million mapped reads), which is a good result. Data can be based on the information contained in a single sample subjected to immunoprecipitation (ChIP-sample) or by comparing a ChIPsample to a control sample. Classical Sequence Analysis In addition to all the High-Throughput Sequencing analysis tools, CLC Genomics Workbench includes all the more than 100 features of CLC Main Workbench for carrying out downstream analysis and for designing followup lab experiments. A few examples are primer design, molecular cloning, BLAST, 5 different types of alignments, 3D molecule viewer, and phylogenetic analyses. Fig. 2: Heat-map visualization tool letting you depict the table of expression values. and easy way for normalizing values for the expression level of a gene when using Digital Gene Expression. Small RNA analysis Small RNA sequenced on SOLiD, Illumina or 454 systems can be analyzed using CLC Genomics Workbench. Adapter trimming and optionally de-multiplexing are the first steps in the analysis, then following by tag counting and finally powerful tools for annotating the small RNAs using miRBase and other resources. The annotations can be grouped on the precursor or mature miRNA level. The final results can be visualized and analyzed using the expression analysis tools. brain_sample1 Transcripts Unique gene reads Unique exon reads Total gene reads ABHD8 5.416,87 1 656 595 695 ABHD9 21,02 1 18 2 32 AKAP8 1 222 124 361 1 772 478 897 125,49 1 27 20 31 AP1M1 2.749,13 1 426 326 468 ARMC6 1.238,56 1 201 149 230 ARRDC2 1.034,80 2 236 160 333 ATP13A1 1.332,95 1 325 244 341 0,00 1 36 0 76 BRD4 1.427,11 2 656 554 693 BST2 1.479,23 1 67 60 80 720,63 1 91 51 107 C19orf44 943,97 1 92 13 316 C19orf50 2.653,11 1 264 195 307 C19orf60 5.254,14 2 346 242 359 C19orf62 3.789,73 2 378 288 428 CCDC105 0,00 1 14 0 16 CCDC124 5.040,96 1 320 274 342 CHERP 1.668,09 1 239 172 474 1 25 0 44 ANKRD41 B3GNT3 C19orf42 CLC Genomics Server can be fully integrated with CLC Bioinformatics Database, supporting Oracle, MySQL, PostgreSQL, H2, and Microsoft SQL Server. CLC Genomics Workbench CLC 673,58 4.311,30 AKAP8L In addition to computational power, CLC Genomics Server offers a flexible job queueing system, easy integration with other applications, easy data sharing opportunities, and a range of other functionalities to ensure that your HighThroughput Sequencing analyses are carried out in a fast, secure, and flexible IT environment. Gen r Expression values CLC Genomics Workbench integrates smoothly with CLC Genomics Server (figure 4, page 3). This enables the Genomics Workbench user to run heavy jobs like whole genome assemblies on one or more central, powerful, computers while working with downstream analyses of other data on the local computer. r ve Feature ID Server Integration o mic s S e CLC Genomics Fig. 3: A table view of 0,00 an expression sample generated from a sequence file90of CALR3 1 47 0 CASP14 0,00 1 5 0 7 NGS mRNA reads. CIB3 Expression profiling0,00 by tags CILP2 31,72 10 CLC Genomics Workbench includes 11 a powerful299tag profiling167 functionality COMP 124,16 35 which to SAGE, using3 NGS technology. The full exCOPE is an extension 7.315,62 582 429 workflow 649 CPAMD8 98,50 280 553 tracting tags from sequence reads of1 tag counting, creating 31virtual tag list, CRLF1 881,29 1 116 83 145 and annotating tag3.396,18 counts with gene2 names are1613supported.1244 CRTC1 1790 CYP4F11 193,77 1 50 31 58 CYP4F12 33,09 1 44 2 76 1 3 0 9 EST library construction CYP4F2 8,06 It is possible to construct an EST library using the de novo assembly algorithm - e.g. to be used as reference sequences for mRNA seq or tag based 1 transcriptomics. Epigenomics analyses CLC Genomics Workbench includes a fully integrated ChIP-seq analysis solution which can easily enable researchers to go from raw data, through reference alignment and onto advanced visual and tabular output of ChIP-seq Workbench Fig. 4: Overview of our three-tier solution the CLC Genomics Server. People can access the server from their laptop computer and easily work on large projects. Customization A new and fast evolving technology, High-Throughput Sequencing constantly provides researchers with new scientific opportunities and new ways of analyzing the huge amounts of data. The problem is not lack of ideas or lack of data. The problem is lack of efficient software for carrying out the analyses or for removing manual bottlenecks in the workflow. CLC Genomics Workbench 3 / 4 CLC Genomics Workbench is available for Windows, Mac OS X, and Linux (Red Hat 5 or later, SUSE 10.2 or later). For detailed system requirements, please refer to clcbio.com/support/system-requirements. We also design and develop customized add-on modules for CLC Genomics Workbench and CLC Genomics Server based on specific customer requests. This is a quick and cost effective way of improving both the speed and the quality of your research. Data set (Read mapping) Run time Roche 454, single-end - NA12878 Human GRCh37, 2.8 million reads, read length: 50-2,5 KiloBases, 1.6 GigaBases, 0.5 fold coverage Ion-Torrent, single-end - NS12911 Human GRCh37, 11.7 million reads, read length: 200-300 bases, 2.9 GigaBases, 0.9 fold coverage Illumina, paired-end - NA18507 Human GRCh37, 1.34 billion reads, read length: 101 bases, 134 GigaBases, 43.7 fold coverage PacBio, single-end Human GRCh37, 1.7 million reads, read length: 50-2.5 KiloBases, 0.9 GigaBases, 0.3 fold coverage 13 7 21 3 17 12 59 1 4 43 hrs hr mins secs mins secs mins secs mins secs Table 1: All benchmarks run on a CLC Genomics Machine, 2x Intel X5650 @ 2.66 GHz, 12 cores (24 logical cores), 48 GB, RAID5 of SATA-disks. Contact your local sales representative or send an e-mail to sales@clcbio.com if you would like to try CLC Genomics Workbench. CLC bio · EMEA Silkeborgvej 2 · Prismet 8000 Aarhus C Denmark Phone: +45 7022 5509 4 / 4 CLC Genomics Workbench CLC bio · Americas 10 Rogers St # 101 Cambridge · MA 02142 USA Phone: +1 (617) 945 0178 CLC bio · AsiaPac 69 · Lane 77 · Xin Ai Road · 7th fl. Neihu District · Taipei · Taiwan 114 Taiwan Phone: +886 2 2790 0799 A Company 8.09.2014 CLC bio CLC bio eliminates these challenges by offering a free Java based Software Developer Kit (SDK) for CLC Genomics Workbench and for CLC Genomics Server. Using the SDK, you will be able to integrate your own algorithms with our products.