Managing Next Generation Sequencing and Multiplexed Genotyping Data Using Open Source LabKey Server Adam Rauch adam@labkey.com © 2010 LabKey Software www.labkey.com LabKey Software Company Overview LabKey Software is a consulting company Spun off from the McIntosh Lab (part owned by FHCRC) Professional software engineers from Amazon, Microsoft, BEA etc Work in partnership with scientists For-profit fee-for-service contracts Non-profit grant sub-awards – Co-investigators with a shared research agenda All development approved by and relevant to FHCRC Development & support around LabKey Server Extending the base LabKey Server platform Creating customized lab-specific solutions Hosting LabKey server Support LabKey Software 2010 2 What Is LabKey Server? An open-source, web-based platform for organizing, analyzing & sharing scientific data Data integration analysis for assays Proteomics, flow cytometry, plate-based assays, etc. Study Data Management Combines demographic, clinical, assay & specimen data LabKey Server powers many deployments… CPAS: FHCRC proteomics repository Atlas Science Portal: SCHARP’s HIV vaccine studies AdaptiveTCR: Customer analytics for ImmunoSEQ NGS UW (Katze, Heinecke, et al), USC, Markey, Harvard, IDRI, TGen, Wisconsin Primate EHR, UC Denver, etc. LabKey Software 2010 3 Dave O’Connor Lab, University of Wisconsin Academic research lab Focus: understanding SIV using nonhuman primate models & applying NHP methods to human HIV disease research LabKey Software 2010 O’Connor Lab SIV/HIV Research Host Immune Genetics Source: modified from Yewdell et al., Nature Reviews Immunology 2003 Virus Genetics Source: Korber et al., British Medical Bulletin 2001 Importance of MHC Class I Host Immune Genetics Source: modified from Yewdell et al., Nature Reviews Immunology 2003 MHC class I molecules dictate immunity to disease High degree of polymorphism within the MHC class I peptide-binding domain Specific MHC alleles associated with superior control of HIV infection Importance of Viral Variability HIV has fast replication cycle, high mutation rate Evolution of the virus causes escape from immune responses Specific mutations are associated with resistance to antiretroviral drug therapy Virus Genetics Source: Korber et al., British Medical Bulletin 2001 Sequencing in the O’Connor Lab 2005 – 2009 Sanger sequencing “Prohibitively expensive” for most experiments 2009 Roche/454 GS FLX at UIUC 2010 Roche/454 GS Junior in lab Roche/454 GS Junior Long-read instrument, critical for genotyping Identical to GS FLX, but 1/8 throughput & lower cost ~100,000 reads per run (~1¢ per read), average ~560bp read length 115 runs this year MID tagging Allows pooling multiple samples (30-100) into a single run Galaxy server Open-source sequence analysis tool (Giardine et al, Genome Res 2005) Lab has built custom workflow to match sequences to known MHC alleles Uses BLAT, transitioning to AGILE (Northwestern alignment tool) LabKey Software 2010 8 Roche/454 MHC Workflow • Total RNA isolation and cDNA synthesis – RNA isolation ~4 hrs; cDNA synthesis ~2 hrs • Primary PCR amplification – plus SPRI purification, quantification, pooling ~3 hrs • emPCR – set-up ~1 hr, run ~5.5 hrs • Breaking and enrichment – ~3 hrs • Roche/454 GS Junior run – set-up ~1.5 hrs; run time ~10 hrs • Data processing and analysis – run processing ~2 hrs; analysis time varies www.454.co m There is a real disconnect between the ability to collect next-generation sequence data (easy) and the ability to analyze it meaningfully (hard) Dave O’Connor PROBLEM: DATA MANAGEMENT! LabKey Software 2010 10 Problem: Data Management As volume has increased, lab has found it difficult to manage all their sequencing data & meta data: Run meta data Run metrics Sequencing reads and quality scores Sample information and multiplex identifiers (MIDs) Reference sequences for genotyping experiments Genotyping matches O’Connor asked LabKey to build a system that can: Store sequencing and genotyping data in a single database that links all the tables, allowing arbitrary queries and reports Provide tools for analysis, querying, visualization and export Automate data workflows for efficiency & consistency Eventually, link sequencing results to their primate EHR system LabKey Software 2010 11 LabKey Sequencing System Reads Quality Scores Metrics Sample Information Reference Sequences Galaxy Genotyping Workflow Sequencing and Genotyping Database Reporting Analysis Visualization Export External Tools LabKey Software 2010 12 Database Schema A nalyses (genotyping) RowId Run CreatedBy Created Sequences (genotyping) Dictionar ies (genotyping) Description RowId RowId Path Dictionary Container FileName Uid CreatedBy Status AlleleName Created SequenceDictionary Initials SequencesView GenbankId ExptNumber Comments Locus Species Origin Sequence PreviousName LastEdit Version ModifiedBy Translation Type IpdAccession Reference M atches (genotyping) RowId Analysis SampleId Runs (genotyping) Reads RowId [Percent] MetaDataId AverageLength Container PosReads A nalysisSamples (genotyping) NegReads PosExtReads Created SampleId Path RegIon Run [...] CreatedBy Analysis NegExtReads M etr ics (genotyping) M etaData (genotyping) FileName Run Status [...] Id Samples (genotyping) Variant UploadId SampleId FullLength [...] AlleleFamily Reads (genotyping) RowId Run Name A llelesJunction (genotyping) MatchId SequenceId ReadsJunction (genotyping) MatchId ReadId Mid Sequence Quality 13 Demo LabKey Software 2010 14 Possible Future Directions Respond to O’Connor lab’s near-term needs Genomics-specific analytics Additional export formats Tighter integration with Galaxy Support for amplicon-designated reads Match combining Simplify configuration and operation Integrate with Wisconsin primate EHR Better integration with R / Bioconductor Visualization Other sequencing platforms: Illumina, PacBio… LabKey Software 2010 15 Acknowledgements O’Connor Laboratory David O’Connor Simon Lank Julie Karl Benjamin Bimber LabKey Software 2010 LabKey Software Mark Igra Brian Connolly Elizabeth Nelson Josh Eckels Matthew Bellew Et al Questions? LabKey Software 2010 17