Working Document – Version D01 – January, 2014 Sanofi’s tranSMART version Description of the enhancements implemented (Working Document) For any questions on this document, please contact any of the following persons: Claire Virenque (claire.virenque@sanofi.com), Annick Peleraux (annick.peleraux@sanofi.com), Sherry Cao (Xiaohong.Cao@genzyme.com), Charlotte Raillere (charlotte.raillere@sanofi.com) 1. PURPOSE OF THIS DOCUMENT This document aims at describing the main tranSMART enhancements developed for Sanofi through two successive releases: RC1 (Release Candidate 1) and RC2. tranSMART RC1 Available. Code base posted in Github: Private repository (TM Data Hub – owned by Recombinant) Public repository (tranSMART > Sanofi RC1) Main improvements: – Search capabilities improved through new tagging functionalities – New ‘Browse’ tab for exploring data stored in tranSMART tranSMART RC2 Built on top of RC1. Developments ongoing. Code ‘in progress’ is posted in Github: public repository (thehyve / branche RC2_dev) Completion date: (late) February Main improvements: – New omics data types supported – Serial data supported – Existing analytics improved – Grid View improved – Export improved – Incremental data loading capability An high-level description of the enhancements implemented in RC1 and RC2 releases can be found in Appendix of this document. Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 1/21 Working Document – Version D01 – January, 2014 2. TABLE OF CONTENT 1. Purpose of this document ..................................................................................................................................... 1 2. Table of content...................................................................................................................................................... 2 3. New search and browse capabilities (RC1) ....................................................................................................... 3 3.1. New organization of tranSMART GUI .......................................................................................................... 3 3.2. New browsing capabilities (‘Browse’ tab) .................................................................................................... 3 3.3. New tagging capabilities based on metadata (‘Browse’ tab) ................................................................... 4 3.4. ‘Integrated’ search based on dictionaries + free text search.................................................................... 4 4. New data types supported (RC2) ........................................................................................................................ 6 4.1. Refactoring code for high dimension data types – Generic API .............................................................. 6 4.2. RBM data ......................................................................................................................................................... 7 4.3. Mass spec proteomic data ............................................................................................................................ 7 4.4. qPCR miRNA data.......................................................................................................................................... 7 4.5. miRNAseq data ............................................................................................................................................... 8 4.6. RNAseq data ................................................................................................................................................... 8 4.7. Metabolomics data ......................................................................................................................................... 8 4.8. Serial data ........................................................................................................................................................ 9 5. Modification of ETL (RC2) .................................................................................................................................... 9 5.1. New ETLs developed ..................................................................................................................................... 9 5.2. Optimization of the clinical ETL .................................................................................................................. 10 5.3. Enable incremental data loading for a study ............................................................................................ 10 6. New capabilities for storing and retrieving unstructured data (files) (RC1) ................................................. 10 7. Improvement of samples handling (RC2) ......................................................................................................... 11 8. Improvement of current analytics (RC2) ........................................................................................................... 12 8.1. Adaptation of legacy analytics to new data types .................................................................................... 12 8.2. Refactoring of analysis code ....................................................................................................................... 12 8.3. Adaptation of legacy analytics to serial data ............................................................................................ 13 8.4. Improvement of Box Plot with ANOVA ...................................................................................................... 13 8.5. Improvement of Scatter Plot ....................................................................................................................... 13 8.6. Improvement of Line Graph ........................................................................................................................ 14 9. Improvement of Grid View (RC2) ...................................................................................................................... 14 10. Improvement of export (RC2) .......................................................................................................................... 15 11. Appendix – High level description of RC1 and RC2 requirements ............................................................. 16 Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 2/21 Working Document – Version D01 – January, 2014 3. NEW SEARCH AND BROWSE CAPABILITIES (RC1) 3.1. New organization of tranSMART GUI Two main tabs – synchronized with each other: Purpose of the two tabs: Note: The ‘Analyze’ tab represents the former ‘Dataset Explorer’. The ‘Browse’ tab is used in place of the former ‘Search’. 3.2. New browsing capabilities (‘Browse’ tab) Users can search and browse data contained in tranSMART using the ‘Browse’ tab. Data is organized in a hierarchical structure: In Browse, a ‘Program Explorer’ panel allows to navigate within Programs, Studies, Assays, Analysis or File Folders: Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 3/21 Working Document – Version D01 – January, 2014 Clicking on an object (its name) will display the description of: Its metadata (tags) (1) Its direct children (Studies, Assays, Analyses, File Folders) (2) 3.3. New tagging capabilities based on metadata (‘Browse’ tab) Each object (Program, Study, Assay, etc.) is tagged with metadata (or tags). Metadata: Provide information on the object Enable queries using search User can tag an object directly from tranSMART GUI using predefined annotation templates: Most fields use CV with pick-list or autocomplete functionalities. Examples of dictionaries used: MESH, WhoDD, some branches Nextbio Ontology. Description field enables to capture free text Note: Enhancements have been made to add the ability to upload files, search and export files from tranSMART – see section 6 below for details. 3.4. ‘Integrated’ search based on dictionaries + free text search 3.4.1. Overview of search functions A new search function was created at the top of the tranSMART screen, which enables users to Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 4/21 Working Document – Version D01 – January, 2014 search for any data (levels 1-4) contained into tranSMART. Search results display: Under the ‘Program Explorer’panel for the Browse tab Under the ‘Navigate Terms’ panel the Analyze tab A new ‘Filter’ option can also be used for selections based on fields with a small set of possible values. Note: Boolean operators ‘and’ / ‘or’ can be used to combine several search and filter criteria. 3.4.2. Free text search With free-text searches, tranSMART looks for the text you type within metadata fields that do not Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 5/21 Working Document – Version D01 – January, 2014 contain controlled vocabulary; for example, titles and descriptions of studies and related objects. tranSMART also looks for free-text keywords in: – – Text files that are attached to objects (Browse tab) Data nodes from the study tree (Analyze tab > Navigate Terms panel) Note: – – With controlled-vocabulary searches, tranSMART includes in the autocomplete list only those filters that begin with the text you type. With free-text searches, tranSMART can find the text you type anywhere within the free-text metadata. For example, tranSMART would find Hiraoka in the title of the Pancreatic Carcinogenesis metadata above. It would also find pancreatic carcinogenesis in the description. Keyword searches are not case-sensitive. 4. NEW DATA TYPES SUPPORTED (RC2) 4.1. Refactoring code for high dimension data types – Generic API Problem: – – – Definition of the high dimension data types and the linked application behavior is currently loosely found over the codebase (whenever the distinction between data types is applicable, it is [hopefully] made) If we add the new data types in that fashion, we create even more technical debt in the application When any changes be needed afterwards, this entails a lot of work since we would have to apply that everywhere and the chance a spot is overlooked increases Solution: Refactoring for data types Approach: Create a generic high dimensional data API that can be extended for specific data types – – – – To make the code more adaptable and maintainable (all details pertaining to a data type in one place) To make it easier to add new data types without making changes all over the place in tranSMART code To maintain consistent behavior for a specific data type over the whole UI (e.g. in different analyses) To prevent bugs and problems appearing for certain data types Steps to add a new HDD type (after refactoring): 1) Procure a comprehensive test data set 2) Design the database schema for the data type, add it to table definitions (DDL), and create database upgrade scripts 3) Define platform definitions (in de_gpl_info) 4) Create ETL pipeline for loading the data, platforms and dictionaries 5) Apply changes and load applicable dictionaries 6) Design data type specific core API classes if applicable 7) Write tests for data type backend (in core-db) Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 6/21 Working Document – Version D01 – January, 2014 8) Implement data type backend [add data type implementing module] (in core-db) 9) Extend UI (e.g. high dimensional data popup) to implement any data type specific constraints 10) Perform end-to-end tests in all analyses 4.2. RBM data Requirement: Ability to load and analyze RBM subject-level data, as high dimensional data Approach: – – – – Dictionary: UniProt Platform file: Mapping between RBM analytes, UniProt ID and gene ID Steps for loading: o Input file has a constraint format with the following columns, but only Sample ID, Analyte and avalue are loaded: id | rid | sampid | plate | visit_code | Analyte (ana_unit) | LDD | avalue | analval | belowLDD | read_low | read_hi logtrans | outlier o Units to be loaded along with the analyte name for display purposes “eg: AgoutiRelated Protein (AGRP) (pg/mL)” o Ensure analytes loaded are part of Platform tables. If not, flag an error for loading. o Z scores to be calculated as usual. Requirements for analysis: o Within each Advanced workflow analysis, in the HDD selection pop up, one should be able to search by protein names or Uniprot ID in the gene/pathway selection box. o UnitProt Protein names along with UniProt ID will be loaded in the dictionary from Uniprot DB. 4.3. Mass spec proteomic data Requirement: Ability to load and analyze mass spec proteomic subject-level data, as high dimensional data Approach: – – – – Dictionary: UniProt Platform file: Mapping between peptide sequence and majority protein Uniprot ID Steps for loading: o Identify the columns that are to be loaded from the sample file o UniProt IDs loaded need to be part of dictionary already- if not there needs to be an error Requirements for analysis: o Dictionary needs to be loaded from UniProt Human database with following columns: Protein Id, Protein name and Gene Symbol. Gene Symbol will be available as Synonyms. o Search will be done based on the UniProt Protein ID, gene ID or pathway in the gene/pathway selection box of the HD pop up in Advanced workflow 4.4. qPCR miRNA data Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 7/21 Working Document – Version D01 – January, 2014 Requirement: Ability to load and analyze qPCR miRNA subject-level data, as high dimensional data Approach: – – – – Dictionary: miRbase Platform file: Mapping between an identifier from the input file (can be a miRID) and miRID Steps for loading o Load Platform file first o Load data file subsequently Requirements for analysis: o Within each Advanced workflow analysis, in the HDD selection pop up, one should be able to search by miRIDs in the gene/pathway selection box. 4.5. miRNAseq data Requirement: Ability to load and analyze sequence based miRNA subject-level data, as high dimensional data Approach: – – – – Dictionary: miRbase Platform file: Mapping between an identifier from the input file (can be a miRID) and miRID Steps for loading o Load Platform file first o Load data file subsequently Requirements for analysis: o Within each Advanced workflow analysis, in the HDD selection pop up, one should be able to search by miRIDs in the gene/pathway selection box. 4.6. RNAseq data Requirement: Ability to load and analyze RNA sequencing subject-level data (transcript-level expression quantification) Approach: – – – – Dictionary: Genes (Entrez), Pathways Platform file: Mapping between transcript ID and gene ID Steps for loading o Load Platform file first o Load data file subsequently Requirements for analysis: o Within each Advanced workflow analysis, in the HDD selection pop up, one should be able to search by gene ID or pathway in the gene/pathway selection box. 4.7. Metabolomics data Requirement: Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 8/21 Working Document – Version D01 – January, 2014 Ability to load and analyze metabolomics subject-level data, as high dimensional data Approach: – – – – – Primary identifier: Biochemical name Dictionary: HMDB o HMDB ID and HMDB Common Name will be loaded as a Dictionary. HMDB ID will be the main entity and Common Name will be the synonym. Pathway, Super pathway and HMDB ID will be loaded as part of a Platform mapping file for each data load. There can be multiple sub-pathways for each biochemical name. HMDB ID is not mandatory Search in the HDD selection pop up can be done based on “HMDB Common name, Sub Pathway and Super Pathway” in the gene/pathway selection box. 4.8. Serial data Requirement: Enable loading of ‘serial’ high and low dimensional data (time course, dose response, different sampling conditions, etc.) Problem: tranSMART overloads modifier_cd in the i2b2 visit dimension; thereby violating the i2b2 data model (at least version 1.6 and up). Approach: Restore this model and use modifier to represent dosage, frequency, time points etc. The link between visit and sample in the high dimensional domain can remain intact; visit then acts as a proxy for sample. Affected components. – – – Core API was changed. Core DB was updated to reflect modifiers (1-2). Modifying ETL and cache unique modifiers and enable re-ordering of concepts. Update to analyses to be able to handle modifiers 5. MODIFICATION OF ETL (RC2) 5.1. New ETLs developed Requirement: Support following new omic data types – – – – – – RBM Mass spec proteomic miRNA (qPCR) miRNAseq RNAseq Metabolomic Approach: New ETLs have been developed. Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 9/21 Working Document – Version D01 – January, 2014 5.2. Optimization of the clinical ETL Requirement: Accelerate loading time for large clinical studies. Approach: Replace Kettle jobs and data validation steps from the Oracle stored procedure by Groovy code. Why: It makes the ETL procedures much more transparent and easier to debug and/or change. Also, it allows for parallel ETL imports which with the current setup are prone to errors; this is especially important considering the incremental data loading requirement (see below). 5.3. Enable incremental data loading for a study Requirement: Ability to load data incrementally for a given study and to delete or overwrite some previously loaded data (for certain subjects only, or certain variables only) without having to reload the study entirely. o Add new variables to existing study Low Dimensional Data Only o Add New patients/Samples to existing study For Low Dimensional and High Dimensional For High Dimensional data, we will add new Samples alone. New patients will be added as part of Low Dimensional Data. o Overwrite values (re-load) for certain variables in an existing study Implemented for Low and High Dimensional Data o Change Label names of variables previously loaded Approach: New ETL will be provided to perform Incremental Data Load. – – – – The ETL will provide options during run time to indicate what kind of incremental load needs to happen based on the 4 points above Existing ETL will remain as is and can be used for loading a study for the first time or a data type for the first time Z score for HD data will be calculated for all the samples during ETL of partial load for Samples Note: Currently Z scores are not be calculated on the fly as part of analysis 6. NEW CAPABILITIES FOR STORING AND RETRIEVING UNSTRUCTURED DATA (FILES) (RC1) Enhancements available in Browse tab to import and search for unstructured data (i.e., files). Overall process: Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 10/21 Working Document – Version D01 – January, 2014 Export of files is possible thanks to a new Export Cart: Note: Work in progress: move file storage from tranSMART servers to MongoDB servers. 7. IMPROVEMENT OF SAMPLES HANDLING (RC2) Requirement: – – – Ability to define several samples per patient Ability to associate high and low dimensional data to a single sample Ability to store, view and export external sample IDs Problem with sample handling in tranSMART from a data model perspective is two-fold: (a) for clinical data, only visits/encounters are used, which does not give enough information depth to cover the given use cases, (b) for high dimensional data, there is no real sample dimension, only the assay dimension. Solution: (To be confirmed w /Development Team) Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 11/21 Working Document – Version D01 – January, 2014 – – Use the modifier dimension in i2b2 is via (see i2b2 documentation). As for coupling samples, introduce a sample dimension in the high dimensional data model and use the encounter (with modifiers as a proxy for that sample entity, thus coupling the two domains). 8. IMPROVEMENT OF CURRENT ANALYTICS (RC2) 8.1. Adaptation of legacy analytics to new data types Requirement: Allow analysis of sequence based miRNA and mRNA, qPCR based miRNA, mass spectrometry metabolomic and proteomic, and RBM data. The following analyses will be modified: – – – – – – – – – – Heat map Hierarchical clustering K-means clustering Marker selection PCA Box Plot Scatter plot Survival analysis Table with Fisher test Line Graph 8.2. Refactoring of analysis code Problem: – – – Much of the analysis logic is repeated and often copied verbatim for each analysis This makes adaptations cumbersome for the same reasons as with the data types When adding support for new data types to all analyses in that fashion, this problem would only multiply itself Solution: Refactoring analyses – – How: Extract common patterns out of the various analyses into a generic class, but implement analysis-specific job details in extensions of that class Why: o To maintain consistent behavior across the advanced analyses (criteria handling, data generation, error messages etc.) o To make the code more adaptable and maintainable (all details pertaining to an analysis in one place) o To make it easier to add new advanced analyses without copying a lot of boilerplate code Steps to add a new analysis script (after refactoring): 1. Create/update relevant R scripts 2. Create/update analysis job, extending from abstract class AnalysisJob 3. Create/update controller and view Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 12/21 Working Document – Version D01 – January, 2014 8.3. Adaptation of legacy analytics to serial data Requirement: Improve analysis of ‘serial’ high and low dimensional data using legacy analytics from the GPL version (Advanced Workflow) – – Ability to analyze a complete ‘serial’ high dimensional data matrix (e.g., gene expression at several time points using Heatmap), Ability to use the ‘serial’ dimension as a variable (e.g., plot gene expression of a gene in function of time using Line Graph). Problem: – – – The only analysis designed for serial data in tranSMART GPL version is Line Graph. HDD analytics (heatmap, hierarchical clustering, K-means clustering, marker selection, PCA) don’t allow selection of serial data (i.e. multiple high dim nodes in the variable selection box) in GPL version. Most other analytics don’t allow serial data (multiple nodes in a variable selection box) either. The only exceptions are: o Correlation Analysis, in which nodes are taken as individual variables that are correlated pairwise o Box Plot with ANOVA, which pools values of all nodes into one box in the plot. Solution: – – – Improvement of Line Graph: see section 8.6 Improvement of HDD analytics: o Enable multiple high dim nodes in a variable selection box o Run analyses and display results across series o Implement automatic numerical ordering of columns (samples) in output if numerical (for example time or dose) series. o Add ability to order columns in heat map output either grouped by subject or by node at user’s choice Improvement of Box Plot with ANOVA: see section 8.4 8.4. Improvement of Box Plot with ANOVA Requirement: – – – Make individual box plots and ANOVAs for each variable when dragging multiple nodes in field ‘Dependent Variable’ Correction of a pre-existing bug (to be confirmed) o When a categorical variable is defined in the dependent variable, multiple nodes can be used in the independent variable. But when a categorical variable is defined in the independent variable, multiple nodes in the dependent variable generate an error. Forbid binning of a variable if several nodes are selected for that variable (to be confirmed) 8.5. Improvement of Scatter Plot Requirement: – – Enable high dimensional data in both variable selection boxes (one node into each variable selection box). Scatter Plot will be done for a pair of variables, across any two data types Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 13/21 Working Document – Version D01 – January, 2014 Examples: Expression of a gene with copy number variation of the same gene, Expression of 2 genes, Age and expression of a gene 8.6. Improvement of Line Graph Requirement: – – – – – – – Enable Line Graph to use high dimensional data in both variable selection boxes o Same high dim pop-up as other workflows, with probe aggregation option o Ensure the data type used are all from the same Platform. If not from the same Platform, provide an error when user clicks on HD Pop Up In case several genes or probes are selected for the measurement concept, plot each gene or probe on a different plot, 1 line per group on each plot. T-tests are run for each plot and results presented in a table below each plot. Better handle x axis o For categorical series (conditions), provide evenly spaced categories on x axis o For numerical series (time course, dose response), provide a scaled x axis. Evenly spaced categories can be provided for numerical series based on user’s choice. Better label y axis o Improve Y axis label to indicate what data node is represented (eg. Mean Body Weight) Add option to plot individual data in addition to group means or medians Example Use Case Expression of a chosen gene in each subject of each category (1 line / subject, 1 color/category Categorical variable o Render categorical variable optional (when no categorical variable is used, the plot will show a single line on the whole cohort) o Enable selection of a single group o Enable the use of continuous variable (low and high dim) with binning Add statistical analyses o t-tests between 2 groups (selectable by user among the ones defined by the categorical variable) at every time point/dose/condition. By default the first 2 groups defined by the categorical variable are selected to run the t-tests. Results are presented in a table below the plot o If there is 1 group, no t-test; 9. IMPROVEMENT OF GRID VIEW (RC2) Requirement: – Display Sample ID o Grid View will contain as many rows as there are samples for a subject. If there are no samples, there will be 1 row per subject, and NULL value in sample ID column. Example: Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 14/21 Working Document – Version D01 – January, 2014 – – – In this example, samples 5 and 6 (from patient 1) are linked to high dim data only, therefore have no value in columns F to I. Enable to hide columns Enable export of Grid View to Excel (visible columns, all rows or selected rows) Enable categorical variables in a single column o Approach: All categorical values are displayed in a Single Column as highlighted in Green Subject Id Sample ID Age Sex Race 101 GSM123 54 M Caucasian 102 GSM123 55 F Hispanic 10. IMPROVEMENT OF EXPORT (RC2) Requirement: – – – – Add advanced filters to allow users to limit the exported data to a subset of low dimensional nodes Add ability to restrict export of high dimensional data by gene, gene list, pathway, protein, etc. Add ability to better categorize the data types available for a study (clinical, gene expression, SNP, etc.) Improve performances (response time) when exporting large data volume Solution: – – – – – Users will have the ability to limit the exported data to a subset of clinical fields by dragging the folder/nodes of interest from the tree into the left panel for Export o For HDD, users should be able to select a particular gene list, pathway, protein, etc. This will be possible thanks to a ‘HDD selection’ pop-up. For serial type data, user will be able to export data at particular time points (for ex.) by dragging the appropriate node from the tree. This can be dragged and dropped accordingly by the user The data available for export should be categorized / sorted per data type (clinical, gene expression, SNP, etc.) Each type of data will be exported into a different tabulated file. Performances for exporting gene expression (processed) data will be improved in order to accelerate export time Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 15/21 11. APPENDIX – HIGH LEVEL DESCRIPTION OF RC1 AND RC2 REQUIREMENTS Area Data Loading / ETL Pipelines ID RC1-1 RC1-2 RC2-1 RC2-2 RC2-3 RC2-4 Description of the requirement Additional information Set up processes and tools to facilitate curation and loading activities Allow scientists to gather and organize non-curated low dimensional clinical & lab data files in self service mode with CV tagging (Files Parking) Continue to develop the loading tool (ICE) to enable the Curators to load all data through a unified interface Develop more QC scripts to control data integrity and accuracy Optimize the clinical ETL pipeline to accelerate loading time for large clinical studies. Add ability to load data incrementally for a given study and to delete or overwrite some previously loaded data (for certain subjects only, or certain variables only) without having to reload the study entirely. Allow loading of ‘serial’ high and low dimensional data (time course, dose response, different sampling conditions, etc.). Improve samples handling: Ability to define several samples per patient Ability to associate high and low dimensional data to a single sample Ability to use sample ID from a Biobank when applicable We have experienced the loading of a large clinical study: 15,000 patients, about 50 clinical parameters. Loading took about 3 days. The stored procedures require to be optimized to speed up the loading (loading duration of a similar dataset should not exceed 1 day). A solution developed by Millenium for handling serial high dimensional data is being tested. However it seems only to partially answer our needs. Chronological/numerical ordering in data tree and in analyses should be possible both for high and low dimensional data even when no date/time is available (i.e., even without using the date/time functionalities of the existing ETL). Today, samples in tranSMART are only available for high dimensional data. For low dimensional data, visits are used. Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 16/21 Area ID RC2-5 RC2-6 Database, backend components Security Analyze – Description of the requirement Additional information Add ability to load RBM subject-level data as high dimensional data. Manage protein/peptide identifiers and ensure unit display. Add ability to load microarray miRNA subject-level data, as high dimensional data. Preprocessing and normalization procedures will be defined internally. Protein/peptide dictionary is to be defined. This requirement may not require any change in the application if miRNAs are added in gene dictionary. miRNA dictionary is to be defined. Specific miRNA platform annotation file (to relate probes to genes) will be loaded. Preprocessing and normalization procedures will be defined internally. Depending on datasets, qPCR data should be loaded as either low or high dimensional data. Preprocessing and normalization procedures will be defined internally. Protein dictionary is to be defined. Also, details will be provided concerning platform(s), protein identifiers, values, dataset structure. Preprocessing and normalization procedures will be defined internally. Metabolite dictionary is to be defined. Also, details to be provided concerning platform(s), metabolite identifiers, values, dataset structure. Big SNP datasets expected. Associated SNP platform(s) will be confirmed at a later date. RC2-7 Add ability to load qPCR mRNA and miRNA subject-level data RC2-8 Add ability to load mass spec proteomic subject-level data, as high dimensional data. RC2-9 Enable metabolomic subject-level data loading, as high dimensional data. RC210 Improve SNP subject-level data handling: Develop more automated procedures for processing and loading large sets of SNP data and thus accelerate loading time. Make sure the infrastructure is well-dimensioned to support large SNP data volumes. Add ability to load RNA sequencing subject-level data (gene-level expression quantification) Optimize the management of annotation files (relating probes to genes) for omic data: same information should be used for gene expression data and gene expression analyses, and gene lists. Set up user authentication through the company’s Active Directory RC211 RC212 RC213 RC214 RC2- Implement security rules and user permissions in Browse tab, which are consistent with the current security rules and permissions in Analyze tab. Allow better analysis of ‘serial’ high and low dimensional data using Currently, annotation files used in Analyze and Browse tabs are stored in different tables. Need to check how annotation files are managed in Gene List/Signature tab. The Kerberos protocol will be used to validate users with Active Directory. Security and permissions in Browse should be defined at study-level following the same ‘mechanism’ as in Analyze (GPL1.0 security features). (*) Advanced workflows from GPL1.0 release Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 17/21 Area ID Description of the requirement Advanced Workflow 15 existing analytics (*): Ability to analyze a complete ‘serial’ data matrix (e.g., gene expression at all time points using Heatmap), Ability to analyze individual columns (e.g., plot gene expression of a gene at one time point against gene expression at another time point using Scatter Plot), Ability to use the ‘serial’ dimension as a variable (e.g., plot gene expression of a gene in function of time using Line Graph). Improve the Line Graph analytics: Enable Line Graph to use high dimensional data Better handle x axis variable (time, numerical or categorical), i.e. have a scaled x axis when time or numerical variable is used. Should work even for time course data without date/time. Add option to plot individual data in addition to group means or medians. RC216 RC217 RC218 RC219 RC220 RC221 Additional information Improve subcategorization of high dimensional data (tissue, timepoints, etc.) in the high dimensional data node selection screen in Advanced Workflows For the Boxplot analytics, make individual box plots for each variable when dragging multiple nodes in field ‘Dependent Variable’, and present output in table format Improve the Correlation Analysis analytics: Combine Correlation Analysis and Scatter Plot Linear Regression into one workflow Allow Correlation Analytics to run with high dimensional data Allow correlation of one variable against many Improve the table output when many variables. Allow analysis of RBM data using existing analytics (*) for high dimensional data Allow analysis of microarray miRNA data using existing analytics (*) for high dimensional data This requirement is related to ‘serial’ data requirements described above. At a minimum a new functionality needs to be developed in Line Graph to allow users to re-order the time points/doses/etc. in chronological/numerical order on the x axis rather than in alphabetical order. Today, order of time points in Line Graph is correct when time course data has been loaded with date/time ETL, but not scaled. This requirement is related to ‘serial’ data requirements described above. (*) Advanced workflows from GPL1.0 release querying gene expression data (*) Advanced workflows from GPL1.0 release querying gene expression data. Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 18/21 Area ID Analyze – Grid View RC222 RC223 RC224 RC225 RC226 Analyze – Export RC227 RC228 Browse – Export Gene Signature/List RC229 RC1-3 RC231 Description of the requirement Additional information Allow analysis of qPCR miRNA and mRNA data using existing analytics (*) for high dimensional data Allow analysis of mass spectrometry subject-level data using existing analytics (*) for high dimensional data Allow analysis of metabolomic subject-level data using existing analytics (*) for high dimensional data Allow analysis of RNA sequencing data using existing analytics (*) for high dimensional data Improve Grid View: Enable categorical variables in a single column Enable column deletion, row or column selection Enable export of selection Automatically include variables used in Advanced Workflows Display sample ID related to patient ID in Grid View. This requirement may not require any change in the application if miRNAs are added in gene dictionary. (*) Advanced workflows from GPL1.0 release querying gene expression data. (*) Advanced workflows from GPL1.0 release querying gene expression data. Protein dictionary is to be defined. (*) Advanced workflows from GPL1.0 release querying gene expression data. Metabolite dictionary is to be defined. (*) Advanced workflows from GPL1.0 release querying gene expression data. Gene-level expression quantification. Today, columns can be sorted, re-ordered and hidden (but not filtered). Search in Grid View possible with Ctrl F (enables subjectspecific data view by searching on subject ID for example). This requirement is linked to the requirement described above about better handling of samples. Improve export of subject-level data: Improve performances (response time) when exporting large data volume Add advanced filters to allow users to limit the exported data to a subset of clinical fields, genes, gene lists, pathways, or combinations of the above Add ability to better categorize the data available for a study (clinical, gene expression, SNP, etc. for each assay type whether high or low dimensional data) Harmonize with Grid View export capabilities Add ability to preview a file in browser (IE and Firefox) Only for common file formats such as ppt, doc, text, excel, pdf Add ability to export files from the Browse UI (shopping cart concept) Under “Gene Signature/List”: When displaying user uploaded gene list, always add gene If the relevant annotation file is not yet in tranSMART, it should not be listed in the drop down in Create Gene List/Signature window and Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 19/21 Area ID Description of the requirement Additional information symbols even if users only uploaded probeset IDs Use same annotation file as for subject-level data in Analyze and analyzed results in Browse. Simplify the Overall GUI: Better integrate certain tabs, such as Search and Dataset Explorer prevent users to load this list of probes. RC232 Improve consistency and synchronization of data trees in ‘Browse’ (Program Explorer panel) and in ‘Analyze’ (Navigate Terms panel): When clicking on the “Open in Analyze view” button from Browse, display a filter on the study ID in the Active Filter panel and restrict the data trees in both tabs to that study. Also highlight and open that study in the data trees of both tabs. There should be an easier way to navigate from the Analyze tab back to the Browse tab. Data trees and right panels should be kept as they were in a tab when switching to the other tab (unless a new filter is applied, in which case both data trees are refreshed). Today, when selecting a study in Browse without using any filter and clicking on the “Open in Analyze view” button, the data tree in Navigate terms contains only the selected study highlighted and open. However if a filter is activated and the button "Open in Analyze view" is clicked, the data tree in Navigate terms is not restricted to the searched study and it is not highlighted and open. Today also, to go back to the Browse tab, one needs to use the menu at the top of the screen, which brings the user back to the welcome screen. RC1-5 Structure all data per project (program) > study > assay Levels 1, 2, 3 & 4 data Curated clinical, low dimensional and high dimensional data Non-curated data (-> link to "files parking") Support documents (reports, informed consent, etc.) Link files to the program/study/assay/subject they are related ("files parking") Enhance tagging/annotation of program/study/assay based on metadata Overall User Interface Data Organization RC1-4 RC1-6 Browse – Tagging and Metadata Searching and Filtering RC1-7 RC230 RC233 Add dictionaries for miRNA, proteins, metabolites miRNA could be included in gene dictionary – to be discussed. Secure file indexing Currently some Excel files (at least) are not properly indexed and as a consequence, free text search does not work properly. Technology used: SOLR. Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 20/21 Area ID RC234 RC1-8 Description of the requirement Additional information After running a free text search in Browse tab, when clicking on bold items in the Program Explorer, highlight in right handside Browse panel: String found in metadata (including in file names) Files containing that string Improve searching capabilities at study level: Allow users to search for studies based on metadata, data labels, data values and free text Allow users to search for level 1 data (raw data files) Handle / manage synonyms Title Page Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document 21/21