Purpose of this document

advertisement
Working Document – Version D01 – January, 2014
Sanofi’s tranSMART version
Description of the enhancements implemented (Working Document)
For any questions on this document, please contact any of the following persons: Claire Virenque
(claire.virenque@sanofi.com), Annick Peleraux (annick.peleraux@sanofi.com), Sherry Cao
(Xiaohong.Cao@genzyme.com), Charlotte Raillere (charlotte.raillere@sanofi.com)
1. PURPOSE OF THIS DOCUMENT
This document aims at describing the main tranSMART enhancements developed for Sanofi through
two successive releases: RC1 (Release Candidate 1) and RC2.
tranSMART RC1
Available.
Code base posted in Github:

Private repository (TM Data
Hub – owned by Recombinant)

Public repository (tranSMART >
Sanofi RC1)
Main improvements:
– Search capabilities improved through
new tagging functionalities
– New ‘Browse’ tab for exploring data
stored in tranSMART
tranSMART RC2
Built on top of RC1.
Developments ongoing.
Code ‘in progress’ is posted in
Github: public repository (thehyve
/ branche RC2_dev)
Completion date: (late) February
Main improvements:
– New omics data types supported
– Serial data supported
– Existing analytics improved
– Grid View improved
– Export improved
– Incremental data loading capability
An high-level description of the enhancements implemented in RC1 and RC2 releases can be found in
Appendix of this document.
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
1/21
Working Document – Version D01 – January, 2014
2. TABLE OF CONTENT
1. Purpose of this document ..................................................................................................................................... 1
2. Table of content...................................................................................................................................................... 2
3. New search and browse capabilities (RC1) ....................................................................................................... 3
3.1. New organization of tranSMART GUI .......................................................................................................... 3
3.2. New browsing capabilities (‘Browse’ tab) .................................................................................................... 3
3.3. New tagging capabilities based on metadata (‘Browse’ tab) ................................................................... 4
3.4. ‘Integrated’ search based on dictionaries + free text search.................................................................... 4
4. New data types supported (RC2) ........................................................................................................................ 6
4.1. Refactoring code for high dimension data types – Generic API .............................................................. 6
4.2. RBM data ......................................................................................................................................................... 7
4.3. Mass spec proteomic data ............................................................................................................................ 7
4.4. qPCR miRNA data.......................................................................................................................................... 7
4.5. miRNAseq data ............................................................................................................................................... 8
4.6. RNAseq data ................................................................................................................................................... 8
4.7. Metabolomics data ......................................................................................................................................... 8
4.8. Serial data ........................................................................................................................................................ 9
5. Modification of ETL (RC2) .................................................................................................................................... 9
5.1. New ETLs developed ..................................................................................................................................... 9
5.2. Optimization of the clinical ETL .................................................................................................................. 10
5.3. Enable incremental data loading for a study ............................................................................................ 10
6. New capabilities for storing and retrieving unstructured data (files) (RC1) ................................................. 10
7. Improvement of samples handling (RC2) ......................................................................................................... 11
8. Improvement of current analytics (RC2) ........................................................................................................... 12
8.1. Adaptation of legacy analytics to new data types .................................................................................... 12
8.2. Refactoring of analysis code ....................................................................................................................... 12
8.3. Adaptation of legacy analytics to serial data ............................................................................................ 13
8.4. Improvement of Box Plot with ANOVA ...................................................................................................... 13
8.5. Improvement of Scatter Plot ....................................................................................................................... 13
8.6. Improvement of Line Graph ........................................................................................................................ 14
9. Improvement of Grid View (RC2) ...................................................................................................................... 14
10. Improvement of export (RC2) .......................................................................................................................... 15
11. Appendix – High level description of RC1 and RC2 requirements ............................................................. 16
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
2/21
Working Document – Version D01 – January, 2014
3. NEW SEARCH AND BROWSE CAPABILITIES (RC1)
3.1. New organization of tranSMART GUI
Two main tabs – synchronized with each other:
Purpose of the two tabs:
Note:
The ‘Analyze’ tab represents the former ‘Dataset Explorer’.
The ‘Browse’ tab is used in place of the former ‘Search’.
3.2. New browsing capabilities (‘Browse’ tab)
Users can search and browse data contained in tranSMART using the ‘Browse’ tab.
Data is organized in a hierarchical structure:
In Browse, a ‘Program Explorer’ panel allows to navigate within Programs, Studies, Assays, Analysis or
File Folders:
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
3/21
Working Document – Version D01 – January, 2014
Clicking on an object (its name) will display the description of:


Its metadata (tags) (1)
Its direct children (Studies, Assays, Analyses, File Folders) (2)
3.3. New tagging capabilities based on metadata (‘Browse’ tab)
Each object (Program, Study, Assay, etc.) is tagged with metadata (or tags). Metadata:


Provide information on the object
Enable queries using search
User can tag an object directly from tranSMART GUI using predefined annotation templates:


Most fields use CV with pick-list or autocomplete functionalities. Examples of dictionaries
used: MESH, WhoDD, some branches Nextbio Ontology.
Description field enables to capture free text
Note: Enhancements have been made to add the ability to upload files, search and export files from
tranSMART – see section 6 below for details.
3.4. ‘Integrated’ search based on dictionaries + free text search
3.4.1. Overview of search functions
A new search function was created at the top of the tranSMART screen, which enables users to
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
4/21
Working Document – Version D01 – January, 2014
search for any data (levels 1-4) contained into tranSMART.
Search results display:


Under the ‘Program Explorer’panel for the Browse tab
Under the ‘Navigate Terms’ panel the Analyze tab
A new ‘Filter’ option can also be used for selections based on fields with a small set of possible
values.
Note: Boolean operators ‘and’ / ‘or’ can be used to combine several search and filter criteria.
3.4.2. Free text search
With free-text searches, tranSMART looks for the text you type within metadata fields that do not
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
5/21
Working Document – Version D01 – January, 2014
contain controlled vocabulary; for example, titles and descriptions of studies and related objects.
tranSMART also looks for free-text keywords in:
–
–
Text files that are attached to objects (Browse tab)
Data nodes from the study tree (Analyze tab > Navigate Terms panel)
Note:
–
–
With controlled-vocabulary searches, tranSMART includes in the autocomplete list only those
filters that begin with the text you type.
With free-text searches, tranSMART can find the text you type anywhere within the free-text
metadata. For example, tranSMART would find Hiraoka in the title of the Pancreatic
Carcinogenesis metadata above. It would also find pancreatic carcinogenesis in the
description.
Keyword searches are not case-sensitive.
4. NEW DATA TYPES SUPPORTED (RC2)
4.1. Refactoring code for high dimension data types – Generic API
Problem:
–
–
–
Definition of the high dimension data types and the linked application behavior is currently
loosely found over the codebase (whenever the distinction between data types is applicable,
it is [hopefully] made)
If we add the new data types in that fashion, we create even more technical debt in the
application
When any changes be needed afterwards, this entails a lot of work since we would have to
apply that everywhere and the chance a spot is overlooked increases
Solution: Refactoring for data types
Approach: Create a generic high dimensional data API that can be extended for specific data types
–
–
–
–
To make the code more adaptable and maintainable (all details pertaining to a data type in
one place)
To make it easier to add new data types without making changes all over the place in
tranSMART code
To maintain consistent behavior for a specific data type over the whole UI (e.g. in different
analyses)
To prevent bugs and problems appearing for certain data types
Steps to add a new HDD type (after refactoring):
1) Procure a comprehensive test data set
2) Design the database schema for the data type, add it to table definitions (DDL), and create
database upgrade scripts
3) Define platform definitions (in de_gpl_info)
4) Create ETL pipeline for loading the data, platforms and dictionaries
5) Apply changes and load applicable dictionaries
6) Design data type specific core API classes if applicable
7) Write tests for data type backend (in core-db)
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
6/21
Working Document – Version D01 – January, 2014
8) Implement data type backend [add data type implementing module] (in core-db)
9) Extend UI (e.g. high dimensional data popup) to implement any data type specific constraints
10) Perform end-to-end tests in all analyses
4.2. RBM data
Requirement:
Ability to load and analyze RBM subject-level data, as high dimensional data
Approach:
–
–
–
–
Dictionary: UniProt
Platform file: Mapping between RBM analytes, UniProt ID and gene ID
Steps for loading:
o Input file has a constraint format with the following columns, but only Sample ID,
Analyte and avalue are loaded:
id | rid | sampid | plate | visit_code | Analyte (ana_unit) | LDD | avalue | analval |
belowLDD | read_low | read_hi logtrans | outlier
o Units to be loaded along with the analyte name for display purposes “eg: AgoutiRelated Protein (AGRP) (pg/mL)”
o Ensure analytes loaded are part of Platform tables. If not, flag an error for loading.
o Z scores to be calculated as usual.
Requirements for analysis:
o Within each Advanced workflow analysis, in the HDD selection pop up, one should be
able to search by protein names or Uniprot ID in the gene/pathway selection box.
o UnitProt Protein names along with UniProt ID will be loaded in the dictionary from
Uniprot DB.
4.3. Mass spec proteomic data
Requirement:
Ability to load and analyze mass spec proteomic subject-level data, as high dimensional data
Approach:
–
–
–
–
Dictionary: UniProt
Platform file: Mapping between peptide sequence and majority protein Uniprot ID
Steps for loading:
o Identify the columns that are to be loaded from the sample file
o UniProt IDs loaded need to be part of dictionary already- if not there needs to be an
error
Requirements for analysis:
o Dictionary needs to be loaded from UniProt Human database with following
columns: Protein Id, Protein name and Gene Symbol. Gene Symbol will be available
as Synonyms.
o Search will be done based on the UniProt Protein ID, gene ID or pathway in the
gene/pathway selection box of the HD pop up in Advanced workflow
4.4. qPCR miRNA data
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
7/21
Working Document – Version D01 – January, 2014
Requirement:
Ability to load and analyze qPCR miRNA subject-level data, as high dimensional data
Approach:
–
–
–
–
Dictionary: miRbase
Platform file: Mapping between an identifier from the input file (can be a miRID) and miRID
Steps for loading
o Load Platform file first
o Load data file subsequently
Requirements for analysis:
o Within each Advanced workflow analysis, in the HDD selection pop up, one should be
able to search by miRIDs in the gene/pathway selection box.
4.5. miRNAseq data
Requirement:
Ability to load and analyze sequence based miRNA subject-level data, as high dimensional data
Approach:
–
–
–
–
Dictionary: miRbase
Platform file: Mapping between an identifier from the input file (can be a miRID) and miRID
Steps for loading
o Load Platform file first
o Load data file subsequently
Requirements for analysis:
o Within each Advanced workflow analysis, in the HDD selection pop up, one should be
able to search by miRIDs in the gene/pathway selection box.
4.6. RNAseq data
Requirement:
Ability to load and analyze RNA sequencing subject-level data (transcript-level expression
quantification)
Approach:
–
–
–
–
Dictionary: Genes (Entrez), Pathways
Platform file: Mapping between transcript ID and gene ID
Steps for loading
o Load Platform file first
o Load data file subsequently
Requirements for analysis:
o Within each Advanced workflow analysis, in the HDD selection pop up, one should be
able to search by gene ID or pathway in the gene/pathway selection box.
4.7. Metabolomics data
Requirement:
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
8/21
Working Document – Version D01 – January, 2014
Ability to load and analyze metabolomics subject-level data, as high dimensional data
Approach:
–
–
–
–
–
Primary identifier: Biochemical name
Dictionary: HMDB
o HMDB ID and HMDB Common Name will be loaded as a Dictionary. HMDB ID will be
the main entity and Common Name will be the synonym.
Pathway, Super pathway and HMDB ID will be loaded as part of a Platform mapping file for
each data load. There can be multiple sub-pathways for each biochemical name.
HMDB ID is not mandatory
Search in the HDD selection pop up can be done based on “HMDB Common name, Sub
Pathway and Super Pathway” in the gene/pathway selection box.
4.8. Serial data
Requirement:
Enable loading of ‘serial’ high and low dimensional data (time course, dose response,
different sampling conditions, etc.)
Problem:
tranSMART overloads modifier_cd in the i2b2 visit dimension; thereby violating the i2b2 data
model (at least version 1.6 and up).
Approach:
Restore this model and use modifier to represent dosage, frequency, time points etc. The link
between visit and sample in the high dimensional domain can remain intact; visit then acts as
a proxy for sample.
Affected components.
–
–
–
Core API was changed. Core DB was updated to reflect modifiers (1-2).
Modifying ETL and cache unique modifiers and enable re-ordering of concepts.
Update to analyses to be able to handle modifiers
5. MODIFICATION OF ETL (RC2)
5.1. New ETLs developed
Requirement: Support following new omic data types
–
–
–
–
–
–
RBM
Mass spec proteomic
miRNA (qPCR)
miRNAseq
RNAseq
Metabolomic
Approach: New ETLs have been developed.
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
9/21
Working Document – Version D01 – January, 2014
5.2. Optimization of the clinical ETL
Requirement: Accelerate loading time for large clinical studies.
Approach:
Replace Kettle jobs and data validation steps from the Oracle stored procedure by Groovy code.
Why:
It makes the ETL procedures much more transparent and easier to debug and/or change. Also, it
allows for parallel ETL imports which with the current setup are prone to errors; this is especially
important considering the incremental data loading requirement (see below).
5.3. Enable incremental data loading for a study
Requirement:
Ability to load data incrementally for a given study and to delete or overwrite some previously
loaded data (for certain subjects only, or certain variables only) without having to reload the
study entirely.
o Add new variables to existing study
 Low Dimensional Data Only
o Add New patients/Samples to existing study
 For Low Dimensional and High Dimensional
 For High Dimensional data, we will add new Samples alone. New patients will
be added as part of Low Dimensional Data.
o Overwrite values (re-load) for certain variables in an existing study
 Implemented for Low and High Dimensional Data
o Change Label names of variables previously loaded
Approach:
New ETL will be provided to perform Incremental Data Load.
–
–
–
–
The ETL will provide options during run time to indicate what kind of incremental load needs
to happen based on the 4 points above
Existing ETL will remain as is and can be used for loading a study for the first time or a data
type for the first time
Z score for HD data will be calculated for all the samples during ETL of partial load for
Samples
Note: Currently Z scores are not be calculated on the fly as part of analysis
6. NEW CAPABILITIES FOR STORING AND RETRIEVING UNSTRUCTURED DATA (FILES) (RC1)
Enhancements available in Browse tab to import and search for unstructured data (i.e., files).
Overall process:
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
10/21
Working Document – Version D01 – January, 2014
Export of files is possible thanks to a new Export Cart:
Note:
Work in progress: move file storage from tranSMART servers to MongoDB servers.
7. IMPROVEMENT OF SAMPLES HANDLING (RC2)
Requirement:
–
–
–
Ability to define several samples per patient
Ability to associate high and low dimensional data to a single sample
Ability to store, view and export external sample IDs
Problem with sample handling in tranSMART from a data model perspective is two-fold:
(a) for clinical data, only visits/encounters are used, which does not give enough information depth
to cover the given use cases,
(b) for high dimensional data, there is no real sample dimension, only the assay dimension.
Solution: (To be confirmed w /Development Team)
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
11/21
Working Document – Version D01 – January, 2014
–
–
Use the modifier dimension in i2b2 is via (see i2b2 documentation).
As for coupling samples, introduce a sample dimension in the high dimensional data model
and use the encounter (with modifiers as a proxy for that sample entity, thus coupling the
two domains).
8. IMPROVEMENT OF CURRENT ANALYTICS (RC2)
8.1. Adaptation of legacy analytics to new data types
Requirement:
Allow analysis of sequence based miRNA and mRNA, qPCR based miRNA, mass spectrometry
metabolomic and proteomic, and RBM data. The following analyses will be modified:
–
–
–
–
–
–
–
–
–
–
Heat map
Hierarchical clustering
K-means clustering
Marker selection
PCA
Box Plot
Scatter plot
Survival analysis
Table with Fisher test
Line Graph
8.2. Refactoring of analysis code
Problem:
–
–
–
Much of the analysis logic is repeated and often copied verbatim for each analysis
This makes adaptations cumbersome for the same reasons as with the data types
When adding support for new data types to all analyses in that fashion, this problem would
only multiply itself
Solution: Refactoring analyses
–
–
How: Extract common patterns out of the various analyses into a generic class, but
implement analysis-specific job details in extensions of that class
Why:
o To maintain consistent behavior across the advanced analyses (criteria handling, data
generation, error messages etc.)
o To make the code more adaptable and maintainable (all details pertaining to an analysis
in one place)
o To make it easier to add new advanced analyses without copying a lot of boilerplate code
Steps to add a new analysis script (after refactoring):
1. Create/update relevant R scripts
2. Create/update analysis job, extending from abstract class AnalysisJob
3. Create/update controller and view
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
12/21
Working Document – Version D01 – January, 2014
8.3. Adaptation of legacy analytics to serial data
Requirement: Improve analysis of ‘serial’ high and low dimensional data using legacy analytics from
the GPL version (Advanced Workflow)
–
–
Ability to analyze a complete ‘serial’ high dimensional data matrix (e.g., gene expression at
several time points using Heatmap),
Ability to use the ‘serial’ dimension as a variable (e.g., plot gene expression of a gene in
function of time using Line Graph).
Problem:
–
–
–
The only analysis designed for serial data in tranSMART GPL version is Line Graph.
HDD analytics (heatmap, hierarchical clustering, K-means clustering, marker selection, PCA)
don’t allow selection of serial data (i.e. multiple high dim nodes in the variable selection box)
in GPL version.
Most other analytics don’t allow serial data (multiple nodes in a variable selection box)
either. The only exceptions are:
o Correlation Analysis, in which nodes are taken as individual variables that are
correlated pairwise
o Box Plot with ANOVA, which pools values of all nodes into one box in the plot.
Solution:
–
–
–
Improvement of Line Graph: see section 8.6
Improvement of HDD analytics:
o Enable multiple high dim nodes in a variable selection box
o Run analyses and display results across series
o Implement automatic numerical ordering of columns (samples) in output if numerical
(for example time or dose) series.
o Add ability to order columns in heat map output either grouped by subject or by
node at user’s choice
Improvement of Box Plot with ANOVA: see section 8.4
8.4. Improvement of Box Plot with ANOVA
Requirement:
–
–
–
Make individual box plots and ANOVAs for each variable when dragging multiple nodes in
field ‘Dependent Variable’
Correction of a pre-existing bug (to be confirmed)
o When a categorical variable is defined in the dependent variable, multiple nodes can
be used in the independent variable. But when a categorical variable is defined in the
independent variable, multiple nodes in the dependent variable generate an error.
Forbid binning of a variable if several nodes are selected for that variable (to be confirmed)
8.5. Improvement of Scatter Plot
Requirement:
–
–
Enable high dimensional data in both variable selection boxes (one node into each variable
selection box).
Scatter Plot will be done for a pair of variables, across any two data types
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
13/21
Working Document – Version D01 – January, 2014
Examples:
Expression of a gene with copy number variation of the same gene,
Expression of 2 genes,
Age and expression of a gene
8.6. Improvement of Line Graph
Requirement:
–
–
–
–
–
–
–
Enable Line Graph to use high dimensional data in both variable selection boxes
o Same high dim pop-up as other workflows, with probe aggregation option
o Ensure the data type used are all from the same Platform. If not from the same Platform,
provide an error when user clicks on HD Pop Up
In case several genes or probes are selected for the measurement concept, plot each gene or
probe on a different plot, 1 line per group on each plot. T-tests are run for each plot and
results presented in a table below each plot.
Better handle x axis
o For categorical series (conditions), provide evenly spaced categories on x axis
o For numerical series (time course, dose response), provide a scaled x axis.
Evenly spaced categories can be provided for numerical series based on user’s choice.
Better label y axis
o Improve Y axis label to indicate what data node is represented (eg. Mean Body Weight)
Add option to plot individual data in addition to group means or medians
Example Use Case
Expression of a chosen gene in each subject of each category (1 line / subject, 1
color/category
Categorical variable
o Render categorical variable optional (when no categorical variable is used, the plot will
show a single line on the whole cohort)
o Enable selection of a single group
o Enable the use of continuous variable (low and high dim) with binning
Add statistical analyses
o t-tests between 2 groups (selectable by user among the ones defined by the categorical
variable) at every time point/dose/condition. By default the first 2 groups defined by the
categorical variable are selected to run the t-tests. Results are presented in a table below
the plot
o If there is 1 group, no t-test;
9. IMPROVEMENT OF GRID VIEW (RC2)
Requirement:
–
Display Sample ID
o Grid View will contain as many rows as there are samples for a subject. If there are
no samples, there will be 1 row per subject, and NULL value in sample ID column.
Example:
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
14/21
Working Document – Version D01 – January, 2014
–
–
–
In this example, samples 5 and 6 (from patient 1) are linked to high dim data only,
therefore have no value in columns F to I.
Enable to hide columns
Enable export of Grid View to Excel (visible columns, all rows or selected rows)
Enable categorical variables in a single column
o Approach: All categorical values are displayed in a Single Column as highlighted in
Green
Subject Id Sample ID
Age
Sex
Race
101
GSM123
54
M
Caucasian
102
GSM123
55
F
Hispanic
10. IMPROVEMENT OF EXPORT (RC2)
Requirement:
–
–
–
–
Add advanced filters to allow users to limit the exported data to a subset of low dimensional
nodes
Add ability to restrict export of high dimensional data by gene, gene list, pathway, protein,
etc.
Add ability to better categorize the data types available for a study (clinical, gene expression,
SNP, etc.)
Improve performances (response time) when exporting large data volume
Solution:
–
–
–
–
–
Users will have the ability to limit the exported data to a subset of clinical fields by dragging
the folder/nodes of interest from the tree into the left panel for Export
o For HDD, users should be able to select a particular gene list, pathway, protein, etc.
This will be possible thanks to a ‘HDD selection’ pop-up.
For serial type data, user will be able to export data at particular time points (for ex.) by
dragging the appropriate node from the tree. This can be dragged and dropped accordingly
by the user
The data available for export should be categorized / sorted per data type (clinical, gene
expression, SNP, etc.)
Each type of data will be exported into a different tabulated file.
Performances for exporting gene expression (processed) data will be improved in order to
accelerate export time
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
15/21
11. APPENDIX – HIGH LEVEL DESCRIPTION OF RC1 AND RC2 REQUIREMENTS
Area
Data Loading /
ETL Pipelines
ID
RC1-1
RC1-2
RC2-1
RC2-2
RC2-3
RC2-4
Description of the requirement
Additional information
Set up processes and tools to facilitate curation and loading activities
 Allow scientists to gather and organize non-curated low
dimensional clinical & lab data files in self service mode with CV
tagging (Files Parking)
 Continue to develop the loading tool (ICE) to enable the Curators
to load all data through a unified interface
Develop more QC scripts to control data integrity and accuracy
Optimize the clinical ETL pipeline to accelerate loading time for large
clinical studies.
Add ability to load data incrementally for a given study and to delete
or overwrite some previously loaded data (for certain subjects only, or
certain variables only) without having to reload the study entirely.
Allow loading of ‘serial’ high and low dimensional data (time course,
dose response, different sampling conditions, etc.).
Improve samples handling:
 Ability to define several samples per patient
 Ability to associate high and low dimensional data to a single
sample
 Ability to use sample ID from a Biobank when applicable
We have experienced the loading of a large clinical study: 15,000
patients, about 50 clinical parameters. Loading took about 3 days. The
stored procedures require to be optimized to speed up the loading
(loading duration of a similar dataset should not exceed 1 day).
A solution developed by Millenium for handling serial high
dimensional data is being tested. However it seems only to partially
answer our needs.
Chronological/numerical ordering in data tree and in analyses should
be possible both for high and low dimensional data even when no
date/time is available (i.e., even without using the date/time
functionalities of the existing ETL).
Today, samples in tranSMART are only available for high dimensional
data. For low dimensional data, visits are used.
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
16/21
Area
ID
RC2-5
RC2-6
Database, backend
components
Security
Analyze –
Description of the requirement
Additional information
Add ability to load RBM subject-level data as high dimensional data.
Manage protein/peptide identifiers and ensure unit display.
Add ability to load microarray miRNA subject-level data, as high
dimensional data.
Preprocessing and normalization procedures will be defined internally.
Protein/peptide dictionary is to be defined.
This requirement may not require any change in the application if
miRNAs are added in gene dictionary. miRNA dictionary is to be
defined. Specific miRNA platform annotation file (to relate probes to
genes) will be loaded.
Preprocessing and normalization procedures will be defined internally.
Depending on datasets, qPCR data should be loaded as either low or
high dimensional data.
Preprocessing and normalization procedures will be defined internally.
Protein dictionary is to be defined. Also, details will be provided
concerning platform(s), protein identifiers, values, dataset structure.
Preprocessing and normalization procedures will be defined internally.
Metabolite dictionary is to be defined. Also, details to be provided
concerning platform(s), metabolite identifiers, values, dataset
structure.
Big SNP datasets expected. Associated SNP platform(s) will be
confirmed at a later date.
RC2-7
Add ability to load qPCR mRNA and miRNA subject-level data
RC2-8
Add ability to load mass spec proteomic subject-level data, as high
dimensional data.
RC2-9
Enable metabolomic subject-level data loading, as high dimensional
data.
RC210
Improve SNP subject-level data handling:
 Develop more automated procedures for processing and loading
large sets of SNP data and thus accelerate loading time.
 Make sure the infrastructure is well-dimensioned to support large
SNP data volumes.
Add ability to load RNA sequencing subject-level data (gene-level
expression quantification)
Optimize the management of annotation files (relating probes to
genes) for omic data: same information should be used for gene
expression data and gene expression analyses, and gene lists.
Set up user authentication through the company’s Active Directory
RC211
RC212
RC213
RC214
RC2-
Implement security rules and user permissions in Browse tab, which
are consistent with the current security rules and permissions in
Analyze tab.
Allow better analysis of ‘serial’ high and low dimensional data using
Currently, annotation files used in Analyze and Browse tabs are stored
in different tables. Need to check how annotation files are managed in
Gene List/Signature tab.
The Kerberos protocol will be used to validate users with Active
Directory.
Security and permissions in Browse should be defined at study-level
following the same ‘mechanism’ as in Analyze (GPL1.0 security
features).
(*) Advanced workflows from GPL1.0 release
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
17/21
Area
ID
Description of the requirement
Advanced
Workflow
15
existing analytics (*):
 Ability to analyze a complete ‘serial’ data matrix (e.g., gene
expression at all time points using Heatmap),
 Ability to analyze individual columns (e.g., plot gene expression of
a gene at one time point against gene expression at another time
point using Scatter Plot),
 Ability to use the ‘serial’ dimension as a variable (e.g., plot gene
expression of a gene in function of time using Line Graph).
Improve the Line Graph analytics:
 Enable Line Graph to use high dimensional data
 Better handle x axis variable (time, numerical or categorical), i.e.
have a scaled x axis when time or numerical variable is used.
Should work even for time course data without date/time.
 Add option to plot individual data in addition to group means or
medians.
RC216
RC217
RC218
RC219
RC220
RC221
Additional information
Improve subcategorization of high dimensional data (tissue,
timepoints, etc.) in the high dimensional data node selection screen in
Advanced Workflows
For the Boxplot analytics, make individual box plots for each variable
when dragging multiple nodes in field ‘Dependent Variable’, and
present output in table format
Improve the Correlation Analysis analytics:
 Combine Correlation Analysis and Scatter Plot Linear Regression
into one workflow
 Allow Correlation Analytics to run with high dimensional data
 Allow correlation of one variable against many
 Improve the table output when many variables.
Allow analysis of RBM data using existing analytics (*) for high
dimensional data
Allow analysis of microarray miRNA data using existing analytics (*) for
high dimensional data
This requirement is related to ‘serial’ data requirements described
above.
At a minimum a new functionality needs to be developed in Line
Graph to allow users to re-order the time points/doses/etc. in
chronological/numerical order on the x axis rather than in alphabetical
order.
Today, order of time points in Line Graph is correct when time course
data has been loaded with date/time ETL, but not scaled.
This requirement is related to ‘serial’ data requirements described
above.
(*) Advanced workflows from GPL1.0 release querying gene
expression data
(*) Advanced workflows from GPL1.0 release querying gene
expression data.
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
18/21
Area
ID
Analyze – Grid
View
RC222
RC223
RC224
RC225
RC226
Analyze –
Export
RC227
RC228
Browse –
Export
Gene
Signature/List
RC229
RC1-3
RC231
Description of the requirement
Additional information
Allow analysis of qPCR miRNA and mRNA data using existing analytics
(*) for high dimensional data
Allow analysis of mass spectrometry subject-level data using existing
analytics (*) for high dimensional data
Allow analysis of metabolomic subject-level data using existing
analytics (*) for high dimensional data
Allow analysis of RNA sequencing data using existing analytics (*) for
high dimensional data
Improve Grid View:
 Enable categorical variables in a single column
 Enable column deletion, row or column selection
 Enable export of selection
 Automatically include variables used in Advanced Workflows
Display sample ID related to patient ID in Grid View.
This requirement may not require any change in the application if
miRNAs are added in gene dictionary.
(*) Advanced workflows from GPL1.0 release querying gene
expression data.
(*) Advanced workflows from GPL1.0 release querying gene
expression data. Protein dictionary is to be defined.
(*) Advanced workflows from GPL1.0 release querying gene
expression data. Metabolite dictionary is to be defined.
(*) Advanced workflows from GPL1.0 release querying gene
expression data. Gene-level expression quantification.
Today, columns can be sorted, re-ordered and hidden (but not
filtered). Search in Grid View possible with Ctrl F (enables subjectspecific data view by searching on subject ID for example).
This requirement is linked to the requirement described above about
better handling of samples.
Improve export of subject-level data:
 Improve performances (response time) when exporting large data
volume
 Add advanced filters to allow users to limit the exported data to a
subset of clinical fields, genes, gene lists, pathways, or
combinations of the above
 Add ability to better categorize the data available for a study
(clinical, gene expression, SNP, etc. for each assay type whether
high or low dimensional data)
 Harmonize with Grid View export capabilities
Add ability to preview a file in browser (IE and Firefox)
Only for common file formats such as ppt, doc, text, excel, pdf
Add ability to export files from the Browse UI (shopping cart concept)
Under “Gene Signature/List”:
 When displaying user uploaded gene list, always add gene
If the relevant annotation file is not yet in tranSMART, it should not be
listed in the drop down in Create Gene List/Signature window and
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
19/21
Area
ID
Description of the requirement
Additional information
symbols even if users only uploaded probeset IDs
Use same annotation file as for subject-level data in Analyze and
analyzed results in Browse.
Simplify the Overall GUI: Better integrate certain tabs, such as Search
and Dataset Explorer
prevent users to load this list of probes.
RC232
Improve consistency and synchronization of data trees in ‘Browse’
(Program Explorer panel) and in ‘Analyze’ (Navigate Terms panel):
 When clicking on the “Open in Analyze view” button from Browse,
display a filter on the study ID in the Active Filter panel and
restrict the data trees in both tabs to that study. Also highlight and
open that study in the data trees of both tabs.
 There should be an easier way to navigate from the Analyze tab
back to the Browse tab.
 Data trees and right panels should be kept as they were in a tab
when switching to the other tab (unless a new filter is applied, in
which case both data trees are refreshed).
Today, when selecting a study in Browse without using any filter and
clicking on the “Open in Analyze view” button, the data tree in
Navigate terms contains only the selected study highlighted and open.
However if a filter is activated and the button "Open in Analyze view"
is clicked, the data tree in Navigate terms is not restricted to the
searched study and it is not highlighted and open.
Today also, to go back to the Browse tab, one needs to use the menu
at the top of the screen, which brings the user back to the welcome
screen.
RC1-5
Structure all data per project (program) > study > assay

Levels 1, 2, 3 & 4 data

Curated clinical, low dimensional and high dimensional data

Non-curated data (-> link to "files parking")

Support documents (reports, informed consent, etc.)
Link files to the program/study/assay/subject they are related ("files
parking")
Enhance tagging/annotation of program/study/assay based on
metadata

Overall User
Interface
Data
Organization
RC1-4
RC1-6
Browse –
Tagging and
Metadata
Searching and
Filtering
RC1-7
RC230
RC233
Add dictionaries for miRNA, proteins, metabolites
miRNA could be included in gene dictionary – to be discussed.
Secure file indexing
Currently some Excel files (at least) are not properly indexed and as a
consequence, free text search does not work properly. Technology
used: SOLR.
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
20/21
Area
ID
RC234
RC1-8
Description of the requirement
Additional information
After running a free text search in Browse tab, when clicking on bold
items in the Program Explorer, highlight in right handside Browse
panel:
 String found in metadata (including in file names)
 Files containing that string
Improve searching capabilities at study level:
 Allow users to search for studies based on metadata, data labels,
data values and free text
 Allow users to search for level 1 data (raw data files)
 Handle / manage synonyms
Title
Page
Sanofi’s tranSMART version: Description of the enhancements implemented – Working Document
21/21
Download