ArrayExpress and Gene Expression Atlas: Mining Functional

advertisement
ArrayExpress and Gene Expression Atlas:
Mining Functional Genomics data
Gabriella Rustici, PhD
Functional Genomics Team
EBI-EMBL
gabry@ebi.ac.uk
Talk structure
 Why do we need a database for functional genomics
data?
 ArrayExpress database
• Archive
• Gene Expression Atlas
 ArrayExpress content
 How to query the database
 How to download data
 How to submit data
2
ArrayExpress
What is functional genomics (FG)?
• The aim of FG is to understand the function of genes and
other parts of the genome
• FG experiments typically utilize genome-wide assays to
measure and track many genes (or proteins) in parallel
under different conditions
• High-throughput technologies such as microarrays and
high-throughput sequencing (HTS) are frequently used in
this field to interrogate the transcriptome
3
ArrayExpress
What biological questions is FG
addressing?
• When and where are genes expressed?
• How do gene expression levels differ in various cell types
and states?
• What are the functional roles of different genes and in
what cellular processes do they participate?
• How are genes regulated?
• How do genes and gene products interact?
• How is gene expression changed in various diseases or
following a treatment?
4
ArrayExpress
Components of a FG experiment
5
ArrayExpress
ArrayExpress
www.ebi.ac.uk/arrayexpress/
 Is a public repository for FG data, which provides easy access to
well annotated data in a structured and standardized format
 Serves the scientific community as an archive for data supporting
publications, together with GEO at NCBI and CIBEX at DDBJ
 Facilitates the sharing of experimental information associated with
the data such as microarray designs, experimental protocols,……
 Based on community standards: MIAME guidelines & MAGE-TAB
format for microarray, MINSEQE guidelines for HTS data
(http://www.mged.org/minseqe/)
6
ArrayExpress
Reporting standards for microarrays
MIAME checklist
 Minimal Information About a Microarray Experiment
 The 6 most critical elements contributing towards MIAME are:
1. Essential sample annotation including experimental factors and their
values (e.g. compound and dose)
2. Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….)
3. Sufficient array annotation (e.g. gene identifiers, genomic
coordinates, probe sequences or array catalog number)
4. Essential laboratory and data processing protocols
(e.g. normalization method used)
5. Raw data for each hybridization (e.g. CEL or GPR files)
6. Final normalized data for the set of hybridizations in the experiment
7
ArrayExpress
Reporting standards for sequencing
MINSEQE checklist
 Minimal Information about a high-throughput Nucleotide
SEQuencing Experiment
 The proposed guidelines for MINSEQE are (still work in progress):
1. General information about the experiment
2. Essential sample annotation including experimental factors and their
values (e.g. compound and dose)
3. Experimental design including sample data relationships (e.g. which
raw data file relates to which sample, ….)
4. Essential experimental and data processing protocols
5. Sequence read data with quality scores, raw intensities and
processing parameters for the instrument
6. Final processed data for the set of assays in the experiment
8
ArrayExpress
Reporting standards for microarrays
MAGE-TAB format
MAGE-TAB is a simple spreadsheet format that uses a number of
different files to capture information about a microarray experiment. We
now adapted it to handle HTS data:
9
IDF
Investigation Description Format file, contains top-level information about the
experiment including title, description, submitter contact details and protocols.
SDRF
Sample and Data Relationship Format file contains the relationships between
samples and arrays, as well as sample properties and experimental factors, as
provided by the data submitter.
ADF (for array
data only)
Array Design Format file, describes the design of an array, i. e. the sequence
located at each feature on the array and its annotation.
Data files
Raw and processed data files. The ‘raw’ data files are the files produced by the
microarray image analysis software (e.g. CEL files for Affymetrix or GPR files from
GenePix) or the trace data files (fastq, .srf or .sff) for HTS data.
The processed data file is a ‘data matrix’ file containing processed values (e.g. files
in which the expression values are linked to gene IDs or genome coordinates)
ArrayExpress
ArrayExpress – two databases
10
ArrayExpress
What is the difference between them?
ArrayExpress Archive
• Central object: experiment
• Query to retrieve experimental information and
associated data
Expression Atlas
• Central object: gene/condition
• Query for gene expression changes across
experiments and across platforms
11
ArrayExpress
ArrayExpress – two databases
12
ArrayExpress
ArrayExpress Archive – when to use it?
• Find FG experiments that might be relevant to your
research
• Download data and re-analyze it.
Often data deposited in public repositories can be used to answer
different biological questions from the one asked in the original
experiments.
• Submit microarray or HTS data that you want to publish.
Major journals will require data to be submitted to a public
repository like ArrayExpress as part of the peer-review process.
13
ArrayExpress
How much data in AE Archive?
14
ArrayExpress
HTS data in AE Archive
HTS vs other technologies
RNA-seq vs DNA-seq
3%
7%
RNA seq
HTS experiments
DNA seq
38%
Other
55%
97%
RNA-seq: coding vs non coding
32%
coding
non coding
68%
RNA seq, DNA-seq
Browsing the AE Archive
16
ArrayExpress
Browsing the AE Archive
AE unique
experiment ID
Curated title of
experiment
Number of
assays
The date when the
data were loaded in
the Archive
Species
investigated
loaded in
Atlas flag
Raw sequencing
data available in
ENA
The list of experiments retrieved
can be printed, saved as Tabdelimited format or exported to
Excel or as RSS feed
17
ArrayExpress
The total number of experiments
and assay retrieved
The direct link to raw and
processed data. An icon
indicates that this type of data
is available.
Browsing the AE Archive
18
ArrayExpress
Experimental factor ontology (EFO)
http://www.ebi.ac.uk/efo
 Application focused ontology modeling experimental factors (EFs) in
AE – selected by default
 Developed to:
• increase the richness of annotations that are currently made
in AE Archive
• to promote consistency
• to facilitate automatic annotation and integrate external data
 EFs are transformed into an ontological representation,
forming classes and relationships between those classes
 EFO terms map to multiple existing domain specific ontologies,
such as the Disease Ontology and Cell Type Ontology
19
ArrayExpress
Building EFO
An example
Take all experimental
factors
sarcoma
Find the logical connection between them
Organize them in an ontology
disease
disease is the parent term
[-]
cancer
neoplasm
is a type of
disease
neoplasm
[-]
neoplasm
cancer
is synonym of
cancer
neoplasm
[-]
disease
sarcoma
is a type of
sarcoma
cancer
[-]
Kaposi’s sarcoma
20
ArrayExpress
Kaposi’s sarcoma
is a type of
sarcoma
Kaposi’s sarcoma
Exploring EFO
An example
21
ArrayExpress
Searching AE Archive
Simple query
22
ArrayExpress
Searching AE Archive
Simple query
 Search across all fields:
• AE accession number e.g. E-MEXP-568
• Secondary accession numbers e.g. GEO series accession
GSE5389
• Experiment name
• Submitter's experiment description
• Sample attributes, experimental factor and values, including
species (e.g. GeneticModification, Mus musculus, DREB2C
over-expression)
• Publication title, authors and journal name, PubMed ID
 Synonyms for terms are always included in searches e.g. 'human'
and 'Homo sapiens’
23
ArrayExpress
AE Archive query output
• Matches to exact terms are highlighted in yellow
• Matches to synonyms are highlighted in green
• Matches to child terms in the EFO are highlighted in pink
AE Archive – experiment view
MIAME or MINSEQE scores show how much
the experiment is standard compliant
Link to files available.
This varies between sequencing
and microarray data. For
microarray experiments you also
have array design file and you
can view a graph of the
experimental design
Experimental factor(s) and its
values
25
ArrayExpress
AE Archive – SDRF file
26
ArrayExpress
SDRF file – sample & data relationship
27
ArrayExpress
Searching AE Archive
Advanced query
 Combine search terms
• Enter two or more keywords in the search box with the operators AND,
OR or NOT. AND is the default search term; a search for kidney cancer'
will return hits with a match to ‘kidney' AND ‘cancer’
• Search terms of more than one word must be entered inside quotes
otherwise only the first word will be searched for. E.g. “kidney cancer”
 Specify fields for searches
• Particular fields for searching can also be specified in the format
of fieldname:value
For more info see http://www.ebi.ac.uk/fg/doc/help/ae_help.html
28
ArrayExpress
Searching AE Archive
Advanced query – examples
29
ArrayExpress
ArrayExpress – two databases
30
ArrayExpress
Expression Atlas – when to use it?
• Find out if the expression of a gene (or a group of genes
with a common gene attribute, e.g. GO term) change(s)
across all the experiments available in the Expression
Atlas;
• Discover which genes are differentially expressed in a
particular biological condition that you are interested in.
31
ArrayExpress
Expression Atlas construction
Experiment selection criteria
 The criteria we use for selecting experiments for inclusion in the Atlas
are as follows:
• Array designs relating to experiment must be provided to
enable re-annotation using Ensembl or Uniprot (or have the
potential for this to be done)
• High MIAME scores
• Experiment must have 6 or more hybridizations
• Sufficient replication and large sample size
• EF and EFV must be well annotated
• Adequate sample annotation must be provided
• Processed data must be provided or raw data which can be
renormalized must be available
32
ArrayExpress
Expression Atlas construction
Analysis pipeline
 New meta-analytical tool for searching gene expression profiles
across experiments in AE
 Data is taken as normalized by the submitter
 Gene-wise linear models (limma) and t-statistics are applied to
calculate the strength of genes’ differential expression across
conditions across experiments
 The result is a two-dimensional matrix where rows correspond to
genes and columns correspond to conditions, rather than samples.
 The matrix entries are p-values together with a sign, indicating the
significance and direction of differential expression
33
ArrayExpress
Expression Atlas construction
34
ArrayExpress
Expression Atlas construction
35
ArrayExpress
Expression Atlas
36
ArrayExpress
Atlas home page
http://www.ebi.ac.uk/gxa/
Query for
genes
Restrict query by
direction of differential
expression
Query for conditions
The ‘advanced
query’ option
allows building
more complex
queries
37
ArrayExpress
Atlas home page
The ‘Genes’ and ‘Conditions’ search boxes
38
ArrayExpress
Atlas home page
A single gene query
39
ArrayExpress
Atlas gene summary page
40
ArrayExpress
Atlas experiment page
41
ArrayExpress
Atlas experiment page – HTS data
42
ArrayExpress
Atlas home page
A ‘Conditions’ only query
43
ArrayExpress
Atlas heatmap view
44
ArrayExpress
Atlas gene-condition query
45
ArrayExpress
Atlas query refining
46
ArrayExpress
Atlas gene-condition query
47
ArrayExpress
Atlas query refining
48
ArrayExpress
Atlas query refining
49
ArrayExpress
Data submission to AE
50
ArrayExpress
Data submission to AE
www.ebi.ac.uk/microarray/submissions.html
51
ArrayExpress
Submission of HTS gene expression data
• Submit via MAGE-TAB submission route
• Submit:
• MAGE-TAB spreadsheet containing details of the samples and
protocols used.
• Trace data files for each sample (in SRF, FASTQ or SFF format )
• Processed data files
• For non-human species we will supply your SRF or FASTQ files to
the European Nucleotide Archive (ENA).
• If you have human identifiable sequencing data you need to submit
to the The European Genome-phenome Archive and not
ArrayExpress. They will supply you with a suitable template for
submission and store human identifiable data securely.
52
ArrayExpress
Types of data that can be submitted
53
ArrayExpress
What happens after submission?
• Email confirmation
• Curation
• The curation team will review your submission and will
email you with any questions.
• Possible reopening for editing
• We will send you an accession number when all the
required information has been provided.
• We will load your experiment into ArrayExpress and
provide you with a reviewer login for viewing the data
before it is made public.
54
ArrayExpress
Find out more
• Visit our eLearning portal, Train online, at
http://www.ebi.ac.uk/training/online/ for courses on
ArrayExpress and Atlas
• Email us at: miamexpress@ebi.ac.uk
• Atlas mailing list: arrayexpress-atlas@ebi.ac.uk
55
ArrayExpress
Download