Gabriella Rustici, PhD
Functional Genomics Team
EBI-EMBL gabry@ebi.ac.uk
Why do we need a database for functional genomics data?
ArrayExpress database
• Archive
• Gene Expression Atlas
Database content
Query the database
Data download
Data submission
2 ArrayExpress
Components of a functional genomics experiment
• Sample source
• Sample treatments
• RNA extraction protocol
• Labelling protocol
Sample Sample
Library
• Sample source
• Sample treatments
• Template preparation
• Library preparation
• Array design information
• Location of each element
• Description of each element
• Hybridization protocol
Array
Chip
• Cluster amplification
• Sequencing and imaging • Image
• Scanning protocol
• Software specifications
• Quantification matrix
• Software specifications
• Control array elements
• Normalization method
Raw data
Normalized data
Data analysis
Data analysis
• From images to sequences
• Quality Control
• Sequence alignment
• Assembly
• Specific steps depending on the application
www.ebi.ac.uk/arrayexpress/
Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays
Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ
Provides easy access to well annotated microarray data in a structured and standardized format
Facilitates the sharing of microarray designs and experimental protocols
Based on FGED standards: MIAME checklist, MAGE-TAB format and MO Ontology.
MINSEQE checklist for HTS data (http://www.mged.org/minseqe/)
4 ArrayExpress
Reporting standards for microarrays
MIAME checklist
Minimal Information About a Microarray Experiment
The 6 most critical elements contributing towards MIAME are:
1.
Essential sample annotation including experimental factors and their values (e.g. compound and dose)
2.
Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….)
3.
Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number)
4.
Essential laboratory and data processing protocols
(e.g. normalization method used)
5.
Raw data for each hybridization (e.g. CEL or GPR files)
6.
Final normalized data for the set of hybridizations in the experiment
5 ArrayExpress
Reporting standards for sequencing
MINSEQE checklist
Minimal Information about a high-throughput Nucleotide
SEQuencing Experiment
The proposed guidelines for MINSEQE are (still work in progress):
1.
General information about the experiment
2.
Essential sample annotation including experimental factors and their values (e.g. compound and dose)
3.
Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….)
4.
Essential experimental and data processing protocols
5.
Sequence read data with quality scores, raw intensities and processing parameters for the instrument
6.
Final processed data for the set of assays in the experiment
6 ArrayExpress
Reporting standards for microarrays
MAGE-TAB format
MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment:
IDF
SDRF
ADF
Data files
Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.
Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.
Array Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences.
Raw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software, such as CEL files for Affymetrix or GPR files from GenePix.
The processed data file is a ‘data matrix’ file containing processed values, as provided by the data submitter.
7 ArrayExpress
Reporting standards
What semantics (or ontology) should we use to best describe its annotation?
Ontology , which is a formal specification of terms in a particular subject area and the relations among them.
Its purpose is to provide a basic, stable and unambiguous description of such terms and relations in order to avoid improper and inconsistent use of the terminology pertaining to a given domain.
Thus far, Gene Ontology (GO) has been the most successful ontology initiative. GO is a controlled vocabulary used to describe the biology of a gene product in any organism.
8 ArrayExpress
Reporting standards for microarrays
MGED ontology (MO)
The MO provides terms for annotating all aspects of a microarray experiment from the design of the experiment and array layout, through to the preparation of the biological sample and the protocols used to hybridize the RNA and analyze the data
The MO was developed to provide terms for annotating experiments in line with the MIAME guidelines, i.e. to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME
Also check Open Biomedical Ontologies (OBO) initiative
( www.obofoundry.org
) for the development of life-science ontologies
9 ArrayExpress
10 ArrayExpress
AE Archive
• Query by experiment, sample and experimental factor annotations
• Filter on species, array platform, molecule assayed and technology used
Gene Expression Atlas
• Gene and/or condition queries
• Query across experiments and across platforms
11 ArrayExpress
12 ArrayExpress
13 ArrayExpress
14 ArrayExpress
15 ArrayExpress
AE unique experiment ID
Curated title of experiment
Number of assays
Species investigated
The date when the data were loaded in the Archive loaded in
Atlas flag
The list of experiments retrieved can be printed, saved as Tabdelimited format or exported to
16
Excel or as RSS feed
ArrayExpress
The total number of experiments and assay retrieved
The direct link to raw and processed data. An icon indicates that this type of data is available.
Raw sequencing data available in
ENA
17 ArrayExpress
http://www.ebi.ac.uk/efo
Application focused ontology modeling experimental factors (EFs) in
AE
Developed to:
• increase the richness of annotations that are currently made in AE Archive
• to promote consistency
• to facilitate automatic annotation and integrate external data
EFs are transformed into an ontological representation, forming classes and relationships between those classes
EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology
18 ArrayExpress
An example
19 ArrayExpress & Atlas
Simple query - EFO
20 ArrayExpress
Simple query
Search across all fields:
• AE accession number e.g. E-MEXP-568
• Secondary accession numbers e.g. GEO series accession
GSE5389
• Experiment name
• Submitter's experiment description
• Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression)
• Publication title, authors and journal name, PubMed ID
Synonyms for terms are always included in searches e.g. 'human' and ' Homo sapiens ’
21 ArrayExpress
• Matches to exact terms are highlighted in yellow
• Matches to synonyms are highlighted in green
• Matches to child terms in the EFO are highlighted in pink
23 ArrayExpress
AE Archive – experiment view
Samples Sample annotation
Gene annotations
24 ArrayExpress
Gene expression levels or count level data
25 ArrayExpress
26 ArrayExpress
27 ArrayExpress
28 ArrayExpress
29 ArrayExpress
30 ArrayExpress
Advanced query
Combine search terms
• Enter two or more keywords in the search box with the operators AND ,
OR or NOT . AND is the default search term; a search for kidney cancer' will return hits with a match to ‘kidney' AND ‘cancer’
• Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer”
Specify fields for searches
• Particular fields for searching can also be specified in the format of fieldname:value
31 ArrayExpress
Advanced query - fieldnames
Field name Searches accession Experiment primary or secondary accession
Example accession:E-MEXP-568 array Array design accession or name array:AFFY-2 OR array:Agilent* ef efv expdesign
Experimental factor, the name of the main variables in an experiment.
Experimental factor value. Has EFO expansion.
ef:celltype OR ef:compound efv:fibroblast
Experiment design type expdesign:”dose response”
Experiment type. Has EFO expansion.
exptype:RNA-seq exptype gxa ef:compound AND gxa:true Presence in the Gene Expression Atlas. Only value is gxa:true.
PubMed identifier pmid:16553887 pmid sa species
Sample attribute values. Has EFO expansion.
Species of the samples. Has EFO expansion.
sa:wild_type species:”homo sapiens” AND ef:cellline
32 ArrayExpress
Advanced query
Filtering experiments by counts of a particular attribute
• Experiments fulfilling certain count criteria can also be searched for e.g. having more than 10 assays (hybridizations)
33
Filter assaycount:[x TO y]
What is filtered filter on the number of of assays where x <= y and both values are between 0 and 99,999
( inclusive ) . To count excluding the values given use curly brackets e.g. assaycount:{1 TO
5} will find experiments with 2-4 assays. Single numbers may also be given e.g. assaycount:10 will find experiments with 10 assays.
filter on the number of experimental factors efcount:[x TO y] samplecount:[x TO y] filter on the number of samples sacount:[x TO y] rawcount:[x TO y]
ArrayExpress filter on the number of sample attribute categories filter on the number of raw files fgemcount:[x TO y] filter on the number of final gene expression matrix (processed data) files miamescore:[x TO y] filter on the MIAME compliance score (maximum score is 5) date:yyyy-mm-dd filter by release date
•date:2009-12-01 - will search for experiments released on 1st of Dec 2009
•date:2009* - will search for experiments released in 2009
•date:[2008-01-01 2008-05-31] - will search for experiments released between 1st of Jan and end of May 2008
Advanced query – an example
34 ArrayExpress & Atlas
35 ArrayExpress
Exercise 1
36 ArrayExpress
Experiment selection criteria
The criteria we use for selecting experiments for inclusion in the Atlas are as follows:
• Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done)
• High MIAME scores
• Experiment must have 6 or more hybridizations
• Sufficient replication and large sample size
• EF and EFV must be well annotated
• Adequate sample annotation must be provided
• Processed data must be provided or raw data which can be renormalized must be available
37 ArrayExpress
Atlas construction
New meta-analytical tool for searching gene expression profiles across experiments in AE
Data is taken as normalized by the submitter
Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments
The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples.
The matrix entries are p -values together with a sign, indicating the significance and direction of differential expression
38 ArrayExpress
Atlas construction
39 ArrayExpress
Atlas construction
up-regulated
down-regulated
no change
41 ArrayExpress
Query for gene(s)
http://www.ebi.ac.uk/gxa/
Restrict search by direction of differential expression
Query for condition(s)
The ‘advanced search’ option allows building more complex queries
42 ArrayExpress
The ‘Genes’ search box & auto-complete function
43 ArrayExpress
The ‘Conditions’ search box & ontology browsing
44 ArrayExpress
A single gene query
45 ArrayExpress
46 ArrayExpress
47 ArrayExpress
Experimental factors list
Expression plot
Table containing gene information and drop down menus for searching within the experiment
48 ArrayExpress & Atlas
A ‘Conditions’ only query
49 ArrayExpress & Atlas
50 ArrayExpress
51 ArrayExpress
Click the ‘ expression profile ’ link to view the experiment page
52 ArrayExpress
53 ArrayExpress
54 ArrayExpress
55 ArrayExpress
56 ArrayExpress
57 ArrayExpress
58 ArrayExpress
Exercises 2, 3 & 4
59 ArrayExpress
www.ebi.ac.uk/microarray/submissions.html
60 ArrayExpress
• Submit via MAGE-TAB submission route
• Submit:
• MAGE-TAB spreadsheet containing details of the samples and protocols used.
• Trace data files for each sample (in SRF, FASTQ or SFF format )
• Processed data files
• For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA).
• If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not
ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely.
61 ArrayExpress & Atlas
62 ArrayExpress & Atlas
• Email confirmation
• Curation
• The curation team will review your submission and will email you with any questions.
• Possible reopening for editing
• We will send you an accession number when all the required information has been provided.
• We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public.
63 ArrayExpress & Atlas