Array Express

advertisement

ArrayExpress and Gene Expression Atlas:

Mining Functional Genomics data

Gabriella Rustici, PhD

Functional Genomics Team

EBI-EMBL gabry@ebi.ac.uk

Talk structure

 Why do we need a database for functional genomics data?

 ArrayExpress database

• Archive

• Gene Expression Atlas

 Database content

 Query the database

 Data download

 Data submission

2 ArrayExpress

Components of a functional genomics experiment

• Sample source

• Sample treatments

• RNA extraction protocol

• Labelling protocol

Sample Sample

Library

• Sample source

• Sample treatments

• Template preparation

• Library preparation

• Array design information

• Location of each element

• Description of each element

• Hybridization protocol

Array

Chip

• Cluster amplification

• Sequencing and imaging • Image

• Scanning protocol

• Software specifications

• Quantification matrix

• Software specifications

• Control array elements

• Normalization method

Raw data

Normalized data

Data analysis

Data analysis

• From images to sequences

• Quality Control

• Sequence alignment

• Assembly

• Specific steps depending on the application

ArrayExpress

www.ebi.ac.uk/arrayexpress/

 Is a public repository for functional genomics data, mostly generated using microarray or high throughput sequencing (HTS) assays

 Serves the scientific community as an archive for data supporting publications, together with GEO at NCBI and CIBEX at DDBJ

 Provides easy access to well annotated microarray data in a structured and standardized format

 Facilitates the sharing of microarray designs and experimental protocols

 Based on FGED standards: MIAME checklist, MAGE-TAB format and MO Ontology.

 MINSEQE checklist for HTS data (http://www.mged.org/minseqe/)

4 ArrayExpress

Reporting standards for microarrays

MIAME checklist

 Minimal Information About a Microarray Experiment

 The 6 most critical elements contributing towards MIAME are:

1.

Essential sample annotation including experimental factors and their values (e.g. compound and dose)

2.

Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….)

3.

Sufficient array annotation (e.g. gene identifiers, genomic coordinates, probe sequences or array catalog number)

4.

Essential laboratory and data processing protocols

(e.g. normalization method used)

5.

Raw data for each hybridization (e.g. CEL or GPR files)

6.

Final normalized data for the set of hybridizations in the experiment

5 ArrayExpress

Reporting standards for sequencing

MINSEQE checklist

 Minimal Information about a high-throughput Nucleotide

SEQuencing Experiment

 The proposed guidelines for MINSEQE are (still work in progress):

1.

General information about the experiment

2.

Essential sample annotation including experimental factors and their values (e.g. compound and dose)

3.

Experimental design including sample data relationships (e.g. which raw data file relates to which sample, ….)

4.

Essential experimental and data processing protocols

5.

Sequence read data with quality scores, raw intensities and processing parameters for the instrument

6.

Final processed data for the set of assays in the experiment

6 ArrayExpress

Reporting standards for microarrays

MAGE-TAB format

MAGE-TAB is a simple spreadsheet format that uses a number of different files to capture information about a microarray experiment:

IDF

SDRF

ADF

Data files

Investigation Description Format file, contains top-level information about the experiment including title, description, submitter contact details and protocols.

Sample and Data Relationship Format file contains the relationships between samples and arrays, as well as sample properties and experimental factors, as provided by the data submitter.

Array Design Format file, describes the design of an array, i. e. the sequence located at each feature on the array and annotation of the sequences.

Raw and processed data files. The ‘raw’ data files are the files produced by the microarray image analysis software, such as CEL files for Affymetrix or GPR files from GenePix.

The processed data file is a ‘data matrix’ file containing processed values, as provided by the data submitter.

7 ArrayExpress

Reporting standards

What semantics (or ontology) should we use to best describe its annotation?

 Ontology , which is a formal specification of terms in a particular subject area and the relations among them.

 Its purpose is to provide a basic, stable and unambiguous description of such terms and relations in order to avoid improper and inconsistent use of the terminology pertaining to a given domain.

 Thus far, Gene Ontology (GO) has been the most successful ontology initiative. GO is a controlled vocabulary used to describe the biology of a gene product in any organism.

8 ArrayExpress

Reporting standards for microarrays

MGED ontology (MO)

 The MO provides terms for annotating all aspects of a microarray experiment from the design of the experiment and array layout, through to the preparation of the biological sample and the protocols used to hybridize the RNA and analyze the data

 The MO was developed to provide terms for annotating experiments in line with the MIAME guidelines, i.e. to provide the semantics to describe a microarray experiment according to the concepts specified in MIAME

 Also check Open Biomedical Ontologies (OBO) initiative

( www.obofoundry.org

) for the development of life-science ontologies

9 ArrayExpress

10 ArrayExpress

ArrayExpress – two databases

How to query AE and Atlas?

AE Archive

• Query by experiment, sample and experimental factor annotations

• Filter on species, array platform, molecule assayed and technology used

Gene Expression Atlas

• Gene and/or condition queries

• Query across experiments and across platforms

11 ArrayExpress

12 ArrayExpress

ArrayExpress – two databases

How much data in AE Archive?

13 ArrayExpress

Archive by species

14 ArrayExpress

15 ArrayExpress

Browsing the AE Archive

AE unique experiment ID

Browsing the AE Archive

Curated title of experiment

Number of assays

Species investigated

The date when the data were loaded in the Archive loaded in

Atlas flag

The list of experiments retrieved can be printed, saved as Tabdelimited format or exported to

16

Excel or as RSS feed

ArrayExpress

The total number of experiments and assay retrieved

The direct link to raw and processed data. An icon indicates that this type of data is available.

Raw sequencing data available in

ENA

17 ArrayExpress

Browsing the AE Archive

Experimental factor ontology (EFO)

http://www.ebi.ac.uk/efo

 Application focused ontology modeling experimental factors (EFs) in

AE

 Developed to:

• increase the richness of annotations that are currently made in AE Archive

• to promote consistency

• to facilitate automatic annotation and integrate external data

EFs are transformed into an ontological representation, forming classes and relationships between those classes

 EFO terms map to multiple existing domain specific ontologies, such as the Disease Ontology and Cell Type Ontology

18 ArrayExpress

Experimental factor ontology (EFO)

An example

19 ArrayExpress & Atlas

Searching AE Archive

Simple query - EFO

20 ArrayExpress

Searching AE Archive

Simple query

 Search across all fields:

• AE accession number e.g. E-MEXP-568

• Secondary accession numbers e.g. GEO series accession

GSE5389

• Experiment name

• Submitter's experiment description

• Sample attributes, experimental factor and values, including species (e.g. GeneticModification, Mus musculus, DREB2C over-expression)

• Publication title, authors and journal name, PubMed ID

 Synonyms for terms are always included in searches e.g. 'human' and ' Homo sapiens ’

21 ArrayExpress

AE Archive query output

• Matches to exact terms are highlighted in yellow

• Matches to synonyms are highlighted in green

• Matches to child terms in the EFO are highlighted in pink

23 ArrayExpress

AE Archive – experiment view

How does processed data look?

Samples Sample annotation

Gene annotations

24 ArrayExpress

Gene expression levels or count level data

25 ArrayExpress

AE Archive – SDRF file

SDRF file – sample & data relationship

26 ArrayExpress

27 ArrayExpress

AE Archive – ADF file

28 ArrayExpress

AE Archive – Old interface

29 ArrayExpress

AE Archive – all files

30 ArrayExpress

AE Archive – all files

Searching AE Archive

Advanced query

 Combine search terms

• Enter two or more keywords in the search box with the operators AND ,

OR or NOT . AND is the default search term; a search for kidney cancer' will return hits with a match to ‘kidney' AND ‘cancer’

• Search terms of more than one word must be entered inside quotes otherwise only the first word will be searched for. E.g. “kidney cancer”

 Specify fields for searches

• Particular fields for searching can also be specified in the format of fieldname:value

31 ArrayExpress

Searching AE Archive

Advanced query - fieldnames

Field name Searches accession Experiment primary or secondary accession

Example accession:E-MEXP-568 array Array design accession or name array:AFFY-2 OR array:Agilent* ef efv expdesign

Experimental factor, the name of the main variables in an experiment.

Experimental factor value. Has EFO expansion.

ef:celltype OR ef:compound efv:fibroblast

Experiment design type expdesign:”dose response”

Experiment type. Has EFO expansion.

exptype:RNA-seq exptype gxa ef:compound AND gxa:true Presence in the Gene Expression Atlas. Only value is gxa:true.

PubMed identifier pmid:16553887 pmid sa species

Sample attribute values. Has EFO expansion.

Species of the samples. Has EFO expansion.

sa:wild_type species:”homo sapiens” AND ef:cellline

32 ArrayExpress

Searching AE Archive

Advanced query

 Filtering experiments by counts of a particular attribute

• Experiments fulfilling certain count criteria can also be searched for e.g. having more than 10 assays (hybridizations)

33

Filter assaycount:[x TO y]

What is filtered filter on the number of of assays where x <= y and both values are between 0 and 99,999

( inclusive ) . To count excluding the values given use curly brackets e.g. assaycount:{1 TO

5} will find experiments with 2-4 assays. Single numbers may also be given e.g. assaycount:10 will find experiments with 10 assays.

filter on the number of experimental factors efcount:[x TO y] samplecount:[x TO y] filter on the number of samples sacount:[x TO y] rawcount:[x TO y]

ArrayExpress filter on the number of sample attribute categories filter on the number of raw files fgemcount:[x TO y] filter on the number of final gene expression matrix (processed data) files miamescore:[x TO y] filter on the MIAME compliance score (maximum score is 5) date:yyyy-mm-dd filter by release date

•date:2009-12-01 - will search for experiments released on 1st of Dec 2009

•date:2009* - will search for experiments released in 2009

•date:[2008-01-01 2008-05-31] - will search for experiments released between 1st of Jan and end of May 2008

Searching AE Archive

Advanced query – an example

34 ArrayExpress & Atlas

35 ArrayExpress

Exercise 1

36 ArrayExpress

ArrayExpress – two databases

Gene Expression Atlas

Experiment selection criteria

 The criteria we use for selecting experiments for inclusion in the Atlas are as follows:

• Array designs relating to experiment must be provided to enable re-annotation using Ensembl or Uniprot (or have the potential for this to be done)

• High MIAME scores

• Experiment must have 6 or more hybridizations

• Sufficient replication and large sample size

• EF and EFV must be well annotated

• Adequate sample annotation must be provided

• Processed data must be provided or raw data which can be renormalized must be available

37 ArrayExpress

Gene Expression Atlas

Atlas construction

 New meta-analytical tool for searching gene expression profiles across experiments in AE

 Data is taken as normalized by the submitter

 Gene-wise linear models (limma) and t-statistics are applied to calculate the strength of genes’ differential expression across conditions across experiments

 The result is a two-dimensional matrix where rows correspond to genes and columns correspond to conditions, rather than samples.

 The matrix entries are p -values together with a sign, indicating the significance and direction of differential expression

38 ArrayExpress

Gene Expression Atlas

Atlas construction

39 ArrayExpress

Gene Expression Atlas

Atlas construction

 up-regulated

 down-regulated

 no change

41 ArrayExpress

Gene Expression Atlas

Query for gene(s)

Atlas home page

http://www.ebi.ac.uk/gxa/

Restrict search by direction of differential expression

Query for condition(s)

The ‘advanced search’ option allows building more complex queries

42 ArrayExpress

Atlas home page

The ‘Genes’ search box & auto-complete function

43 ArrayExpress

Atlas home page

The ‘Conditions’ search box & ontology browsing

44 ArrayExpress

Atlas home page

A single gene query

45 ArrayExpress

46 ArrayExpress

Atlas gene summary page

47 ArrayExpress

Atlas experiment page

Experimental factors list

Expression plot

Table containing gene information and drop down menus for searching within the experiment

Atlas experiment page – HTS data

48 ArrayExpress & Atlas

Atlas home page

A ‘Conditions’ only query

49 ArrayExpress & Atlas

50 ArrayExpress

Atlas heatmap view

51 ArrayExpress

Atlas list view

Click the ‘ expression profile ’ link to view the experiment page

52 ArrayExpress

Atlas data download

53 ArrayExpress

Atlas gene-condition query

54 ArrayExpress

Atlas query refining

55 ArrayExpress

Atlas gene-condition query

56 ArrayExpress

Atlas query refining

57 ArrayExpress

Atlas query refining

58 ArrayExpress

Exercises 2, 3 & 4

59 ArrayExpress

Data submission to AE

Data submission to AE

www.ebi.ac.uk/microarray/submissions.html

60 ArrayExpress

Submission of HTS gene expression data

• Submit via MAGE-TAB submission route

• Submit:

• MAGE-TAB spreadsheet containing details of the samples and protocols used.

• Trace data files for each sample (in SRF, FASTQ or SFF format )

• Processed data files

• For non-human species we will supply your SRF or FASTQ files to the European Nucleotide Archive (ENA).

• If you have human identifiable sequencing data you need to submit to the The European Genome-phenome Archive and not

ArrayExpress. They will supply you with a suitable template for submission and store human identifiable data securely.

61 ArrayExpress & Atlas

Types of data that can be submitted

62 ArrayExpress & Atlas

What happens after submission?

• Email confirmation

• Curation

• The curation team will review your submission and will email you with any questions.

• Possible reopening for editing

• We will send you an accession number when all the required information has been provided.

• We will load your experiment into ArrayExpress and provide you with a reviewer login for viewing the data before it is made public.

63 ArrayExpress & Atlas

Download