Curation practices at the EMAGE gene expression database Jeff Christiansen PhD

advertisement
Curation practices at the
EMAGE gene expression
database
Jeff Christiansen PhD
EMAGE Senior Curator
MRC Human Genetics Unit
Edinburgh
Overview
What is EMAGE?
Data types
Data sources
Annotation methods and searching methods
Curation Aspects
Genes, sequences, text descriptions and links
General biocuration efforts
Access to images (data access)
Towards standardised experimental reporting
Maintaining and developing the framework housing EMAGE data
Data Preservation
Gene Expression
Every cell in the body contains
copies of all genes (~25,000) in
the nucleus.
However, different cell types
‘express’ different sets of these
genes.
Gene expression =
DNA ... mRNA … protein
Detection of a specific mRNA or
protein is performed to profile the
expression of a gene.
Gene Expression profiling
Gene expression profiling can be done:
en masse
(in a dissociated tissue)
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
individually
(in an intact tissue (in situ))
EMAGE: Data Types
EMAGE holds in situ expression data (mRNA and protein) in mouse embryos
EMAGE: Data Types
EMAGE holds in situ expression data (mRNA and protein) in mouse embryos
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Quic kTime™ and a
TIFF (Unc ompres sed) dec ompres sor
are needed to see this pic ture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Whole embryo photos
Photos of sections
3D images
EMAGE: Data Sources
EMAGE: Data Sources
Literature
EMAGE: Data Sources
Literature
155 main journals contain data concerning mouse embryos
36 Mouse Genome Informatics (USA) curators manually read these journals
and create a basic index of papers containing relevant data
(i.e. since 1993, 14,375 papers, contain 59,906 images showing in situ
expression results for 10,052 genes)
4 full-time MGI/GXD staff fully curate a proportion of these papers (~25%)
EMAGE: Data Sources
Large scale screening projects
EMAGE: Data Sources
Large scale screening projects
GUDMAP
- mouse urogenital system: ~34,000 images
- curated data (5 year project)
GenSat - mouse embryo E12.5:
~5,000 images
- non-curated
FaceBase - mouse craniofacial:
~ 4000 2D + ~ 4000 3D images
- curated data (2 year pilot project)
EURExpressII - mouse embryo E14.5:
~500,000 images
- curated data (4 year project)
EMAGE: Data Sources
Data submissions from individual labs
EMAGE: Data Sources
Data submissions from individual labs
~ 2,500 images from numerous labs, all stages of development
EMAGE: Data Annotation
EMAGE: Data Annotation
Source
- journal/screen/direct submission, submitter contact details
EMAGE: Data Annotation
Source
- journal/screen/direct submission, submitter contact details
Detection reagent
- Defining the reagent that was used to detect expression
EMAGE: Data Annotation
Source
- journal/screen/direct submission, submitter contact details
Detection reagent
- Defining the reagent that was used to detect expression
Experimental Conditions
- Defining the full experimental conditions used
EMAGE: Data Annotation
Source
- journal/screen/direct submission, submitter contact details
Detection reagent
- Defining the reagent that was used to detect expression
Experimental Conditions
- Defining the full experimental conditions used
Links
- Addition of specific relevant links to data in other databases
EMAGE: Data Annotation
Sites of gene expression
EMAGE: Data Annotation
Sites of gene expression - annotation to EMAP embryo Atlas
EMAGE: Data Annotation
Sites of gene expression - annotation to EMAP embryo Atlas
EMAGE: Data Annotation
Sites of gene expression - annotation to EMAP embryo Atlas
Text annotation (to anatomy ontology)
Detected in:
central nervous system: ganglion: cranial: acoustic ganglion VIII
central nervous system: ganglion: cranial: facial ganglion VII
central nervous system: ganglion: cranial: glossopharyngeal IX
central nervous system: ganglion: cranial: trigeminal V
central nervous system: ganglion: cranial: vagus X
peripheral nervous system: spinal: ganglion: dorsal root ganglion
EMAGE: Data Annotation
Sites of gene expression - annotation to EMAP embryo Atlas
Text annotation (to anatomy ontology)
Detected in:
central nervous system: ganglion: cranial: acoustic ganglion VIII
central nervous system: ganglion: cranial: facial ganglion VII
central nervous system: ganglion: cranial: glossopharyngeal IX
central nervous system: ganglion: cranial: trigeminal V
central nervous system: ganglion: cranial: vagus X
peripheral nervous system: spinal: ganglion: dorsal root ganglion
Text annotation is based on the author/submitter description
Challenge is to accurately reflect the meaning of the author description
in the constraints of the ontology
This process often highlights shortcomings of the ontology
EMAGE: Data Annotation
Sites of gene expression - annotation to EMAP embryo Atlas
Spatial annotation (to virtual embryo model)
strong
moderate not detected
EMAGE: Data Interrogation
EMAGE: Data Interrogation
Text based:
WHAT gene expression is detected in the 1st branchial arch from TS14-TS14?
EMAGE: Data Interrogation
Spatial based:
EMAGE: Data Interrogation
Spatial
based
data
mining:
EMAGE: Data Curation
EMAGE: Data Curation
EMAGE: Data Curation
Biocuration staff from 106 databases…
1st International Biocurator Meeting
December 2006
Monterrey, USA.
2nd International Biocurator Meeting
October 2007
San Jose, USA.
EMAGE: Data Curation
International Society for Biological Curation (ISBC)
- discussion begun at the 2nd International Biocuration Meeting
-currently being formed
- for professional Biocurators and those who develop Biological
Curations Tools and Databases
- to provide a forum for interactions between Biocurators
- to present a unified voice for the Biological Curation effort
- to facilitate better communication and interactions between
Biocurators and both Researchers and Journals
EMAGE: Data Curation
EMAGE: Data Curation
EMAGE: Data Curation
EMAGE: Data Curation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Author Description
EMAGE Database Description
Gene Assayed:
Gsh2
MGI:94843 (current symbol Gsx2 aka Gsh2, Gsh-2)
Stage:
E9.5
TS16
Probe:
“as used by Hsieh-Li et al”
MGI:2447759
(nt 1038-1498 of S79041.1)
Hsieh-Li: “a 460-bp BamHI-SmaI
fragment of the Gsh-2 cDNA), which
does not contain homeobox
sequences".
Detected in
Ventro-lateral forebrain
1
61
121
181
241
301
361
421
ctgcctcggc
tacctgctcc
ccattgccct
gcccgacttt
ttgtttgctt
caggaaaaac
cgcatgccat
tttttgttgt
taacgaagac
cggtaccctg
cattccacct
gaagctagct
ttttgttgtt
cagggttgat
ttgcccccct
tgttttaaaa
aaggagattt
ccctcctcct
ggaaaagaaa
cctctttatc
tttaatgtaa
taaagtttaa
gtcttttcag
tgaaatcatt
Telencephalon (EMAP:1705)
Patterm: Restricted
Note: Expression restricted to
ventro-lateral telencephalon)
cccccttgta
ccccatcagc
ctctgaaaag
tgggattcca
atatctagaa
cactgtatgg
aacttgatga
gaagttgcca
aaggcagagg
acagggacca
tccggggaat
ctcagttacg
ttctaaccag
ggggaggggt
gaagaggggt
t
ctccttctgc
aagttctagt
tcaatgccgg
gattggtttt
tctcatatat
tggttgaaga
ttctttattg
EMAGE: Data Curation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Author Description
EMAGE Database Description
Gene Assayed:
Gsh2
MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2)
Stage:
E9.5
TS16
Probe:
“as used by Hsieh-Li et al”
MGI:2447759
(nt 1038-1498 of S79041.1)
Hsieh-Li: “a 460-bp BamHI-SmaI
fragment of the Gsh-2 cDNA), which
does not contain homeobox
sequences".
Detected in
Ventro-lateral forebrain
1
61
121
181
241
301
361
421
ctgcctcggc
tacctgctcc
ccattgccct
gcccgacttt
ttgtttgctt
caggaaaaac
cgcatgccat
tttttgttgt
taacgaagac
cggtaccctg
cattccacct
gaagctagct
ttttgttgtt
cagggttgat
ttgcccccct
tgttttaaaa
aaggagattt
ccctcctcct
ggaaaagaaa
cctctttatc
tttaatgtaa
taaagtttaa
gtcttttcag
tgaaatcatt
Telencephalon (EMAP:1705)
Patterm: Restricted
Note: Expression restricted to
ventro-lateral telencephalon)
cccccttgta
ccccatcagc
ctctgaaaag
tgggattcca
atatctagaa
cactgtatgg
aacttgatga
gaagttgcca
aaggcagagg
acagggacca
tccggggaat
ctcagttacg
ttctaaccag
ggggaggggt
gaagaggggt
t
ctccttctgc
aagttctagt
tcaatgccgg
gattggtttt
tctcatatat
tggttgaaga
ttctttattg
EMAGE: Data Curation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Author Description
EMAGE Database Description
Gene Assayed:
Gsh2
MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2)
Stage:
E9.5
TS16
Probe:
“as used by Hsieh-Li et al”
MGI:2447759
(nt 1038-1498 of S79041.1)
Hsieh-Li: “a 460-bp BamHI-SmaI
fragment of the Gsh-2 cDNA), which
does not contain homeobox
sequences".
Quality Assurance
always performed
by a second curator
Detected in
Ventro-lateral forebrain
1
61
121
181
241
301
361
421
ctgcctcggc
tacctgctcc
ccattgccct
gcccgacttt
ttgtttgctt
caggaaaaac
cgcatgccat
tttttgttgt
taacgaagac
cggtaccctg
cattccacct
gaagctagct
ttttgttgtt
cagggttgat
ttgcccccct
tgttttaaaa
aaggagattt
ccctcctcct
ggaaaagaaa
cctctttatc
tttaatgtaa
taaagtttaa
gtcttttcag
tgaaatcatt
Telencephalon (EMAP:1705)
Patterm: Restricted
Note: Expression restricted to
ventro-lateral telencephalon)
cccccttgta
ccccatcagc
ctctgaaaag
tgggattcca
atatctagaa
cactgtatgg
aacttgatga
gaagttgcca
aaggcagagg
acagggacca
tccggggaat
ctcagttacg
ttctaaccag
ggggaggggt
gaagaggggt
t
ctccttctgc
aagttctagt
tcaatgccgg
gattggtttt
tctcatatat
tggttgaaga
ttctttattg
EMAGE: Data Curation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Author Description
EMAGE Database Description
Gene Assayed:
Gsh2
MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2)
Objective
Stage:
E9.5
TS16
“as used by Hsieh-Li et al”
MGI:2447759
(nt 1038-1498 of S79041.1)
Hsieh-Li: “a 460-bp BamHI-SmaI
fragment of the Gsh-2 cDNA), which
does not contain homeobox
sequences".
Detected in
Ventro-lateral forebrain
1
61
121
181
241
301
361
421
ctgcctcggc
tacctgctcc
ccattgccct
gcccgacttt
ttgtttgctt
caggaaaaac
cgcatgccat
tttttgttgt
taacgaagac
cggtaccctg
cattccacct
gaagctagct
ttttgttgtt
cagggttgat
ttgcccccct
tgttttaaaa
aaggagattt
ccctcctcct
ggaaaagaaa
cctctttatc
tttaatgtaa
taaagtttaa
gtcttttcag
tgaaatcatt
Telencephalon (EMAP:1705)
Patterm: Restricted
Note: Expression restricted to
ventro-lateral telencephalon)
cccccttgta
ccccatcagc
ctctgaaaag
tgggattcca
atatctagaa
cactgtatgg
aacttgatga
gaagttgcca
aaggcagagg
acagggacca
tccggggaat
ctcagttacg
ttctaaccag
ggggaggggt
gaagaggggt
t
ctccttctgc
aagttctagt
tcaatgccgg
gattggtttt
tctcatatat
tggttgaaga
ttctttattg
EMAGE: Data Curation
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Author Description
EMAGE Database Description
Gene Assayed:
Gsh2
MGI:94843 current symbol Gsx2 (aka Gsh2, Gsh-2)
Objective
Stage:
E9.5
TS16
“as used by Hsieh-Li et al”
MGI:2447759
(nt 1038-1498 of S79041.1)
Probe:
‘Subjective’
Screen
vs
non-screen
Hsieh-Li: “a 460-bp BamHI-SmaI
fragment of the Gsh-2 cDNA), which
does not contain homeobox
sequences".
1
61
121
181
241
301
361
421
ctgcctcggc
tacctgctcc
ccattgccct
gcccgacttt
ttgtttgctt
caggaaaaac
cgcatgccat
tttttgttgt
taacgaagac
cggtaccctg
cattccacct
gaagctagct
ttttgttgtt
cagggttgat
ttgcccccct
tgttttaaaa
aaggagattt
ccctcctcct
ggaaaagaaa
cctctttatc
tttaatgtaa
taaagtttaa
gtcttttcag
tgaaatcatt
Curator Confidence of:
Detected in
Pattern
clarity forebrain
Ventro-lateral
Morphology match
between data
and model
Telencephalon (EMAP:1705)
Patterm: Restricted
Note: Expression restricted to
ventro-lateral telencephalon)
cccccttgta
ccccatcagc
ctctgaaaag
tgggattcca
atatctagaa
cactgtatgg
aacttgatga
gaagttgcca
aaggcagagg
acagggacca
tccggggaat
ctcagttacg
ttctaaccag
ggggaggggt
gaagaggggt
t
ctccttctgc
aagttctagt
tcaatgccgg
gattggtttt
tctcatatat
tggttgaaga
ttctttattg
EMAGE: Data Curation
This standardised, stable and accessible
description places the data in a wider
biological context for data analysis:
EMAGE: Data Curation
e.g. GENE IDENTIFIER: MGI:94843
EMAGE: Data Curation
e.g. PROBE SEQUENCE:
Nkx2-1 0.531kb probe
Nkx2-1 0.702kb probe
EMAGE: Data Curation
e.g. PROBE SEQUENCE:
Nkx2-1 0.531kb probe
Why the difference?
Nkx2-1 0.702kb probe
EMAGE: Data Curation
e.g. PROBE SEQUENCE:
Nkx2-1 0.531kb probe
Nkx2-1 0.702kb probe
EMAGE: Data Curation
Automation
EMAGE: Data Curation
Automation
Sequence-based bioinformatics
Text mining
EMAGE: Data Curation
Automation
Sequence-based bioinformatics
Text mining
Automated image analysis
EMAGE: Data Curation
Automation
Sequence-based bioinformatics
Text mining
Automated image analysis
Raw data
EMAGE: Data Curation
Automation
Sequence-based bioinformatics
Text mining
Automated image analysis
Signal extraction
Registration to atlas model
Access to raw data
Access to raw data
OPEN ACCESS - True open access allows authors to retain ownership of the copyright for
their article, but authors allow anyone to download, reuse, reprint, modify, distribute, and/or copy
articles, so long as the original authors and source are cited. No permission is required from the
authors or the publishers.
Access to raw data
OPEN ACCESS - True open access allows authors to retain ownership of the copyright for
their article, but authors allow anyone to download, reuse, reprint, modify, distribute, and/or copy
articles, so long as the original authors and source are cited. No permission is required from the
authors or the publishers.
Of 155 journals containing mouse gene
expression data, only BMC, PLoS and the
Biochemical Journal are truly
open access, with a CC attribution (BMC,
PLoS) or CC attribution non-commercial
(Biochem J) licence, and xml data access.
Access to raw data
Open Access
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Fig7B. Copyright: This image is from [doi:10.1186/1471213X-6-56] Andrieu D; Meziane H; Marly F; Angelats C;
Fernandez PA; Muscatelli F, BMC Dev Biol 2006:8, an
open-access article, licensee BioMed Central Ltd.
[PMID:17116257]
Access to raw data
OPEN ACCESS - True open access allows authors to retain ownership of the copyright for
their article, but authors allow anyone to download, reuse, reprint, modify, distribute, and/or copy
articles, so long as the original authors and source are cited. No permission is required from the
authors or the publishers.
Of 155 journals containing mouse gene
expression data, only BMC, PLoS and the
Biochemical Journal are truly
open access, with a CC attribution (BMC,
PLoS) or CC attribution non-commercial
(Biochem J) licence, and xml data access.
All others still requires separate agreements
between the publishers and an individual for
any use apart from viewing individual papers
online and printing copies during the “open
access” period.
This is quite inhibitory to access of the data
for computational reasons (e.g. text mining,
re-use of images)
Access to raw data
Number of suitable images
Open access
Non open access
Journal
Access to raw data
Number of suitable images
Open access
Non open access
Copyright
agreement
negotiated with
Publishers
(16 journals)
Journal
Access to raw data
Copyright agreement negotiated with publisher
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Figure 5E. Reprinted with permission from
Elsevier from
[doi:10.1016/j.ydbio.2006.02.002] Dev Biol
2006 May 15;293(2):370-81, Andersson O;
Reissmann E; Jornvall H; Ibanez CF,
"Synergistic interaction between Gdf1 and
Nodal during anterior axis development."
Copyright 2006 [PMID:7556909].
Number of suitable images
Access to raw data
What about the rest?
Journal
Access to raw data
Many offer permissions requests via Copyright Licensing Agency
(Europe), Copyright Clearance Center (USA) or an equivalent
Individual permissions:
~ £10 per image
Annual Global permissions:
- intended for Organisation level i.e. MRC
~ ££££££ even for non-profit organisations
Access to raw data
MRC - no policy regarding electronic reproduction
Stéphane Goldstein - Research Information Network (London)
- images should be treated as data
- issues guidance to encourage good practice for
researchers, funding agencies, publishers, government
http://www.rin.ac.uk/data-principles
Access to raw data
MRC - no policy regarding electronic reproduction
Leads from Stéphane to take it further:
David Shotton - Reader in Image Bioinformatics, Oxford
Scientific Technical & Medical Publishers Association
Access to raw data
No agreement for reproduction
Figure 2C. [doi:10.1007/s00441-005-0036-9]
Amrein L; Barraud P; Daniel JY; Perel Y;
Landry M, "Expression patterns of nm23
genes during mouse organogenesis." Cell
Tissue Res 2005 Dec;322(3):365-78.
[PMID:16082520].
Access to raw data
Proposed Data Standards
Proposed Data Standards
In situ gene expression is generally reported to a level that would
prevent the experiment being repeated.
Proposed Data Standards
Inspired by MIAME - minimum information about a microarray experiment
Proposed Data Standards
MIACA: Minimum Information About a Cellular Assay, and the Cellular Assay Object Model
MIGS: Minimum Information about a Genome Sequence
MIAPE: Minimum Information About a Proteomics Experiment (Mass Spec, Gel Electrophoresis)
MIAPA : Minimum Information about a Phylogenetic Analysis
MIARE: Minimum Information about an RNAi experiment
MI-FACE: Minimum Information about a Fluorescence Activated Cell Experiment
PSI-MOD: a community standard for representation of Protein Modification Data
MISFISHIE: Minimum Information Specification For In Situ Hybridization and Immunohistochemistry
Experiments
MIMIx: Minimum Information required for reporting a Molecular Interaction Experiment
Aim is to have these adopted by the community and a requirement for published data
Creators of different standards must communicate with each other for standardisation!
Proposed Data Standards
MISFISHIE - Minimum Information Specification For In Situ
Hybridisation and Immunohistochemistry
Experiments
EMAGE: other curation tasks
pdf
EMAGE: other curation tasks
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Scan image from paper copy
and make these available
via EMAGE
pdf
Maintaining the EMAGE data framework
Maintaining the EMAGE data framework
17 3D models
~26,000 terms over 26 stages
Maintaining the EMAGE data framework
EMAP ontology management
~26,000 terms over 26 stages
- used by EMAGE, GXD, GUDMAP and others
- available through OBO-Foundry
(hosted at Berkeley)
Maintaining the EMAGE data framework
EMAP ontology management
The EMAP ontology can be edited using OBO-edit
Maintaining the EMAGE data framework
EMAP ontology management
Distributed editing via OBO-edit
Challenge is to develop the ontology whilst maintaining the integrity of
annotations already made using older versions.
The curation process required to safeguard this are
currenly being devised…
Checkout/check-in process with changes moderated via a group of experts
EMAGE: data preservation
EMAGE: data preservation
EMAP ontology management
EMAGE data back-up / archiving
Data = xml + images (.jpeg .gif .mov .mpeg .wlz)
Working versions + live database + ‘backup’ copies
Daily back-ups to network and tape (on-site and off-site)
Short term protection
No archive system currenly in place - required!
EMAGE: data preservation
EMAP ontology management
EMAGE data back-up / archiving
MRC policy - scientific data should be "kept" for 10 years,
clinical data for 20 years.
Al Brown - MRC were to set up an archiving/preservation
centre - not to be.
MRC currently aim to use a Data Support Service
(one bidder is EU/EDINA/DCC) - outcome unknown as yet.
Acknowledgements
Shanmugasundaram Venkataraman
Lorna Richardson
Malcolm Fisher
Jackie Finger, Terry Hayamizu, Connie Smith, Ingeborg McCright
Martin Ringwald
Peter Stevenson
Nick Burton
Yiya Yang
Jiangao Rao
Attilla Gyenesei
Duncan Davidson
Richard Baldock
Download