Databases for Microarrays

advertisement
Databases for Microarrays
Vidhya Jagannathan
SIB, Lausanne
Overview








Microarray data in a nutshell
Why databases?
What data to represent?
What is a database?
Different data models
E-R modelling
Microarray Databases
Standards being developed
Vidhya Jagannathan, SIB, Lausanne
Microarray Experiment
Vidhya Jagannathan, SIB, Lausanne
Microarray Data in a Nutshell


Lots of data to be managed before and after the
experiment.
Data to be stored before the experiment .



Data to be stored after the experiment



Description of the array and the sample.
Direct access to all the cDNA and gene sequences,
annotations, and physical DNA resources.
Raw Data - scanned images.
Gene Expression Matrix - Relative expression levels observed
on various sites on the array.
Hence we can see that database software capable of
dealing with larger volumes of numeric and image
data is required.
Vidhya Jagannathan, SIB, Lausanne
Why Databases?



Tailored to datatype
Tailored to the Scientists
Intuitive ways to query the data


Diagrams, forms, point and click, text etc.
Support for efficient answering of queries.

Query optimisation, indexes, compact physical storage.
Vidhya Jagannathan, SIB, Lausanne
Data Representation

Goal: Represent data in an intuitive and
convenient manner

Without unnecessary replication of information

Making it easy to write queries to find required
information

Supporting efficient retrieval of required information
Vidhya Jagannathan, SIB, Lausanne
What is a Database?

A database is an organised collection of pieces of
structured electronic information.





Example 1: Libraires use a database system to keep track of
library inventory and loans.
Example 2: All airlines use database system to manage their
flights and reservations.
The collection of records kept for a common purpose such as
these is known as a database.
The records of the database normally reside on a
hard disk and the records are retrieved into computer
memory only when they are accessed.
So the reasons are obvious why we need to discuss
about a Microarray database.
Vidhya Jagannathan, SIB, Lausanne
Data Models




Describes a container for data and
methods to store and retrieve data from
that container.
Abstract math algorithms and concepts.
Cannot touch a data model.
Very useful
Vidhya Jagannathan, SIB, Lausanne
Types of Data Models





Ad-hoc file formats (not really data models!)
Relational data model
Object-relational data model
Object-oriented data model
XML (Extensible Markup Language)
Vidhya Jagannathan, SIB, Lausanne
Ad-hoc File Formats

The various 'ad-hoc' file formats in use for
microarray data are:






Flat file formats.
Spread sheet formats.
Not the least - Even MS-Word documents !!!
Very rudimentary method to store data .
Sometimes contains redundant information.
Extremely inefficient for retrieval of particular
subsets of the results.
Vidhya Jagannathan, SIB, Lausanne
Relational Data Model






Most prevalent and used in many databases
developed today.
The collection of related information is represented as
a set of tables.
Data value is stored in the intersection of row and
column
Column values are of the same kind. A Simple data
validation.
Rows are unique. So no data redundancy and every
row is meaningful and can be identified by the unique
key.
Utilises Structured Query Language (SQL) for data
storage, retrival and manipulation.
Vidhya Jagannathan, SIB, Lausanne
Example
Table
GENE
GENE_ID
CONTIG_ID
CONTIG_START
CONTIG_END
CONTIG_STRAND
GB2VN
NT_0106058.3
2354807
2360778
Complement
GB2VN32
NT_0106051.3
2308745
2321072
Complement
Row or Record
Field or Column
Vidhya Jagannathan, SIB, Lausanne
Example
GENE
GENE_ID
CONTIG_ID
CONTIG_BEGIN
CONTIG_END
CONTIG_STRAND
GB2VN
NT_0106058.3
2354807
2360778
Complement
GB2VN32
NT_0106051.3
2308745
2321072
Complement
CLASSIFICATION
CLASS_ID
GENE_ID
DESCRIPTION
TYPE
MSX2
GB2VN
MSH(Drosophila)
Gene
GO:0003677
GB2VN32
DNA binding
Molecular function
Vidhya Jagannathan, SIB, Lausanne
Advantages of Relational
Model




Allows information to be broken up into
logical units and stored in tables.
Allows combining data from different tables in
different ways to derive useful information.
Great for queries involving information from
multiple original sources.
Can easily gather related information.

e.g. information about a particular gene from multiple
datasets/experiments
Vidhya Jagannathan, SIB, Lausanne
Object Oriented Model





Object Oriented Model allows real world data
to be represented as objects.
Objects encapsulate the data and provide
methods to access or manipulate it.
Objects with specific structure and set of
methods are said to belong to the object
class.
Allows new classes to be created by
extending the description of the parent class.
Child classes inherit the data and methods of
the parent class.
Vidhya Jagannathan, SIB, Lausanne
Example
OODBMS
Biomolecule
. String seq
Inherits
DNA
DNA
DNA
get_bio_seq()
Inherits
Protein
Protein
Protein
Vidhya Jagannathan, SIB, Lausanne
Example - ArrayExpress
ArrayExpress
Sample
ExpressionValue
Hybridization
Array
Experiment
External links
Ontology
e.g., organism
taxonomy
Reference
e.g., publication, web
resource
Vidhya Jagannathan, SIB, Lausanne
Database
e.g., gene in
SWISS-prot
Database Design
Entity-Relationship Concept
Entity A
Relationship
Examples
Vidhya Jagannathan, SIB, Lausanne
Entity B
Entities

are real world objects


contain attributes


ex: gene
ex: gene_id, sequence
are drawn as rectangle boxes that holds the
name of the entity and attribute in two
different notations as there is no standard!
Gene
sequence
gene_id
Gene
gene_id
sequence
notation 2
notation 1
Vidhya Jagannathan, SIB, Lausanne
Relationship

Relationships provide connections between two or
more entities





ex: Which genes were used in which experiment
When two entities are involved in a relationship, it is
known as binary relationship.
When three entities are invoved in a relationship, it is
called as ternary relationship.
When more than three entities are involved in a
relationship, it is usually broken in to one or more
binary or ternary relationships.
are drawn as a line linking the involved entities as:
Gene
used_in
Vidhya Jagannathan, SIB, Lausanne
Experiment
Example E-R Diagram
Gene
gene-id
sequence
Expression-value
value
Expt-Exptr
Expt-Sample
Experiment
Experiment-Id
Date
Image
Notation
*
Many-to-one
1
Expt-Array
Experimenter
Experimenter-Id
Name
E-mail
Dept.
Institution
Sample
Sample-Id
Organism
Cell-type
{Drug-Ids}
Array
Array-Id
Manufacturer
Type
Batch
Vidhya Jagannathan, SIB, Lausanne
¥Multivalued
attribute
Object relational data
model



Improved relational model by adding some
features from object data models.
Information is represented as in relational
models but column values not restricted to one
mutliple values are allowed.
Example (sample table in previous slides):
sampe-id
organism
celltype
s001
ecoli
c123
s002
ecoli
c123
Vidhya Jagannathan, SIB, Lausanne
drug_id
d1
d2
ac1
nm
ac3
nm
Queries, queries, queries!!

Given a collection of microarray
generated gene expression data, what
kind of questions the users wish to
pose.

Constructing an extensive list of
possible interesting queries and data
mining problems that has to be
supported by the database will facilitate
the design process.
Vidhya Jagannathan, SIB, Lausanne
Queries, queries, queries!!

Query to the data





Which genes are linked ?
Which genes are expressed similarly to my gene
XYZ?
Which genes have a changed the expression in a
second condition ?
Which genes are co-expressed in differing
conditions ?
classification (of tumors, diseased tissues etc.):
which patterns are characteristic for a certain
class of samples, which genes are involved?
Vidhya Jagannathan, SIB, Lausanne
More Queries !!!

Queries that add a link in additional
knowledge




functional classification of genes: Are changes
clustered in particular classes?
metabolic pathway information: Is a certain
pathway/route in a pathway affected?
disease information & clinical follow up: correlation
to expression patterns.
phenotype information for mutants: Are there
correlations between particular phenotypes and
expression patterns?
Vidhya Jagannathan, SIB, Lausanne
More Queries !!!



in what region is the interesting gene located
in the genome?
is there synteny in this region with other
species?
is there a known trait that maps to this
region?
Vidhya Jagannathan, SIB, Lausanne
Query Language

Language in which user requests information
from the database.


SQL
• Data definition helps you implement your model
and data manipulation helps you modify and
retrive data
Advantages:
• Can specify query declaratively and let database
system figure out best way of finding answers
• Supports queries of medium complexity
• Specialized languages
• SQL language statements are not abstract but
very close to spoken language.
Vidhya Jagannathan, SIB, Lausanne
Basic SQL Queries


Find the image for experiment number 1345
 select image from experiment
where experiment-id = 1345;
Find the experiment-id and image of all
experiments involving e-coli
select experiment-id, image from experiment,
sample where experiment.sample-id =
sample.sample-id and sample.organism =
`e.coli‘;
All combinations of rows from the relations in the
from clause are considered, and those that satisfy
the where conditions are output


Vidhya Jagannathan, SIB, Lausanne
Interfacing
 SQL queries are carried out on terminal screen which
is not very useful and user friendly for an end user,
so applications are created to interface more friendly
staments with the SQL statements
 A web form is a typical example of interface for SQL
 Applications for data loading.
 More complex queries (e.g. data mining such as

classification and clustering) are very imporatant part
of the Microarray Analyis Protocol
It is very important to interface the various
applications we use to analyse the retrieved data
with database.
Vidhya Jagannathan, SIB, Lausanne
Vidhya Jagannathan, SIB, Lausanne
Gene Expression Databases
Require Integration

There are many different types of data presenting numerous
relationships.

There are a number of Databases with lots of information.

Experiments need to be compared because the experiments
are very difficult to perform and very expensive.

Solution: Make all the databases talk the same language.

XML was the choice of data interchange format.
Vidhya Jagannathan, SIB, Lausanne
Why XML?

Why XML ?: XML provides the method for defining
the meaning or semantics of data.
 Example : A XML file of the earlier table we defined
<gene_features>
<gene_id>GBVN32</gene_id>
<contig_id>NT_010651</contig_id>
<contig_start>2354807</contig_start>
<contig_end>2360778</contig_end>
<contig_strand>Complement</contig_strand>
</gene_features>
Vidhya Jagannathan, SIB, Lausanne
Mapping XML to Relational
Database

The Data Structure in XML is defined in Document Type
Descrciptor as follows
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT
<!ELEMENT


gene_id (#PCDATA)>
contig_id (#PCDATA)>
contig_start (#PCDATA)>
contig_end (#PCDATA)>
contig_sequence (#PCDATA)>
This kind of DTD also helps us to have control over the
vocabulary used.
SQL:
create table gene (
gene_id varchar(5) primary key,
contig_id varchar(10) not null,
contig_start integer not null,
contig_end integer not null,
contig_sequence text not null);

So the DTD can be directly mapped into a relational database.
Vidhya Jagannathan, SIB, Lausanne
MAGE-ML As Data
Interchage Format
Expression Data
Converter (program)
MAGE-ML
Databases
Vidhya Jagannathan, SIB, Lausanne
Existing Microarray Databases



Several gene expression databases exist:Both
commercial and non-commercial.
Most focus on either a particular technolgy or a
particular organism or both.
Commercial databases:



Rosetta Inpharmatics and Genelogic, the specifics of their
internal structure is not available for internal scrutiny due to
their proprietary nature.
Some non-commercial efforts to design more general
databases merit particular mention.
We will discuss few of the most promising ones





ArrayExpress - EBI
The Gene expression Omnibus (GEO) - NLM
The Standford microarray Database
ExpressDB - Harvard
Genex - NCGR Vidhya Jagannathan, SIB, Lausanne
Vidhya Jagannathan, SIB, Lausanne
ArrayExpress



Public repository of microarray based gene
expression data.
Implemented in Oracle at EBI.
Contains:



Accepts submissions in MAGE-ML format via a webbased data annotation/submission tool called
MIAMExpress.
A demo version of MIAMExpress is available at:
http://industry.ebi.ac.uk/~parkinso/subtool/subtype.html
Provides a simple web-based query interface and is directly
linked to the Expression Profiler data analysis tool which allows
expression data clustering and other types of data exploration
directly through theVidhya
web.
Jagannathan, SIB, Lausanne


several curated gene expression datasets
possible introduction of an image server to archive raw
image data associated with the experiments.
Gene Express Omnibus


The Gene Expression Omnibus ia a gene expression
database hosted at the National library of Medicine
It supports four basic data elements







Platform ( the physical reagents used to generate the data)
Sample (information about the mRNA being used)
Submitter ( the person and organisation submitting the
data)
Series ( the relationship among the samples).
It allows download of entire datasets, it has not
ability to query the relationships
Data are entered as tab delimited ASCII records,with
a number of columns that depend on the kind of
array selected.
Supports Serial Analysis of Gene Expression (SAGE)
data.
Vidhya Jagannathan, SIB, Lausanne
Stanford Microarray Database





Contains the largest amount of data.
Uses relational database to answer queries.
Associated with numerious clustering and
analysis features.
Users can access the data in SMD from the
web interface of the package.
Disadvantage :



It supports only Cy3/Cy5 glass slide data
It is designed to exclusively use an oracle
database
Has been recently released outside without
anykind of support !!
Vidhya Jagannathan, SIB, Lausanne
MaxdSQL




Minor changes to the ArrayExpress object data
model allowed it to be instantiated as a relational
database, and MaxdSQL is the resulting
implementation.
MaxdSQL supports both Spotted and Affymetrix data
and not SAGE data.
MaxdSQL is associated with the maxdView, a java
suite of analysis and visualisation tools.This tool also
provides an environment for developing tools and
intergrating existing software.
MaxdLoad is the data-loading application software.
Vidhya Jagannathan, SIB, Lausanne
Vidhya Jagannathan, SIB, Lausanne
GeneX




Open source database and integrated tool set
released by NCGR http://www.ncgr.org.
Open source - provides a basic infrastructure upon
which others can build.
Stores numeric values for a spot measurement
(primary or raw data), ratio and averaged data
across array measurements.
Includes a web interface to the database that allow
users to retrieve:



Entire datasets, subsets
Guided queries for processing by a particular analysis routine
Download data in both tab delimited form and GeneXML
format ( more descriptions later)
Vidhya Jagannathan, SIB, Lausanne
Vidhya Jagannathan, SIB, Lausanne
ExpressDB





ExpressDB is a relational database containing yeast
and E.coli RNA expression data.
It has been conceived as an example on how to
manage that kind of data.
It allows web-querying or SQL-querying.
It is linked to an integrated database for functional
genomics called Biomolecule Interaction Growth and
Expression Database (BIGED).
BIGED is intended to support and integrate RNA
expression data with other kinds of functional
genomics data
Vidhya Jagannathan, SIB, Lausanne
Survey of existing
microarray systems
This survey is based on the article published in
BRIEFINGS IN BIOINFORMATICS, Vol 2, No 2,
pp 143-158, May 2001:
®A comparison of microarray databases
Vidhya Jagannathan, SIB, Lausanne
The Microarray Gene Expression
Database Group (MGED)
History and Future:



Founded at a meeting in November, 1999 in Cambridge, UK.
In May 2000 and March 2001: development of
recommendations for microarray data annotations (MAIME,
MAML).
MGED 2nd meeting:


establishment of a steering committee consisting of representatives
of many of the worlds leading microarray laboratories and
companies
MGED 4th meeting in 2002:
MAIME 1.0 will be published
 MAML/GEML and object models will be accepted by the OMG
 concrete ontology and data normalization recommendations will be
published.
information can be obtained from
http://www.mged.org


Vidhya Jagannathan, SIB, Lausanne
The Microarray Gene Expression
Database Group (MGED)
Goals:

Facilitate the adoption of standards for DNA-array
experiment annotation and data representation.

Introduce standard experimental controls and data
normalization methods.

Establish gene expression data repositories.

Allow comparision of gene expression data from
different sources.
Vidhya Jagannathan, SIB, Lausanne
MGED Working Groups
Goals:

MIAME: Experiment description and data representation
standards - Alvis Brazma

MAGE: Introduce standard experimental controls and data
normalization methods - Paul Spellman. This group includes the
MAGE-OM and MAGE-ML development.

OWG: Microarray data standards, annotations, ontologies and
databases - Chris Stoeckert

NWG: Standards for normalization of microarray data and
cross-platform comparison - Gavin Sherlock
Vidhya Jagannathan, SIB, Lausanne
References

URL:
Tutorial on Information Management for Genome
Level Bioinformatics, Paton and Goble, at VLDB
2001: http://www.dia.uniroma3.it/~vldbproc/#tutEuropea
 European Molecular Biology Network
http://www.embnet.org/
 Univ. Manchester site (with relational version of
Microarray data representation, and links to other
sites)
• http://www.bioinf.man.ac.uk
Database textbook with absolutely no bioinformatics
coverage
For Microarray Data
 http://linkage.rockefeller.edu/wli/microarray/



Vidhya Jagannathan, SIB, Lausanne
Download