Macromolecular Structure Database Project EMSD Infra-structure Services for autonomous

advertisement
Macromolecular Structure Database Project
EMSD
Infra-structure Services for
Europe To develop an autonomous structural
database capability in Europe
http://www.ebi.ac.uk/msd
Temblor
Advanced search
EMBL
Wellcome
Trust
EBI-MSD
EU
Spine
CCP4
Structural Genomics
harvesting
Oxford
CLRC
EU
Autostruct
EHTPX
Validation
E-science
York
Daresbury
BBSRC
EU
Integration
NMRQual
SCOP CATH pfam
Utrecht
Sanger Inst
EU
MRC
Data Exchange
IIMS
CCPN
Electron Microscopy
Cambridge
EBI-MSD
BBSRC
EU
USA
Grant & co-ordinator
Grant Funding
RCSB
Core Funding
BMRB
Data Exchange
E-MSD Provides
 clean biological data
 integrated data
 a single web access point
 query interfaces for different
users
 interconnected views of the
data relating structure, sequence,
text & experimental details
For Biologist, Chemist, Structural Biologist, Teacher
Web
Interface
Query Results and
Search Query
Interactive viewer
Keyword
Sequence
Structure
Active site
Ligand
PDB
Atlas
page
Structure
Secondary Struct
Sequence
Active Site
Expt data
Folds- Scop/Dali
Ligands
Active Sites
Sorted
Hit List
Medline
SwissProt
Methods
FastA
SSM
Web services
PDB
Secondary Struct

Data API’s
Folds- Scop/Dali
Ligands
Active Sites

Methods - as web services
Medline
SwissProt
Methods
FastA
SSM
Web based pages


Search
interfaces
Interactive
Visualisation
DATA INTEGRATION
A Database for all ?
MSD SEARCH DATABASE
Data integration

We want to include all types of
biological data




Structure, Sequence, Textual
Observed biochemistry (Brenda)
Sequence annotation (Prints)
DNA - ORFS, SNIPS
PDB
Secondary Struct
Folds- Scop/Dali
Ligands
Active Sites

But we can’t do everything !

So can the Grid allow the integration
of data from other sources ?
Medline
SwissProt
Problems for Grid (1- Provenance)

We are a funded institute.

We have to be seen to be useful or we do not get funded !

Industry need to be seen - share holders

Origin of the Distributed information:


User and funding body need to see who provided the
information.
How do we retain and present detail of this ?
Problem for Grid (2)

We do not know “best practice” in much of biology



There will be conflict of information



Methods : structure alignment, secondary structure…
Data : multiple coordinates, multiple sequence data….
Data/methods have associated validity information - the
different data/methods may be only inconsistent in part.
How is conflicting information going to be presented to
and filtered for a user
Who is going to assign data validity !
Grid problem (3- Data access control)

Bioinformatics is fashionable at the moment.



There is a “problem” when something is perceived to be useful
eg : There are about 60,000 patents in the US for the ~30,000
human genes - not a problem yet, but…..
This is more than data security :



Will Grid employ some good lawyers ?
Will Grid hide information on request - cf PDB has “hold” status
Will Grid “modify” information on request - cf. Google search
result order as been “updated”
Summary

We want to be able to provide a scientific service



Web pages and Web services
We would like to be able to expand the results to
include information from other data resources.
The 3 issues are only a small number of issues, but
represent fundamental problems
CLEAN DATA : Quaternary structure
Assembly
Sub-Assembly
Chains
Biology
Xray Experiment
Atoms
Residues
CLEAN DATA :Example of experimental result
Authors would
know structure,
we have to derive it
at submission
Asymmetric unit
M.BOCHTLER et al, NATURE, 403, 800 (2000)
Contains 3 separate molecules - 2 copies of a dodecamer and 1
hexamer
Dodecamer
Hexamer
Assembly
http://pqs.ebi.ac.uk
Clean data
RESOLUTION
SLIDING SCALE FOR RULES
electron density at different resolutions phenylalanine
Correctly placed into the 1.2 Å data.
This still can be done with confidence in
the 2 Å case.
But at 3 Å we already observe a deviation
of the centroid of the ring from the
correct model
Zscore=(Fit-<Fit>)/sigma
A large positive spike is indicative of a residue which is worse
than the average for that residue type in structures of
similar resolutions.
1qi3
1f83
1rmg
Terrible
Good
Geometric outliers
PHENYLALANINE
LIGAND DB
Loader
Site environment DB







Covalent Bonds
Coordinate bonds
Hydrogen bonds
Planes
Non-bonding
Electrostatics
Di-Sulphide bonds
N
ASP
PHE
O
S
PHE
VAL
Download