Proteomics databases for comparative studies: Transactional and Data Warehouse approaches Patricia Rodriguez-Tomé, Nicolas Pinaud, Thomas Kowall GeneProt, Switzerland What is Proteomics ? P1 P2 P3 P4 P5 E1 E2 E3 G1 Separation (CEX, RP) Protein EST Genomic Peptide MS MS Sample BioInformatics processes Manual analysis DB About DBs for Proteomics at GeneProt Needs Data Transactional DB Data Warehouse Data Mining Data Management Challenges A high-throughput Need for a convenient environment requires near real time processing Quick response to evolving laboratory procedures and evolving user needs Accomodate to heterogeneous data types Manage a constantly rising flood of data data access at all levels of granularity via analysis software and web front ends Adapt to demand for global queries across all proteomics studies Adapt and innovate to offer new tools: Statistics, Data mining. Data Flow experimental data (LIMS) XML DB Identification of peptides and proteins annotation external data sources Data export Data details Experimental data: Store MS and MS/MS peak External data sources: Import information from external lists Store all meta data Identification : Load peptide matches, identified proteins, scores Automatic annotation and analysis: Give access to data, store data sources: taxonomy, ontologies, bibliography… Export data: Export all or a subset of data Flat file Database dump Misc: Access control, security an results Expert annotation: Give interactive access to data using a Web interface, store manual validation and annotation confidentiality data consistency/integrity checks Error checks and corrections Run statistics backup and archive Data production per project Raw data (spectra) : 330 000 -> 1 500 000 Identified peptides : 45 000 -> 145 000 Identified sequences: 10 000 ->120 000 Database size : 15G -> 140G Nbr projects: 16 1Tb of databases files Implementation: transactional Intended to capture all relevant information from proteomics experiment, protein identification automatic and manual annotation and validation. Each proteome is isolated in its own ProtDB (16 at present). Complex and generic data model for efficient data storage. Built in data consistency and error checks. A layer of « views » provides fast query access. Web front end: interactive means to visualize, update and validate data. Limitations We have 16 projects on-line: High cost of maintenance to keep all database schemas compatible. Space : could we archive some of the projects ? New spectrometers produce more data Inter databases queries: Technique « exists » but implementation is often awkward and there is no efficient solution in our case. What about overcoming these limitations and take advantage of this wealth of data ? Decide what data are actually important in the long term. Merge the data from all the projects. Clean and consolidate the data. Implement an update procedure to keep this « merged data system » up to date (archive old projects) Data Warehouse ? This looks very much like the definition of a data warehouse ! Data consolidation and integration Non instantaneous accuracy, non volatility Comprehensive data structure Query throughput ProtWare: proteomics data warehouse 1. Stores consolidated and final analysis results, 2. 3. 4. 5. 6. centralises data common to proteins in all proteome studies. Is read-only, not real time, asynchronous updates are run weekly. Data model is focused on proteome to proteome comparisons. Comprehensive data structure which enhance the performance of analysis queries. Ideally suited for statistical analysis and data mining tools. Provides a decision support system. ProtDB and ProtWare data flow 10+11 bytes X M L P2 … Pn Extraction Transformation Loading P1 10+8 bytes classification, taxonomy… ProtWare analyses & statistical queries website flat file flat file DB dump 10+5 bytes ProtDB vs ProtWare ProtDB: transactional system Data input, real time acces to data Data updates, annotation, validation Error and consistency checks Stores experimental data Stores all steps of data annotation and validation (keep history) In depth queries on a given proteome ProtWare: data warehouse Read-only, asynchronous updates from ProtDB Consolidated data and final results of annotation and validation (no history) No experimental data Queries oriented to proteomes comparisons, statistics, data mining Decision support system The needle in a haystack Of course we are looking for the Holy Grail ! Find the interesting proteins in all our data that: Can be used for diagnostic, Can explain a disease, Can be used to cure a disease. KDD and Data Mining Knowledge Discovery in Databases is « the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data ». Data Mining is the discovery stage of the KDD. Data mining tools provide additional possibilities to explore a database. Data Mining tools ProtWare: the data warehouse model is protein query oriented. R package: statistics and clustering tools Oracle 10g new data mining functions Database infrastructure Data input files use XML. RDBMS: Oracle 9i moving to Oracle 10g on Linux ProtWare uses ANSI SQL, portable to other ANSI SQL compliant systems (PostgreSQL). Web interface built using standard technologies: PERL, CGI, DBI, HTML, Javascript, SVG.