e-Science Data Information and Knowledge Transformation BinX An edikt Project Testbed Ted Wen, Robert Carroll, Denise Ecklund, Bob Gibbins, Davy Virdee, Rob Baxter Presentation outline Edikt project A data problem BinX - today – language – library – applications BinX – future 2 www.edikt.org What is edikt? e-Science Data, Information and Knowledge Transformation – a research development activity designed to bridge the gap between applications science and computer science in the realms of Grid-scale data take prototypes from CS and Grid research… …engineer them into robust tools… …for real application science problems… …test them under extreme science conditions… …and keep an eye on the commercial possibilities Team of 8 professional engineers, mgmt & staff Funded by SHEFC; Project start was May 2002 3 www.edikt.org Current activities edikt::Eldas – proving GGF’s GDSS for virtual organisations – developing scalable data access technologies edikt::BinX – data interchange for astronomy & PP edikt::Giggle and RLS – evaluation of data replication technology for PP Bioinformatics – data mediation to integrate multiple data sources – data versioning to manage changing schemas 4 www.edikt.org e-Science Data Information and Knowledge Transformation “eScience Data” Real-World and In Silico Experiments Research and discovery Workflow Real-world Experiments Data C Analysis C C Abstract Model Workflow support tools – Format converter – Model builder Result Data C C App area 1 In silico Experiments App area 2 App area 3 Results App area 4 Generic Tools Existing tools: XML processors New tools: Perl script generators Model description generators 6 www.edikt.org Data integration & mediation Real-world Experiments Data Integrator/ Mediator Data Data Integrated Data Distributed Geo-sensors – One sensor type with overlapping observation regions – Resolve conflicting values in the overlap – Compute “total space” – min or max? If max, define missing values Public Biochemical Signalling DBs – – – – Match the input records Build integrated records Detect data value conflicts Resolve data value conflicts S1 S2 S4 S3 S6 S5 Reaction 1 D1 D2 D3 D1 Reaction 2 D1 D2 D3 D2 D3 . . . . Reaction n D1 D2 D3 7 www.edikt.org Data subsets Real-world Experiments 1953 Legacy Data Data Analysis S Real-world Experiments today C New Data Legacy data was not organized for the new analysis – Extract a data subset – Define the subset by queries Results New Analysis New Results Structural metadata query: “What is the minimum geo-space data coverage?” Simple semantic query: “What reactions require 2 or more inhibitor agents to prevent the reaction?” Complex semantic query: What objects are contained in a 3-dimensional image?” 8 www.edikt.org BinX for binary data BinX is a foundation tool for these problems when the data is a structured binary file. Workflow – format conversion Binary data1 BinX XML1 Binary data2 BinX-based format conversion Data Subsets R-W Exper Binary data BinX XML2 Data Integration Binary data S1 Exp1 Binary data Binary data BinX XML description S2 Binary data Exp2 Binary data S3 Exp3 Binary data D 1 D 2 D 3 Integrate dBinary data I-D 9 www.edikt.org