vii TABLE OF CONTENT CHAPTER 1 TITLE PAGE DECLARATION ii DEDICATION iii ACKNOWLEDGEMENT iv ABSTRACT v ABSTRAK vi TABLE OF CONTENT vii LIST OF TABLE x LIST OF FIGURE xi LIST OF ABBREVIATION xiii LIST OF APPENDICES xiv PROJECT INTRODUCTION 1 1.1 Introduction 1 1.2 Background Problem 2 1.3 Problem Statement 3 1.4 Project Aim 4 1.5 Objective 4 1.6 Scope 5 1.7 Significant of the Study 5 1.8 Organization of the Report 6 viii 2 LITERATURE REVIEW 7 2.1 Introduction 7 2.2 Data Format 9 2.3 Microarray Analysis Process 10 2.3.1 Sharing of Microarray Data 11 2.3.2 Microarray Data Standardization 12 2.3.3 The End Product of Microarray Data Analysis 13 2.4 Enterprise Information Approach 13 2.5 Metadata 14 2.5.1 Function of Metadata 16 2.5.2 Structuring Metadata 16 2.5.3 Metadata Schema and Elements Set 17 2.5.4 Metadata for Dataset 18 2.5.5 Creating Metadata 18 2.6 Xml 2.6.1 19 DTD to validate Xml Data 19 2.7 Database Management System (DBMS) 21 2.8 Data Model use based on flow data 21 2.8.1 2.8.2 Relational Data Model 22 2.8.1.1 Notation of Relational Data Model 24 2.8.1.2 26 Limitation of Relational Data Model Xml Data in Relational Database 27 2.8.2.1 Create Xml tree 28 2.8.2.2 The Storage of Xml Data 29 2.9 Data Repository 2.9.1 31 Data Warehouse 31 2.9.1.1 Drawback of Data Warehousing 32 2.9.1.2 Data Warehouse Metadata 32 2.9.2 Data Marts 33 2.9.3 Data Federated 33 2.9.3.1 Issues in Database Federation 2.10 Summary 34 36 ix 3 METHODOLOGY 37 3.1 Introduction 37 3.2 Operational Framework 37 3.3 Metadata for Data Integration 41 3.4 Metadata Framework for Biological Data 42 3.5 Accurate measure Query of protein secondary . 44 Structure prediction process. 4 3.6 Summary 44 EXPERIMENTAL RESULT AND DISCUSSION 45 4.1 Introduction 45 4.2 Current process for protein secondary structure 46 prediction 4.3 Websites be used based on the query flow process 48 4.3.1 Motif Website 48 4.3.2 Prosite Database 50 4.3.3 Blast NCBI 52 4.3.4 PRINTS (DbBrowser) 55 4.3.5 PDB 57 4.4 Enterprise Based Data Model 61 4.5 Metadata framework for Integrate Data Model 61 4.5.1 System Overview 63 4.5.1.1 Convert Data to Xml 65 4.5.1.2 Create Xml schema 68 4.5.1.3 Create Relational Database on Xml 68 schema 4.5.1.4 The relational among table in the 70 relational database 4.5.1.5 Query Result 75 4.5.2 XML query 76 4.5.3 Data Store 79 4.6 Summary 80 x 5 CONCLUSION AND FUTURE WORK 81 5.1 Introduction 81 5.2 Summary Work 82 5.3 Achievement 83 5.4 83 Limitation 5.5 Future Work and Recommendation 84 5.6 Conclusion 85 REFERENCES 86 APPENDICES A-C 91-95 xi LIST OF TABLE TABLE NO TITLE PAGE 23 4.1 Schematic representation of different element of the relation of table The data block table with data block name 4.2 The entity Poly table 72 4.3 Entity poly category 72 4.4 Entity poly sequence 72 4.5 Atom site table 73 4.6 Entity poly sequence table 73 4.7 Structure reference table 74 4.8 Structure reference sequence table 74 4.9 Structure reference category 74 4.10 Structure reference sequence category 74 4.11 Atom site category 74 2.1 72 xii LIST OF FIGURE FIGURE NO TITLE PAGE 2.1 Example of flat file format 9 2.2 Show contains a portion of a sample protein in database 9 2.3 Document type definitions (DTD) which DTD can have 20 either an external potion or an internal portion or both 2.4 Basic component in the notation of the relational model 24 2.5 Diagram of a different tables and the overall relational 25 microarray database structure of the ArrayDB 2.6 Flow chart of from xml element to node of XML tree 28 2.7 Flow chart of insert data algorithm 30 3.1 Project operational framework 40 3.2 Metadata frameworks for biological data 43 4.1 The query flow of the PSSP process 46 4.2 The process for search “PEEL” motif sequence 48 4.3 Example for the selected database 49 4.4 The result show the number of motif found from the 49 selected libraries database 4.5 Example of the FASTA format 50 4.6 The example of the motif sequence search from the 51 prosite database 4.7 The result of the motif sequence search 51 xiii 4.8 The example of the motif sequence searching from the 52 BLAST website 4.9 The query id, description, molecule type, query, and 53 database name for the motif sequence “PEEL”. 4.10 Show the sequences producing significant alignments 53 4.11 The alignments of motif sequence with secondary 54 assignment 4.12 The search sequence query process 55 4.13 The result of the sequence fragment search 55 4.14a) Seed alignment with 4 sequence” 56 4.14b) The view alignment of query display sequence “PEEL” 56 4.15 The view structure of query display sequence “PEEL” 56 4.16 The interface for searching motif from PDB website. 57 4.17 The motif query structure hits from the PDB website 58 4.18 Example of detail description about motif query hits 58 4.19 The motif sequence identifies as “PEEL” found in the 59 currently displayed seqres sequence 4.20 The carbonic anhydrase 3d structure 59 4.21a) The PDB file in textual file format 60 4.21b) The example of PDB/XML file 60 4.22 Metadata frameworks for biological data 62 4.23 Overall systems overview 63 4.24 Workflows to transforms XML into relational database. 64 4.25a) Flat file formats from Blast database 65 4.25b) XML file format after transformation from the flat file 66 4.25c) DTD for validate the XML file 66 4.25d) Code file for validate XML file with the DTD 67 4.25e) The XSLT data from the xml file which it’s only view the 67 necessary data that scientist need from the xml file 4.26a) The XML schemas for PDB XML file 69 4.26b) The SQL statement to create the table that corresponding 69 XML data 4.27 The relational among table in the relational database 71 xiv 4.28a) The connection string statement and query statement 75 4.28b) The table view for the searching result of motif sequence 76 4.28c) The view result detail about datablock name 1BGC 76 4.29a) The “motiftable.xquery” file 77 4.29b) Show the motiftable.aspx code file 78 4.29c) The “motiftable.aspx.cs” code file. 78 4.29d) Show the output of the Xquery result 79 xv LIST OF SYMBOLS < - > - The beginning of the tag element in the DTD The end of the tag element in the DTD ? - Zero or one in DTD + - One or many in DTD * - Zero or many in DTD xvi LIST OF ABBREVIATION DBMS - Database Management System DOM - Document Object Model DTD - Document Type Definitions EDF - Extensible Data Format FBB - Faculty Of Bioscience and Bioengineering PSSP - Protein Secondary Structure Prediction RDB - Relational Database XML -Extensible Mark Up Languages xvii LIST OF APPENDICES APPENDIX TITLE PAGE A Gantt Chart for Project 1 91 B Gantt Chart for Project 2 93 C Motif Query Research Tool 95