A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE Maulik Vyas B.E., C.I.T.C, India, 2007 PROJECT Submitted in partial satisfaction of the requirements for the degree of MASTER OF SCIENCE in COMPUTER SCIENCE at CALIFORNIA STATE UNIVERSITY, SACRAMENTO FALL 2011 A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE A Project By Maulik Vyas Approved by: __________________________________, Committee Chair Meiliu Lu, Ph.D. __________________________________, Second Reader Ying Jin, Ph. D. ____________________________ Date ii Student: Maulik Vyas I certify that this student has met the requirements for format contained in the University format manual, and that this project is suitable for shelving in the Library and credit is to be awarded for the Project. __________________________, Graduate Coordinator Nikrouz Faroughi, Ph.D. Department of Computer Science iii ________________ Date Abstract of A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE by Maulik Vyas Data Warehouses are used by various organizations to organize, understand and use the data with the help of provided tools and architectures to make strategic decisions. Biological data warehouse such as the annotated protein sequence database is subject oriented, volatile collection of data related to protein synthesis used in bioinformatics. Data mart contains a subset of enterprise data from data warehouse that is of value to a specific group of users. I implemented a data mart based on data warehouse design principles and techniques on protein sequence database using data provided by Swiss Institute of Bioinformatics. While the data warehouse contains information about many protein sequence areas, data mart focuses on one or more subject area. It brings together experimental results, computed features and scientific conclusions by implementing star schema and data cube that supports the data warehouse to make it easier for organizations to distribute data within a unit. This enables them to deploy the data, manipulate it and develop the protein sequence data any way they see fit. The main goal of this project is to provide consistent, accurate annotated protein sequence data to group of researchers working on protein sequence. I took a chunk of this data to extract it from warehouse, transform it and loaded it in staging area. I used HJSplit to split the XML protein sequence data into equal parts and iv extract information using XML editor. I populated the database tables in Microsoft Access 2010 from XML file. Once the database was set up, I used MySQL Workbench 5.2 CE to generate queries related to star schema. Finally, I implemented star schema, OLAP operations, and data cube and drill up-down operations for strategic analysis of protein sequence database based on SQL queries. This ensured explicit support for dimension, aggregation and long-range analysis. _______________________, Committee Chair Meiliu Lu, Ph.D. _______________________ Date v DEDICATION This project is dedicated to my beloved parents Kirankumar Vyas and Jayshree Vyas for their never-ending sacrifice, love and support and understanding. I would also like to dedicate this to my loving wife Tanvi Desai for encouraging me to pursue Master in Computer Science and for being a pillar of support for me throughout. vi ACKNOWLEDGMENTS I would like to extend my gratitude to my project advisor Dr. Meiliu Lu, Professor, Computer Science for guiding me throughout this project and helping me in completing this project successfully. I am also thankful to Dr. Ying Jin, Professor, Computer Science, for reviewing my report. I am grateful to Dr. Nikrouz Faroughi, Graduate Coordinator, Department of Computer Science, for reviewing my report and providing valuable feedbacks. In addition, I would like to thank The Department of Computer Science at California State University for extending this opportunity for me to pursue this program and guiding me all the way to become a successful student. Lastly, I would like to thank my parents Kirankumar Vyas and Jayshree Vyas and my loving wife Tanvi Desai for providing me the moral support and encouragement throughout my life. vii TABLE OF CONTENTS Page Dedication…………………………………………………………………………………... vi Acknowledgments…………………………………………………………………………... vii List of Figures………………………………………………………………......................... x List of Abbreviations…………………………………………………………...................... xi Chapter 1. INTRODUCTION……………………………………………………………………….. 1 1.1 Introduction to Data Warehousing………………………………………………… 1 1.2 Introduction to Annotated Protein Sequence and UniProt..…….............................. 2 1.3 Goal of the Project……...………………………………………............................. 3 2. COLLECTION AND ANALYSIS OF UNIPROT IN PROTEIN SEQUENCE………… 4 2.1 Collecting UniProt in Protein Sequence.……………………...…………………... 4 2.2 Extract, Transform, Load (ETL)……….……………………...…………………... 5 3. DESIGNING STAR SCHEMA FOR UNIPROT…………………………………......…. 10 3.1 Introduction to Star Schema….…………………………………............................ 10 3.2 Designing a Star Schema……………….………………………………………… 10 3.2.1 Mapping Dimensions into Tables……………………………........................ 11 3.2.2 Dimensional Hierarchy…...……….…………………………........................ 12 4. OLAP OPERATIONS IMPLEMENTED ON UNIPROT…………………….................. 21 4.1 Introduction to Online Analytical Processing…..…………………………..……... 21 4.2 Type of OLAP Operations…………………….…………………………………... 22 viii 4.3 Data Cube………………………………………………………………………….. 23 4.4 OLAP Operations………………………………………………………………….. 26 5. TESTING………………………………………………………………………………… 29 5.1 Test Cases…………………………………………………………………………. 29 6. CONCLUSIONS…………………………………………………………………………. 31 6.1 Summary……………………………………………………………....................... 31 6.2 Strengths and Weaknesses.………………………………………………………... 33 6.2.1 Strengths ………………………………………………….………………… 33 6.2.2 Weakness …………………………………………………………………… 35 6.3 Future Work.…………………………………………………................................. 36 Bibliography………………………………………………………………………………... 37 ix LIST OF FIGURES Page Figure 1-1 Data Warehouse Architecture………………………………………………….. 2 Figure 2-1 Structure of ETL and Data Warehouse……………………………………….... 6 Figure 2-2 Sample XML File During Extraction………………………………………….. 7 Figure 2-3 Implementing Transformation on UniProt…………………………………….. 8 Figure 3-1 Dimension Table of Source………….……………………………………….... 12 Figure 3-2 Sample Data from Source Table……………………………………………...... 13 Figure 3-3 Sample Data from Gene Table…………………………………………………. 14 Figure 3-4 Sample Data from Isoform Table………………………………………………. 15 Figure 3-5 Star Schema Example 1...…………………………………………………........ 16 Figure 3-6 Sample Output of the SQL Query 1…………………………………………..... 18 Figure 3-7 Sample Data of Entry Table……………………………………………………. 19 Figure 3-8 Star Schema Example 2………………………………………………………… 20 Figure 4-1 Front View of Data Cube……..………………………………………………... 24 Figure 4-2 Data Cube………………………………………………………………………. 26 Figure 4-3 Sample Output of the SQL Query 2……….…………………………………… 28 x LIST OF ABBREVIATIONS DW: Data Warehouse OLAP Online Analytical Processing ETL: Extract, Transform and Load XML Extensible Markup Language OLTP Online Transaction Processing xi 1 Chapter 1 INTRODUCTION 1.1 Introduction to Data Warehousing Data Warehouse (DW) is a methodological approach for organizing and managing database used for reporting and analysis while providing organization with trustworthy, consistent data for applications running in an organization. The data stored in the warehouse is uploaded from the operational systems. The data may pass through an operational data store for additional operations before it is used in the DW for reporting. Data warehouse depicts data and its relationship by drawing distinction between data and information [1]. The data is cleaned to remove redundancy, transformed into compatible data and then made available to managers and professionals handling data mining, online analytical processing (OLAP), market research and decision support. Essential components of data warehousing include analyzing data, extracting data from the database, transforming data and managing data dictionary [2]. 2 Figure 1-1: Data Warehouse Architecture 1.2 Introduction to Annotated Protein Sequence and UniProt The protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function. Bioinformatics has revolutionized biological industry by applying computer technology to the management of biological information. Today's computers are able to gather, store, analyze and 3 integrate biological information that can be applied for protein-based drug discovery or genebased drug discovery [3]. UniProt is a scientific community that has high quality, freely available resources of protein sequence and functional information. I will be using the database of Swiss Institute of Bioinformatics to manually annotate protein sequence [4]. 1.3 Goal of the Project The goal of this project is to understand the interconnection of data warehousing with UniProt which is a part of bioinformatics as well as carry out research on the current and potential application of data warehousing with the available database. This project covers star schema that can be applied to the database. Research issues that still need to be explored are discussed at the end of the project report. This report is structured as follows: Chapter 2 discusses how UniProt protein sequence was collected, analyzed and implemented. Chapter 3 discusses the implementation of data warehousing on UniProt. Chapter 4 discusses OLAP operations done on UniProt to get stable, non-redundant data. Chapter 5 discusses star schema implemented on the database. Chapter 6 gives the summary of implementing data mart and data warehouse concepts on annotated protein sequence data, strength and weakness of using star schema and OLAP operations on annotated protein sequence data and future work. 4 Chapter 2 COLLECTION AND ANALYSIS OF UNIPROT IN PROTEIN SEQUENCE This chapter discusses the collection procedure of UniProt from the Swiss bank website. It furthermore, discusses the analysis done on the protein sequence to get data from their XML files. 2.1 Collecting UniProt in Protein Sequence The Universal Protein Resource (UniProt) is the bank for protein sequence and Annotation data. It is accessible at www.uniprot.org and is collaboration among European Bioinformatics Institute, Swiss Institute of Bioinformatics and Protein Information Resource. I finalized annotated protein sequence from Swiss Institute of Bioinformatics. UniProt is updated every four weeks. I opted for UniProt KB and came across three data sets in three different file formats, which were xml, fasta and text. To get to know more about Fasta, I downloaded the database with .fasta extension and then looked around to find supporting software. Found a software named fasta 1.0.1. Fasta is a scientific data format used to store nucleic acid sequences (such as DNA sequences) or protein sequences. The format may contain multiple sequences and therefore, is sometimes referred to as the fasta database format. Fasta files often start with a header line that may contain comments or other information. The rest of the file contains sequence data. Each sequence starts with a ">" symbol followed by the name of the sequence. The rest of the line describes the sequence, and the remaining lines contain the sequence itself. In order to use data from Fasta, one has to have higher configuration of computer. 5 Since it was not feasible to get a high-performance machine, I decide to go for XML as it is machine and platform friendly file extension for most large databases. Once I downloaded the data set in xml format, I extracted it to get a complete XML document. Now this XML file is over 2 GB in size. I truncated it in order to use the data from document. For this I used HJ Split to split parts into pre-defined parts (10mb). This enabled me to use the data more effectively since I could generate more fields along with individual tables. Once the XML opened, I analyzed the data from the document by creating missing links in the document. I sorted the document for more clarity about data flow. Each data was categorized in different type of tables. For example, there was a location for sub cellular data as well as gene. So it was put into a set of tables belonging to location tab. Similarly, it was done for the name of gene, isoform, and organism. I followed the Extract, Transform and Load (ETL) procedure, which is explained in detail below. 2.2 Extract, Transform and Load (ETL) ETL is a process to extract data from different type of systems, transform it into a structure which can be used for analysis and reporting and then load it into a database and/or cube. 6 Figure 2-1: Structure of ETL and Data Warehouse Extract: I extracted data from different external source; UniProt’s Swiss database website. This data is mostly structured and/or unstructured. Since the data was in xml document it was hard for me to send a query due to incompatibility. So I put the data in staging area, which is structured in the same way as the original data from the website. So, I had to extract individual fields from xml file which I had to split into multiple parts in order to get database fields. Below is one of many xml files generated after extracting it from the universal protein website. 7 Figure 2-2: Sample XML File During Extraction. 8 Transform: Once the data is available in staging area, I ensured that it was on one platform and one database. This ensures that we can execute basic queries like sorting, filtering, joining the tables. I also checked for data and cleaned it by adding data or modifying data as per requirements. As shown in figure below, the source table has plenty of inconsistent and incomplete data. I ensured the corresponding data was completely filled, and unwanted data was cleared. After all data was prepared, I implemented slowly changing dimensions, which are needed to keep track of attributes that change over time which will help in analysis and reports. Figure 2-3: Implementing Transformation on UniProt 9 Load: Finally, the above data is loaded into the data warehouse, usually into facts and dimension tables so that we can combine the data, aggregate them and load into data marts to generate star schema and/or cubes as necessary. Now what really happens when generating star schema is, by extracting primary key of participating tables or dimensions, we put the list, in Fact table and give its own primary key as well as define the fact table with a name to distinguish it from dimension tables. Once, it is listed, dimension tables are linked to fact table with corresponding primary keys. 10 Chapter 3 DESIGNING STAR SCHEMA FOR UNIPROT This chapter discusses the fundamentals of star schema. We start by introducing the concept of star schema, how it is useful to bioinformatics, then our implementation on annotated protein sequence data by mapping dimension and fact tables to generate a star schema that can be used for analysis. 3.1 Introduction to Star Schema Star schema is a relational database schema for representing multidimensional data. It is the simplest form of data warehouse schema that contains one or more dimensions and fact tables. It is called a star schema because an entity-relationship diagram between dimensions and fact tables resemble a star where one fact table is connected to multiple dimensions. The center of star schema consists of large fact table, and it point towards dimension tables. 3.2 Designing a Star Schema We start by raising a real-life question: How do we view logical data stored in the database? For example, we can ask some questions like: Which protein sequence was affected by Organism host? Where are the proteins and genes located? In what gene location did the lineage and isoform affect which gene? Some of the above questions are common in biotechnology. In order to discover answers to questions we first need to know the design procedure of star schema. In order to analyze the 11 protein sequence data we need to first identify the business process then identify facts or measures followed by identifying the dimensions for facts and list columns that describe each dimension. We conclude by determining the lowest level of summary in fact table. Most of the above questions are aggregated data asking for counts or sums and not individual transactions. Finally, these questions are looked at ‘by’ conditions, which refer to the data by using some conditions. Figuring out aggregated values to be shown, such as protein sequence, gene location, gene and then figure out the ‘by’ condition that drives the design of star schema. It is important to note that in star schema, every dimension will have a primary key and dimension table will not have any parent table. The hierarchies for dimension are stored in dimensional table itself in a star schema. When we examine a data, we usually want to see some sort of aggregated data. These are called Measures. Measures are numeric values that are measurable and additive. Example is accession in entry table. We also need to look at measures using ‘by’ condition which are called Dimensions. In order to examine accession, most scientist or analyst wants to see what entry keyword and sequences are obtained periodically [6]. 3.2.1 Mapping Dimensions into Tables Dimension table should have a single field primary key. This is typically a surrogate key and is often just an identity column with auto incrementing number. Real information is stored in other fields since the primary key’s value is meaningless. The other fields are called attributes and contain a full description of dimension record. Dimension tables often contain large fields. One of the greatest challenges in a star schema is the problem of changing dimensional data [6]. 12 3.2.2 Dimensional Hierarchy We build dimension tables by implementing OLAP hierarchy, which is usually a single dimension table. Storing hierarchy in a dimension table allows for easiest browsing of dimensional data. For example, we have a Source table. If we are to create a dimension table for the same, it will look something like what is shown below: SourceDimension Source_Id source_strain Source_tissue Subcelllocation_id Subcelllocation Subcell_topology Figure: 3-1 Dimension Table of Source Source table consist of tissue and strain. Basically, it shows the source where protein gene will affect and whether it will be on a tissue or cell and how is the strain on the source. A typical source table looks as seen below: 13 Figure 3-2: Sample Data from Source Table Storing the hierarchy in a dimension table allows for easiest browsing of dimensional data. In above example, users can easily choose a category and list all sub cell locations as per required data. The above example shows how a hierarchy of dimension tables is built in a star schema. This would be done by using drill-down operation which is OLAP based and choose individual location from within the same table. No need to join to an external table for any of hierarchical 14 information. In the overly-simplified example below, there are two dimension tables jointed to fact table. Gene dimension table below is generated from Gene’s table which is shown below: Figure 3-3: Sample Data from Gene Table 15 The Isoform dimension table is generated from Isoform table, which contains several different forms of same protein, which are produced from related genes. The sample data from Isoform table is shown below with its structure. Figure 3-4 Sample Data from Isoform Table 16 In the overly-simplified example 1, there are two dimension tables joined to the fact table. For now, examples will use only one measure: HostLocation. Figure: 3-5 Star Schema Example 1 17 In order to see the location for a particular isoform for a particular lineage, a SQL query would look something like this: SELECT subcell_location, isoform_id, isoform_name FROM IsoformDimension INNER JOIN (GeneDimension INNER JOIN LocationFact ON GeneDimension.gene_id = LocationFact.gene_id) ON IsoformDimension.ID = LocationFact.ID WHERE GeneDimension.gene_name='HLAA' AND IsoformDimension.isoform_id=’P11171’ AND IsoformDimesion.lineage_id=6 18 The sample output of the above query is shown below: Subcellular location Isoform isoform_id isoform_name subcell_location P11171-1 1 Membrane P11171-2 2 Cytoplasm P11171-3 3 Nucleus P11171-4 Erythroid lamellipodium P11171-5 Non-erythroid A filopodium P11171-6 Non-erythroid B growth cone P11171-7 7 synaptosome Figure 3-6: Sample Output of the SQL Query 1 The fact table contains measures often called as facts. The facts are numeric and additive across some or all the dimensions. Fact tables are generally long and skinny while dimension tables are fat. Fact tables hold a number of records represented by the product of the counts in all the dimension tables. When building a star schema, we must decide the granularity of the fact table. The granularity, or frequency, of the data is determined by the lowest level of granularity of each dimension table. Lower the granularity, more the records existing in the fact table. The granularity also determines how far users can drill down without returning to the base, transaction-level data. The Entry table that we will use in next star schema example consists of gene sequence, keyword and organism host name and its location. An entry table data is shown below: 19 Figure 3-7: Sample Data of Entry Table Now let’s look at another 3-dimension, 1 fact table star schema. The measure here again is host location but uses Entry dimension table to gain access to organism where the host is located. 20 Figure 3-8: Star Schema Example 2 21 Chapter 4 OLAP OPERATIONS IMPLEMENTED ON UNIPROT This chapter discusses OLAP operations implemented on UniProt. The chapter introduces concept of OLAP operations, its type. The chapter also briefly discusses data cube along with examples and query. 4.1 Introduction to Online Analytical Processing OLAP (online analytical processing) is computer processing that enables a user to easily and selectively extract and view data from different points of view. For example, a user can request that data be analyzed to display a spreadsheet showing all of a company's beach ball products sold in Florida in July, compare revenue figures with those for the same products in September, and then see a comparison of other product sales in Florida in the same time period. To facilitate this kind of analysis, OLAP data is stored in a multidimensional database. Whereas a relational database can be thought of as two-dimensional, a multidimensional database considers each data attribute (such as product, geographic sales region, and time period) as a separate "dimension." OLAP software can locate the intersection of dimensions (all products sold in the Eastern region above a certain price during a certain time period) and display them. Attributes such as time periods can be broken down into sub-attributes. Main goal of OLAP was to support ad-hoc but complex querying performed by business analysis. Since data is explored and aggregated in various ways, it was important to introduce an interactive process of creating, managing, 22 analyzing and reporting on data that included spreadsheet-like analysis to work with a huge amount of data in the data warehouse. 4.2 Types of OLAP Operations OLAP systems use the following taxonomy. Multidimensional OLAP (MOLAP) is the 'classic' form of OLAP. MOLAP stores data in an optimized multi-dimensional array, rather than in a relational database. Thus, it requires the precomputation and storage of information in the cube - the operation known as processing. Relational OLAP (ROLAP) usually is directly related to relational database. The base data and the dimension tables are stored as relational tables. In order to hold the aggregated information, we use new tables depending on a specialized schema design. The above method is used to manipulate the data stored in the relational database in order to give it a traditional OLAP view by using, slicing and dicing functionality. In essence, each action of slicing and dicing is equivalent to adding a "WHERE" clause in the SQL statement. Comparing the two OLAPs we can distinguish that each type has certain benefits, although there is disagreement about the specifics of the benefits between providers. Some MOLAP implementations are prone to the database explosion, a phenomenon causing vast amounts of storage space to be used by MOLAP databases when certain common conditions are met: high number of dimensions, pre-calculated results and sparse multidimensional data. 23 MOLAP generally delivers better performance due to specialized indexing and storage optimizations. MOLAP also needs less storage space compared to ROLAP because the specialized storage typically includes compression techniques. ROLAP is generally more scalable. However, large-volume pre-processing is difficult to implement efficiently so it is frequently skipped. ROLAP query performance can therefore suffer tremendously. Since ROLAP relies more on the database to perform calculations, it has more limitations in the specialized functions it can use. 4.3 Data Cube A Data Cube (OLAP Cube or Multi-dimensional Cube) is a data structure that allows faster analysis of data. It also has the capability to manipulate and analyze data from multiple perspectives. The cube consists of numeric facts called Measures, which are categorized by Dimensions. The cube structure may be created from star schema or snowflake schema of tables in the database. Measures are derived from records in fact table and dimensions are derived from dimension tables. For the current project of UniProt, we will consider cube metadata created from star schema. 24 Figure 4-1 Front View of Data Cube The above cube is used to represent data along some measure of interest. Although it is called a ‘cube’, it can be 2-dimensional, 3-dimensional or higher. Each dimension represents some attribute in database and cells in data cube represent a measure of interest. For example, we can count the number of times that attribute combination occurs in the database or the minimum, maximum, sum of some attribute. Queries are performed on the cube to retrieve decision support information. 25 In above example, we have three tables that are related to gene, organism where it resides and source tissue. So the data cube formed from this is the 3-dimensional representation with each table cell of the cube representing a combination of values shown from organism, source and gene. The content of each cell is counted number of times that specific combination of value occurs together in database. Cells that appear blank, in fact, have a value of zero. The cube can then be used to retrieve information within the database about which gene affects which organism and which specific source tissue is affected. Now let us consider another data cube example in which we show the maximum value of three attributes isoform, lineage and interactant. What this shows is how many times isoform interacts with lineage and uses interactant as label. So basically, we have isoform names which are listed that interact with taxon through the listed names and uses interactant labels to find out the maximum number of interaction between taxon and isoform names. Because of too many cells in cube is filled with no data it takes up valuable processing time by effectively adding up zeros, which are in empty cells. This condition is called Sparsity and to overcome this, we have to use Linking cubes. For example, gene may be available for all organisms and source, but the location may not be available with this amount of analysis. So instead of creating a sparse cube, it is sometimes better to create another separate but linked cube in which a sub-set of data can be analyzed in great detail. The linking ensures that data in the cube remain consistent. 26 Figure 4-2: Data Cube 4.4 OLAP Operations Common operations include slice and dice, drill down, roll up, and pivot. With OLAP, we can analyze multidimensional data from multiple perspectives. OLAP consists of three basic analytical operations: consolidation, drill-down, and slicing and dicing. In consolidation, we implement aggregation of data so that it can be accumulated and computed in one or more dimensions. Slicing and dicing is where users take out a specific set of data of the cube and view the slices from different viewpoints. 27 Usually OLAP uses multi dimensional data model to configure so that we can do complex analytical and ad-hoc queries with a rapid execution time. The core of any OLAP system is an OLAP cube (also called a 'multidimensional cube' or a hypercube). Cube consists of numeric facts called measures, which are categorized by dimensions. The cube metadata is typically created from a star schema or snowflake schema of tables in a relational database. Measures are derived from the records in the fact table, and dimensions are derived from the dimension tables. Each measure can be thought of as having a set of labels, or meta-data associated with it. A dimension is what describes these labels; it provides information about the measure. OLAP SLICING A slice is a subset of a multi-dimensional array corresponding to a single value for one or more members of the dimensions not in the subset. For example, if the member Actuals is selected from the Scenario dimension, then the sub-cube of all the remaining dimensions is the slice that is specified. The data omitted from this slice would be any data associated with the non-selected members of the Scenario dimension, for example budget, variance, forecast, etc. From an end user perspective, the term slice most often refers to a two- dimensional page selected from the cube [7]. OLAP Drill-up and drill-down: Drilling down or up is a specific analytical technique whereby the user navigates among levels of data ranging from the most summarized (up) to the most detailed (down). In example 1, if we have to drill down to a subcategory, the SQL would change to look like this: 28 SELECT subcellularlocation_id, isoform_id, isoform_name FROM IsoformDimension INNER JOIN (GeneDimension INNER JOIN LocationFact ON GeneDimension.gene_id = LocationFact.gene_id) ON IsoformDimension.ID = LocationFact.ID WHERE GeneDimension.gene_name='HIBADH’ AND IsoformDimension.subcellloc_id = 37 AND IsoformDimesion.lineage_id=6 Sample output of the above SQL query would be as shown below: isoform_id isoform_name subcelllocation_id Subcell_location P53353-1 FSA-Acr.1 37 Secreted Figure 4-3: Sample Output of the SQL Query 2 29 Chapter 5 TESTING In this chapter, we discuss some test cases implemented on the data mart and the procedure that is followed while testing. 5.1 Test Cases It is very important to effectively test a project for successful implementation. For testing, most projects are recommended to do unit testing and black box testing methodologies. The test data set is generated by working with end-users. The test files were used to check if data is being populated correctly, and that extraction is done exactly as desired. The following information defines the test cases and how results were documented. Test # Test Case 1 Test continuous internet connection to ensure successful data download of xml file from universal protein website. 2 Use HJ Split for splitting 2.26 GB of data from XML document. Check the split for loose ends. Enter appropriate code to beginning and termination of code. 3 Extract data from first XML split file and subsequently follow with others. 4 Since 2.26GB of data is huge, I decided to extract limited amount of data and had to truncate the file after getting 2500 samples of protein sequence. 30 5 Check the data format for XML File. This is important as when extracting data, if it is not in required XML format it could lead to improperly distributed data. 6 Once, the data is transformed and loaded into database, check for consistent data. 7 Check for redundancy, clean the data; perform database operations to optimize the data to handle queries and load generated from the schemas. 8 Create new SQL query to insert, update, select data from database to generate star schema and data cube. Mention primary key, foreign key and relationships. 9 Run SQL query to populate data which is to be used in star schema and Data Cube Sample runs for each of the above test plans are covered with star schema example in section 3.2 and for data cube; it is discussed in detail in section 4.3. Each section in star schema has two examples of how a query is used to generate star schema and show what data is generated as well as what measures are used to identify the relationship. Similarly, each section in data cube has an example of how three tables in the database are used to count the number of occurrences in the database as well as maximum times the occurrence takes place with organism and gene. 31 Chapter 6 CONCLUSIONS 6.1 Summary The main purpose of this project was to understand the working of data warehouses in bioinformatics especially related to protein sequence. In this project, we learnt that research group, analyst and lab technicians in organizations will greatly benefit by implementing technologies like star schema, data cube, OLAP operations on the data in the data warehouse to obtain cohesive and analytical results. The test case led us to approximations for the missing or biased aggregates of those cells that have missing or low support. The method we implement is adaptive to sudden changes of data distribution, called discontinuities that inevitably occur in real-life data, which is collected for the purpose of being analyzed. Since most of these data are collected to handle on-going research, it is usually called operational data. The data warehouse is used to collect and organize data for analysis, which can also be referred to as informational data and use OLAP. I integrated protein sequence data with gene and source. The reason behind this is that integration plays a vital role in the data warehouse since data is gathered from a variety of source, merged into a coherent whole database. This helps add stability to the data stored in the data warehouse is useful for users. This project was developed to accumulate experimental knowledge of protein function due to easy availability of protein sequence data. These enabled me to model protein sequence as per research group requirements and develop evolution of protein sequence function. The protein sequence data can be used in protein classification service. Proteins can be classified using 32 protein sequences at family and sub family levels. The second application is an expression data analysis service, where functional classification information can help find biological patterns in data obtained from genome-wide experiment. The third application of this project is by coding SQL query for single-nucleotide polymorphism scoring service. In this case, information about proteins is used to assess the likelihood of deleterious effect from substituting a taxon or lineage at a specific position. The technology used to implement the data warehouse like star schema, data cube can be very beneficial for above applications. The coursework for data warehouse and data mining that I took under expert guidance of Dr. Meiliu Lu was an enlightening and an enriching experience. It helped me understand the goals and techniques used in data warehousing. The course helped me construct a data warehouse, understand design techniques for relational databases like star schema design, online analytical processing. I also learned how to implement data cube in 3-dimension using multidimensional databases by creating and maintaining it. Through this coursework, I was motivated to implement data mart on annotated protein sequence data, use design techniques like star schema, data cube and OLAP operations on the protein sequence data. A data cube contains cells, each of which is associated with some summary information, or aggregate, which the decisions are to be based on. However, in protein sequence databases, due to the nature of their contents, data distribution tends to be clustered and sparse. The Sparsity situation gets worse, as the number of cells increases. It is necessary to acquire support for those 33 cells that have support levels below a certain threshold by combining with adjacent cells. Otherwise, incomplete or biased results could be derived due to lack of sufficient support. The data often comes from OLTP systems but may also come from spreadsheets, flat files and other sources. In this case, database came from xml file. The data is formatted in such a way that it provides fast response to queries. Star schemas provide fast response by denormalizing dimension tables and potentially through providing many indexes. If we go through the protein sequence database, we find ‘Db Reference id’ was used as indexing to accelerate the fetching of data. We implemented OLAP operations on star schema to get more result oriented data by implementing data cube also known as OLAP Cube. Once we have the query to generate data for star schema, we can get factual information that is stored in the database. 6.2 Strengths and Weaknesses We discuss strength and weakness of star schema, OLAP operations and data warehouse in brief here. Most importantly, it is organization specific about how they want to utilize DW and OLAP operations to effective monitor and analyze their data. 6.2.1 Strengths The simplicity for users to write and process database is a very important benefit of using star schema. Since queries are written with simple inner joins between the facts and a small number of dimensions. Star joins are simpler than possible in the snowflake schema. By using ‘Where’ conditions we can use to filter on attributes desired. Aggregation is quite fast using da. 34 Additionally, it provides a direct and intuitive mapping between the business entities being analyzed by end users and the schema design. For typical star queries, it provides highly optimized performance. Furthermore, star schema is widely supported by a large number of business intelligence tools that can anticipate or require that data-warehouse schema contain dimension tables. One of the major benefits of the star schema is that the low-level transactions may be summarized to the fact table grain. This greatly speeds the queries performed as part of the decision support process. However, the aggregation or summarization of the fact table is not always done if cubes are being built [8]. OLAP allows for the minimization of data entry. For each detail record, only the primary key value from the Source table is stored, along with the primary key of the gene table, and then the sub cellular location is added. This greatly reduces the amount of data entry necessary to add a product to an order. Not only does this approach reduce the data entry required, it greatly reduces the size of a Source record. The records take up much less space with a normalized table structure. This means that the table is smaller, which helps speed inserts, updates, and deletes. In addition to keeping the table smaller, most of the fields that link to other tables are numeric. Queries generally perform much better against numeric fields than they do against text fields. 35 Therefore, replacing a series of text fields with a numeric field can help speed queries. Numeric fields also index faster and more efficiently. With normalization, there are frequently fewer indexes per table. Each transaction requires the maintenance of affected indexes. With fewer indexes to maintain, inserts, updates, and deletes run faster [9]. 6.2.2 Weakness In a star schema there is no relationship between two relational tables. All dimensions are denormalized and query performance degrades. A star schema is hard to design. It is easier on users but very hard for developers and modelers. Dimensional table is in denormalized form so it can decrease performance and increase queries response time. There are some disadvantages to an OLAP when trying to analyze queries. Queries must utilize joins across multiple tables to get all the data, which make it to read slowly. When normalization is implemented, developers have no choice but to query from multiple tables to get the detail necessary for a report. Fewer indexes per table which is an advantage of OLAP sometimes are a disadvantage too. Fewer indexes per table speed up insert, update, and delete. However, if we use fewer indexes in the database, then the select queries will run slower. For data retrieval, a higher number of correct indexes help speed retrieval. Since one need to speed transactions by minimizing the number of indexes, OLAP databases trade faster transactions at the cost of slowing data retrieval. Last but 36 not least, the data in an OLAP system is not user friendly. If an analyst wants to spend more time performing analysis by looking at the data, the IT group should support their desire for fast, easy queries, so it is important that we solve the problem since the data retrieval may be slow as a trade off. We can solve this by having a second copy of the data in a structure reserved for analysis. This is heavily indexed allowing analyst and customers to perform large queries against the data without impacting modification on the main data [9]. 6.3 Future Work As a part of future implementation, I would like to develop a tool or application which enables to extract data directly from the data warehouse to generate a star schema and data cube by providing desired data according to the user requirements. I would like to implement other OLAP operations like pivot, dicing, slicing in more detailed data warehouse where it is implemented across multiple tables. It would be a great learning experience if I can prepare a general data warehouse so that it can benefit multiple organizations in bioinformatics and not be specific to protein sequence. Time permitting; I would also effectively implement testing on the star schema to see how good it holds according to various user requirements when it is changed dynamically. 37 BIBLIOGRAPHY [1] Oracle9i Data Warehousing Guide, Release 2 (9.2), Part Number A96520-01 [online] http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/concept.htm [2] National Library of Medicine http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html [3] Federal Energy Management Program, - “Using Distributed Energy Resources” [online] http://www.ebi.ac.uk/uniprot/index.html [4] Howard Hamilton, Ergun Gurak, Leah Findlater, Wayne Olive, and James Ranson, “Knowledge Discovery in Databases” [online] http://www2.cs.uregina.ca/~hamilton/courses/831/notes/dcubes/dcubes.html [5] Passionned Tools, “ETL Tools Comparison” [Online] http://www.etltool.com/what-is-etl.htm [6] Craig Utley, Designing Star Schema” [Online] http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase /tabid/101/Default.aspx [7] OLAP Council, “OLAP and OLAP Server Definitions”, January 1995 [Online] http://altaplana.com/olap/glossary.html#SLICE [8] Oracle9i Data Warehousing Guide Release 2 (9.2) Part Number A96520-01 [Online] http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/schemas.htm [9] Katherine Drewek, Data Warehousing: Similarities and Differences of Inmon and Kimball [online] http://www.b-eye-network.com/view/743 [10] Business Intelligence and Data Warehousing [Online] http://www.sdgcomputing.com/glossary.htm