A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM UNIPROT DATABASE

advertisement
A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM
UNIPROT DATABASE
Maulik Vyas
B.E., C.I.T.C, India, 2007
PROJECT
Submitted in partial satisfaction of
the requirements for the degree of
MASTER OF SCIENCE
in
COMPUTER SCIENCE
at
CALIFORNIA STATE UNIVERSITY, SACRAMENTO
FALL
2011
A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM
UNIPROT DATABASE
A Project
By
Maulik Vyas
Approved by:
__________________________________, Committee Chair
Meiliu Lu, Ph.D.
__________________________________, Second Reader
Ying Jin, Ph. D.
____________________________
Date
ii
Student: Maulik Vyas
I certify that this student has met the requirements for format contained in the University format
manual, and that this project is suitable for shelving in the Library and credit is to be awarded for
the Project.
__________________________, Graduate Coordinator
Nikrouz Faroughi, Ph.D.
Department of Computer Science
iii
________________
Date
Abstract
of
A DATA MART FOR ANNOTATED PROTEIN SEQUENCE EXTRACTED FROM
UNIPROT DATABASE
by
Maulik Vyas
Data Warehouses are used by various organizations to organize, understand and use the data with
the help of provided tools and architectures to make strategic decisions. Biological data
warehouse such as the annotated protein sequence database is subject oriented, volatile collection
of data related to protein synthesis used in bioinformatics. Data mart contains a subset of
enterprise data from data warehouse that is of value to a specific group of users. I implemented a
data mart based on data warehouse design principles and techniques on protein sequence database
using data provided by Swiss Institute of Bioinformatics. While the data warehouse contains
information about many protein sequence areas, data mart focuses on one or more subject area. It
brings together experimental results, computed features and scientific conclusions by
implementing star schema and data cube that supports the data warehouse to make it easier for
organizations to distribute data within a unit. This enables them to deploy the data, manipulate it
and develop the protein sequence data any way they see fit. The main goal of this project is to
provide consistent, accurate annotated protein sequence data to group of researchers working on
protein sequence. I took a chunk of this data to extract it from warehouse, transform it and loaded
it in staging area. I used HJSplit to split the XML protein sequence data into equal parts and
iv
extract information using XML editor. I populated the database tables in Microsoft Access 2010
from XML file. Once the database was set up, I used MySQL Workbench 5.2 CE to generate
queries related to star schema. Finally, I implemented star schema, OLAP operations, and data
cube and drill up-down operations for strategic analysis of protein sequence database based on
SQL queries. This ensured explicit support for dimension, aggregation and long-range analysis.
_______________________, Committee Chair
Meiliu Lu, Ph.D.
_______________________
Date
v
DEDICATION
This project is dedicated to my beloved parents Kirankumar Vyas and Jayshree Vyas for their
never-ending sacrifice, love and support and understanding. I would also like to dedicate this to
my loving wife Tanvi Desai for encouraging me to pursue Master in Computer Science and for
being a pillar of support for me throughout.
vi
ACKNOWLEDGMENTS
I would like to extend my gratitude to my project advisor Dr. Meiliu Lu, Professor, Computer
Science for guiding me throughout this project and helping me in completing this project
successfully. I am also thankful to Dr. Ying Jin, Professor, Computer Science, for reviewing my
report. I am grateful to Dr. Nikrouz Faroughi, Graduate Coordinator, Department of Computer
Science, for reviewing my report and providing valuable feedbacks. In addition, I would like to
thank The Department of Computer Science at California State University for extending this
opportunity for me to pursue this program and guiding me all the way to become a successful
student.
Lastly, I would like to thank my parents Kirankumar Vyas and Jayshree Vyas and my loving wife
Tanvi Desai for providing me the moral support and encouragement throughout my life.
vii
TABLE OF CONTENTS
Page
Dedication…………………………………………………………………………………...
vi
Acknowledgments…………………………………………………………………………... vii
List of Figures……………………………………………………………….........................
x
List of Abbreviations…………………………………………………………......................
xi
Chapter
1. INTRODUCTION………………………………………………………………………..
1
1.1 Introduction to Data Warehousing…………………………………………………
1
1.2 Introduction to Annotated Protein Sequence and UniProt..……..............................
2
1.3 Goal of the Project……...……………………………………….............................
3
2. COLLECTION AND ANALYSIS OF UNIPROT IN PROTEIN SEQUENCE…………
4
2.1 Collecting UniProt in Protein Sequence.……………………...…………………...
4
2.2 Extract, Transform, Load (ETL)……….……………………...…………………...
5
3. DESIGNING STAR SCHEMA FOR UNIPROT…………………………………......….
10
3.1 Introduction to Star Schema….…………………………………............................
10
3.2 Designing a Star Schema……………….…………………………………………
10
3.2.1 Mapping Dimensions into Tables……………………………........................
11
3.2.2 Dimensional Hierarchy…...……….…………………………........................
12
4. OLAP OPERATIONS IMPLEMENTED ON UNIPROT…………………….................. 21
4.1 Introduction to Online Analytical Processing…..…………………………..……...
21
4.2 Type of OLAP Operations…………………….…………………………………...
22
viii
4.3 Data Cube…………………………………………………………………………..
23
4.4 OLAP Operations…………………………………………………………………..
26
5. TESTING…………………………………………………………………………………
29
5.1 Test Cases………………………………………………………………………….
29
6. CONCLUSIONS………………………………………………………………………….
31
6.1 Summary…………………………………………………………….......................
31
6.2 Strengths and Weaknesses.………………………………………………………...
33
6.2.1 Strengths ………………………………………………….…………………
33
6.2.2 Weakness ……………………………………………………………………
35
6.3 Future Work.………………………………………………….................................
36
Bibliography………………………………………………………………………………...
37
ix
LIST OF FIGURES
Page
Figure 1-1 Data Warehouse Architecture…………………………………………………..
2
Figure 2-1 Structure of ETL and Data Warehouse………………………………………....
6
Figure 2-2 Sample XML File During Extraction…………………………………………..
7
Figure 2-3 Implementing Transformation on UniProt……………………………………..
8
Figure 3-1 Dimension Table of Source………….………………………………………....
12
Figure 3-2 Sample Data from Source Table……………………………………………......
13
Figure 3-3 Sample Data from Gene Table………………………………………………….
14
Figure 3-4 Sample Data from Isoform Table……………………………………………….
15
Figure 3-5 Star Schema Example 1...…………………………………………………........
16
Figure 3-6 Sample Output of the SQL Query 1………………………………………….....
18
Figure 3-7 Sample Data of Entry Table…………………………………………………….
19
Figure 3-8 Star Schema Example 2………………………………………………………… 20
Figure 4-1 Front View of Data Cube……..………………………………………………...
24
Figure 4-2 Data Cube……………………………………………………………………….
26
Figure 4-3 Sample Output of the SQL Query 2……….……………………………………
28
x
LIST OF ABBREVIATIONS
DW:
Data Warehouse
OLAP
Online Analytical Processing
ETL:
Extract, Transform and Load
XML
Extensible Markup Language
OLTP
Online Transaction Processing
xi
1
Chapter 1
INTRODUCTION
1.1 Introduction to Data Warehousing
Data Warehouse (DW) is a methodological approach for organizing and managing database used
for reporting and analysis while providing organization with trustworthy, consistent data for
applications running in an organization. The data stored in the warehouse is uploaded from the
operational systems. The data may pass through an operational data store for additional
operations before it is used in the DW for reporting. Data warehouse depicts data and its
relationship by drawing distinction between data and information [1].
The data is cleaned to remove redundancy, transformed into compatible data and then made
available to managers and professionals handling data mining, online analytical processing
(OLAP), market research and decision support.
Essential components of data warehousing include analyzing data, extracting data from the
database, transforming data and managing data dictionary [2].
2
Figure 1-1: Data Warehouse Architecture
1.2 Introduction to Annotated Protein Sequence and UniProt
The protein database is a collection of sequences from several sources, including translations
from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt,
PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure
and function.
Bioinformatics has revolutionized biological industry by applying computer technology to the
management of biological information. Today's computers are able to gather, store, analyze and
3
integrate biological information that can be applied for protein-based drug discovery or genebased drug discovery [3].
UniProt is a scientific community that has high quality, freely available resources of protein
sequence and functional information. I will be using the database of Swiss Institute of
Bioinformatics to manually annotate protein sequence [4].
1.3 Goal of the Project
The goal of this project is to understand the interconnection of data warehousing with UniProt
which is a part of bioinformatics as well as carry out research on the current and potential
application of data warehousing with the available database. This project covers star schema that
can be applied to the database. Research issues that still need to be explored are discussed at the
end of the project report. This report is structured as follows: Chapter 2 discusses how UniProt
protein sequence was collected, analyzed and implemented. Chapter 3 discusses the
implementation of data warehousing on UniProt. Chapter 4 discusses OLAP operations done on
UniProt to get stable, non-redundant data. Chapter 5 discusses star schema implemented on the
database. Chapter 6 gives the summary of implementing data mart and data warehouse concepts
on annotated protein sequence data, strength and weakness of using star schema and OLAP
operations on annotated protein sequence data and future work.
4
Chapter 2
COLLECTION AND ANALYSIS OF UNIPROT IN PROTEIN SEQUENCE
This chapter discusses the collection procedure of UniProt from the Swiss bank website. It
furthermore, discusses the analysis done on the protein sequence to get data from their XML files.
2.1 Collecting UniProt in Protein Sequence
The Universal Protein Resource (UniProt) is the bank for protein sequence and Annotation data.
It is accessible at www.uniprot.org and is collaboration among European Bioinformatics Institute,
Swiss Institute of Bioinformatics and Protein Information Resource. I finalized annotated protein
sequence from Swiss Institute of Bioinformatics. UniProt is updated every four weeks. I opted for
UniProt KB and came across three data sets in three different file formats, which were xml, fasta
and text.
To get to know more about Fasta, I downloaded the database with .fasta extension and then
looked around to find supporting software. Found a software named fasta 1.0.1. Fasta is a
scientific data format used to store nucleic acid sequences (such as DNA sequences) or protein
sequences. The format may contain multiple sequences and therefore, is sometimes referred to as
the fasta database format. Fasta files often start with a header line that may contain comments or
other information. The rest of the file contains sequence data. Each sequence starts with a ">"
symbol followed by the name of the sequence. The rest of the line describes the sequence, and the
remaining lines contain the sequence itself. In order to use data from Fasta, one has to have
higher configuration of computer.
5
Since it was not feasible to get a high-performance machine, I decide to go for XML as it is
machine and platform friendly file extension for most large databases. Once I downloaded the
data set in xml format, I extracted it to get a complete XML document. Now this XML file is over
2 GB in size. I truncated it in order to use the data from document. For this I used HJ Split to split
parts into pre-defined parts (10mb). This enabled me to use the data more effectively since I
could generate more fields along with individual tables. Once the XML opened, I analyzed the
data from the document by creating missing links in the document. I sorted the document for
more clarity about data flow. Each data was categorized in different type of tables. For example,
there was a location for sub cellular data as well as gene. So it was put into a set of tables
belonging to location tab. Similarly, it was done for the name of gene, isoform, and organism. I
followed the Extract, Transform and Load (ETL) procedure, which is explained in detail below.
2.2 Extract, Transform and Load (ETL)
ETL is a process to extract data from different type of systems, transform it into a structure which
can be used for analysis and reporting and then load it into a database and/or cube.
6
Figure 2-1: Structure of ETL and Data Warehouse
Extract: I extracted data from different external source; UniProt’s Swiss database website. This
data is mostly structured and/or unstructured. Since the data was in xml document it was hard for
me to send a query due to incompatibility. So I put the data in staging area, which is structured in
the same way as the original data from the website. So, I had to extract individual fields from xml
file which I had to split into multiple parts in order to get database fields. Below is one of many
xml files generated after extracting it from the universal protein website.
7
Figure 2-2: Sample XML File During Extraction.
8
Transform: Once the data is available in staging area, I ensured that it was on one platform and
one database. This ensures that we can execute basic queries like sorting, filtering, joining the
tables. I also checked for data and cleaned it by adding data or modifying data as per
requirements. As shown in figure below, the source table has plenty of inconsistent and
incomplete data. I ensured the corresponding data was completely filled, and unwanted data was
cleared. After all data was prepared, I implemented slowly changing dimensions, which are
needed to keep track of attributes that change over time which will help in analysis and reports.
Figure 2-3: Implementing Transformation on UniProt
9
Load: Finally, the above data is loaded into the data warehouse, usually into facts and dimension
tables so that we can combine the data, aggregate them and load into data marts to generate star
schema and/or cubes as necessary. Now what really happens when generating star schema is, by
extracting primary key of participating tables or dimensions, we put the list, in Fact table and give
its own primary key as well as define the fact table with a name to distinguish it from dimension
tables. Once, it is listed, dimension tables are linked to fact table with corresponding primary
keys.
10
Chapter 3
DESIGNING STAR SCHEMA FOR UNIPROT
This chapter discusses the fundamentals of star schema. We start by introducing the concept of
star schema, how it is useful to bioinformatics, then our implementation on annotated protein
sequence data by mapping dimension and fact tables to generate a star schema that can be used
for analysis.
3.1 Introduction to Star Schema
Star schema is a relational database schema for representing multidimensional data. It is the
simplest form of data warehouse schema that contains one or more dimensions and fact tables. It
is called a star schema because an entity-relationship diagram between dimensions and fact tables
resemble a star where one fact table is connected to multiple dimensions. The center of star
schema consists of large fact table, and it point towards dimension tables.
3.2 Designing a Star Schema
We start by raising a real-life question: How do we view logical data stored in the database?
For example, we can ask some questions like:

Which protein sequence was affected by Organism host?

Where are the proteins and genes located?

In what gene location did the lineage and isoform affect which gene?
Some of the above questions are common in biotechnology. In order to discover answers to
questions we first need to know the design procedure of star schema. In order to analyze the
11
protein sequence data we need to first identify the business process then identify facts or
measures followed by identifying the dimensions for facts and list columns that describe each
dimension. We conclude by determining the lowest level of summary in fact table.
Most of the above questions are aggregated data asking for counts or sums and not individual
transactions. Finally, these questions are looked at ‘by’ conditions, which refer to the data by
using some conditions. Figuring out aggregated values to be shown, such as protein sequence,
gene location, gene and then figure out the ‘by’ condition that drives the design of star schema.
It is important to note that in star schema, every dimension will have a primary key and
dimension table will not have any parent table. The hierarchies for dimension are stored in
dimensional table itself in a star schema. When we examine a data, we usually want to see some
sort of aggregated data. These are called Measures. Measures are numeric values that are
measurable and additive. Example is accession in entry table. We also need to look at measures
using ‘by’ condition which are called Dimensions. In order to examine accession, most scientist
or analyst wants to see what entry keyword and sequences are obtained periodically [6].
3.2.1 Mapping Dimensions into Tables
Dimension table should have a single field primary key. This is typically a surrogate key and is
often just an identity column with auto incrementing number. Real information is stored in other
fields since the primary key’s value is meaningless. The other fields are called attributes and
contain a full description of dimension record. Dimension tables often contain large fields. One of
the greatest challenges in a star schema is the problem of changing dimensional data [6].
12
3.2.2 Dimensional Hierarchy
We build dimension tables by implementing OLAP hierarchy, which is usually a single
dimension table. Storing hierarchy in a dimension table allows for easiest browsing of
dimensional data.
For example, we have a Source table. If we are to create a dimension table for the same, it will
look something like what is shown below:
SourceDimension
Source_Id
source_strain
Source_tissue
Subcelllocation_id
Subcelllocation
Subcell_topology
Figure: 3-1 Dimension Table of Source
Source table consist of tissue and strain. Basically, it shows the source where protein gene will
affect and whether it will be on a tissue or cell and how is the strain on the source. A typical
source table looks as seen below:
13
Figure 3-2: Sample Data from Source Table
Storing the hierarchy in a dimension table allows for easiest browsing of dimensional data. In
above example, users can easily choose a category and list all sub cell locations as per required
data. The above example shows how a hierarchy of dimension tables is built in a star schema.
This would be done by using drill-down operation which is OLAP based and choose individual
location from within the same table. No need to join to an external table for any of hierarchical
14
information. In the overly-simplified example below, there are two dimension tables jointed to
fact table.
Gene dimension table below is generated from Gene’s table which is shown below:
Figure 3-3: Sample Data from Gene Table
15
The Isoform dimension table is generated from Isoform table, which contains several different
forms of same protein, which are produced from related genes. The sample data from Isoform
table is shown below with its structure.
Figure 3-4 Sample Data from Isoform Table
16
In the overly-simplified example 1, there are two dimension tables joined to the fact table. For
now, examples will use only one measure: HostLocation.
Figure: 3-5 Star Schema Example 1
17
In order to see the location for a particular isoform for a particular lineage, a SQL query would
look something like this:
SELECT subcell_location, isoform_id, isoform_name
FROM IsoformDimension INNER JOIN (GeneDimension INNER JOIN
LocationFact ON GeneDimension.gene_id = LocationFact.gene_id)
ON IsoformDimension.ID = LocationFact.ID
WHERE GeneDimension.gene_name='HLAA' AND IsoformDimension.isoform_id=’P11171’
AND IsoformDimesion.lineage_id=6
18
The sample output of the above query is shown below:
Subcellular location
Isoform
isoform_id
isoform_name
subcell_location
P11171-1
1
Membrane
P11171-2
2
Cytoplasm
P11171-3
3
Nucleus
P11171-4
Erythroid
lamellipodium
P11171-5
Non-erythroid A
filopodium
P11171-6
Non-erythroid B
growth cone
P11171-7
7
synaptosome
Figure 3-6: Sample Output of the SQL Query 1
The fact table contains measures often called as facts. The facts are numeric and additive across
some or all the dimensions. Fact tables are generally long and skinny while dimension tables are
fat. Fact tables hold a number of records represented by the product of the counts in all the
dimension tables. When building a star schema, we must decide the granularity of the fact table.
The granularity, or frequency, of the data is determined by the lowest level of granularity of each
dimension table. Lower the granularity, more the records existing in the fact table. The
granularity also determines how far users can drill down without returning to the base,
transaction-level data.
The Entry table that we will use in next star schema example consists of gene sequence, keyword
and organism host name and its location. An entry table data is shown below:
19
Figure 3-7: Sample Data of Entry Table
Now let’s look at another 3-dimension, 1 fact table star schema. The measure here again is host
location but uses Entry dimension table to gain access to organism where the host is located.
20
Figure 3-8: Star Schema Example 2
21
Chapter 4
OLAP OPERATIONS IMPLEMENTED ON UNIPROT
This chapter discusses OLAP operations implemented on UniProt. The chapter introduces
concept of OLAP operations, its type. The chapter also briefly discusses data cube along with
examples and query.
4.1 Introduction to Online Analytical Processing
OLAP (online analytical processing) is computer processing that enables a user to easily and
selectively extract and view data from different points of view. For example, a user can request
that data be analyzed to display a spreadsheet showing all of a company's beach ball products
sold in Florida in July, compare revenue figures with those for the same products in September,
and then see a comparison of other product sales in Florida in the same time period. To facilitate
this kind of analysis, OLAP data is stored in a multidimensional database. Whereas a relational
database can be thought of as two-dimensional, a multidimensional database considers each data
attribute (such as product, geographic sales region, and time period) as a separate "dimension."
OLAP software can locate the intersection of dimensions (all products sold in the Eastern region
above a certain price during a certain time period) and display them. Attributes such as time
periods can be broken down into sub-attributes. Main goal of OLAP was to support ad-hoc but
complex querying performed by business analysis. Since data is explored and aggregated in
various ways, it was important to introduce an interactive process of creating, managing,
22
analyzing and reporting on data that included spreadsheet-like analysis to work with a huge
amount of data in the data warehouse.
4.2 Types of OLAP Operations
OLAP systems use the following taxonomy.
Multidimensional OLAP (MOLAP) is the 'classic' form of OLAP. MOLAP stores data in an
optimized multi-dimensional array, rather than in a relational database. Thus, it requires the precomputation and storage of information in the cube - the operation known as processing.
Relational OLAP (ROLAP) usually is directly related to relational database. The base data and
the dimension tables are stored as relational tables. In order to hold the aggregated information,
we use new tables depending on a specialized schema design. The above method is used to
manipulate the data stored in the relational database in order to give it a traditional OLAP view
by using, slicing and dicing functionality. In essence, each action of slicing and dicing is
equivalent to adding a "WHERE" clause in the SQL statement.
Comparing the two OLAPs we can distinguish that each type has certain benefits, although there
is disagreement about the specifics of the benefits between providers.

Some MOLAP implementations are prone to the database explosion, a phenomenon
causing vast amounts of storage space to be used by MOLAP databases when certain
common conditions are met: high number of dimensions, pre-calculated results and
sparse multidimensional data.
23

MOLAP generally delivers better performance due to specialized indexing and storage
optimizations. MOLAP also needs less storage space compared to ROLAP because the
specialized storage typically includes compression techniques.

ROLAP is generally more scalable. However, large-volume pre-processing is difficult to
implement efficiently so it is frequently skipped. ROLAP query performance can
therefore suffer tremendously.

Since ROLAP relies more on the database to perform calculations, it has more limitations
in the specialized functions it can use.
4.3 Data Cube
A Data Cube (OLAP Cube or Multi-dimensional Cube) is a data structure that allows faster
analysis of data. It also has the capability to manipulate and analyze data from multiple
perspectives. The cube consists of numeric facts called Measures, which are categorized by
Dimensions. The cube structure may be created from star schema or snowflake schema of tables
in the database. Measures are derived from records in fact table and dimensions are derived from
dimension tables. For the current project of UniProt, we will consider cube metadata created from
star schema.
24
Figure 4-1 Front View of Data Cube
The above cube is used to represent data along some measure of interest. Although it is called a
‘cube’, it can be 2-dimensional, 3-dimensional or higher. Each dimension represents some
attribute in database and cells in data cube represent a measure of interest. For example, we can
count the number of times that attribute combination occurs in the database or the minimum,
maximum, sum of some attribute. Queries are performed on the cube to retrieve decision support
information.
25
In above example, we have three tables that are related to gene, organism where it resides and
source tissue. So the data cube formed from this is the 3-dimensional representation with each
table cell of the cube representing a combination of values shown from organism, source and
gene. The content of each cell is counted number of times that specific combination of value
occurs together in database. Cells that appear blank, in fact, have a value of zero. The cube can
then be used to retrieve information within the database about which gene affects which organism
and which specific source tissue is affected.
Now let us consider another data cube example in which we show the maximum value of three
attributes isoform, lineage and interactant. What this shows is how many times isoform interacts
with lineage and uses interactant as label. So basically, we have isoform names which are listed
that interact with taxon through the listed names and uses interactant labels to find out the
maximum number of interaction between taxon and isoform names. Because of too many cells in
cube is filled with no data it takes up valuable processing time by effectively adding up zeros,
which are in empty cells. This condition is called Sparsity and to overcome this, we have to use
Linking cubes. For example, gene may be available for all organisms and source, but the location
may not be available with this amount of analysis. So instead of creating a sparse cube, it is
sometimes better to create another separate but linked cube in which a sub-set of data can be
analyzed in great detail. The linking ensures that data in the cube remain consistent.
26
Figure 4-2: Data Cube
4.4 OLAP Operations
Common operations include slice and dice, drill down, roll up, and pivot. With OLAP, we can
analyze multidimensional data from multiple perspectives. OLAP consists of three basic
analytical operations: consolidation, drill-down, and slicing and dicing.
In consolidation, we implement aggregation of data so that it can be accumulated and computed
in one or more dimensions. Slicing and dicing is where users take out a specific set of data of the
cube and view the slices from different viewpoints.
27
Usually OLAP uses multi dimensional data model to configure so that we can do complex
analytical and ad-hoc queries with a rapid execution time. The core of any OLAP system is an
OLAP cube (also called a 'multidimensional cube' or a hypercube). Cube consists of numeric
facts called measures, which are categorized by dimensions. The cube metadata is typically
created from a star schema or snowflake schema of tables in a relational database. Measures are
derived from the records in the fact table, and dimensions are derived from the dimension tables.
Each measure can be thought of as having a set of labels, or meta-data associated with it. A
dimension is what describes these labels; it provides information about the measure.
OLAP SLICING
A slice is a subset of a multi-dimensional array corresponding to a single value for one or more
members of the dimensions not in the subset. For example, if the member Actuals is selected
from the Scenario dimension, then the sub-cube of all the remaining dimensions is the slice that is
specified. The data omitted from this slice would be any data associated with the non-selected
members of the Scenario dimension, for example budget, variance, forecast, etc. From an end
user perspective, the term slice most often refers to a two- dimensional page selected from the
cube [7].
OLAP Drill-up and drill-down:
Drilling down or up is a specific analytical technique whereby the user navigates among levels of
data ranging from the most summarized (up) to the most detailed (down). In example 1, if we
have to drill down to a subcategory, the SQL would change to look like this:
28
SELECT subcellularlocation_id, isoform_id, isoform_name
FROM IsoformDimension INNER JOIN (GeneDimension INNER JOIN
LocationFact ON GeneDimension.gene_id = LocationFact.gene_id)
ON IsoformDimension.ID = LocationFact.ID
WHERE GeneDimension.gene_name='HIBADH’ AND IsoformDimension.subcellloc_id = 37
AND IsoformDimesion.lineage_id=6
Sample output of the above SQL query would be as shown below:
isoform_id isoform_name subcelllocation_id Subcell_location
P53353-1 FSA-Acr.1
37
Secreted
Figure 4-3: Sample Output of the SQL Query 2
29
Chapter 5
TESTING
In this chapter, we discuss some test cases implemented on the data mart and the procedure that is
followed while testing.
5.1 Test Cases
It is very important to effectively test a project for successful implementation. For testing, most
projects are recommended to do unit testing and black box testing methodologies. The test data
set is generated by working with end-users. The test files were used to check if data is being
populated correctly, and that extraction is done exactly as desired. The following information
defines the test cases and how results were documented.
Test #
Test Case
1
Test continuous internet connection to ensure successful data download of xml file
from universal protein website.
2
Use HJ Split for splitting 2.26 GB of data from XML document. Check the split for
loose ends. Enter appropriate code to beginning and termination of code.
3
Extract data from first XML split file and subsequently follow with others.
4
Since 2.26GB of data is huge, I decided to extract limited amount of data and had to
truncate the file after getting 2500 samples of protein sequence.
30
5
Check the data format for XML File. This is important as when extracting data, if it
is not in required XML format it could lead to improperly distributed data.
6
Once, the data is transformed and loaded into database, check for consistent data.
7
Check for redundancy, clean the data; perform database operations to optimize the
data to handle queries and load generated from the schemas.
8
Create new SQL query to insert, update, select data from database to generate star
schema and data cube. Mention primary key, foreign key and relationships.
9
Run SQL query to populate data which is to be used in star schema and Data Cube
Sample runs for each of the above test plans are covered with star schema example in section 3.2
and for data cube; it is discussed in detail in section 4.3.
Each section in star schema has two examples of how a query is used to generate star schema and
show what data is generated as well as what measures are used to identify the relationship.
Similarly, each section in data cube has an example of how three tables in the database are used
to count the number of occurrences in the database as well as maximum times the occurrence
takes place with organism and gene.
31
Chapter 6
CONCLUSIONS
6.1 Summary
The main purpose of this project was to understand the working of data warehouses in
bioinformatics especially related to protein sequence. In this project, we learnt that research
group, analyst and lab technicians in organizations will greatly benefit by implementing
technologies like star schema, data cube, OLAP operations on the data in the data warehouse to
obtain cohesive and analytical results. The test case led us to approximations for the missing or
biased aggregates of those cells that have missing or low support. The method we implement is
adaptive to sudden changes of data distribution, called discontinuities that inevitably occur in
real-life data, which is collected for the purpose of being analyzed. Since most of these data are
collected to handle on-going research, it is usually called operational data. The data warehouse is
used to collect and organize data for analysis, which can also be referred to as informational data
and use OLAP. I integrated protein sequence data with gene and source. The reason behind this is
that integration plays a vital role in the data warehouse since data is gathered from a variety of
source, merged into a coherent whole database. This helps add stability to the data stored in the
data warehouse is useful for users.
This project was developed to accumulate experimental knowledge of protein function due to
easy availability of protein sequence data. These enabled me to model protein sequence as per
research group requirements and develop evolution of protein sequence function. The protein
sequence data can be used in protein classification service. Proteins can be classified using
32
protein sequences at family and sub family levels. The second application is an expression data
analysis service, where functional classification information can help find biological patterns in
data obtained from genome-wide experiment. The third application of this project is by coding
SQL query for single-nucleotide polymorphism scoring service. In this case, information about
proteins is used to assess the likelihood of deleterious effect from substituting a taxon or lineage
at a specific position. The technology used to implement the data warehouse like star schema,
data cube can be very beneficial for above applications.
The coursework for data warehouse and data mining that I took under expert guidance of Dr.
Meiliu Lu was an enlightening and an enriching experience. It helped me understand the goals
and techniques used in data warehousing. The course helped me construct a data warehouse,
understand design techniques for relational databases like star schema design, online analytical
processing. I also learned how to implement data cube in 3-dimension using multidimensional
databases by creating and maintaining it. Through this coursework, I was motivated to implement
data mart on annotated protein sequence data, use design techniques like star schema, data cube
and OLAP operations on the protein sequence data.
A data cube contains cells, each of which is associated with some summary information, or
aggregate, which the decisions are to be based on. However, in protein sequence databases, due to
the nature of their contents, data distribution tends to be clustered and sparse. The Sparsity
situation gets worse, as the number of cells increases. It is necessary to acquire support for those
33
cells that have support levels below a certain threshold by combining with adjacent cells.
Otherwise, incomplete or biased results could be derived due to lack of sufficient support.
The data often comes from OLTP systems but may also come from spreadsheets, flat files and
other sources. In this case, database came from xml file. The data is formatted in such a way that
it provides fast response to queries. Star schemas provide fast response by denormalizing
dimension tables and potentially through providing many indexes. If we go through the protein
sequence database, we find ‘Db Reference id’ was used as indexing to accelerate the fetching of
data.
We implemented OLAP operations on star schema to get more result oriented data by
implementing data cube also known as OLAP Cube. Once we have the query to generate data for
star schema, we can get factual information that is stored in the database.
6.2 Strengths and Weaknesses
We discuss strength and weakness of star schema, OLAP operations and data warehouse in brief
here. Most importantly, it is organization specific about how they want to utilize DW and OLAP
operations to effective monitor and analyze their data.
6.2.1 Strengths
The simplicity for users to write and process database is a very important benefit of using star
schema. Since queries are written with simple inner joins between the facts and a small number of
dimensions. Star joins are simpler than possible in the snowflake schema. By using ‘Where’
conditions we can use to filter on attributes desired. Aggregation is quite fast using da.
34
Additionally, it provides a direct and intuitive mapping between the business entities being
analyzed by end users and the schema design. For typical star queries, it provides highly
optimized performance.
Furthermore, star schema is widely supported by a large number of business intelligence tools
that can anticipate or require that data-warehouse schema contain dimension tables.
One of the major benefits of the star schema is that the low-level transactions may be summarized
to the fact table grain. This greatly speeds the queries performed as part of the decision support
process. However, the aggregation or summarization of the fact table is not always done if cubes
are being built [8].
OLAP allows for the minimization of data entry. For each detail record, only the primary key
value from the Source table is stored, along with the primary key of the gene table, and then the
sub cellular location is added. This greatly reduces the amount of data entry necessary to add a
product to an order.
Not only does this approach reduce the data entry required, it greatly reduces the size of a Source
record. The records take up much less space with a normalized table structure. This means that
the table is smaller, which helps speed inserts, updates, and deletes.
In addition to keeping the table smaller, most of the fields that link to other tables are numeric.
Queries generally perform much better against numeric fields than they do against text fields.
35
Therefore, replacing a series of text fields with a numeric field can help speed queries. Numeric
fields also index faster and more efficiently.
With normalization, there are frequently fewer indexes per table. Each transaction requires the
maintenance of affected indexes. With fewer indexes to maintain, inserts, updates, and deletes run
faster [9].
6.2.2 Weakness
In a star schema there is no relationship between two relational tables. All dimensions are denormalized and query performance degrades. A star schema is hard to design. It is easier on users
but very hard for developers and modelers. Dimensional table is in denormalized form so it can
decrease performance and increase queries response time.
There are some disadvantages to an OLAP when trying to analyze queries. Queries must utilize
joins across multiple tables to get all the data, which make it to read slowly. When normalization
is implemented, developers have no choice but to query from multiple tables to get the detail
necessary for a report.
Fewer indexes per table which is an advantage of OLAP sometimes are a disadvantage too. Fewer
indexes per table speed up insert, update, and delete. However, if we use fewer indexes in the
database, then the select queries will run slower. For data retrieval, a higher number of correct
indexes help speed retrieval. Since one need to speed transactions by minimizing the number of
indexes, OLAP databases trade faster transactions at the cost of slowing data retrieval. Last but
36
not least, the data in an OLAP system is not user friendly. If an analyst wants to spend more time
performing analysis by looking at the data, the IT group should support their desire for fast, easy
queries, so it is important that we solve the problem since the data retrieval may be slow as a
trade off. We can solve this by having a second copy of the data in a structure reserved for
analysis. This is heavily indexed allowing analyst and customers to perform large queries against
the data without impacting modification on the main data [9].
6.3 Future Work
As a part of future implementation, I would like to develop a tool or application which enables to
extract data directly from the data warehouse to generate a star schema and data cube by
providing desired data according to the user requirements. I would like to implement other OLAP
operations like pivot, dicing, slicing in more detailed data warehouse where it is implemented
across multiple tables. It would be a great learning experience if I can prepare a general data
warehouse so that it can benefit multiple organizations in bioinformatics and not be specific to
protein sequence. Time permitting; I would also effectively implement testing on the star schema
to see how good it holds according to various user requirements when it is changed dynamically.
37
BIBLIOGRAPHY
[1] Oracle9i Data Warehousing Guide, Release 2 (9.2), Part Number A96520-01 [online]
http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/concept.htm
[2] National Library of Medicine
http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html
[3] Federal Energy Management Program, - “Using Distributed Energy Resources” [online]
http://www.ebi.ac.uk/uniprot/index.html
[4] Howard Hamilton, Ergun Gurak, Leah Findlater, Wayne Olive, and James Ranson, “Knowledge Discovery in Databases” [online]
http://www2.cs.uregina.ca/~hamilton/courses/831/notes/dcubes/dcubes.html
[5] Passionned Tools, “ETL Tools Comparison” [Online]
http://www.etltool.com/what-is-etl.htm
[6] Craig Utley, Designing Star Schema” [Online]
http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase
/tabid/101/Default.aspx
[7] OLAP Council, “OLAP and OLAP Server Definitions”, January 1995 [Online]
http://altaplana.com/olap/glossary.html#SLICE
[8] Oracle9i Data Warehousing Guide Release 2 (9.2) Part Number A96520-01 [Online]
http://download.oracle.com/docs/cd/B10500_01/server.920/a96520/schemas.htm
[9] Katherine Drewek, Data Warehousing: Similarities and Differences of Inmon and Kimball
[online]
http://www.b-eye-network.com/view/743
[10] Business Intelligence and Data Warehousing [Online]
http://www.sdgcomputing.com/glossary.htm
Download