CHAPTER 1 INTRODUCTION 1.1 Overview

advertisement
CHAPTER 1
INTRODUCTION
1.1
Overview
Data compression is important in making maximal use of limited information
storage and transmission capabilities. One might think that as such capabilities
increase, data compression would become less relevant. But so far this has not been
the case, since the volume of data always seems to increase more rapidly than
capabilities for storing and transmitting it. Wolfram (2002) said, in the future,
compression is always likely to remain relevant when there are physical constraints,
such as transmission by electromagnetic radiation that is not spatially localized.
They are many types of specified compression such as text, image and sound.
In this research, the DNA sequences will be the subject of experiments. They consist
of a specified kind of text only. The deoxyribonucleic acid (DNA) constitutes the
physical medium in which all properties of living organisms are encoded. Biological
2
database such as EMBL, GenBank, and DDBJ, were developed around the world to
store nucleotide sequence (DNA, RNA) and amino-acid sequences of proteins, and
the improvement and addition of those entity sizes, increase nowadays exponentially
fast (Grumbach and Tahi ,1994). Not as big as some other scientific databases, their
size is in hundred of gigabyte.
The first ever compression was invented in 1838, the Morse code for use in
telegraphy. It applies data compression based on shorter codeword for letters such as
"e" and "t" that are more common in English. In 1949 Claude Shannon and Robert
Fano develop a systematic way to assign codeword based on probability of blocks,
(Wolfram, 2002) In the mid-1970s, the idea emerged of dynamically updating
codeword for Huffman encoding, based on the actual data encountered, (Huffman,
1952). And in the late 1970s, with online storage of text files becoming common,
software compression programs began to be developed, almost all based on adaptive
Huffman coding. In 1977 Abraham Lempel and Jacob Ziv (1977, 1978) suggested
the basic idea of pointer-based encoding. In the mid-1980s, following work by Terry
Welch (1984), the so-called LZW algorithm rapidly became the method of choice
for most general-purpose compression systems. It was used in programs such as
PKZIP, as well as in hardware devices such as modems, (Nevill, Witten and Olson,
1996).
This research focuses on enhancement of recently used character
compression to solve large scale DNA Sequence. The selected large scale genes will
be tested using this research scheme. Next section will discuss the background of
DNA sequencing that will lead to understanding why the compression for largescale DNA sequence must be done, in motivation of research. The goal and
objectives of the research are presented in section 1.4 and the scope of research is
presented in section 1.5. The thesis outlines will be introduced in section 1.6.
3
1.2
Background on DNA Sequencing
Finding a single gene amid the vast stretches of DNA that makes up the
human genome - three billion base-pairs' worth - requires a set of powerful tools.
The Human Genome Project (HGP) was devoted to develop new and better tools to
make gene hunts faster, cheaper and practicable for almost any scientist to
accomplish said (Watson, 1990; Francis et.al., 1998).
These tools include genetic maps, physical maps and DNA sequence - which
is a detailed description of the order of the chemical building blocks, or bases, in a
given stretch of DNA. Indeed, the monumental achievement of the HGP was its
successful sequencing of the entire length of human DNA, also called the human
genome, (Adams et.al., 1991).
Scientists need to know the sequence of bases because it tells them the kind
of genetic information that is carried in a particular segment of DNA. For example,
they can use sequence information to determine which stretches of DNA contain
genes, as well as to analyze those genes for changes in sequence, called mutations,
that may cause disease.
DNA sequencing involves a process of Polymerase Chain Reaction or PCR.
The purpose of sequencing is to determine the order of the nucleotides of a gene.
This order is the key to the understanding of the human genome. Frederick Sanger,
was first accredited with the invention of DNA sequencing techniques as said by
Roberts(1987).
4
Sanger's approach involved copying DNA strands which would show the
location of the nucleotides in the strands though the use of X-Ray machines. This
technique is very slow and tedious, usually taking many years to sequence only a
few million letters in a string of DNA that often contains hundreds of millions or
even billions of letters. Modern techniques make used of fluorescent tags instead of
X-rays. This significantly reduced the time required to process a given batch of
DNA.
In 1991, working with Nobel laureate Hamilton Smith, Venter's genomic
research project (TIGR) created a bold new sequencing process coined
'shotgunning’.(Weber and Myers, 1997).
"Using an ordinary kitchen blender, they would shatter the organism's DNA
into millions of small fragments, run them through the sequencers (which can read
500 letters at a time), then reassemble them into full genomes using a high speed
computer and novel software written by in-house computer "(Weber and Myers,
1997).
This new method not only uses super fast automated machines, but also the
fluorescent detection process and the PCR DNA copying procedure. This method is
very fast and accurate compared to older techniques.
5
1.2.1 DNA Sequence Identification
DNA sequencing is a complex nucleotide-sequencing technique including
three identifiable steps:
1. Polymerase Chain Reaction (PCR)
2. Sequencing Reaction
3. Gel Electrophoresis & Computer Processing
Chromosomes (Roberts, 1987), which range from 50 million to 250 million
bases, must first be broken into much shorter pieces (PCR step). Each short piece is
used as a template to generate a set of fragments that differ in length from each other
by a single base that will be identified in a later step (template preparation and
sequencing reaction steps).
The fragments in a set are separated by gel electrophoresis (separation step).
New fluorescent dyes allow separation of all four fragments in a single lane on the
gel.
6
Figure 1.1:
The Separation of the Molecules with Electrophoresis
The final base at the end of each fragment is identified (base-calling step).
This process recreates the original sequence of As, Ts, Cs, and Gs for each short
piece generated in the first step. Current electrophoresis limits are about 500 to 700
bases sequenced per read. Automated sequencers analyze the resulting
electropherograms and the output is a four-color chromatogram showing peaks that
represent each of the 4 DNA bases as shown in Figure 1.1
The fluorescently labeled fragments that migrate through the gel are passed
through a laser beam at the bottom of the gel. The laser exits the fluorescent
molecule, which sends out light of a distinct color. That light is collected and
focused by lenses into a spectrograph. Based on the wavelength, the spectrograph
separates the light across a CCD camera (charge coupled device). Each base has its
own color, so the sequencer can detect the order of the bases in the sequenced gene
as shown in Figure 1.2.
7
Figure 1.2:
The Scanning and |Detection System on the ABI Prism 377
Sequencer
After the bases are "read," computers are used to assemble the short
sequences (in blocks of about 500 bases each, called the read length) into long
continuous stretches that are analyzed for errors, gene-coding regions, and other
characteristics. It will use the ABI Prism 377 sequencer as shown in Figure 1.2.
8
Figure 1.3
A Snapshot of the Detection of the Molecules on the
Sequencer
After the sequencer successes his job, the window similar like Figure 1.3 will
be shown. Each dot and color represents for each A, C, T, and G code. This image,
will be studied and produce a DNA Sequence.
At the end, the DNA data will be provided to public, to solve human needs.
Figure 1.4 is a summary of DNA sequencing steps.
9
Figure 1.4:
DNA Sequencing Work Flow Summary
1.2.2 Large-scale DNA Sequencing
The evolution of Human Genome Project (HGP), promises that all organism
cells can be mapped. The human genome is about three billion (3,000,000,000) base
pair longs (Collins et. al., 2003) if the average fragment length is 500 bases, it would
take a minimum of 6 million (3 billion/500) to sequence the human genome (not
allowing for overlap = 1-fold coverage). Keeping track of such a high number of
10
sequences presents significant challenges, only held down by developing and
coordinating several procedural and computational algorithms, such as efficient
database development and management.
Advancement of this knowledge will motivate another research towards
completing another genome project. Therefore, a huge database with a good
algorithm will make this large scale DNA Sequencing reliable and can be done,
without limitations.
1.2.3
Benefits of Genome Research
Rapid progress in genome science and a glimpse into its potential
applications have spurred observers to predict that biology will be the foremost
science of the 21st century. Technology and resources generated by the Human
Genome Project and other genomics research are already having a major impact on
research across the life sciences. The potential for commercial development of
genomics research presents U.S. industry with a wealth of opportunities, and sales of
DNA-based products and technologies in the biotechnology industry are projected to
exceed $45 billion by 2009 (Consulting Resources Corporation Newsletter, Spring
1999).
Technology and resources promoted by the HGP are starting to have
profound impacts on biomedical research and promise to revolutionize the wider
spectrum of biological research and clinical medicine. Increasingly detailed genome
maps have aided researchers seeking genes associated with dozens of genetic
11
conditions, including myotonic dystrophy, fragile X syndrome, neurofibromatosis
types 1 and 2, inherited colon cancer, Alzheimer's disease, and familial breast
cancer.
On the horizon is a new era of molecular medicine characterized less by
treating symptoms and more by looking to the most fundamental causes of disease.
Rapid and more specific diagnostic tests will make possible earlier treatment of
countless maladies. Medical researchers also will be able to devise novel therapeutic
regimens based on new classes of drugs, immunotherapy techniques, avoidance of
environmental conditions that may trigger disease, and possible augmentation or
even replacement of defective genes through gene therapy.
Another benefits including
1.3
x
Decoding of microbes
x
Finding out about our potential weaknesses and problems
x
Finding out evolution and our links with life
x
Helping to solve crimes
x
Agricultural benefits
Motivation of the Research
The rapid advancement of next-generation DNA sequencers has been
possible due to vast improvements in computer technology, specifically in speed and
size. These new systems produce enormous amounts of data - one run could
generate close to one terabytes of data - and bioinformatics and data management
12
tools have to play catch-up to trigger the analysis and storage of this datum.
Data management and storage will always be an issue for the life science and
medical research industries, and is something that vendor will constantly have to
improve to appease the research world. Luckily, there is hope for software vendors.
Researchers will only begin to warm to the idea that next-generation technologies
produce better data, and will provide time- and cost-savings, if there are adequate
software applications to analyze the data.
However, how much researcher spends on storage device, the transmission
problem will occurred. Even transferring data among computers can consists several
hours for 30 gigabyte file, how about some terabytes data?
Therefore, a specific compression technique for DNA compression has been
invented lately. Most of them using, LZ77 idea, because of the dictionary function
that helps sequential data easily to compress. Many compression algorithms are
focusing on the scheme to shorten the process, enhance the compression ratio, and
fasten the process. From Biocompress to newest algorithm, Graph Compression,
these researches care about compression of sequence data. Logically, if one
sequence of nucleotide (GTACCTATG…) is compressed using any technique, it
will reduce its size. For example by using Biocompress algorithm for CHNTXX
sequence, compression rate is 16.26% (Susan, 1998). For more details about existing
DNA Sequence, please refer to Chapter 2.
Base on previous issue (previous section), The Human Genome Project, a
terabytes of DNA Sequence data are not suitable for non-specific DNA compression.
Mostly, a small sequence has been tested, and it has been proven that they can solve
13
the problem. Base on their experiment, the compression ratio become worst when
the data is bigger .
1.4
Statement of the Problem
i.
The size of DNA Sequence database and the chain itself, arise
drastically inline with advancement of sequencing technology. The
storage problem will occur soon.
ii.
Advancement of LZ77 (LZSS) has always been focusing on general
data, and for DNA sequence many researchers keep testing and
experimenting on popular and small sequences, instead of using it to
iii.
A huge data is not compatible for mobile device usage and data
transfer (Kwong and Ho, 2001). A good compression for large scale
data must be implemented to support the mobile technology.
iv.
A transfer rate among two research center (Eg. National Center for
Biotechnology Information in United States and Institute of Medical
Research in Malaysia) must be enhanced, to cater the knowledge
transfer. Mostly, large scale data, took a lot of time to transfer.
14
1.5
Objectives of the Study
i.
To find the best solution for large scale DNA sequence compression.
This research will be the first ever research focusing on large scale
DNA sequence.
ii.
To enhance LZ77 (LZSS) from universal data compression scheme
purposes to suit large scale DNA sequence problem. It is reliable
based on the characteristic of LZ77 which is similar to DNA
sequence.
iii.
To study a hash table approach, which has solved many type of data
(e.g. sequence data, picture, and jpeg) and implement it into the LZ77
environment. This approach will make DNA sequence stay in
computer memory while compression / decompress is decompress.
iv.
To optimize hash table to suit large scale DNA sequence data with
suitable method. Hash table cannot achieve optimum performance if
the data atmosphere is not suitable for hashing scheme,
1.6
Scope of the Study
Compression of DNA sequence is a huge area in bioinformatics area. A lot
of weighted factor has been identified to compress the sequences. Some of them use
the uniqueness of DNA sequence, the palindrome among sequence and in this
research will focus only on similarities of the characters. However, the latest
compression scheme using dynamic programming did not use any of these factors.
15
They are two types of sequence which has been stored in NCBI databases;
the FASTA and binary format. Sometimes bioinformatics application needs to use
single or both formats. Sadly, all DNA specific compression just focuses on the
FASTA. It will ease the researcher to identify which DNA belongs to whom, instead
of binary. On the other hand, binary is really good on data transmission. They tend
to minimize the size and it will ease the data transmission. This research will use
and analyze DNA sequence data in FASTA and Binary format (text sequence).
In this modern world, a lot of DNA databases (servers) lie in research center.
Some of them only focus on certain world of organism. An example, Bioinformatics
Database in Japan focusing on bacteria, while in Malaysia, they tend to store crops
DNA such as Jathropa and Palm oil. Only one universal server that supports all
databases around the world. The National Center of Biological Information (NCBI)
in United States. Bioinformaticians give a trust to this server, based on capabilities
of storing multiple type of organism including human. This trend will be succeeding
by this research as a primary source.
The special characteristics being highlight for this research is the similarities
among sequences. In computer science, several compression algorithm has been
introduced, and the perfect algorithm suits the needs of this research of LZ77. It uses
sliding window, and create a dictionary to compare with the futures characters.
Using the sliding windows technique, the compression work can be done without
any mistake. The original data still there with the simplify appearance.
16
1.7
Thesis Outline
This section gives a general description of the contents of subsequent
chapters in this thesis. Chapter 2 gives a review of the various techniques to solve
DNA Sequence compression problem. Chapter 3 describes the methodology adopted
to achieve the objectives of this research. Chapter 4 will discuss about the algorithm
construction focus on enhancement of hash table, to suit large DNA sequence.
Chapter 5 will present various experiment using several type of data and
environment. Chapter 6 will summarize the findings of research and future works.
Download