CHAPTER 1 INTRODUCTION 1.1 Overview Data compression is important in making maximal use of limited information storage and transmission capabilities. One might think that as such capabilities increase, data compression would become less relevant. But so far this has not been the case, since the volume of data always seems to increase more rapidly than capabilities for storing and transmitting it. Wolfram (2002) said, in the future, compression is always likely to remain relevant when there are physical constraints, such as transmission by electromagnetic radiation that is not spatially localized. They are many types of specified compression such as text, image and sound. In this research, the DNA sequences will be the subject of experiments. They consist of a specified kind of text only. The deoxyribonucleic acid (DNA) constitutes the physical medium in which all properties of living organisms are encoded. Biological 2 database such as EMBL, GenBank, and DDBJ, were developed around the world to store nucleotide sequence (DNA, RNA) and amino-acid sequences of proteins, and the improvement and addition of those entity sizes, increase nowadays exponentially fast (Grumbach and Tahi ,1994). Not as big as some other scientific databases, their size is in hundred of gigabyte. The first ever compression was invented in 1838, the Morse code for use in telegraphy. It applies data compression based on shorter codeword for letters such as "e" and "t" that are more common in English. In 1949 Claude Shannon and Robert Fano develop a systematic way to assign codeword based on probability of blocks, (Wolfram, 2002) In the mid-1970s, the idea emerged of dynamically updating codeword for Huffman encoding, based on the actual data encountered, (Huffman, 1952). And in the late 1970s, with online storage of text files becoming common, software compression programs began to be developed, almost all based on adaptive Huffman coding. In 1977 Abraham Lempel and Jacob Ziv (1977, 1978) suggested the basic idea of pointer-based encoding. In the mid-1980s, following work by Terry Welch (1984), the so-called LZW algorithm rapidly became the method of choice for most general-purpose compression systems. It was used in programs such as PKZIP, as well as in hardware devices such as modems, (Nevill, Witten and Olson, 1996). This research focuses on enhancement of recently used character compression to solve large scale DNA Sequence. The selected large scale genes will be tested using this research scheme. Next section will discuss the background of DNA sequencing that will lead to understanding why the compression for largescale DNA sequence must be done, in motivation of research. The goal and objectives of the research are presented in section 1.4 and the scope of research is presented in section 1.5. The thesis outlines will be introduced in section 1.6. 3 1.2 Background on DNA Sequencing Finding a single gene amid the vast stretches of DNA that makes up the human genome - three billion base-pairs' worth - requires a set of powerful tools. The Human Genome Project (HGP) was devoted to develop new and better tools to make gene hunts faster, cheaper and practicable for almost any scientist to accomplish said (Watson, 1990; Francis et.al., 1998). These tools include genetic maps, physical maps and DNA sequence - which is a detailed description of the order of the chemical building blocks, or bases, in a given stretch of DNA. Indeed, the monumental achievement of the HGP was its successful sequencing of the entire length of human DNA, also called the human genome, (Adams et.al., 1991). Scientists need to know the sequence of bases because it tells them the kind of genetic information that is carried in a particular segment of DNA. For example, they can use sequence information to determine which stretches of DNA contain genes, as well as to analyze those genes for changes in sequence, called mutations, that may cause disease. DNA sequencing involves a process of Polymerase Chain Reaction or PCR. The purpose of sequencing is to determine the order of the nucleotides of a gene. This order is the key to the understanding of the human genome. Frederick Sanger, was first accredited with the invention of DNA sequencing techniques as said by Roberts(1987). 4 Sanger's approach involved copying DNA strands which would show the location of the nucleotides in the strands though the use of X-Ray machines. This technique is very slow and tedious, usually taking many years to sequence only a few million letters in a string of DNA that often contains hundreds of millions or even billions of letters. Modern techniques make used of fluorescent tags instead of X-rays. This significantly reduced the time required to process a given batch of DNA. In 1991, working with Nobel laureate Hamilton Smith, Venter's genomic research project (TIGR) created a bold new sequencing process coined 'shotgunning’.(Weber and Myers, 1997). "Using an ordinary kitchen blender, they would shatter the organism's DNA into millions of small fragments, run them through the sequencers (which can read 500 letters at a time), then reassemble them into full genomes using a high speed computer and novel software written by in-house computer "(Weber and Myers, 1997). This new method not only uses super fast automated machines, but also the fluorescent detection process and the PCR DNA copying procedure. This method is very fast and accurate compared to older techniques. 5 1.2.1 DNA Sequence Identification DNA sequencing is a complex nucleotide-sequencing technique including three identifiable steps: 1. Polymerase Chain Reaction (PCR) 2. Sequencing Reaction 3. Gel Electrophoresis & Computer Processing Chromosomes (Roberts, 1987), which range from 50 million to 250 million bases, must first be broken into much shorter pieces (PCR step). Each short piece is used as a template to generate a set of fragments that differ in length from each other by a single base that will be identified in a later step (template preparation and sequencing reaction steps). The fragments in a set are separated by gel electrophoresis (separation step). New fluorescent dyes allow separation of all four fragments in a single lane on the gel. 6 Figure 1.1: The Separation of the Molecules with Electrophoresis The final base at the end of each fragment is identified (base-calling step). This process recreates the original sequence of As, Ts, Cs, and Gs for each short piece generated in the first step. Current electrophoresis limits are about 500 to 700 bases sequenced per read. Automated sequencers analyze the resulting electropherograms and the output is a four-color chromatogram showing peaks that represent each of the 4 DNA bases as shown in Figure 1.1 The fluorescently labeled fragments that migrate through the gel are passed through a laser beam at the bottom of the gel. The laser exits the fluorescent molecule, which sends out light of a distinct color. That light is collected and focused by lenses into a spectrograph. Based on the wavelength, the spectrograph separates the light across a CCD camera (charge coupled device). Each base has its own color, so the sequencer can detect the order of the bases in the sequenced gene as shown in Figure 1.2. 7 Figure 1.2: The Scanning and |Detection System on the ABI Prism 377 Sequencer After the bases are "read," computers are used to assemble the short sequences (in blocks of about 500 bases each, called the read length) into long continuous stretches that are analyzed for errors, gene-coding regions, and other characteristics. It will use the ABI Prism 377 sequencer as shown in Figure 1.2. 8 Figure 1.3 A Snapshot of the Detection of the Molecules on the Sequencer After the sequencer successes his job, the window similar like Figure 1.3 will be shown. Each dot and color represents for each A, C, T, and G code. This image, will be studied and produce a DNA Sequence. At the end, the DNA data will be provided to public, to solve human needs. Figure 1.4 is a summary of DNA sequencing steps. 9 Figure 1.4: DNA Sequencing Work Flow Summary 1.2.2 Large-scale DNA Sequencing The evolution of Human Genome Project (HGP), promises that all organism cells can be mapped. The human genome is about three billion (3,000,000,000) base pair longs (Collins et. al., 2003) if the average fragment length is 500 bases, it would take a minimum of 6 million (3 billion/500) to sequence the human genome (not allowing for overlap = 1-fold coverage). Keeping track of such a high number of 10 sequences presents significant challenges, only held down by developing and coordinating several procedural and computational algorithms, such as efficient database development and management. Advancement of this knowledge will motivate another research towards completing another genome project. Therefore, a huge database with a good algorithm will make this large scale DNA Sequencing reliable and can be done, without limitations. 1.2.3 Benefits of Genome Research Rapid progress in genome science and a glimpse into its potential applications have spurred observers to predict that biology will be the foremost science of the 21st century. Technology and resources generated by the Human Genome Project and other genomics research are already having a major impact on research across the life sciences. The potential for commercial development of genomics research presents U.S. industry with a wealth of opportunities, and sales of DNA-based products and technologies in the biotechnology industry are projected to exceed $45 billion by 2009 (Consulting Resources Corporation Newsletter, Spring 1999). Technology and resources promoted by the HGP are starting to have profound impacts on biomedical research and promise to revolutionize the wider spectrum of biological research and clinical medicine. Increasingly detailed genome maps have aided researchers seeking genes associated with dozens of genetic 11 conditions, including myotonic dystrophy, fragile X syndrome, neurofibromatosis types 1 and 2, inherited colon cancer, Alzheimer's disease, and familial breast cancer. On the horizon is a new era of molecular medicine characterized less by treating symptoms and more by looking to the most fundamental causes of disease. Rapid and more specific diagnostic tests will make possible earlier treatment of countless maladies. Medical researchers also will be able to devise novel therapeutic regimens based on new classes of drugs, immunotherapy techniques, avoidance of environmental conditions that may trigger disease, and possible augmentation or even replacement of defective genes through gene therapy. Another benefits including 1.3 x Decoding of microbes x Finding out about our potential weaknesses and problems x Finding out evolution and our links with life x Helping to solve crimes x Agricultural benefits Motivation of the Research The rapid advancement of next-generation DNA sequencers has been possible due to vast improvements in computer technology, specifically in speed and size. These new systems produce enormous amounts of data - one run could generate close to one terabytes of data - and bioinformatics and data management 12 tools have to play catch-up to trigger the analysis and storage of this datum. Data management and storage will always be an issue for the life science and medical research industries, and is something that vendor will constantly have to improve to appease the research world. Luckily, there is hope for software vendors. Researchers will only begin to warm to the idea that next-generation technologies produce better data, and will provide time- and cost-savings, if there are adequate software applications to analyze the data. However, how much researcher spends on storage device, the transmission problem will occurred. Even transferring data among computers can consists several hours for 30 gigabyte file, how about some terabytes data? Therefore, a specific compression technique for DNA compression has been invented lately. Most of them using, LZ77 idea, because of the dictionary function that helps sequential data easily to compress. Many compression algorithms are focusing on the scheme to shorten the process, enhance the compression ratio, and fasten the process. From Biocompress to newest algorithm, Graph Compression, these researches care about compression of sequence data. Logically, if one sequence of nucleotide (GTACCTATG…) is compressed using any technique, it will reduce its size. For example by using Biocompress algorithm for CHNTXX sequence, compression rate is 16.26% (Susan, 1998). For more details about existing DNA Sequence, please refer to Chapter 2. Base on previous issue (previous section), The Human Genome Project, a terabytes of DNA Sequence data are not suitable for non-specific DNA compression. Mostly, a small sequence has been tested, and it has been proven that they can solve 13 the problem. Base on their experiment, the compression ratio become worst when the data is bigger . 1.4 Statement of the Problem i. The size of DNA Sequence database and the chain itself, arise drastically inline with advancement of sequencing technology. The storage problem will occur soon. ii. Advancement of LZ77 (LZSS) has always been focusing on general data, and for DNA sequence many researchers keep testing and experimenting on popular and small sequences, instead of using it to iii. A huge data is not compatible for mobile device usage and data transfer (Kwong and Ho, 2001). A good compression for large scale data must be implemented to support the mobile technology. iv. A transfer rate among two research center (Eg. National Center for Biotechnology Information in United States and Institute of Medical Research in Malaysia) must be enhanced, to cater the knowledge transfer. Mostly, large scale data, took a lot of time to transfer. 14 1.5 Objectives of the Study i. To find the best solution for large scale DNA sequence compression. This research will be the first ever research focusing on large scale DNA sequence. ii. To enhance LZ77 (LZSS) from universal data compression scheme purposes to suit large scale DNA sequence problem. It is reliable based on the characteristic of LZ77 which is similar to DNA sequence. iii. To study a hash table approach, which has solved many type of data (e.g. sequence data, picture, and jpeg) and implement it into the LZ77 environment. This approach will make DNA sequence stay in computer memory while compression / decompress is decompress. iv. To optimize hash table to suit large scale DNA sequence data with suitable method. Hash table cannot achieve optimum performance if the data atmosphere is not suitable for hashing scheme, 1.6 Scope of the Study Compression of DNA sequence is a huge area in bioinformatics area. A lot of weighted factor has been identified to compress the sequences. Some of them use the uniqueness of DNA sequence, the palindrome among sequence and in this research will focus only on similarities of the characters. However, the latest compression scheme using dynamic programming did not use any of these factors. 15 They are two types of sequence which has been stored in NCBI databases; the FASTA and binary format. Sometimes bioinformatics application needs to use single or both formats. Sadly, all DNA specific compression just focuses on the FASTA. It will ease the researcher to identify which DNA belongs to whom, instead of binary. On the other hand, binary is really good on data transmission. They tend to minimize the size and it will ease the data transmission. This research will use and analyze DNA sequence data in FASTA and Binary format (text sequence). In this modern world, a lot of DNA databases (servers) lie in research center. Some of them only focus on certain world of organism. An example, Bioinformatics Database in Japan focusing on bacteria, while in Malaysia, they tend to store crops DNA such as Jathropa and Palm oil. Only one universal server that supports all databases around the world. The National Center of Biological Information (NCBI) in United States. Bioinformaticians give a trust to this server, based on capabilities of storing multiple type of organism including human. This trend will be succeeding by this research as a primary source. The special characteristics being highlight for this research is the similarities among sequences. In computer science, several compression algorithm has been introduced, and the perfect algorithm suits the needs of this research of LZ77. It uses sliding window, and create a dictionary to compare with the futures characters. Using the sliding windows technique, the compression work can be done without any mistake. The original data still there with the simplify appearance. 16 1.7 Thesis Outline This section gives a general description of the contents of subsequent chapters in this thesis. Chapter 2 gives a review of the various techniques to solve DNA Sequence compression problem. Chapter 3 describes the methodology adopted to achieve the objectives of this research. Chapter 4 will discuss about the algorithm construction focus on enhancement of hash table, to suit large DNA sequence. Chapter 5 will present various experiment using several type of data and environment. Chapter 6 will summarize the findings of research and future works.