bioinformatics Cource handouts: For the class third year B.Sc.(biotechnology), North maharashtra university, Jalgaon. Dhananjay Bhole,(M.Sc. bioinformatics, M.B.A. HR) Coordinator, Centre for Disability Studies, Department of Education and Extension,, University of Pune. Email: dhananjay.bhole@gmail.com Cell: 9850123212 Preface: Understanding need of the students, Notes are drawn by considering syllabus of TY B.Sc. biotechnology, North Maharashtra university Jalgaon. Several websites, standerd databases and books are refered for taking notes. At several places, pages from websites are copied directly. This introductory notes consist of 6 chapters. First chapter gives an overview of bioinformatics. second chapter provides basic information about computer. Chapter third is related to basic database concept. Fourth chapter is more elaborated on types of biological databases. In the fifth chapter, brief information about comparative genomics and proteomics is given. In the last chapter, information about human genome project is provided. In the end of the notes, bioinformatics glossary is given. Teachers can distribute the notes freely among the students in the colleges where my lectures are condected. Dhananjay Bhole Contents 1. bioinformatics definition, history, scope and importance 2. computer fundamental and its application in biology 3. basic database concepts 4. biological databases 5. genomics and proteomics 6. human genome project 1. bioinformatics definition, history, scope and importance 1.0 Introduction Quantitative tools are indispensable in modern biology. Most biological research involves application of some type of mathematical, statistical, Or computational tools to help synthesize recorded data and integrate various types of information in the process of answering a particular biological question. For example, enumeration and statistics are required for assessing everyday laboratory experiments, such as making serial dilutions of a solution or counting bacterial colonies, phage plaques, or trees and animals in the natural environment. A classic example in the history of genetics is by Gregor Mendel and Thomas Morgan, who, by simply counting genetic variations of plants and fruit flies, were able to discover the principles of genetic inheritance. More dedicated use of quantitative tools may involve using calculus to predict the growth rate of a human population or to establish a kinetic model for Enzyme catalysis. For very sophisticated uses of quantitative tools, one may find application of the “game theory” to model animal behavior and evolution, or the use of millions of nonlinear partial differential equations to model cardiac blood flow. Whether the application is simple or complex, subtle or explicit, it is clear that mathematical and computational tools have become an integral part of modern-day biological research. However, none of these examples of quantitative tool use in biology could be considered to be part of bioinformatics, which is also quantitative in nature. To help the reader understand the difference between bioinformatics and other elements of quantitative biology, a detailed explanation of what is bioinformatics is provided in the following sections. 1.1 definition WHAT IS BIOINFORMATICS? Bioinformatics is an interdisciplinary research area at the interface between computer science and biological science. A variety of definitions exist in The literature and on the World Wide Web; some are more inclusive than others. The definition proposed by Luscombe et al. in defining bioinformatics As a union of biology and informatics: bioinformatics involves the technology that uses computers for storage, retrieval, manipulation, and distribution of information related to biological macromolecules such as DNA, RNA, and proteins. The emphasis here is on the use of computers because most of the tasks in genomic data analysis are highly repetitive or mathematically complex. The use of computers is absolutely indispensable in mining genomes for information gathering and knowledge building. Bioinformatics differs from a related field known as computational biology. Bioinformatics is limited to sequence, structural, and functional analysis Of genes and genomes and their corresponding products and is often considered computational molecular biology. However, computational biology encompasses all biological areas that involve computation. For example, mathematical modeling of ecosystems, population dynamics, application of the game theory in behavioral studies, and phylogenetic construction using fossil records all employ computational tools, but do not necessarily involve biological macromolecules. Beside this distinction, it is worth noting that there are other views of how the two terms relate. For example, one version defines bioinformatics as The development and application of computational tools in managing all kinds of biological data, whereas computational biology is more confined to the Theoretical development of algorithms used for bioinformatics. The confusion at present over definition may partly reflect the nature of this vibrant and quickly evolving new field. Explanation: Bioinformatics and computational biology are rooted in life sciences as well as computer and information sciences and technologies. Both of these interdisciplinary approaches draw from specific disciplines such as mathematics, physics, computer science and engineering, biology, and behavioral science. Bioinformatics and computational biology each maintain close interactions with life sciences to realize their full potential. Bioinformatics applies principles of information sciences and technologies to make the vast, diverse, and complex life sciences data more understandable and useful. Computational biology uses mathematical and computational approaches to address theoretical and experimental questions in biology. Although bioinformatics and computational biology are distinct, there is also significant overlap and activity at their interface. Definition The NIH Biomedical Information Science and Technology Initiative Consortium agreed on the following definitions of bioinformatics and computational biology recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations. Bioinformatics: Research, development, or application of computational tools and Approaches for expanding the use of biological, medical, behavioral or health data, Including those to acquire, store, organize, archive, analyze, or visualize such data. Computational Biology: The development and application of data-analytical and Theoretical methods, mathematical modeling and computational simulation techniques To the study of biological, behavioral, and social systems. 1.2 history Bioinformatics, which will be more clearly defined below, is the discipline of quantitative analysis of information relating to biological macromolecules With the aid of computers. The development of bioinformatics as a field is the result of advances in both molecular biology and computer science over the past 30–40 years. Although these developments are not described in detail here, understanding the history of this discipline is helpful in obtaining a broader insight into current bioinformatics research. A succinct chronological summary of the landmark events that have had major impacts on the development of bioinformatics is presented here to provide context. The earliest bioinformatics efforts can be traced back to the 1960s, although the word bioinformatics did not exist then. Probably, the first major bioinformatics Project was undertaken by Margaret Dayhoff in 1965, who developed a first protein sequence database called Atlas of Protein Sequence and Structure. Subsequently, In the early 1970s, the Brookhaven National Laboratory established the Protein Data Bank for archiving three-dimensional protein structures. At its onset, The database stored less than a dozen protein structures, compared to more than 30,000 structures today. The first sequence alignment algorithm was developed by Needleman and Wunsch in 1970. This was a fundamental step in the development of the field of bioinformatics, which paved the way for the routine sequence Comparisons and database searching practiced by modern biologists. The first protein structure prediction algorithm was developed by Chou and Fasman in 1974. Though it is rather rudimentary by today’s standard, it pioneered a series of developments in protein structure prediction. The 1980s saw the establishment of GenBank and the development of fast database searching algorithms such as FASTA by William Pearson and BLAST by Stephen Altschul and coworkers. The start of the human genome project in the late 1980s provided a major boost for the development of bioinformatics. The development and the increasingly widespread use of the Internet in the 1990s made instant access to, and exchange and dissemination of, biological data possible. These are only the major milestones in the establishment of this new field. The fundamental reason that bioinformatics gained prominence as a discipline Was the advancement of genome studies that produced unprecedented amounts of biological data. The explosion of genomic sequence information generated a Sudden demand for efficient computational tools to manage and analyze the data. The development of these computational tools depended on knowledge generated From a wide range of disciplines including mathematics, statistics, computer science, information technology, and molecular biology. The merger of these Disciplines created an information-oriented field in biology, which is now known as bioinformatics. 1.3 SCOPE Bioinformatics consists of two subfields: the development of computational tools and databases and the application of these tools and databases in generating Biological knowledge to better understand living Systems. These two subfields are complementary to each other. The tool development includes writing software For sequence, structural, and functional analysis, as well as the construction and curating of biological databases. These tools are used in three areas of genomic and molecular biological research: molecular sequence analysis, molecular structural analysis, and molecular functional analysis. The analyses of biological data often generate new problems and challenges that in turn spur the development of new and better computational tools. The areas of sequence analysis include sequence alignment, sequence database searching, motif and pattern discovery, gene and promoter finding, reconstruction of evolutionary relationships, and genome assembly and comparison. Structural analyses include protein and nucleic acid structure analysis, comparison, classification, and prediction. The functional analyses include gene expression profiling, protein–protein interaction prediction, protein subcellular localization prediction, metabolic pathway reconstruction, and simulation (Fig. 1.1). Figure 1.1: Overview of various subfields of bioinformatics. Biocomputing tool development is at the foundation of all bioinformatics analysis. The applications Of the tools fall into three areas: sequence analysis, structure analysis, and function analysis. There are intrinsic connections between different areas of analyses represented by bars between the boxes. The three aspects of bioinformatics analysis are not isolated but often interact to produce integrated results (see Fig. 1.1). For example, protein structure Prediction depends on sequence alignment data; clustering of gene expression profiles requires the use of phylogenetic tree construction methods derived In sequence analysis. Sequence-based promoter prediction is related to functional analysis of coexpressed genes. Gene annotation involves a number of activities, which include distinction between coding and noncoding sequences, identification of translated protein sequences, and determination of the gene’s evolutionary relationship with other known genes; prediction of its cellular functions employs tools from all three groups of the analyses. 1.4 GOALS The ultimate goal of bioinformatics is to better understand a living cell and how it functions at the molecular level. By analyzing raw molecular sequence And structural data, bioinformatics research can generate new insights and provide a “global” perspective of the cell. The reason that the functions of A cell can be better understood by analyzing sequence data is ultimately because the flow of genetic information is dictated by the “central dogma” of Biology in which DNA is transcribed to RNA, which is translated to proteins. Cellular functions are mainly performed by proteins whose capabilities are Ultimately determined by their sequences. Therefore, solving functional problems using sequence and sometimes structural approaches has proved to be a fruitful endeavor. 1.5 APPLICATIONS Bioinformatics has not only become essential for basic genomic and molecular biology research, but is having a major impact on many areas of biotechnology And biomedical sciences. It has applications, for example, in knowledge-based drug design, forensic DNA analysis, and agricultural biotechnology. Computational studies of protein–ligand interactions provide a rational basis for the rapid identification of novel leads for synthetic drugs. Knowledge of the three-dimensional structures of proteins allows molecules to be designed that are capable of binding to the receptor site of a target protein with great affinity and specificity. This informatics-based approach significantly reduces the time and cost necessary to develop drugs with higher potency, fewer side effects, and less toxicity than using the traditional trial-and-error approach. In forensics, results from molecular phylogenetic analysis have been accepted as evidence in criminal courts. Some sophisticated Bayesian statistics and likelihood-based methods for analysis of DNA have been applied in the analysis of forensic identity. It is worth mentioning that genomics and bioinformtics are now poised to revolutionize our healthcare system by developing personalized and customized medicine. The high speed genomic sequencing coupled with sophisticated informatics technology will allow a doctor in a clinic to quickly sequence a patient’s genome and easily detect potential harmful mutations and to engage in early diagnosis and effective treatment of diseases. Bioinformatics tools are being used in agriculture as well. Plant genome databases and gene expression profile analyses have played an important role in the development of new crop varieties that have higher productivity and more resistance to disease. 1.6 LIMITATIONS Having recognized the power of bioinformatics, it is also important to realize its limitations and avoid over-reliance on and over-expectation of bioinformatics Output. In fact, bioinformatics has a number of inherent limitations. In many ways, the role of bioinformatics in genomics and molecular biology research Can be likened to the role of intelligence gathering in battlefields. Intelligence is clearly very important in leading to victory in a battlefield. Fighting A battle without intelligence is inefficient and dangerous. Having superior information and correct intelligence helps to identify the enemy’s weaknesses And reveal the enemy’s strategy and intentions. The gathered information can then be used in directing the forces to engage the enemy and win the battle. However, completely relying on intelligence can also be dangerous if the intelligence is of limited accuracy. Overreliance on poor-quality intelligence Can yield costly mistakes if not complete failures. It is no stretch in analogy that fighting diseases or other biological problems using bioinformatics is like fighting battles with intelligence. Bioinformatics And experimental biology are independent, but complementary, activities. Bioinformatics depends on experimental science to produce raw data for analysis. It, in turn, provides useful interpretation of experimental data and important leads for further experimental research. Bioinformatics predictions are Not formal proofs of any concepts. They do not replace the traditional experimental research methods of actually testing hypotheses. In addition, the quality of bioinformatics predictions depends on the quality of data and the sophistication of the algorithms being used. Sequence data from high throughput analysis Often contain errors. If the sequences are wrong or annotations incorrect, the results from the downstream analysis are misleading as well. That is why It is so important to maintain a realistic perspective of the role of bioinformatics. Bioinformatics is by no means a mature field. Most algorithms lack the capability and sophistication to truly reflect reality. They often make incorrect Predictions that make no sense when placed in a biological context. Errors in sequence alignment, for example, can affect the outcome of structural or phylogenetic analysis. The outcome of computation also depends on the computing power available. Many accurate but exhaustive algorithms cannot be used because of the slow rate of computation. Instead, less accurate but faster algorithms have to be used. This is a necessary trade-off between accuracy and Computational feasibility. Therefore, it is important to keep in mind the potential for errors produced by bioinformatics programs. Caution should always be exercised when interpreting prediction results. It is a good practice to use multiple programs, if they are available, and perform multiple evaluations. A more accurate prediction can often be obtained if one draws a consensus by comparing results from different algorithms. 1.7 Future of bioinformatics Despite the pitfalls, there is no doubt that bioinformatics is a field that holds great potential for revolutionizing biological research in the coming decades. Currently, the field is undergoing major expansion. In addition to providing more reliable and more rigorous computational tools for sequence, Structural, and functional analysis, the major challenge for future bioinformatics development is to develop tools for elucidation of the functions and Interactions of all gene products in a cell. This presents a tremendous challenge because it requires integration of disparate fields of biological knowledge And a variety of complex mathematical and statistical tools. To gain a deeper understanding of cellular functions, mathematical models are needed to simulate A wide variety of intracellular reactions and interactions at the whole cell level. This molecular simulation of all the cellular processes is termed systems Biology. Achieving this goal will represent a major leap toward fully understanding a living system. That is why the system-level simulation and integration Are considered the future of bioinformatics. Modeling such complex networks and making predictions about their behavior present tremendous challenges and Opportunities for bioinformaticians. The ultimate goal of this endeavor is to transform biology from a qualitative science to a quantitative and predictive Science. This is truly an exciting time for bioinformatics. 2. Computer fundamental and its application in biology What is computer Computer is an automatic electronic device used to perform an arithmatic and logical operation, Historical Evolution of computer :, The development of computer has passed through a number of stages before it racedffte present state of development. In fact the development Of the first calculating device named ABACUS dates back to 3000 B.C from ABACUS to micro computer .the computing system have undergone a tremendous changes. Historical evaluations are given below. 1)ABACUS 2)Analog machines and Napier's Bone 3)Odometer 4)Basic pascal and hismechanical calculator 5)Clartes Babbage different machine 6)Harmine Hollerith :The'punchedcard 7)Electrical machines 8)Howard Aikens and IBM Mark i TYPES OF COMPUTER Technically Computers are of three types 1)Digital Computer 2)Analog Computer 3)Hybrid Computer 1)Digital Computer.-In these type of Computers mathematil expression are finally represented asbinary digits 0 and 1 all operations are using these digits at very high rate. 2)Analog Computer.-In this similarities are established in form of current or voltage signal. AnalOg computer operateby measuring raher than counting in analg computer the variable is an electrical signal produced analogous to the variable of physical system. 3)Hybrid Computer :-Thi type of computer is a combination computer using all good quantities of both the analog and digital computers. CHARACTERISTICS OF COMPUTER 1.SPEED- Computers makes calculat's at very fast rate. But how fast?. Let us muply 369 with 514. It takes about 50 to 60 seconds for expert calculator to do this job. But modem computer can do 30, 00:000 such calculations in minute and those too, without mistake. 2. ACCURACY - Computers never makes mistake .If we :an insert a faulty data will get faultyresult. Their is no chance of making mistake in computer. 3.STORAGE- Computer can store a large amount of data or tnformation which we can in micxo.seco= 4:VERSATILITY - It helpSln forming re jobs automatically and also perform audio, vleo and graphics functions. 5.DILIGENCE' computer can gow0rking endlessly and unlike human being can not get tired. It can process a large amount of information at a time. 6. INTEGRITY -It is abil to take in and store a sequence of instructions for obeying. Such a sequence of instructions is.called as PROGRAM. It is also ability to obey the sequence ofprogram automatically. Block diagram of computer INPUT UNIT :- We cargive information to the computer through input unit e.g. mouse ,keyboard Keyboard - With the help of keyboard we provide information or command to the computer. It is same as typewriter but only additional keys are present in keyboard. CPU (Central Processing Unit) :It is called as heart of the computer .The basic function performed by computer is programme execution. CPU controls the operation of computer and perform it's data processing function. The major three components of CPU are- A) ALU - (Arithmatic LogicUnit)All the calculations & compadsonsare made in this unit of computer. The actual processing of calculation is done in ALU, B) CU- (ControlUnit) It acts like supervisor. It control the transfer of data and infon'nal between various unit. It is also initiate appropriate function by the arithmatic unit. C) MEMORY- Computer can store a large amount of data. The storage capacity of computer called as memory. There are to main types of memory. 1. Primary Memory :- This is of two types a) RAM - Ram stands for random access memory. ItisvolaBeortemporary./fmeaas soon as power is off any information stored in the ram wilt belosses, fnthis memory we can read the contents as well as write new information. b) FROM - FROM stands for read only memory. It is permanent or nonvolatilemeiry. In this memory we can not change data or can not write,a'information We ca (n'ly region the information. = 2. Secodary Memory :-Secondary memory is also called as external memory. The secondary storage device is - " a) Floppy disk- It have various sizes, information is written in it or read from the floppy along ncentric circle called tracks. Floppy disk is corn paratively cheap and can be take from one place to another place. OUTPUT DEVICES :- It gives us the final result in the desired form. VDU (V$ual Display Unit):- It is common output device. This gives display of output from the computer. It is temporary display when computer is off display is Iossed. Printer-Printers are divided into main two types 1) Impact printer :- The printers produce an impact on a piece of paper on which information is to be typed i.e. Head of a printer physically touched to the paper. 2) Non-impact printer :-The printers in which no impact printing machine is ublised & their is no physical contact between head and paper. Printers are classified by their printing mechanism also as below a)Character pdnter :-This pdntsone character at a time b)Line printer :-This prints one line at a time c)Page pdnter :-This prints one page at a time PROGRAMMING LANGUAGES , A languages is a system of communication, programming language consist of all symbols and' character that permit people to communicate with computer. A)Machine Language -A languages which uses numeric codes to represent operations and numeric addresses of operand and is the only onedirectly understand by computer. The sequence of such a instructions called as machine language, B) Assembly Languages- In assembly language mnemonics are used to represent operation codeand string of characters to represent addresses. As an assembly language is designed mainly to replace each machine code with an understandabie mnemonic and eac, h address with a simple alphanumeric string. It- should first translated into to its equivalent machine language program. C) High Level Languages- The development of technique an micro instructions lead to the deveol.ment of high Ivel languages. A no.of languages have !en develop to process scientific and maematica115roblem. High,level language is English like language. Some commonly used high level language is PASCAL, COBOL, FORTRAN etc. Software A set of instructions given tothe computer to operate and control its activities is called software. Software is the part of computer which enable the hardware to use. As a car cannot run without fuel, a computer carmot work wout software. Software can be classified as follow 1)System Software 2)Application Software 1) System Software: Most of the software will be programme which contribute to control and performance of the computer system. System software consist of a)Operating System b)Utility software c) Translator a) operating System -An integrated set of programmes that manages the resources of computer and shedules its operation called software. The operating system acts as an interface between the hardware and the programs. e.g. DOS (Single user), UNIX (Multiuser) Uses of operating system; 1.Control and coordination of peripheral devices such as printers, display screen and disk drives, 2. To monitor the use of machine resources. 3. To help the user develop programmes. 4. To deal with any faults that may occur in the computer and inform the operator. 2) utility software There are many task common to a variety of applications. Examples of such task are: Computer is an automatic electronic device used to perform an arithmatic and logical operation. 2.2 Types of computers according to size 1. Micro computer 2. miniframe computer and 3. main frame computer work stations. Micro computer: is a common small computer used for personal purpose . eg personal desk top or laptop computers. Miniframe computers: the larger computers or work stations used for commercial perpose eg servers in small computer lab. It is made up of many micro computers. operating systems and architectures is arose in the 1970s and 1980s, but minicomputers are generally not considered mainframes. Main frame computers: Mainframes (often colloquially referred to as Big Iron) are computers used mainly by large organizations for critical applications, typically bulk data processing such as census, industry and consumer statistics, ERP, and financial transaction processing. Most large-scale computer system architectures were firmly established in the 1960s. 2.3 Application of computers in biology 1 To store vast, diverse, and complex life sciences data 2 To have fast and easy accessibility of biological data 3 To make biological information more understandable and useful by using various visualization tools. 4 to analyze biological data for addressing theoretical and experimental questions in biology by using mathematical and computational approaches. 3. basic database concepts 3.1 what is data? Technically, raw facts and figures, such as orders and payments, which are processed into information, such as balance due and quantity on hand is consider as data. the terms data and information are used synonymously., the term data is the plural of "datum," which is one item of data. But datum is rarely used, and data is used as both singular and plural in practice. Data is Any form of information whether on paper or in electronic form. Data may refer to any electronic file no matter what the format: database data, text, images, audio and video. Everything read and written by the computer can be considered as data except for instructions in a program that are executed (software). A common misconception is that software is also data. Software is executed, or run, by the computer. Data are "processed." Thus, software causes the computer to process data. The amount of data versus information kept in the computer is a tradeoff. Data can be processed into different forms of information, but it takes time to sort and sum transactions. Up-to-date information can provide instant answers. 3.2 Basic Database Concepts Database Concepts WHAT IS A DATABASE? A database is a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria. Databases are composed of computer hardware and software for data management. The chief objective of the development of a database is to organize data in a set of structured records to enable easy retrieval of information. Each record, also called an entry, should contain a number of fields that hold the actual data items, for example, fields for names, phone numbers, addresses, dates. To retrieve a particular record from the database, a user can specify a particular piece of information, called value, to be found in a particular field and expect the computer to retrieve the whole data record. This process is called making a query. Database management systems (DBMS) are collections of tools used to manage databases. Four basic functions performed by all DBMS are: • Create, modify, and delete data structures, e.g. tables • Add, modify, and delete data • Retrieve data selectively • Generate reports based on data A short list of database applications would include: Inventory, Payroll, Membership Orders, Shipping, Reservation, Invoicing, Accounting, Security, Catalogues, Mailing, Medical records etc. 3.3 Database Components Databases are composed of related tables, while tables are composed of fields and records. Field:A field is an area (within a record) reserved for a specific piece of data. Examples: customer number, customer name, street address, city, state, phone, current balance. Fields are defined by: • Field name • Data type • Character: text, including such things as telephone numbers and zip codes • Numeric: numbers which can be manipulated using math operators • Date: calendar dates which can be manipulated mathematically • Logical: True or False, Yes or No Field size: Amount of space reserved for storing data Record: A record is the collection of values for all the fields pertaining to one entity: i.e. a person, product, company, transaction, etc. Table: A table is a collection of related records. For example, employee table, product table, customer, and orders tables. In a table, records are represented by rows and fields are represented as columns. Relationships: There are three types of relationships which can exist between tables: • One-to-One • One-to-Many • Many-to-Many The most common relationships in relational databases are One-to-Many and Many-toMany. An example of a One-to-Many relationship would be a Customer table and an Orders table: each order has only one customer, but a customer can make many orders. One-to-Many relationships consist of two tables, the "one" table, and the "many" table. An example of a Many-to-Many relationship would be an Orders table and a Products table: an order can contain many products, and a product can be on many orders. A Many-to-Many relationship consists of three tables: two "one" tables, both in a One-toMany relationship with a third table. The third table is sometimes referred to as the lien. Key Fields: In order for two tables to be related, they must share a common field. The common field (key field) in the "one" table of a One-to- Many relationship needs to be a primary key. The same field in the "many" table of a One-to-Many relationship is called the foreign key. Primary key: A Primary key is a field or a combination of two or more fields. The value in the primary key field for each record uniquely identifies that record. In the example above, customer number is the Primary key for the Customer table. A customer number identifies one and only one customer in the Customer table. The primary key for the Orders table would be a field for the order number. Foreign key: When a "one" table's primary key field is added to a related "many" table in order to create the common field which relates the two tables, it is called a foreign key in the "many" table. In the example above, the primary key (customer number) from the Customer table ("one" table) is a foreign key in the Orders table ("many" table). For the "many" records of the Order table, the foreign key identifies with which unique record in the Customer table they are associated. 3.4 Rationalization and Redundancy Grouping logically-related fields into distinct tables, determining key fields, and then relating distinct tables using common key fields is called rationalizing a database. There are two major reasons for designing a database this way: • To avoid wasting storage space for redundant data • To eliminate the complication of updating duplicate data copies For example, in the Customers/Orders database, we want to be able to identify the customer name, address, and phone number for each order, but we want to avoid repeating that information for each order. To do so would take up storage space needlessly and make the job of updating multiple customer addresses difficult and time-consuming. To avoid redundancy: 1. Place all the fields related to customers (name, address, etc.) into a Customer table and create a Primary key field which uniquely identifies each customer: Customer ID. 2. Put all the fields related to orders (date, salesperson, total, etc.) into the Orders table. 3. Include the Primary key field (Customer ID) from the Customer table in the table for Orders. The One-to-Many relationship between Customer and Orders is defined by the common field Customer ID. In the table for Customers (the "one" table) Customer ID is a primary key, while in the Orders table (the "many" table) it is a foreign key. Reference: Bruce Miller, 2005 Database Management Systems Index 4. Introduction to Biological Databases 4.1 need One of the hallmarks of modern genomic research is the generation of enormous amounts of raw sequence data. As the volume of genomic data grows, sophisticated computational methodologies are required to manage the data deluge. Thus, the very first challenge in the genomics era is to store and handle the staggering volume of information through the establishment and use of computer databases. The development of databases to handle the vast amount of molecular biological data is thus a fundamental task of bioinformatics. This chapter introduces some basic concepts related to databases, in particular, the types, designs, and architectures of biological databases. Emphasis is on retrieving data from the main biological databases such as GenBank. Although data retrieval is the main purpose of all databases, biological databases often have a higher level of requirement, known as knowledge discovery, which refers to the identification of connections between pieces of information that were not known when the information was first entered. For example, databases containing raw sequence information can perform extra computational tasks to identify sequence homology or conserved motifs. These features facilitate the discovery of new biological insights from raw data. 4.2 TYPES OF DATABASES Originally, databases all used a flat file format, which is a long text file that contains many entries separated by a delimiter, a special character such as a vertical bar (|). Within each entry are a number of fields separated by tabs or commas. Except for the raw values in each field, the entire text file does not contain any hidden instructions for computers to search for specific information or to create reports based on certain fields from each record. The text file can be considered a single table. Thus, to search a flat file for a particular piece of information, a computer has to read through the entire file, an obviously inefficient process. This is manageable for a small database, but as database size increases or data types become more complex, this database style can become very difficult for information retrieval. Indeed, searches through such files often cause crashes of the entire computer system because of the memory-intensive nature of the operation. there are over 1,000 public and commercial biological databases. These biological databases usually contain genomics and proteomics data, but databases are also used in taxonomy. The data are nucleotide sequences of genes or amino acid sequences of proteins. Furthermore information about function, structure, localisation on chromosome, clinical effects of mutations as well as similarities of biological sequences can be found. • Most important public databases for molecular biology • 1 Primary sequence databases • 2 Meta-databases • 3 Genome Browsers • 4 Specialized databases • 5 Expression, regulation & pathways databases • 6 Protein sequence databases • 7 Protein structure databases • 8 Microarray-databases • 9 Protein-Protein Interactions Overview Biological databases have become an important tool in assisting scientists to understand and explain a host of biological phenomena from the structure of biomolecules and their interaction, to the whole metabolism of organisms and to understanding the evolution of species. This knowledge helps facilitate the fight against diseases, assists in the development of medications and in discovering basic relationships amongst species in the history of life. The biological knowledge of databases is usually (locally) distributed amongst many different specialized databases. This makes it difficult to ensure the consistency of information, which sometimes leads to low data quality. By far the most important resource for biological databases is a special (yearly) issue of the journal "Nucleic Acids Research" (NAR). The Database Issue is freely available, and categorizes all the publicly available online databases related to computational biology (or bioinformatics). NCBI, Most important public databases for molecular biology (from http://www.kokocinski.net/bioinformatics/databases.php) Primary sequence databases The International Nucleotide Sequence Database (INSD) consists of the following databases. 1. DDBJ (DNA Data Bank of Japan) 2. EMBL Nucleotide DB ( European Molecular Biology Laboratory) 3. GenBank [1] ( National Center for Biotechnology Information) These databanks represent the current knowledge about the sequences of all organisms. They interchange the stored information and are the source for many other databases. Meta-databases 1. MetaDB (MetaDB: A Metadatabase for the Biological Sciences) containing links and descriptions for over 1200 biological databases. 2. Entrez [2] ( National Center for Biotechnology Information) 3. euGenes ( Indiana University) 4. GeneCards ( Weizmann Inst.) 5. SOURCE ( Stanford University) 6. mGen containing four of the world biggest databases GenBank, Refseq, EMBL and DDBJ easy and simple program friendly gene extraction 7. Harvester ( EMBL Heidelberg) Bioinformatic_Harvester Integrating 16 major protein resources. Strictly speaking a meta-database can be considered a database of databases, rather than any one integration project or technology. It collects information from different other sources and usually makes them available in new and more convenient form. Genome Browsers 1. UCSC Genome Bioinformatics Genome Browser and Tools ( UCSC) 2. Ensembl Genome Browser ( Sanger Institute and EBI) 3. Integrated Microbial Genomes Microbial Genome Browser ( Joint Genome Institute, Department of Energy) 4. GBrowse The GMOD GBrowse Project Genome Browsers enable researchers to visualize and browse entire genomes (most have many complete genomes) with annotated data including gene prediction and structure, proteins, expression, regulation, variation, comparative analysis, etc. Annotated data is usually from multiple diverse sources. Specialized databases 1. CGAP Cancer Genes ( National Cancer Institute) 2. Clone Registry Clone Collections ( National Center for Biotechnology Information) 3. DBGET H.sapiens ( Univ. of Kyoto) 4. GDB Hum. Genome Db ( Human Genome Organisation) 5. I.M.A.G.E Clone Collections (Image Consortium) 6. MGI Mouse Genome ( Jackson Lab.) 7. SHMPD The Singapore Human Mutation and Polymorphism Database 8. NCBI-UniGene (National Center for Biotechnology Information) 9. OMIM Inherited Diseases (Online Mendelian Inheritance in Man) 10. Off. Hum. Genome Db (HUGO Gene Nomenclature Committee) 11. List with SNP-Databases 12. p53 The p53 Knowledgebase Expression, regulation & pathways databases 1. KEGG PATHWAY Database [3] ( Univ. of Kyoto) 2. Reactome [4] ( Cold Spring Harbor Laboratory, EBI, Gene Ontology Consortium) Protein sequence databases 1. UniProt [5] Universal Protein Resource (UniProt Consortium: EBI, Expasy, PIR) 2. PIR Protein Information Resource ( Georgetown University Medical Center (GUMC)) 3. Swiss-Prot [6] Protein Knowledgebase ( Swiss Institute of Bioinformatics) 4. PEDANT Protein Extraction, Description and ANalysis Tool (Forschungszentrum f. Umwelt & Gesundheit) 5. PROSITE Database of Protein Families and Domains 6. DIP Database of Interacting Proteins ( Univ. of California) 7. Pfam Protein families database of alignments and HMMs ( Sanger Institute) 8. ProDom Comprehensive set of Protein Domain Families ( INRA/ CNRS) 9. SignalP Server for signal peptide prediction Protein structure databases Protein structure databases: 1. Protein Data Bank [7] (PDB) (Research Collaboratory for Structural Bioinformatics (RCSB)) 2. CATH Protein Structure Classification 3. SCOP Structural Classification of Proteins 4. SWISS-MODEL Server and Repository for Protein Structure Models 5. ModBase Database of Comparative Protein Structure Models (Sali Lab, UCSF) Microarray-databases Microarraydatabases: 1. ArrayExpress ( European Bioinformatics Institute) 2. Gene Expression Omnibus ( National Center for Biotechnology Information) 3. maxd (Univ. of Manchester) 4. SMD ( Stanford University) 5. GPX(Scottish Centre for Genomic Technology and Informatics) Protein-Protein Interactions Protein-protein interactions: 1. BioGRID [8] A General Repository for Interaction Datasets ( Samuel Lunenfeld Research Institute) 2. STRING: STRING is a database of known and predicted protein-protein interactions .( EMBL) • Interactome 4.3 DNA sequence databases NCBI: national centre for biotechnology information Established in 1988 as a national resource for molecular biology information, NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease. EMBL: European molecular biology laboratory Developed by European bioinformatics institute Heidelberg Germany It also archives up to date and detail information about biological macro molecules such as nucleotide sequences and protein sequences. DDBJ: DNA data bank of japan began DNA data bank activities in earnest in 1986 at the National Institute of Genetics (NIG) with the endorsement of the Ministry of Education, Science, Sport and Culture. From the beginning, DDBJ has been functioning as one of the International DNA Databases, including EBI (European Bioinformatics Institute; responsible for the EMBL database) in Europe and NCBI (National Center for Biotechnology Information; responsible for GenBank database) in the USA as the two other members. Consequently, we have been collaborating with the two data banks through exchanging data and information on Internet and by regularly holding two meetings, the International DNA Data Banks Advisory Meeting and the International DNA Data Banks Collaborative Meeting. The Center for Information Biology at NIG was reorganized as the Center for Information Biology and DNA Data Bank of Japan ( CIB-DDBJ) in 2001. The new center is to play a major role in carrying out research in information biology and to run DDBJ operation in the world. It is generally accepted that research in biology today requires both computer and experimental equipment equally well. In particular, we must rely on computers to analyze DNA sequence data accumulating at a remarkably rapid rate. Actually, this triggered the birth and development of information biology. DDBJ is the sole DNA data bank in Japan, which is officially certified to collect DNA sequences from researchers and to issue the internationally recognized accession number to data submitters. We collect data mainly from Japanese researchers, but of course accept data and issue the accession number to researchers in any other countries. Since we exchange the collected data with EMBL/EBI and GenBank/NCBI on a daily basis, the three data banks share virtually the same data at any given time. We also provide worldwide many tools for data retrieval and analysis developed by at DDBJ and others. Database collaboration: NCBI, EMBL and DDBJ are collaborated internationally for exchange of data and information on Internet and by regularly holding two meetings, the International DNA Data Banks Advisory Meeting and the International DNA Data Banks Collaborative Meeting. the three data banks share virtually the same data at any given time. 4.4 protein sequence databasesSwis prot: swis institute for protein resources Swiss-Prot strives to provide reliable protein sequences associated with a high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.), a minimal level of redundancy and high level of integration with other databases. In 2002, the UniProt consortium was created: it is a collaboration between the Swiss Institute of Bioinformatics, the European Bioinfomatics Institute and the Protein Information Resource (PIR), funded by the National Institutes of Health. Swiss-Prot and its automatically curated supplement TrEMBL, have joined with the Protein Information Resource protein database to produce the UniProt Knowledgebase, the world's most comprehensive catalogue of information on proteins. [2] As of 3 April 2007, UniProtKB/Swiss-Prot release 52.2 contains 263,525 entries. As of 3 April 2007, the UniProtKB/TrEMBL release 35.2 contains 4,232,122 entries. The UniProt consortium produced 3 database components, each optimised for different uses. The UniProt Knowledgebase ( UniProtKB (Swiss-Prot + TrEMBL)), the UniProt Non-redundant Reference ( UniRef) databases, which combine closely related sequences into a single record to speed similarity searches and the UniProt Archive ( UniParc), which is a comprehensive repository of protein sequences, reflecting the history of all protein sequences. Tremble: translated nucleotide sequence database of European molecular biology laboratory. The database also archive the same kind of information as that of swis prot. 5 genomics and proteomics 5.1 What is gene ,genome and genomics? Gene:Agene is segment of dna or chromosome responsible for coding one or more functional protein. Genome: The genome is the gene complement of an organism. A genome sequence comprises the information of the entire genetic material of an organism. Genomics: it is the science deals with the study of entire genome, gene organization such as gene order, gene arrangement, gene ontology etc. The goal of Genomics is to determine the complete DNA sequence for all the genetic material contained in an organism's complete genome. 5.2 Structural genomics: the branch of genomics that determines the three-dimensional structures of proteins Structural genomics or structural bioinformatics refers to the analysis of macromolecular structure particularly proteins, using computational tools and theoretical frameworks. One of the goals of structural genomics is the extension of idea of genomics, to obtain accurate three-dimensional structural models for all known protein families, protein domains or protein folds. Structural alignment is a tool of structural genomics. 5.3 functional genomics Understanding the function of genes and other parts of the genome is known as functional genomics. Functional genomics is a field of molecular biology that attempts to make use of the vast wealth of data produced by genomic projects (such as genome sequencing projects) to describe gene (and protein !) functions and interactions. Unlike genomics and proteomics, functional genomics focus on the dynamic aspects such as gene transcription, translation, and protein-protein interactions, as opposed to the static aspects of the genomic information such as DNA sequence or structures. Fields of Application Functional genomics includes function-related aspects of the genome itself such as mutation and polymorphism (such as SNP) analysis, as well as measurement of molecular activities. The latter comprise a number of "-omics" such as transcriptomics ( gene expression), proteomics ( protein expression), phosphoproteomics and metabolomics. Together these measurement modalities quantifies the various biological processes and powers the understanding of gene and protein functions and interactions. Frequently Used Techniques Functional genomics uses mostly high-throughput techniques to characterize the abundance gene products such as mRNA and proteins. Some typical technology platforms are: • DNA microarrays and SAGE for mRNA • two-dimensional gel electrophoresis And mass spectrometryfor protein Because of the large quantity of data produced by these techniques and the desire to find biologically meaningful patterns, bioinformatics is crucial to this type of analysis. Examples of techniques in bioinformatics are data clustering orprincipal component analysis for unsupervised machine learning (class detection) as well as artificial neural networks or support vector machines for supervised machine learning (class prediction, classification). 5.4 proteom and proteomics Proteom: The Proteome is the protein complement expressed by a genome. While the genome is static, the proteome continually changes in response to external and internal events. Proteomics: The study of how the entire set of proteins produced by a particular organism interact. It encompasses the identification and quantification of proteins, and the effect of their modifications, interactions, activities, and function, during disease states, and treatment. A study of an organism's proteins, including the molecular structure of the protein. Protein structure often determines the roles that proteins play in plant physiology. A term applied to anyone working with proteins, which is almost everyone in the post-genomic age. It is the science of determining protein structure and function. The study of how the entire set of proteins produced by a particular organism interact 5.5 What is comparative genomics? How does it relate to functional genomics? Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and noncoding regions of the genome. Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans. Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them. Some of these sequence-similarity tools are accessible to the public over the Internet. One of the most widely used is BLAST, which is available from the National Center for Biotechnology Information. BLAST is a set of programs designed to perform similarity searches on all available sequence data. For instructions on how to use BLAST, see the tutorial Sequence similarity searching using NCBI BLAST available through Gene Gateway, an online guide for learning about genes, proteins, and genetic disorders. Why is model organism research important? Why do we care what diseases mice get? Functional genomics research is conducted using model organisms such as mice. Model organisms offer a cost-effective way to follow the inheritance of genes (that are very similar to human genes) through many generations in a relatively short time. Some model organisms studied in the HGP were the bacterium Escherichia coli, yeast Saccharomyces cerevisiae, roundworm Caenorhabditis elegans, fruit fly Drosophila melanogaster, and laboratory mouse. Additionally, HGP spinoffs have led to genetic analysis of other environmentally and industrially important organisms in the United States and abroad. For "Microbial Genomes Sequenced". How closely related are mice and humans? How many genes are the same? Answer provided by Lisa Stubbs of Lawrence Livermore National Laboratory, Livermore, California. Mice and humans (indeed, most or all mammals including dogs, cats, rabbits, monkeys, and apes) have roughly the same number of nucleotides in their genomes -- about 3 billion base pairs. This comparable DNA content implies that all mammals contain more or less the same number of genes, and indeed our work and the work of many others have provided evidence to confirm that notion. I know of only a few cases in which no mouse counterpart can be found for a particular human gene, and for the most part we see essentially a one-to-one correspondence between genes in the two species. The exceptions generally appear to be of a particular type --genes that arise when an existing sequence is duplicated. Gene duplication occurs frequently in complex genomes; sometimes the duplicated copies degenerate to the point where they no longer are capable of encoding a protein. However, many duplicated genes remain active and over time may change enough to perform a new function. Since gene duplication is an ongoing process, mice may have active duplicates that humans do not possess, and vice versa. These appear to make up a small percentage of the total genes. I believe the number of human genes without a clear mouse counterpart, and vice versa, won't be significantly larger than 1% of the total. Nevertheless, these novel genes may play an important role in determining species-specific traits and functions. However, the most significant differences between mice and humans are not in the number of genes each carries but in the structure of genes and the activities of their protein products. Gene for gene, we are very similar to mice. What really matters is that subtle changes accumulated in each of the approximately 30,000 genes add together to make quite different organisms. Further, genes and proteins interact in complex ways that multiply the functions of each. In addition, a gene can produce more than one protein product through alternative splicing or post-translational modification; these events do not always occur in an identical way in the two species. A gene can produce more or less protein in different cells at various times in response to developmental or environmental cues, and many proteins can express disparate functions in various biological contexts. Thus, subtle distinctions are multiplied by the more than 30,000 estimated genes. The often-quoted statement that we share over 98% of our genes with apes (chimpanzees, gorillas, and orangutans) actually should be put another way. That is, there is more than 95% to 98% similarity between related genes in humans and apes in general. (Just as in the mouse, quite a few genes probably are not common to humans and apes, and these may influence uniquely human or ape traits.) Similarities between mouse and human genes range from about 70% to 90%, with an average of 85% similarity but a lot of variation from gene to gene (e.g., some mouse and human gene products are almost identical, while others are nearly unrecognizable as close relatives). Some nucleotide changes are “neutral” and do not yield a significantly altered protein. Others, but probably only a relatively small percentage, would introduce changes that could substantially alter what the protein does. Put these alterations in the context of known inherited human diseases: a single nucleotide change can lead to inheritance of sickle cell disease, cystic fibrosis, or breast cancer. A single nucleotide difference can alter protein function in such a way that it causes a terrible tissue malfunction. Single nucleotide changes have been linked to hereditary differences in height, brain development, facial structure, pigmentation, and many other striking morphological differences; due to single nucleotide changes, hands can develop structures that look like toes instead of fingers, and a mouse's tail can disappear completely. Single-nucleotide changes in the same genes but in different positions in the coding sequence might do nothing harmful at all. Evolutionary changes are the same as these sequence differences that are linked to person-toperson variation: many of the average 15% nucleotide changes that distinguish humans and mouse genes are neutral; some lead to subtle changes, whereas others are associated with dramatic differences. Add them all together, and they can make quite an impact, as evidenced by the huge range of metabolic, morphological, and behavioral differences we see among organisms. Why are mice used in this research? Mice are genetically very similar to humans. They also reproduce rapidly, have short life spans, are inexpensive and easy to handle, and can be genetically manipulated at the molecular level. 5.6 What genomes have been sequenced completely? In addition to the human genome, numerous other genomes have been sequenced. These include the mouse Mus musculus, the fruitfly Drosophila melanogaster, the worm Caenorhabditis elegans, the bacterium Escherichia coli, the yeast Saccharomyces cerevisiae, the plant Arabidopsis thaliana, and several microbes. For a complete listing see A Quick Guide to Sequenced Genomes from the Genome News Network. Other resources for information on sequenced genomes: • GOLD -Genomes Online Database provides comprehensive access to information regarding complete and ongoing genome projects around the world. • Comprehensive Microbial Resource -A tool that allows the researcher to access all of the bacterial genome sequences completed to date. From The Institute for Genomic Research (TIGR). • Entrez Genome -A resource from the National Center for Biotechnology Information (NCBI) for accessing information about completed and in-progress genomes. 5.7 overview on Comparative Genomic Analysis Sequencing the genomes of the human, the mouse and a wide variety of other organisms - from yeast to chimpanzees - is driving the development of an exciting new field of biological research called comparative genomics. By comparing the human genome with the genomes of different organisms, researchers can better understand the structure and function of human genes and thereby develop new strategies in the battle against human disease. In addition, comparative genomics provides a powerful new tool for studying evolutionary changes among organisms, helping to identify the genes that are conserved among species along with the genes that give each organism its own unique characteristics. Using computer-based analysis to zero in on the genomic features that have been preserved in multiple organisms over millions of years, researchers will be able to pinpoint the signals that control gene function, which in turn should translate into innovative approaches for treating human disease and improving human health. In addition, the evolutionary perspective may prove extremely helpful in understanding disease susceptibility. For example, chimpanzees do not suffer from some of the diseases that strike humans, such as malaria and AIDS. A comparison of the sequence of genes involved in disease susceptibility may reveal the reasons for this species barrier, thereby suggesting new pathways for prevention of human disease. Although living creatures look and behave in many different ways, all of their genomes consist of DNA, the chemical chain that makes up the genes that code for thousands of different kinds of proteins. Precisely which protein is produced by a given gene is determined by the sequence in which four chemical building blocks - adenine (A), thymine (T), cytosine (C) and guanine (G) - are laid out along DNA's double-helix structure. In order for researchers to most efficiently use an organism's genome in comparative studies, data about its DNA must be in large, contiguous segments, anchored to chromosomes and, ideally, fully sequenced. Furthermore, the data needs to be organized for easy access and highspeed analysis by sophisticated computer software. The successful sequencing of the human genome, which is scheduled to be finished in April 2003, and the recent draft assemblies of the mouse and rat genomes have demonstrated that large-scale sequencing projects can generate high-quality data at a reasonable cost. As a result, the interest in sequencing the genomes of many other organisms has risen dramatically. The fledgling field of comparative genomics has already yielded some dramatic results. For example, a March 2000 study comparing the fruit fly genome with the human genome discovered that about 60 percent of genes are conserved between fly and human. Or, to put it more simply, the two organisms appear to share a core set of genes. Researchers have found that two-thirds of human cancer genes have counterparts in the fruit fly. Even more surprisingly, when scientists inserted a human gene associated with early-onset Parkinson's disease into fruit flies, they displayed symptoms similar to those seen in humans with the disorder, raising the possibility that the tiny insects could serve as a new model for testing therapies aimed at Parkinson's. In September 2002, the cow (Bos taurus), the dog (Canis familiaris) and the ciliate Oxytricha (Oxytricha trifallax) joined the "high priority" that the National Human Genome Research Institute (NHGRI) decided to consider for genome sequencing as capacity becomes available. Other high-priority animals include the chimpanzee (Pan troglodytes), the chicken (Gallus gallus), the honey bee (Apis mellifera) and even a sea urchin (Strongylocentrotus purpuratus). With sequencing projects on the human, mouse and rat genomes progressing rapidly and nearing completion, NHGRI-supported sequencing capability is expected to be available soon for work on other organisms. NHGRI created a priority-setting process in 2001 to make rational decisions about the many requests being brought forward by various communities of scientists, each championing the animals used in its own research. The prioritysetting process, which does not result in new grants for sequencing the organisms, is based on the medical, agricultural and biological opportunities expected to be created by sequencing a given organism. In addition to its implications for human health and well being, comparative genomics may benefit the animal world as well. As sequencing technology grows easier and less expensive, it will likely find wide applications in zoology as a tool to tease apart the often-subtle differences among animal species. Such efforts might possibly lead to the rearrangement of some branches on the evolutionary tree, as well as point to new strategies for conserving or expanding rare and endangered species. 6. Human Genome Project What was the Human Genome Project? The Human Genome Project (HGP) was the international, collaborative research program whose goal was the complete mapping and understanding of all the genes of human beings. All our genes together are known as our "genome." The HGP was the natural culmination of the history of genetics research. In 1911, Alfred Sturtevant, then an undergraduate researcher in the laboratory of Thomas Hunt Morgan, realized that he could - and had to, in order to manage his data - map the locations of the fruit fly (Drosophila melanogaster) genes whose mutations the Morgan laboratory was tracking over generations. Sturtevant's very first gene map can be likened to the Wright brothers' first flight at Kitty Hawk. In turn, the Human Genome Project can be compared to the Apollo program bringing humanity to the moon. The hereditary material of all multi-cellular organisms is the famous double helix of deoxyribonucleic acid (DNA), which contains all of our genes. DNA, in turn, is made up of four chemical bases, pairs of which form the "rungs" of the twisted, ladder-shaped DNA molecules. All genes are made up of stretches of these four bases, arranged in different ways and in different lengths. HGP researchers have deciphered the human genome in three major ways: determining the order, or "sequence," of all the bases in our genome's DNA; making maps that show the locations of genes for major sections of all our chromosomes; and producing what are called linkage maps, complex versions of the type originated in early Drosophila research, through which inherited traits (such as those for genetic disease) can be tracked over generations. The HGP has revealed that there are probably somewhere between 30,000 and 40,000 human genes. The completed human sequence can now identify their locations. This ultimate product of the HGP has given the world a resource of detailed information about the structure, organization and function of the complete set of human genes. This information can be thought of as the basic set of inheritable "instructions" for the development and function of a human being. The International Human Genome Sequencing Consortium published the first draft of the human genome in the journal Nature in February 2001 with the sequence of the entire genome's three billion base pairs some 90 percent complete. A startling finding of this first draft was that the number of human genes appeared to be significantly fewer than previous estimates, which ranged from 50,000 genes to as many as 140,000.The full sequence was completed and published in April 2003. Upon publication of the majority of the genome in February 2001, Francis Collins, the director of NHGRI, noted that the genome could be thought of in terms of a book with multiple uses: "It's a history book - a narrative of the journey of our species through time. It's a shop manual, with an incredibly detailed blueprint for building every human cell. And it's a transformative textbook of medicine, with insights that will give health care providers immense new powers to treat, prevent and cure disease." The tools created through the HGP also continue to inform efforts to characterize the entire genomes of several other organisms used extensively in biological research, such as mice, fruit flies and flatworms. These efforts support each other, because most organisms have many similar, or "homologous," genes with similar functions. Therefore, the identification of the sequence or function of a gene in a model organism, for example, the roundworm C. elegans, has the potential to explain a homologous gene in human beings, or in one of the other model organisms. These ambitious goals required and will continue to demand a variety of new technologies that have made it possible to relatively rapidly construct a first draft of the human genome and to continue to refine that draft. These techniques include: • DNA Sequencing • The Employment of Restriction Fragment-Length Polymorphisms (RFLP) • Yeast Artificial Chromosomes (YAC) • Bacterial Artificial Chromosomes (BAC) • The Polymerase Chain Reaction (PCR) • Electrophoresis Of course, information is only as good as the ability to use it. Therefore, advanced methods for widely disseminating the information generated by the HGP to scientists, physicians and others, is necessary in order to ensure the most rapid application of research results for the benefit of humanity. Biomedical technology and research are particular beneficiaries of the HGP. However, the momentous implications for individuals and society for possessing the detailed genetic information made possible by the HGP were recognized from the outset. Another major component of the HGP - and an ongoing component of NHGRI - is therefore devoted to the analysis of the ethical, legal and social implications (ELSI) of our newfound genetic knowledge, and the subsequent development of policy options for public consideration. Glossary Keyword: Analogous protein Definition: Two proteins with related folds but unrelated sequences are called analogous. During evolution, analogous proteins independently developed the same fold. Keyword: Databank Definition: In the biosciences, a databank (or data bank) is a structured set of raw data, most notably DNA sequences from sequencing projects (e.g. the EMBL and GenBank databases). Keyword: Database Definition: A database (or data base) is a collection of data that is organized so that its contents can easily be accessed, managed, and modified by a computer. The most prevalent type of database is the relational database which organizes the data in tables; multiple relations can be mathematically defined between the rows and columns of each table to yield the desired information. An object-oriented database stores data in the form of objects which are organized in hierachical classes that may inherit properties from classes higher in the tree structure. In the biosciences, a database is a curated repository of raw data containing annotations, further analysis, links to other databases. Examples of databases are the SWISSPROT database for annotated protein sequences or the FlyBase database of genetic and molecular data for Drosophila melanogaster. Keyword: Dynamic Progamming Definition: In general, dynamic programming is an algorithmic scheme for solving discrete optimization problems that have overlapping subproblems. In a dynamic programming algorithm, the definition of the function that is optimized is extended as the computation proceeds. The solution is constructed by progressing from simpler to more complex cases, thereby solving each subproblem before it is needed by any other subproblem. In particular, the algorithm for finding optimal alignments is an example of dynamic programming. Keyword: Force-field Definition: In molecular dynamics and molecular mechanics calculations, the intra- and intermolecular interactions of a molecule are calculated from a simplified empirical parametrization called a force field. These include atom masses, charges, dihedral angles, improper angles, van-der-Waals and electrostatic interactions, etc. Keyword: Genome Definition: The genome is the gene complement of an organism. A genome sequence comprises the information of the entire genetic material of an organism. Keyword: Genomics, Functional Genomics, Structural Genomics Definition: The goal of Genomics is to determine the complete DNA sequence for all the genetic material contained in an organism's complete genome. Functional genomics (sometimes refered to as functional proteomics) aims at determining the function of the proteome (the protein complement encoded by an organism's entire genome). It expands the scope of biological investigation from studying single genes or proteins to studying all genes or proteins at once in a systematic fashion, using large-scale experimental methodologies combined with statistical analysis of the results. Structural Genomics is the systematic effort to gain a complete structural description of a defined set of molecules, ultimately for an organism’s entire proteome. Structural genomics projects apply X-ray crystallography and NMR spectroscopy in a high-throughput manner. Keyword: Hidden Markov Model Definition: A Hidden Markov Model (HMM) is a general probabilistic model for sequences of symbols. In a Markov chain, the probability of each symbol depends only on the preceeding one. Hidden Markov models are widely used in bioinformatics, most notably to replace sequence profile in the calculation of sequence alignments Keyword: Homologous Proteins Definition: Two proteins with related folds and related sequences are called homologous. Commonly, homologous proteins are further divided into orthologous and paralogous proteins. While orthologous proteins evolved from a common ancestral gene, paralogous proteins were created by gene duplication. Keyword: Neural Network Definition: A neural network is a computer algorithm to solve non-linear optimisation problems. The algorithm was derived in analogy to the way the densely interconnected, parallel structure of the brain processes information. Keyword: Ontology Definition: The word ontology has a long history in philosophy, in which it refers to the study of being as such. In information science, an ontology is an explicit formal specification of how to represent the objects, concepts and other entities that are assumed to exist in some area of interest and the relationships among them. Keyword: Open Reading Frame (ORF) Definition: An opening frame contains a series of codons (base triplets) coding for amino acids without any termination codons. There are six potential reading frames of an unidentified sequence. Keyword: Protein Folding Problem Definition: Proteins fold on a time scale from ms to s. Starting from a random coil conformation, proteins can find their stable fold quickly although the number of possible conformations is astronomically high. The Protein Folding Problem is to predict the folding and the final structure of a protein solely from its sequence. The Protein Structure Prediction Problem refers to the combinatorial problem to calculate the three-dimensional structure of a protein from its sequence alone. It is one of the biggest challenges in structural bioinformatics. Keyword: Proteome Definition: The Proteome is the protein complement expressed by a genome. While the genome is static, the proteome continually changes in response to external and internal events. Keyword: Proteomics Definition: Proteomics aims at quantifying the expression levels of the complete protein complement (the proteome) in a cell at any given time. While proteomics research was initially focussed on two-dimensional gel electrophoresis for protein separation and identification, proteomics now refers to any procedure that characterizes the function of large sets of proteins. It is thus often used as a synonym for functional genomics. Keyword: Sequence Contig Definition: A contig consists of a set of gel readings from a sequencing project that are related to one another by overlap of their sequences. The gel readings of a contig can be combined to form a contiguous consensus sequence whose length is called the length of the contig. Keyword: Sequence Profile Definition: A sequence profile represents certain features in a set of aligned sequences. In particular, it gives position-dependent weights for all 20 amino acids and as for insertion and deletion events at any sequence position. Keyword: Single Nucleotide Polymorphism Definition: Single Nucleotide Polymorphisms (SNPs) are single base pair positions in genomic DNA at which normal individuals in a given population show different sequence alternatives (alleles) with the least frequent allele having an abundance of 1 % or greater. SNPs occur once every 100 to 300 bases and are hence the most common genetic variations. Keyword: Threading Definition: Threading techniques try to match a target sequence on a library of known threedimensional structures by „threading“ the target sequence over the known coordinates. In this manner, threading tries to predict the three-dimensional structure starting from a given protein sequence. It is sometimes successful when comparisons based on sequences or sequence profiles alone fail due to a too low sequence similarity. Keyword: Turing Machine Definition: The Turing machine is one of the key abstractions used in modern computability theory. It is a mathematical model of a device that changes its internal state and reads from, writes on, and moves a potentially infinite tape, all in accordance with its present state. The model of the Turing machine played an important role in the conception of the modern digital computer.