What is bioinformatics? (Adapted from the Frequently Asked Questions page at bioinformatics.org) In a broad sense bioinformatics describes any use of computers to handle biological information. In practice, the definition used by most people is narrower; bioinformatics is a synonym for "computational molecular biology"---the use of computers to characterize the molecular components of living things. "Classical" bioinformatics When most biologists talk about bioinformatics they are referring to the practice of using computers to store, compare, retrieve, analyze or predict the function of biomolecules. Biomolecules include your genetic material (DNA and RNA) and the products of your genes: proteins. These are the concerns of "classical" bioinformatics, dealing primarily with sequence analysis. Fredj Tekaia at the Institut Pasteur offers this definition of bioinformatics: The mathematical, statistical and computing methods that aim to solve biological problems using DNA and amino acid sequences and related information. Most large biological molecules are polymers, or ordered chains of simpler molecular modules called monomers. Think of the monomers as beads or building blocks which, despite having different colors and shapes, all have the same thickness and the same way of connecting to one another. Monomers that can combine in a chain are of the same general class, but each kind of monomer in that class has its own well-defined set of characteristics. Many monomer molecules can be joined together to form a single, far larger, macromolecule. Macromolecules can have exquisitely specific informational content and/or chemical properties. According to this scheme, the monomers in a given macromolecule of DNA or protein can be treated computationally as letters of an alphabet, put together in pre-programmed arrangements to carry messages or do work in a cell. Bioinformatics uses computational methods to try and figure out the information contained in these strings of letters. "New" bioinformatics The greatest achievement of bioinformatics to date, the Human Genome Project, is now completed. While sequencing of various genomes is still ongoing, the nature and priorities of bioinformatics research and applications are changing to focus on deciphering what the information contained in these genomes tells us. Here are some of the ways in which people are attacking this problem: Now that we have sequenced multiple whole genomes we can look for differences and similarities between all the genes of multiple species. From such studies we can draw particular conclusions about species and general ones about evolution. This kind of science is often referred to as comparative genomics. There are now technologies designed to measure the relative number of copies of a genetic message (levels of gene expression) at different stages in development or disease or in different tissues. Such technologies, such as DNA microarrays, are often referred to collectively as functional genomics and are producing large amounts of data that must be stored and analyzed using computational methods. Large-scale methods of investigating the functions and associations of proteins (for example yeast two-hybrid methods) are frequently referred to as proteomics (a combination of protein and genomics). Medical informatics is the more direct application of genomic and proteomic technologies to the investigation of disease, and usually includes the incorporation of traditional clinical data. The merging of individual clinical data with newer molecular data, while providing exciting opportunities for finding the causes of complex diseases like diabetes, heart disease, or cancer, also provides some unique problems in data management, integration, and analysis. How old is the discipline? "How old is bioinformatics?" The answer to this depends on which source you choose to read. Bioinformatics is a relatively young field compared to physics, chemistry, biochemistry, molecular biology, or computer science, but people have been engaging in some of its practices since the dawn of molecular biology 40 years ago. From Attwood and Parry-Smith's "Introduction to Bioinformatics"1, The term bioinformatics is used to encompass almost all computer applications in biological sciences, but was originally coined in the mid-1980s for the analysis of biological sequence data. From Mark S. Boguski's article in the "Trends Guide to Bioinformatics"2, The term "bioinformatics" is a relatively recent invention, not appearing in the literature until 1991 and then only in the context of the emergence of electronic publishing... ...However, some of my role models when I was a graduate student (Margaret O. Dayhoff, Russell F. Doolittle, Walter M. Fitch and Andrew D. McLachlan) had been building databases, developing algorithms and making biological discoveries by sequence analysis since the 1960s---long before anyone thought to label this activity with a special term (if anything it was called ‘molecular evolution’). Even a relatively new kid on the block, the National Center for Biotechnology Information (NCBI), is celebrating its 10th anniversary this year, having been written into existence by US Congressman Claude Pepper and President Ronald Reagan in 1988. So bioinformatics has, in fact, been in existence for more than 30 years and is now middle-aged." 1 2 Prentice-Hall 1999 (Longman Higher Education; ISBN 0582327881) Elsevier, Trends Supplement 1998, p. 1. Other Fields Related to Bioinformatics Biophysics Molecular biology itself grew out of biophysics. The British Biophysical Society defines biophysics as "an interdisciplinary field which applies techniques from the physical sciences to understanding biological structure and function". Computational Biology Computational Biology is a broader term than bioinformatics, and encompasses various computational approaches to modeling biological problems ranging in scope from ecosystems, to blood flow in the heart, to a cell, to the dynamics of individual protein molecules. Genomics Genomics is the intersection of genetics and bioinformatics, and involves the analysis or comparison of genomes or subsets of genomes. Mathematical Biology Mathematical biology is less tied to the collection and analysis of sequence data than bioinformatics, and generally entails developing mathematical models (or applying existing models) to explain various features of biological systems. Proteomics Proteomics involves characterizing the many tens of thousands of proteins expressed in a given cell type at a given time (e.g. measuring their biochemical properties, identifying what other proteins and smaller molecules they interact with, determining the spatial location within the cell where they are found, or determining their three-dimensional structures) and involves the storage and analysis of vast amounts of data. Medical Informatics Medical informatics traditionally deals with the collection and management of patient data in a health care setting. Biomedical informatics involves the combination of this traditional patient data with newer molecular data to investigate questions of disease at the patient level. Where to go for more information The information in this document was adapted from: http://bioinformatics.org/faq/ Wikipedia has a fairly comprehensive page on Bioinformatics: http://en.wikipedia.org/wiki/Bioinformatics The European Bioinformatics Institute (EBI) has an extensive information and tutorial section on their website titled “2can”: http://www.ebi.ac.uk/2can/bioinformatics/bioinf_what_1.html The National Center for Biotechnology Information (part of the National Institutes of Health, and the home for the BLAST sequence alignment program) have an education site: http://www.ncbi.nlm.nih.gov/About/primer/bioinformatics.html The Howard Hughes Medical Institute (HHMI) has a variety of educational materials, both online and available for free on DVD, relating to modern biomedical science at: http://www.hhmi.org/biointeractive/index.html Careers in Bioinformatics http://www.biohealthmatics.com/careers/biocareerpaths.aspx Bioinformatician This will involve the acquisition and analyzing of data from collaborators, public databases, and genome projects, and scientific publications. Biomedical Computer Scientist The role would involve the design and development of programs and/or databases to be used in biological field. Strong programming skills usually a requirement for this role. Geneticist These usually fall into three categories – Research Geneticist, Laboratory Geneticist and Genetic Counselors. The first two usually bioinformatics degrees at Graduate level and all require a strong understating of genetics. Computational Biologist Computational Biologist develop computational tools and methods to solve complex theoretical and mathematical problems as they relate to interpreting genomic information. Knowledge of correlation in statistical and mathematical analyses with genetic and biological information. A bioinformatics biologist would have to collaborate with other researchers and departments and some of his or her duties would include development of tools that would support research objectives, the compilation and analysis of data, including writing and editing reports for journal publication as needed. Biostatistician A biostatistician responsibilities include reviewing potential bioinformatics publications for statistical accuracy, and writing reports for in-house team members and collaborators, as well as information gathering on various studies. At a higher level they would ensure the consistent application of statistical analysis across different studies. Experience with a wide range of statistical methods, such as ANOVA, logistic regression analysis, survival analysis, linkage analysis, and multivariate analysis would be necessary. Proficiency in SPlus or SAS (computer based statistical analysis tools)might be necessary in some positions. Biomedical Chemist Biomedical Chemists analyze pharmaceutical materials for quality, purity and strength. They use approved methodology and observe safety practices. They produce sample batches of a drug for troubleshooting and help design the scaling-up process that takes drug manufacture up to factory proportions. Advanced positions require extensive record keeping and the supervision and integration of a lab team. Clinical Data Manager The Clinical Data Manager uses complex computer systems within bioinformatics environments would need to possess analytical skill to detect and resolve data problems in clinical research studies. A good understanding of the data generated in a clinical research study, methodologies for data storage, reviewing data, database design and testing, and the ability to extract information are all skills that a CDM must have. Molecular Microbiologist The Microbiologist will support efforts to characterize pathogenic bacteria. The Microbiologist will determine bacterial/spore resistance to standard and novel antimicrobials and decontaminants various conditions. A Bachelor of Science in Microbiology with experience in general microbiology is sometimes required. Software/Database Programmer The bioinformatics programmer is responsible for performing analyses on data from genomic and other biological databases, clinical trials and other sources, including listings, tabulations, graphical summaries and formal statistical estimates and tests. Ability to assess quality of analysis data, perform cross study analyses and be able to create and use/write SAS macros to automate all of the above functions. Additionally, the person in this role will design and create analysis databases. A thorough knowledge of study design and protocol requirements is fundamental. Medical Writer/Technical Writer The duties of the medical writer comprise assisting departments in the preparation and writing of documents required for regulatory submissions, writing study protocol and other documents needed for clinical studies, and clinical study reports in accordance with regulatory guidelines. Other tasks include Drafting and coordinating the preparation of manuscripts for publication. A Master’s or PhD Degree is usually required for this position. Research Associates and Research Scientists An advanced degree is needed. Research Associates participate in and contribute to a scientific objective. He or She must be conversant with laboratory equipment and software use as well as safety and protocols. Research Associates also monitor and collect clinical trial data; coordinate designed trials; and prepare written reports, protocols, and study tracking documents. They are responsible for overall site management, including conducting initiation, interim, and close-out visits. A high level of interaction is required with physicians, pharmaceutical companies. Reviewing study documentation and ensuring compliance with clinical objectives and procedures is also a requirement for this role. How does BLAST work? Overview BLAST stands for Basic Local Alignment Search Tool. It takes as input a sequence (either DNA or protein), and returns a list of sequences from the database that are ranked according to how similar they are to the input sequence. The underlying premise for this technique is that sequences that are similar to the input, or query, sequence are homologous to it; that is, they are derived from a common precursor sequence. Over time the two sequences have accumulated mutations and are no longer identical, but we can calculate statistically the likelihood that they are as similar as they are due to pure chance. Once the sequences diverge sufficiently from each other, we are no longer able to tell with certainty whether they are homologous or not – they are no more similar than sequences from genes that are not derived from the same common precursor molecule (i.e. unrelated sequences). This is why the ranking according to similarity is important – we want to find homologous sequences, because then we can assume that the genes we find in the database have similar biological functions. The BLAST page for doing a nucleotide database search (blastn). E-values Each database sequence (or database “hit”) has an E-value listed after it. The “E” stands for “expectation”. This number tells us the likelihood that the database and query sequences are not homologous. This is sort of backwards from what you might expect, but the number is telling us how many times we should “expect” (hence the name, “expectation value”) to see that amount of similarity between two sequences simply by chance (i.e. by searching a database of randomly generated DNA sequences). An E-value of 1 means we would expect to see that level of similarity at least once each time we search the database with our query sequence. An E-value of 0.1 (or 10-1 in scientific notation) means we would expect to see that level of similarity 1 in every 10 times we searched the database. Usually scientists use an E-value of 0.001 (or 10-3) as a cutoff – if the E-value is smaller than this then the “hit” is assumed to be a homologous gene (since there is only a 1-in-one-thousand chance that the two proteins are not homologous), while if it is larger than this then we assume that we cannot tell for sure whether the “hit” is homologous or not – the two sequences are not sufficiently similar for us to make that determination. These sequences in the database can be considered homologs to the query sequence, since there is only 1 chance in 3 x 1028 that the level of similarity between the query and database sequences are due to chance. These sequences cannot be assumed to be homologs to the query sequence, since there is roughly a 1 out of 1 chance that we would see this level of similarity between two sequences simply by chance. Alignments The way BLAST comes up with the similarity measure between two sequences is to align them. It does this by going through a process of matching up the appropriate letters (A, G, T, and C, representing the 4 bases, or building blocks, of DNA), maximizing the number of alignments between like letters (e.g. an A aligned to an A), and minimizing the number of mismatches (e.g. an A aligned to a G). I does this by giving high scores to matches and negative scores to mismatches, then adding up the score for the whole alignment. You can view the alignment between the query sequence and any of the top-scoring database sequences in the BLAST output. A BLAST alignment. Gaps One of the ways that sequences diverge from each other over time is to either add or delete nucleotides (e.g. the letters) from one sequence or the other. This is accounted for by putting gaps in the alignment – these are represented in the BLAST output by aligning a letter in one sequence with a dash (-) in the other sequence. These are also given negative scores, just as with mismatched letters. A BLAST alignment of protein sequences with gaps (lower right-hand corner). Note that gaps (shown as dashes) can appear in either the query sequence or the database sequence. Organisms One of the useful things about the BLAST site is that one can quickly find out which organism a sequence belongs to. This can be done for single sequences by clicking on the sequence description itself, which takes you to a page giving information about that particular sequence, or for all the hits at once by clicking on the “Taxonomy Reports” list. Sometimes it make take a bit of sleuthing to figure out what the scientific name means in everyday language (e.g. that Arabidopsis thalania is a plant and Drosophila melanogaster is a fruit fly), but this is very useful information to have when evaluating the list of hits to your query sequence. Click on the “Taxonomy Reports” link to see more detailed information about the organisms that the sequences found by BLAST come from.