1 BIOINFORMATICS BIO 208 Genetics revised s10 Bioinformatics is the field of science in which biology, computer science, and information technology merge into a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned. The simplest tasks used in bioinformatics concern the creation and maintenance of databases of biological information. Nucleic acid sequences (and the protein sequences derived from them) comprise the majority of such databases. Bioinformatics includes the development of new algorithms and statistics with which to assess relationships among members of these large data bases and the analysis and interpretation of various data including nucleotide and amino acid sequences, protein domains, and protein structures (computational biology). Computational molecular biology includes: o finding the genes in the DNA sequences of various organisms o developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences o clustering protein sequences into families of related sequences and the development of protein models o aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. NCBI = National Center for Biotechnology Information Established in 1988 as a national resource for molecular biology information, the NCBI creates public databases, conducts research in computational biology, develops software tools for analyzing genome data, and disseminates biomedical information. BLAST= Basic Local Alignment Search Tool BLAST programs can be used to search both DNA and protein sequences on the NCBI server. The program you will use is BLASTp (p for protein). GenBank contains over 6,000 whole genome sequences, and over 100,000,000 DNA sequences. Over 30,000 people per day access GenBank online. OMIM Online Mendelian Inheritance in Man is a comprehensive compendium of human genes and genetic phenotypes. OMIM contains information on all known Mendelian disorders and over 12,000 genes. PubMed PubMed is a service of the U.S. National Library of Medicine that includes over 18 million citations from MEDLINE and other life science journals for biomedical articles back to 1948. 2 NCBI resources Literature Download a large, custom set of records from NCBI , Obtain the full text of an article Find articles about a topic similar to that in a given article. Find published information on a gene or sequence DNA & RNA Download a large, custom set of records from NCBI View/download features around an object or between two objects on a chromosome Link from an object on a map to another resource , Obtain a genomic DNA clone for a gene Retrieve all sequences for an organism or taxon Find a curated version of a sequence record (NCBI Reference Sequence) , Find transcript sequences for a gene Design PCR primers and check them for specificity Save text search and receive regular search results by e-mail. Find published information on gene or sequence Proteins Download a large, custom set of records from NCBI, View a mutation site in a 3D structure View the 3D structure of a protein, Align two or more 3D structures to a given structure Find the function of a gene or gene product, Link from an object on a map to another resource Retrieve all sequences for an organism or taxon , Find curated version of sequence record Find transcript sequences for a gene, Save a text search and/or receive regular search results by e-mail Find published information on a gene or sequence Sequence Analysis Run BLAST software on a local computer. Design PCR primers and check them for specificity Automate BLAST searches performed on NCBI servers. Run BLAST searches against custom, local databases Submit multiple query sequences in a single BLAST search Obtain genomic sequence for/near a gene, marker, transcript or protein Genes & Expression View all SNPs associated with a gene. View genotype frequency data for a gene, disease or SNP Find genes associated with a phenotype or disease Find human variants associated with a phenotype or disease as reported in the literature Download a large, custom set of records from NCBI. Find the function of a gene or gene product View/download features around an object or between two objects on a chromosome Link from object on map to another resource. Find human variants with clinical association in SNP database Find syntenic regions between the genomes of two organisms. Obtain a genomic DNA clone for a gene Find transcript sequences for a gene. Save a text search and/or receive regular search results by e-mail Find published information on a gene or sequence. Find a homolog for a gene in another organism Obtain genomic sequence for/near a gene, marker, transcript or protein Genomes Download the complete genome for an organism View/download features around an object or between two objects on a chromosome Link from an object on a map to another resource. Find syntenic regions between the genomes of two organisms Obtain a genomic DNA clone for a gene. Check the status of genome sequencing for an organism 3 Maps & Markers View/download features around an object or between two objects on a chromosome Link from an object on a map to another resource. Find syntenic regions between the genomes of two organisms Obtain a genomic DNA clone for a gene Domains & Structures View a mutation site in a 3D structure. View the 3D structure of a protein. Align two or more 3D structures to a given structure Find the function of a gene or gene product. Save a text search and/or receive regular search results by e-mail Genetics & Medicine View genotype frequency data for a gene, disease or SNP. Find genes associated with a phenotype or disease Find human variants associated with a phenotype or disease as reported in the literature Find human variants with a clinical association in the SNP database Taxonomy Retrieve all sequences for an organism or taxon. Find the complete taxonomic lineage for an organism Generate a Common Tree for a set of taxa Data & Software Download the complete genome for an organism, large, custom set of records from NCBI, NCBI Software Training & Tutorials Learn about the basics of molecular biology and bioinformatics, Learn about an NCBI resource Complete an NCBI tutorial. Find out what's new at NCBI Homology Find syntenic regions between the genomes of two organisms. Find a homolog for a gene in another organism Small Molecules Find bioassays in which a given drug is active. Find bioassays that test a particular disease or protein target Variation View all SNPs associated with a gene. View genotype frequency data for a gene, disease or SNP Find genes associated with a phenotype or disease Find human variants associated with a phenotype or disease as reported in the literature Download a large, custom set of records from NCBI. View a mutation site in a 3D structure Find human variants with a clinical association in the SNP database 4 PROBLEM CONTEXT (adapted from National Science Foundation) You and your research partners attended a presentation at which you learned of an effective folk remedy used for the prevention of fungal disease in humans. The pasty nature of the remedy is provided by the structural protein, keratin. During the presentation, evidence was presented suggesting that there is a lower incidence of breast cancer and heart disease among those who take the folk remedy. The folk remedy contains the following: Water Salt Pigeon feather extract Muskmelon seeds Southern copperhead snake venom Your start-up biotech company is interested in the therapeutic effects of these agents and needs to identify the specific protein responsible for the observed anti-cancer and anti-heart disease effects. After identifying the active protein, your company will isolate the gene encoding the protein. The gene will be engineered and cloned into bacteria. By growing the bacteria in culture, you will be able to purify large quantities of the protein which can be used in FDA regulated clinical trials. Three amino acid sequences have been isolated from the folklore remedy. You will use the NCBI’s BLAST program to search the protein database to identify the protein from which these amino acid sequences were obtained. You will hypothesize as to which protein might possess the anti-cancer, anti-heart disease function desired. Objectives: To fully understand the purpose of the laboratory exercise including the problem context To view some of the organism databases offered in the BLAST assembled genomes To convert amino acid sequences into FASTA format To utilize the BLAST searching tool to identify proteins using short amino acid sequences To evaluate the contents of the folklore remedy with respect to their use in medicine To identify a journal article relevant to the anti-cancer or anti-heart disease effects of the component(s) in the folk remedy To explain why the selected protein is likely to have anti-cancer and anti-heart disease activity 5 I. IDENTITY OF THE UNKNOWN PROTEINS All sequences for the BLAST alignment programs are entered in FASTA format. View the table on the last page of this handout to see the one letter FASTA code for each of the amino acids. Q1. What is the 1 letter code for the following amino acid sequence? Methionine – Lysine – Leucine- Tyrosine – Serine –Leucine-Leucine-Serine-LeucineLeucine-Phenylanaline-Leucine-Glycine-Valine-Leucine-Tryptophan-Arginine-SerineGlutaminc Acid- Glycine- Valine- Alanine- Serine-Serine-Serine-Asparagine-Aspartic acidAspartic acid-Valine-Glycine 1 letter code (FASTA format) ________________________________________________ The three amino acid sequences isolated from the folklore remedy are: 1: The above sequence from question1 2: MSCYNPCLPC QPCGPTPLAN 3: DAPANPCCDA ATCKLTTGSQ CADGLCCDQC Searching the BLAST database Access the NCBI home page at http://www.ncbi.nlm.nih.gov/ select BLAST (right column) select list all genomic species Q2. Explore the species in GenBank by clicking on the organism groups (primates, rodents, monotremes, marsupials, invertebrates, protozoa, plants, fungi, etc). List the scientific name (Genus and species) of the following organisms. Human Chimpanzee Mouse Zebrafish Fruit fly Maize (corn) Bakers yeast Return and select protein BLAST (searches protein data base with amino acid query) Enter the amino acid sequence of protein 1 in FASTA format (see above) Select the nr database = non-redundant protein sequences. Scroll down and click on BLAST. View the color key for alignment scores. A score above 50 will be considered relevant in this exercise. Scroll down to sequences producing significant alignments and view the sequence that produced the strongest hit. 6 Q3. The sequence that has the best match has the highest score. What is the probable identity of the protein? Q4. Select the link to the left of the sequence description (begins with sp or ref) What is the source of the protein? List genus and species AND common name. Q5. Scroll down to examine the amino acid sequence alignment. How many amino acids long is the protein? Q6. Examine the amino acid sequence carefully to locate your original query sequence within in it. Which amino acids (from what position to what position in the protein) was your query sequence? ____________ to _____________ Determination of protein function Search Wikipedia to identify the function of protein 1 in the organism you identified. Wikipedia contains accurate information with respect to the bioinformatics exercise. Q7. Examine the applications and ingredients of the folklore remedy. This will assist you in determining the function of the protein you have identified. What is the function of this protein in the folklore remedy? Based on the problem context, would your company pursue this protein? Identification of Additional Proteins in the Folklore Medicine Examine the FASTA sequence of protein 2. Use the table on the last page of the handout to determine the amino acid sequence of this protein. Q8. What is the amino acid sequence of protein 2 (use 3 letter amino acid abbreviations): Q9. Use BLAST to determine the identity of this protein. What is the identity of protein 2? Q10. What is the genus and species of the organism that protein 2 was isolated from? Q11. What is the common name of the organism? Q12. Which one of the folklore ingredients did you identify? Q13. Reread the problem context that describes the uses of the folk remedy. The protein you identified is has a structural, or binding, function. Explain (Wikipedia can be used) 7 Q14. Use protein BLAST to determine the identity of protein 3. Although you may obtain a number of hits with a score of 100, observe only those that correspond to full sequences (not partial, or protein chains). What is the identity of protein 3? Q15. What is the genus and species from which protein 3 was isolated? Q16. What is the common name of the organism? Q17. Which one of the folklore ingredients did you identify? Q18. What is (are) the function(s) of the protein with respect to your interests in the biotech startup company described in the problem context (search Wikipedia and read at least 3 article links) Q11. List 4 journal names in which research biologists have published papers on this protein. Journal names can be found at the end of each article. II. SUMMARY Provide an analysis of your investigation. Of the 3 proteins analyzed, which do you think is the responsible for reducing the risk of breast cancer and heart disease? Which element of the folklore remedy would you purify and pursue as a drug? What is the basis for your opinion? What are the roles of the 2 proteins not selected for further study in the folklore remedy? 8 Amino acid symbols One letter symbol Three letter symbol alanine A Ala arginine R Arg asparagine N Asn aspartic acid D Asp cysteine C Cys glutamic acid E Glu glutamine Q Gln glycine G Gly histidine H His isoleucine I Ile leucine L Leu lysine K Lys methionine M Met phenylalanine F Phe proline P Pro serine S Ser threonine T Thr tryptophan W Trp tyrosine Y Tyr valine V Val Amino acid