BIOINFORMATICS

advertisement
1
BIOINFORMATICS
BIO 208 Genetics
revised s10
Bioinformatics is the field of science in which biology, computer science, and information
technology merge into a single discipline. The ultimate goal of the field is to enable the
discovery of new biological insights as well as to create a global perspective from which
unifying principles in biology can be discerned. The simplest tasks used in bioinformatics
concern the creation and maintenance of databases of biological information. Nucleic acid
sequences (and the protein sequences derived from them) comprise the majority of such
databases. Bioinformatics includes the development of new algorithms and statistics with which
to assess relationships among members of these large data bases and the analysis and
interpretation of various data including nucleotide and amino acid sequences, protein domains,
and protein structures (computational biology).
Computational molecular biology includes:
o finding the genes in the DNA sequences of various organisms
o developing methods to predict the structure and/or function of newly discovered
proteins and structural RNA sequences
o clustering protein sequences into families of related sequences and the
development of protein models
o aligning similar proteins and generating phylogenetic trees to examine
evolutionary relationships.
NCBI = National Center for Biotechnology Information
Established in 1988 as a national
resource for molecular biology information, the NCBI creates public databases, conducts
research in computational biology, develops software tools for analyzing genome data, and
disseminates biomedical information.
BLAST= Basic Local Alignment Search Tool
BLAST programs can be used to search both
DNA and protein sequences on the NCBI server. The program you will use is BLASTp (p for protein).
GenBank contains over 6,000 whole genome sequences, and over 100,000,000 DNA
sequences. Over 30,000 people per day access GenBank online.
OMIM Online Mendelian Inheritance in Man is a comprehensive compendium of human
genes and genetic phenotypes. OMIM contains information on all known Mendelian disorders
and over 12,000 genes.
PubMed PubMed is a service of the U.S. National Library of Medicine that includes over 18
million citations from MEDLINE and other life science journals for biomedical articles back to
1948.
2
NCBI resources
Literature


Download a large, custom set of records from NCBI , Obtain the full text of an article
Find articles about a topic similar to that in a given article. Find published information on a gene or sequence
DNA & RNA







Download a large, custom set of records from NCBI
View/download features around an object or between two objects on a chromosome
Link from an object on a map to another resource , Obtain a genomic DNA clone for a gene
Retrieve all sequences for an organism or taxon
Find a curated version of a sequence record (NCBI Reference Sequence) , Find transcript sequences for a gene
Design PCR primers and check them for specificity
Save text search and receive regular search results by e-mail. Find published information on gene or sequence
Proteins






Download a large, custom set of records from NCBI, View a mutation site in a 3D structure
View the 3D structure of a protein, Align two or more 3D structures to a given structure
Find the function of a gene or gene product, Link from an object on a map to another resource
Retrieve all sequences for an organism or taxon , Find curated version of sequence record
Find transcript sequences for a gene, Save a text search and/or receive regular search results by e-mail
Find published information on a gene or sequence
Sequence Analysis




Run BLAST software on a local computer. Design PCR primers and check them for specificity
Automate BLAST searches performed on NCBI servers. Run BLAST searches against custom, local databases
Submit multiple query sequences in a single BLAST search
Obtain genomic sequence for/near a gene, marker, transcript or protein
Genes & Expression










View all SNPs associated with a gene. View genotype frequency data for a gene, disease or SNP
Find genes associated with a phenotype or disease
Find human variants associated with a phenotype or disease as reported in the literature
Download a large, custom set of records from NCBI. Find the function of a gene or gene product
View/download features around an object or between two objects on a chromosome
Link from object on map to another resource. Find human variants with clinical association in SNP database
Find syntenic regions between the genomes of two organisms. Obtain a genomic DNA clone for a gene
Find transcript sequences for a gene. Save a text search and/or receive regular search results by e-mail
Find published information on a gene or sequence. Find a homolog for a gene in another organism
Obtain genomic sequence for/near a gene, marker, transcript or protein
Genomes




Download the complete genome for an organism
View/download features around an object or between two objects on a chromosome
Link from an object on a map to another resource. Find syntenic regions between the genomes of two organisms
Obtain a genomic DNA clone for a gene. Check the status of genome sequencing for an organism
3
Maps & Markers



View/download features around an object or between two objects on a chromosome
Link from an object on a map to another resource. Find syntenic regions between the genomes of two organisms
Obtain a genomic DNA clone for a gene
Domains & Structures


View a mutation site in a 3D structure. View the 3D structure of a protein. Align two or more 3D structures to a
given structure
Find the function of a gene or gene product. Save a text search and/or receive regular search results by e-mail
Genetics & Medicine



View genotype frequency data for a gene, disease or SNP. Find genes associated with a phenotype or disease
Find human variants associated with a phenotype or disease as reported in the literature
Find human variants with a clinical association in the SNP database
Taxonomy


Retrieve all sequences for an organism or taxon. Find the complete taxonomic lineage for an organism
Generate a Common Tree for a set of taxa
Data & Software

Download the complete genome for an organism, large, custom set of records from NCBI, NCBI Software
Training & Tutorials


Learn about the basics of molecular biology and bioinformatics, Learn about an NCBI resource
Complete an NCBI tutorial. Find out what's new at NCBI
Homology

Find syntenic regions between the genomes of two organisms. Find a homolog for a gene in another organism
Small Molecules

Find bioassays in which a given drug is active. Find bioassays that test a particular disease or protein target
Variation





View all SNPs associated with a gene. View genotype frequency data for a gene, disease or SNP
Find genes associated with a phenotype or disease
Find human variants associated with a phenotype or disease as reported in the literature
Download a large, custom set of records from NCBI. View a mutation site in a 3D structure
Find human variants with a clinical association in the SNP database
4
PROBLEM CONTEXT (adapted from National Science Foundation)
You and your research partners attended a presentation at which you learned of an effective folk
remedy used for the prevention of fungal disease in humans. The pasty nature of the remedy is
provided by the structural protein, keratin. During the presentation, evidence was presented
suggesting that there is a lower incidence of breast cancer and heart disease among those who
take the folk remedy.
The folk remedy contains the following:
Water
Salt
Pigeon feather extract
Muskmelon seeds
Southern copperhead snake venom
Your start-up biotech company is interested in the therapeutic effects of these agents and needs
to identify the specific protein responsible for the observed anti-cancer and anti-heart disease
effects.
After identifying the active protein, your company will isolate the gene encoding the protein. The
gene will be engineered and cloned into bacteria. By growing the bacteria in culture, you will be
able to purify large quantities of the protein which can be used in FDA regulated clinical trials.
Three amino acid sequences have been isolated from the folklore remedy. You will use the
NCBI’s BLAST program to search the protein database to identify the protein from which these
amino acid sequences were obtained. You will hypothesize as to which protein might possess the
anti-cancer, anti-heart disease function desired.
Objectives:
 To fully understand the purpose of the laboratory exercise including the problem context
 To view some of the organism databases offered in the BLAST assembled genomes
 To convert amino acid sequences into FASTA format
 To utilize the BLAST searching tool to identify proteins using short amino acid sequences
 To evaluate the contents of the folklore remedy with respect to their use in medicine
 To identify a journal article relevant to the anti-cancer or anti-heart disease effects of the
component(s) in the folk remedy
 To explain why the selected protein is likely to have anti-cancer and anti-heart disease
activity
5
I. IDENTITY OF THE UNKNOWN PROTEINS
All sequences for the BLAST alignment programs are entered in FASTA format. View the table
on the last page of this handout to see the one letter FASTA code for each of the amino acids.
Q1. What is the 1 letter code for the following amino acid sequence?
Methionine – Lysine – Leucine- Tyrosine – Serine –Leucine-Leucine-Serine-LeucineLeucine-Phenylanaline-Leucine-Glycine-Valine-Leucine-Tryptophan-Arginine-SerineGlutaminc Acid- Glycine- Valine- Alanine- Serine-Serine-Serine-Asparagine-Aspartic acidAspartic acid-Valine-Glycine
1 letter code (FASTA format)  ________________________________________________
The three amino acid sequences isolated from the folklore remedy are:
1: The above sequence from question1
2: MSCYNPCLPC QPCGPTPLAN
3: DAPANPCCDA ATCKLTTGSQ CADGLCCDQC
Searching the BLAST database
Access the NCBI home page at http://www.ncbi.nlm.nih.gov/
 select BLAST (right column)
 select list all genomic species
Q2. Explore the species in GenBank by clicking on the organism groups (primates, rodents,
monotremes, marsupials, invertebrates, protozoa, plants, fungi, etc). List the scientific name
(Genus and species) of the following organisms.
Human
Chimpanzee
Mouse
Zebrafish
Fruit fly
Maize (corn)
Bakers yeast
 Return and select protein BLAST (searches protein data base with amino acid query)
 Enter the amino acid sequence of protein 1 in FASTA format (see above)
 Select the nr database = non-redundant protein sequences. Scroll down and click on BLAST.
View the color key for alignment scores. A score above 50 will be considered relevant in this
exercise. Scroll down to sequences producing significant alignments and view the sequence
that produced the strongest hit.
6
Q3. The sequence that has the best match has the highest score. What is the probable identity of
the protein?
Q4. Select the link to the left of the sequence description (begins with sp or ref)
What is the source of the protein? List genus and species AND common name.
Q5. Scroll down to examine the amino acid sequence alignment. How many amino acids long is
the protein?
Q6. Examine the amino acid sequence carefully to locate your original query sequence within in
it. Which amino acids (from what position to what position in the protein) was your query
sequence? ____________ to _____________
Determination of protein function
Search Wikipedia to identify the function of protein 1 in the organism you identified. Wikipedia
contains accurate information with respect to the bioinformatics exercise.
Q7. Examine the applications and ingredients of the folklore remedy. This will assist you in
determining the function of the protein you have identified. What is the function of this protein
in the folklore remedy? Based on the problem context, would your company pursue this protein?
Identification of Additional Proteins in the Folklore Medicine
Examine the FASTA sequence of protein 2. Use the table on the last page of the handout to
determine the amino acid sequence of this protein.
Q8. What is the amino acid sequence of protein 2 (use 3 letter amino acid abbreviations):
Q9. Use BLAST to determine the identity of this protein. What is the identity of protein 2?
Q10. What is the genus and species of the organism that protein 2 was isolated from?
Q11. What is the common name of the organism?
Q12. Which one of the folklore ingredients did you identify?
Q13. Reread the problem context that describes the uses of the folk remedy. The protein you
identified is has a structural, or binding, function. Explain (Wikipedia can be used)
7
Q14. Use protein BLAST to determine the identity of protein 3. Although you may obtain a
number of hits with a score of 100, observe only those that correspond to full sequences (not
partial, or protein chains). What is the identity of protein 3?
Q15. What is the genus and species from which protein 3 was isolated?
Q16. What is the common name of the organism?
Q17. Which one of the folklore ingredients did you identify?
Q18. What is (are) the function(s) of the protein with respect to your interests in the biotech
startup company described in the problem context (search Wikipedia and read at least 3 article
links)
Q11. List 4 journal names in which research biologists have published papers on this protein.
Journal names can be found at the end of each article.
II. SUMMARY
Provide an analysis of your investigation. Of the 3 proteins analyzed, which do you think is the
responsible for reducing the risk of breast cancer and heart disease? Which element of the
folklore remedy would you purify and pursue as a drug? What is the basis for your opinion?
What are the roles of the 2 proteins not selected for further study in the folklore remedy?
8
Amino acid symbols
One letter
symbol
Three letter
symbol
alanine
A
Ala
arginine
R
Arg
asparagine
N
Asn
aspartic acid
D
Asp
cysteine
C
Cys
glutamic acid
E
Glu
glutamine
Q
Gln
glycine
G
Gly
histidine
H
His
isoleucine
I
Ile
leucine
L
Leu
lysine
K
Lys
methionine
M
Met
phenylalanine
F
Phe
proline
P
Pro
serine
S
Ser
threonine
T
Thr
tryptophan
W
Trp
tyrosine
Y
Tyr
valine
V
Val
Amino acid
Download