Lect 5: Bioinformatics software

advertisement
Essential Bioinformatics and Biocomputing
(LSM2104: Section I)
Biological Databases and
Bioinformatics Software
Prof. Chen Yu Zong
Tel: 6874-6877
Email: csccyz@nus.edu.sg
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, NUS
January 2003
Lecture 5: Bioinformatics software
Outline:
– Types of bioinformatics software
•
•
•
•
•
Sequence, pattern and domain
Evolutionary analysis
Visualization
Modeling and prediction (sequence, structure and function)
Data mining (bibliographic and text searches)
– Examples
Essential Bioinformatics and
Biocomputing (LSM2104)
2
Types of Bioinformatics software
1.
Analysis of biological data/systems and characterization of
molecules and sequences.
2.
Analysis and interpretation of experimental results
3.
Simulation of laboratory experiments, important for tackling
large scale problems
4.
Predictions that lead to the design of experiments
5.
Bioinformatics software can be accessed via WWW, or
through integrated software packages (such as Emboss,
GCG, Staden, DNAstar, …). It may be coupled with
databases, or may stand alone.
Essential Bioinformatics and
Biocomputing (LSM2104)
3
Bioinformatics software
Major sources
• Software package at ExPASy Molecular Biology
Server http://www.expasy.org ; http://au.expasy.org
• Software at PBIL Bio-Informatique Lyonnais
http://pbil.univ-lyon1.fr/
• Toolbox at EBI European Bioinformatics Institute
http://www.ebi.ac.uk/Tools/index.html
Essential Bioinformatics and
Biocomputing (LSM2104)
4
Bioinformatics software
• Major types of bioinformatics tools
•
•
•
•
•
•
•
•
•
Sequence analysis tools
Sequence comparison
Pattern and domain search
Evolutionary analysis
Prediction of sequence structure and function
Visualization of molecular structures
Structure modeling
Bibliographic and text searches
Specialized and other tools
Essential Bioinformatics and
Biocomputing (LSM2104)
5
Bioinformatics software
Sequence analysis tools
This kind of software focuses on extraction and
comparison of properties in DNA and protein sequences
– Sequence analysis provides for identification of domains,
structure, and function, and other properties
- The analysis of individual sequences helps with sequence
comparison
•
Textbook chapter 5, pages 81-93
Essential Bioinformatics and
Biocomputing (LSM2104)
6
Bioinformatics software
Sequence analysis tools
This kind of software focuses on extraction and
comparison of DNA and protein sequence
properties such as
– composition of nucleotide or protein sequences
– codon usage in DNA
– translation and backtranslation
Textbook chapter 5, pages 81-93
Essential Bioinformatics and
Biocomputing (LSM2104)
7
Bioinformatics software
Composition of nucleotide or protein sequences
• Composition (frequency of occurrence of a nucleotide or
of an amino acid) is the most basic analysis. It can give us
important functional and structural clues.
• For example, CG-rich regions called CpG islands are
often found in promoters. A short region just before the
splice site at the end of introns often has high C+T
content.
Essential Bioinformatics and
Biocomputing (LSM2104)
8
Bioinformatics software
Composition of protein and DNA sequences
• Web:
– NPS@ Network Protein Sequence @nalysis
http://npsa-pbil.ibcp.fr/ (Amino-acid composition)
– AA Composition
http://molbiol.soton.ac.uk/compute/aacomp.html
• JEMBOSS (in our own laboratory)
– http://srs1.bic.nus.edu.sg/jnlp/ (nucleic, composition,
compseq)
Essential Bioinformatics and
Biocomputing (LSM2104)
9
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
10
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
11
Bioinformatics software
Codon usage in DNA
• Web:
– Count-codon program in Codon Usage Database
http://www.kazusa.or.jp/codon/countcodon.html
(needs start and stop codons at the start and the end
of the sequence)
– Tool for Gene to Codon Usage Table
http://www.entelechon.com/eng/genetocut.html
– (does not care about start and stop codons)
• JEMBOSS (in the laboratory)
– http://srs1.bic.nus.edu.sg/jnlp/ (nucleic, codon usage,
cusp)
DNA coding region should have only one stop codon
Essential Bioinformatics and
Biocomputing (LSM2104)
12
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
13
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
14
Bioinformatics software
Translation (DNA to protein) and back translation
(protein to DNA)
• Web:
– Translate tool at ExPASy http://au.expasy.org/tools/dna.html
(DNA to protein)
• JEMBOSS (in the laboratory)
– http://srs1.bic.nus.edu.sg/jnlp/ (DNA to protein and reverse)
(nucleic, translation, transeq; nucleic, translation, backtranseq)
If we translate and back translate the same sequence we will typically
not get the same sequence as the starting one.
Essential Bioinformatics and
Biocomputing (LSM2104)
15
Bioinformatics Software
Sequence comparison (the most important software)
This will be taught next month by A/P Tan Tin Wee.
Web:
• Local alignment (BLAST, FASTA)
– http://www.ebi.ac.uk/fasta33/
– http://www.ncbi.nlm.nih.gov/BLAST/
– http://www.ebi.ac.uk/blast2/
• Multiple alignment (Clustal W)
– http://www.ebi.ac.uk/clustalw/index.html
• JEMBOSS (in the laboratory)
– http://srs1.bic.nus.edu.sg/jnlp/
Local alignment: Smith-Waterman (alignment, local, water)
Global alignment: Needleman-Wunsh (alignment, global, needle)
Essential Bioinformatics and
Biocomputing (LSM2104)
16
Bioinformatics software
Evolutionary analysis
• Multiple sequence alignments can be used as
measures of evolutionary distance between proteins.
The phylogeny systems are used to represent
evolutionary distances between sequences.
• WebPhylip
• http://sdmc.krdl.org.sg:8080/~lxzhang/phylip/
• GeneBee
• http://www.genebee.msu.su/services/phtree_reduced.html
Read textbook, page 83.
Essential Bioinformatics and
Biocomputing (LSM2104)
17
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
18
Bioinformatics software
Prediction of sequence structure and function
• Sequences that have similar structure often have similar
function. For many sequences we can extract secondary
and tertiary structure from the PDB database.
• What if our sequence is not in the PDB? We can predict
structure of a biological sequence using appropriate
software.
• There are several programs for prediction of secondary
structure. For prediction of tertiary structure we can do
modelling.
• http://npsa-pbil.ibcp.fr (PHD method for secondary
structure prediction)
Essential Bioinformatics and
Biocomputing (LSM2104)
19
Bioinformatics software
• Secondary structure prediction:
Essential Bioinformatics and
Biocomputing (LSM2104)
20
Bioinformatics software
• Secondary structure prediction:
– The PHD program predicted four alpha helices in
the human IL-2 (red). The number of helices is
correct, but their lengths and boundaries are not
correct (purple).
– When we make a prediction in bioinformatics, we
must have an idea about the accuracy of prediction
programs.
– To assess the accuracy of a program, we can test it
with known data. Our test must have sufficient
examples, so that we can make reasonable
conclusions.
Essential Bioinformatics and
Biocomputing (LSM2104)
21
Secondary structure prediction Bioinformatics software
•
alpha –Lactalbumin PDB 1A4V
•
http://npsa-pbil.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_server.html
Essential Bioinformatics and
Biocomputing (LSM2104)
22
Bioinformatics software
• We used nine different programs for prediction of secondary
structure of alpha–Lactalbumin (PDB 1A4V).
• The results show that the best predictions for this molecule
were from “Predator”, while DSC was the laggard.
• This test does not mean that Predator is the best of the tested
programs, nor that DSC is the worst. To make such
conclusions we must make test set first. The test set should
contain the examples from the family of proteins that our
query protein belongs to.
• The learning point – none of the prediction programs (and this
applies across all bioinformatics software, not only secondary
structure prediction) is 100% accurate. The users must be
cautious when interpreting results from the predictive
software.
Essential Bioinformatics and
Biocomputing (LSM2104)
23
Bioinformatics software
• Common measure (other measures also exist)
• Sensitivity SE=TP/(TP+FN)
• Specificity SP=TN/(TN+FP)
•
•
•
•
•
•
For example, prediction of binding peptides to a particular receptor
Experimental
Predicted
Class
Example 1 Binder
Binder
True positive (TP)
Example 2 Non-binder
Non-binder
True negative (TN)
Example 3 Binder
Non-binder
False negative (FN)
Example 4 Non-binder
Binder
False positive (FP)
• Prediction system that has SE=0.8 and SP=0.9 will correctly predict
8 of 10 experimental positives, and for each 10 experimental
negatives it will make one false prediction. This prediction accuracy
may be very good for prediction of peptide binding, but is not very
good for some other predictions, for example gene prediction.
Essential Bioinformatics and
Biocomputing (LSM2104)
24
Bioinformatics software
• Prediction of 3-D structure
• Various modelling programs
– comparative modelling, using known structures as templates
– ab initio modelling, using atomic simulation, residue statistics, etc.
• These methods will be covered later in the course
• An example of the comparative modelling software is SWISSMODEL http://www.expasy.org/swissmod/SWISS-MODEL.html
• This model is provided by email.
• This tool has the facility for assessing the quality of
predictions
Essential Bioinformatics and
Biocomputing (LSM2104)
25
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
26
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
27
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
28
Bioinformatics software
• Software for visualisation of 3-D structures.
Provides different views to 3-D molecular
structure, which will be taught by A/P Shoba.
– Chime, Rasmol (they use files in PDB format)
– Scorpion database uses Chime. Chime can be
downloaded from:
http://www.mdli.com/downloads/downloads.html?uid=&key=&id=1
Essential Bioinformatics and
Biocomputing (LSM2104)
29
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
30
Bioinformatics software
Essential Bioinformatics and
Biocomputing (LSM2104)
31
Bioinformatics software
• Text searches
• Text searching software is used associated with
databases. Most commonly we search by keywords or
combinations of keywords.
• Examples of PubMed searches:
–
–
–
–
–
Diabetes
–181,672 matches
Diabetes AND IDDM
– 35,841
Diabetes AND IDDM AND autoimmunity
– 1,109
Diabetes OR autoimmunity
– 190,674
Diabetes[Title/Abstract]
– 114,624
• The last example is more advanced PubMed option
“preview/index” Essential Bioinformatics and
Biocomputing (LSM2104)
32
Bioinformatics software
Summary of Today’s lecture
• Why bioinformatics software?
• Types of software: sequence, motif, evolution, visualization,
structural modeling, simulation, test search.
• Examples of selected software:
–
–
–
–
–
–
Sequence composition
DNA-protein sequence translation
Evolutionary analysis
Protein secondary structure prediction
Comparative modeling
Text search
• To be taught later: Sequence comparison, visualization etc.
Essential Bioinformatics and
Biocomputing (LSM2104)
33
Summary of the Section:
Biological databases and bioinformatics software
• We first focused on biological databases. We covered
topics:
–
–
–
–
–
discussed types of biological databases
briefly described popular databases
structure of the GenBank and SWISS-PROT entries
searching biological databases
types of questions that can be answered by searching
databases
– completeness and errors in the databases
Essential Bioinformatics and
Biocomputing (LSM2104)
34
Summary of the Section:
Biological databases and bioinformatics software
• The second topic was bioinformatics software. We
covered:
– why do we need bioinformatics software?
– briefly described major types of bioinformatics software
– described software for sequence composition, codon usage,
translation and backtranslation
– introduced the concept of sequence alignment, evolutionary
analysis
– secondary and tertiary structure prediction, molecular
visualization
– accuracy of prediction software
– text searching
Essential Bioinformatics and
Biocomputing (LSM2104)
35
Download