Bioinformatics Resources and Tools on the Web: A Primer

advertisement

Bioinformatics Resources and Tools on the Web: A Primer

Joel H. Graber

Center for Advanced Biotechnology

Boston University

Outline

• Introduction: What is bioinformatics?

• The basics

– The five sites that all biologists should know

• Some examples

– Using the tools in a somewhat less-than-naïve manner

• Questions/comments are welcome at all points

• Much of this material comes from the Boston

University course: BF527 Bioinformatic

Applications ( http://matrix.bu.edu/BF527/ )

What is bioinformatics?

Examples of Bioinformatics

• Database interfaces

– Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …

• Sequence alignment

– BLAST, FASTA

• Multiple sequence alignment

– Clustal, MultAlin, DiAlign

• Gene finding

– Genscan, GenomeScan, GeneMark, GRAIL

• Protein Domain analysis and identification

– pfam, BLOCKS, ProDom,

• Pattern Identification/Characterization

– Gibbs Sampler, AlignACE, MEME

• Protein Folding prediction

– PredictProtein, SwissModeler

Things to know and remember about using web server-based tools

• You are using someone else’s computer

• You are (probably) getting a reduced set of options or capacity

• Servers are great for sporadic or proof-ofprinciple work, but for intensive work, the software should be obtained and run locally

Five websites that all biologists should know

• NCBI (The National Center for Biotechnology Information;

– http://www.ncbi.nlm.nih.gov/

• EBI (The European Bioinformatics Institute)

– http://www.ebi.ac.uk/

• The Canadian Bioinformatics Resource

– http://www.cbr.nrc.ca/

• SwissProt/ExPASy (Swiss Bioinformatics Resource)

– http://expasy.cbr.nrc.ca/sprot/

• PDB (The Protein Databank)

– http://www.rcsb.org/PDB/

NCBI ( http://www.ncbi.nlm.nih.gov/ )

• Entrez interface to databases

– Medline/OMIM

– Genbank/Genpept/Structures

• BLAST server(s)

– Five-plus flavors of blast

• Draft Human Genome

• Much, much more…

EBI ( http://www.ebi.ac.uk/ )

• SRS database interface

– EMBL, SwissProt, and many more

• Many server-based tools

– ClustalW, DALI, …

SwissProt ( http://expasy.cbr.nrc.ca/sprot/ )

• Curation!!!

– Error rate in the information is greatly reduced in comparison to most other databases.

• Extensive cross-linking to other data sources

• SwissProt is the ‘gold-standard’ by which other databases can be measured, and is the best place to start if you have a specific protein to investigate

A few more resources to be aware of

• Human Genome Working Draft

– http://genome.ucsc.edu/

• TIGR (The Institute for Genomics Research)

– http://www.tigr.org/

• Celera

– http://www.celera.com/

• (Model) Organism specific information:

– Yeast : http://genome-www.stanford.edu/Saccharomyces/

– Arabidopis : http://www.tair.org/

– Mouse : http://www.jax.org/

– Fruitfly : http://www.fruitfly.org/

– Nematode: http://www.wormbase.org/

• Nucleic Acids Research Database Issue

– http://nar.oupjournals.org/ (First issue every year)

Example 1: Searching a new genome for a specific protein

• Specific problem: We want to find the closest match in C. elegans of D. melanogaster protein

NTF1 , a transcription factor

• First- understanding the different forms of blast

The different versions of BLAST

1 st Step: Search the proteins

• blastp is used to search for C. elegans proteins that are similar to NTF1

• Two reasonable hits are found, but the hits have suspicious characteristics

– besides the fact that they weren’t included in the complete genome !

2 nd Step: Search the nucleotides

• tblastn is used to search for translations of C. elegans nucleotide that are similar to NTF1

• Now we have only one hit

– How are they related?

Conclusion: Incorrect gene prediction/annotation

• The two predicted proteins have essentially identical annotation

• The protein-protein alignments are disjoint and consecutive on the protein

• The protein-nucleotide alignment includes both protein-protein alignments in the proper order

• Why/how does this happen?

Final(?) Check: Gene prediction

• Genscan is the best available ab initio gene predictor

– http://genes.mit.edu/GENSCAN.html

• Genscan’s prediction spans both proteinprotein alignments, reinforcing our conclusion of a bad prediction

Ab initio vs. similarity vs. hybrid models for gene finding

• Ab initio : The gene looks like the average of many genes

– Genscan, GeneMark, GRAIL…

• Similarity: The gene looks like a specific known gene

– Procrustes,…

• Hybrid: A combination of both

– Genomescan ( http://genes.mit.edu/genomescan/ )

A similar example: Fruitfly homolog of mRNA localization protein VERA

• Similar procedure as just described

– Tblastn search with BLOSUM45 produces an unexpected exon

• Conclusion: Incomplete (as opposed to incorrect) annotation

– We have verified the existence of the rare isoform through RT-PCR

Another example: Find all genes with pdz domains

• Multiple methods are possible

• The ‘best’ method will depend on many things

– How much do you know about the domain?

– Do you know the exact extent of the domain?

– How many examples do you expect to find?

Some possible methods if the domain is a known domain:

• SwissProt

– text search capabilities

– good annotation of known domains

– crosslinks to other databases (domains)

• Databases of known domains:

– BLOCKS ( http://blocks.fhcrc.org/ )

– Pfam ( http://pfam.wustl.edu/ )

– Others (ProDom, ProSite, DOMO,…)

Determination of the nature of conservation in a domain

• For new domains, multiple alignment is your best option

– Global: clustalw

– Local: DiAlign

– Hidden Markov Model: HMMER

• For known domains, this work has largely been done for you

– BLOCKS

– Pfam

If you have a protein, and want to search it to known domains

• Search/Analysis tools

– Pfam

– BLOCKS

– PredictProtein

( http://cubic.bioc.columbia.edu/predictprotein/predictprotein.html

)

Different representations of conserved domains

• BLOCKS

– Gapless regions

– Often multiple blocks for one domain

• PFAM

– Statistical model, based on HMM

– Since gaps are allowed, most domains have only one pfam model

Conclusions

• We have only touched small parts of the elephant

• Trial and error (intelligently) is often your best tool

• Keep up with the main five sites, and you’ll have a pretty good idea of what is happening and available

Download