Tree Construction - Cell Wall Genomics

advertisement

Tutorial of bioinformatics and tree generation at the Cell Wall

Genomics website

Bryan Penning

*Supported by the NSF

Plant Genome Research and and REU Programs

Bioinformatics Goals

• We currently have a wealth of Arabidopsis thaliana cell wall gene information on the website, we wanted to:

– Add family information about rice and maize Type II cell walls to compare to A. thaliana Type I cell walls

– Add links to outside information on rice genes like we have for A. thaliana

– Include annotated composite trees of A. thaliana, rice and maize gene families

– Add links to sites used to generate the data

– Add source protein sequence used for our family trees so other researchers can make their own adding their genes of interest

– Generate a tutorial on how researchers can make use of the bioinformatics data on our site

Supported by the NSF

Plant Genome Research and REU Programs

Diagram of our bioinformatics approach

Too few genes, Blast other sites

N

Genes from

A. thaliana

Blast

TIGR

Homologous rice genes

Choose genes

A thaliana

& rice genes

Make tree

Good tree?

Y

Too many genes, tighten criteria

N

Diagram of the process used to find the genes and draw family trees for cell wall related rice genes. The same approach is used for maize.

Supported by the NSF

Plant Genome Research and REU Programs

Publish to website

Draw rice dendrogram

Diagram of our bioinformatics approach

A. thaliana genes

Rice genes

Maize genes

Supported by the NSF

Plant Genome Research and REU Programs

Draw tree with all family members

Annotate Publish to web

Diagram of the process used to integrate cell wall related genes from all three family trees into a composite tree.

BLASTing genes

• To be considerate of the bioinformatics community with the number of

BLASTs to be performed and to speed the process, we downloaded the text or “flat file” of the TIGR rice protein sequences (available at: http://www.tigr.org/tdb/e2k1/osa1/data_download.shtml

) and performed local blasts using blastall from NCBI (available at: http://www.ncbi.nlm.nih.gov/BLAST/download.shtml

)

• Direction for use of these tools is available at the above sites and is beyond the scope of this tutorial

• For a small number of BLASTs, you can use web-based methods and common programs such as Word and Excel plus any of a number of downloadable tree drawing programs to make these kinds of trees on your own if you are not familiar with programming languages such as

Perl to automate the process. Although web searches can be more time consuming, they work just as well for a few sequences

Supported by the NSF

Plant Genome Research and REU Programs

Supported by the NSF

Plant Genome Research and REU Programs

Web BLASTing

• For smaller numbers of BLASTs to the rice genome, TIGR provides an excellent Web BLAST at: http://tigrblast.tigr.org/eukblast/index.cgi?project=osa1

• You can also use the new BLAST tool at Gramene: http://www.gramene.org/multi/blas tview for most cereal sequences

• Note: gene model versions sometimes differ between Gramene and TIGR as one site may update to the latest model before the other

Supported by the NSF

Plant Genome Research and REU Programs

Web BLASTing

• Downloading the protein sequence for

Arabidopsis SUD1

(At3g46440) from

TIGR, you can

BLAST it against the

TIGR Rice

Pseudomolecules –

Protein database using BLASTp

Supported by the NSF

Plant Genome Research and REU Programs

Web BLASTing

• You get a series of “hits” to the gene of interest

• A higher score and smaller probability is a better match to the original gene

• This procedure is followed for all of the genes in a family to gather the best possible hits, sort the hits to remove duplicates and choose the best rice matches to the

Arabidopsis families

• You can use NCBI’s

blastall tool for multiple simultaneous blasts as we do for this step

Supported by the NSF

Plant Genome Research and REU Programs

Organizing BLASTs

• This is a word document generated by BLASTing

SUD1 and SUD2 of

Arabidopsis against the

TIGR Rice Protein database

• The hits were copied into word and set to the font

Courier New, 9 pt and saved as a text only document (to remove the HTML code)

• The file was reloaded in

Word and converted to a table (table menu) using other and the character |

(shift \) to separate the columns

Supported by the NSF

Plant Genome Research and REU Programs

Organizing BLASTs

• The Word file is copied into

Excel and the Data – Sort menu is used to sort by the first column

• This brings all of the same named genes together (the two highlighted lines for example)

• Duplicate genes are removed from the spreadsheet and the far right column only

(LOC_Osxxgxxxxxx) tags can be copied back to word

Supported by the NSF

Plant Genome Research and REU Programs

Organizing BLASTs

• You can use the table menu to convert table to text (Paragraph Marks) to generate a list of genes

• These genes can be searched through a downloaded database using the NCBI

fastacmd (included in the BLAST download tools) or you can search them one at a time using a web-based database such as the locus search name on TIGR:

( http://www.tigr.org/td b/e2k1/osa1/LocusName

Search.shtml

)

Supported by the NSF

Plant Genome Research and REU Programs

Generating a tree

• Once you have found all of your sequences, check that each sequence name has a < in front of it

(denoting a new sequence name) and the sequence starts on a new line

• Copy and paste all of your sequences into an alignment program like

ClustalW (we use: http://align.genome.jp/ from the Kyoto

University Bioinformatics center, but any ClustalW program will work)

Supported by the NSF

Plant Genome Research and REU Programs

Generating a tree

• For our trees we use:

Slow/Accurate pair-wise comparisons and Gonnet for our Weight Matrix (two spots on the website)

• Click execute alignment to get your sequence alignment

• At the end of the alignment page will be the information needed for tree drawing programs

• You can click on clustal.dnd for a quick tree or take the information after it – A

Newick format tree and copy it into a new Word file, saving it as a text file

(include all parenthesis)

Supported by the NSF

Plant Genome Research and REU Programs

Creating a tree

• We use the program TreeDyn to generate our trees (available at: http://www.treedyn.org/ )

• This is an example of the

Arabidopsis and rice 1.1 family

• The tree text file was loaded into

TreeDyn and the frame enlarged

• The red text for Arabidopsis sequences was done by changing the font color to red and using the find panel to find all At* sequences (which turn red)

• The scale at the bottom was added by right clicking on that space and choosing the tree name, annotation, and scale sub-menus

• This square tree is useful to see associations of genes for different species

Supported by the NSF

Plant Genome Research and REU Programs

Square tree example

• This is part of the family

1.1 square dendrogram of

Arabidopsis, rice and maize from our website

• The red names are

Arabidopsis sequence , the

black names are rice, and the green names are maize

• Regions alternate between grey shaded and white backgrounds (added with

Photoshop) to indicate clades of similar sequence genes which may relate function (such as

AUD/SUD or GME, etc)

Supported by the NSF

Plant Genome Research and REU Programs

Radial dendrograms

• TreeDyn can also draw radial dendrograms such as the one shown for rice family 1.1

• This can be done by right clicking on the tree area to bring up the grey box in TreeDyn, choosing your tree, then Conformation-

Radial

• Treedyn allows you to resize, rotate, and flip clades around (see http://www.treedyn.org/ for detailed tutorials on these processes)

• For our site, we export the radial trees as jpeg images

Finishing a radial dendrogram

The TreeDyn tree jpeg is finished as a FLASH file where the ovals and family names are added (Rice family

1.1 shown)

Supported by the NSF

Plant Genome Research and REU Programs

Each individual clade of a family tree is also prepared in TreeDyn and link buttons added later in FLASH

( AUD/SUD-like shown)

Supported by the NSF

Plant Genome Research and REU Programs

Viewing your gene of interest

• We provide protein sequence information you can download and add in your own sequence of interest for comparison to these three species

• Under each tree (family 1.1 shown) is the link “View the protein sequence file”

• Right click and choose Save

Target as… to download the sequence with a filename and location you will remember

• You can do this for each

Arabidopsis, rice, and maize family

Supported by the NSF

Plant Genome Research and REU Programs

Viewing your gene of interest

• You may have a sequence you think is related to a particular family such as nucleotide interconversion pathway (family 1.1)

• For example, the wheat

EST CV523101 from

Genebank: http://www.ncbi.nlm.nih.go

v/entrez/viewer.fcgi?db=n ucleotide&val=CV523101 might be related to the

TIGR rice gene:

Os05g29990 in the

AUD/SUD clade of family

1.1 according to information from Gramene

Supported by the NSF

Plant Genome Research and REU Programs

Viewing your gene of interest

• You can take the nucleotide sequence and covert it to protein sequence using a program such as Genemark:

( http://opal.biology.gatech.e

du/GeneMark/eukhmm.cgi

)

• Protein sequence returned:

>CV523101_wheat

IARIFNTYGPRMCIDDGRVVSNFVAQALR

KEPLTVYGDGKQTRSFQYVSDLVEGLMRL

MEGDHIGPFNLGNPGEFTMLELAKVVQDT

IDPNARIEFRENTQDDPHKRKPDITKAKE

QLGWEPKIALRDGLPLMVTDFRKRIFGDQ

DSAATATEG

Supported by the NSF

Plant Genome Research and REU Programs

Viewing your gene of interest

• Paste all of the sequences for family 1.1 (Arabidopsis, rice, and maize) plus the

Wheat EST,

CV523101_wheat, converted to protein into a

ClustalW program such as: http://align.genome.jp/ from the Kyoto University

Bioinformatics center

• Perform the multiple alignment, copy the Newick tree data generated into a new word file, and save a text file as previously shown

Viewing your gene of interest

The AUD/SUD clade of the family 1.1 tree for

Arabidopsis (red) , Rice (black) , Maize (green) , and a wheat EST (blue) added to demonstrate how you can visualize relatedness of your own genes using our protein sequences

• Taking the Newick tree from clustalW into

TreeDyn as previously shown will allow you to visualize the tree

• The AUD/SUD clade of the tree generated by

TreeDyn shows that the wheat EST (in blue) is most closely related to the rice gene Os05g29990 in the AUD clade

Supported by the NSF

Plant Genome Research and REU Programs

Bioinformatics sites used

• General

– Multiple alignment for trees, ClustalW ( http://align.genome.jp/ )

– Making trees, TreeDyn ( http://www.treedyn.org/ )

– BLASTing NCBI ( http://www.ncbi.nlm.nih.gov/BLAST/ )

– Proteins translated by GeneMark

( http://opal.biology.gatech.edu/GeneMark/eukhmm.cgi

)

• Rice

– Sequence BLAST using TIGR ( http://www.tigr.org/tdb/e2k1/osa1/ )

– Downloading rice protein sequences from TIGR

( http://www.tigr.org/tdb/e2k1/osa1/LocusNameSearch.shtml

)

• Maize

– Sequence BLAST using TIGR ZmGI ( http://www.tigr.org/tigrscripts/tgi/T_index.cgi?species=maize )

– Sequence BLAST using Gramene

( http://www.gramene.org/multi/blastview )

Supported by the NSF

Plant Genome Research and REU Programs

Download