Structural phylogenomic inference of protein function

advertisement

Structural Phylogenomic Inference of Protein

Function

Kimmen Sjölander

Extend function prediction through inclusion of structure prediction and analysis

Anti-fungal defensin

(Radish)

Drosomycin

(Drosophila)

Scorpion toxin

Predict active site & subfamily specificity positions

VirB4

Annotation transfer by homology

• Status quo approach to protein function prediction

– Given a gene (or protein) of unknown function

• Run BLAST to find homologs

• Identify the top BLAST hit(s)

• If the score is significant, transfer the annotation

– If resources permit, predict domains using PFAM or CDD

• Problems :

– Approach fails completely for ~30% of genes

– Of those with annotations, only 3% have any supporting experimental evidence

• 97% have had functions predicted by homology alone*

– High error rate

* Based on analysis of >300K proteins in the UniProt database

Tomato Cf-2 Bioinformatics Analysis

Domain fusion and fission events complicate function prediction by homology, particularly for particularly common domains (e.g., LRR regions).

Domain structure analysis (e.g., PFAM) is often critical.

Tomato Cf-2 (GI:1587673)

Dixon, Jones, Keddie, Thomas, Harrison and Jones JDG

Cell (1996)

BLAST against Arabidopsis

Top BLAST hit in Arabidopsis is an RLK!

Panther

PFAM results

Errors due to domain shuffling

(sic)

Error presumably due to non-orthology of database hits used for annotation

Phylogenetic analysis suggests it’s more likely a Biogenic

Amine GPCR

Human neutral sphingomyelinase

or bacterial isochorismate synthase?

Database annotation errors

Main sources of annotation errors:

1. Domain shuffling

2. Gene duplication (failure to discriminate between orthologs and paralogs)

3. Existing database annotation errors

Propagation of existing database annotation errors

Errors in gene structure

Contamination

Other…

Galperin and Koonin, “Sources of systematic error in functional annotation of genomes: domain rearrangement, nonorthologous gene displacement and operon disruption.” In Silico Biol. 1998

Phylogenomic inference

Eisen “Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis,”

Genome Research 1998

Sjölander, “Phylogenomic inference of protein molecular function: advances and challenges," Bioinformatics

2004

Piet Hein, Grooks

QuickT ime ™ an d a

TIFF ( Uncomp res sed) deco mpre ssor ar e need ed to see this pictur e.

There is nothing more difficult to take in hand, more perilous to conduct, or more uncertain in its success, than to take the lead in the introduction of a new order of things.

Because the innovator has for enemies all those who have done well under the old conditions, and lukewarm defenders in those who may do well under the new.

This coolness arises partly from the incredulity of men, who do not readily believe in new things until they have had a long experience of them .

Construction of genome-scale phylogenomic libraries

Cluster genome into global homology groups

Include homologs from other species

Construct multiple sequence alignment

Predict cellular localization.

Predict protein structure

Predict key residues

Deposit book in library

Construct HMMs for the family and for individual subfamilies.

Construct phylogenetic trees.

Overlay with annotation data.

Identify subfamilies.

Retrieve key literature

Berkeley Universal Proteome

Phylogenomic Explorer

9,707 protein family “books” and 708K HMMs and expanding daily http://phylogenomics.berkeley.edu/UniversalProteome

Protein fold prediction

12% identity

VirB4 TrwB structure (1E9RA)

Active site

Example Book: Voltage-gated K+ channels

SCI-PHY subfamilies supported by

ML tree, and also consistent with subtype and phylogenetic distribution

(only one branch of

ML tree displayed)

GO annotations for Shal subfamily

Database queries

Look up protein family “books” based on the annotations associated with any sequence.

Queries can be based on GO biological process, PFAM domains, UniProt accession numbers, etc.

Key algorithms in PhyloFacts library construction

What clustering methods are appropriate for inference of protein function?

What alignment methods are accurate?

How to mask?

What tree methods to use?

How to root a tree?

Can we define functional subfamilies automatically?

Fraction superposable positions drops with evolutionary divergence

Pairwise alignment MSA-pw Sequence-profile methods

%ID #pair %Superpos BLAST ClustalW Tcoffee ClustalW MAFFT MUSCLE

>70 107 90.6

0.954

0.955

0.955

0.955

0.954

0.954

50-70

40-50

63

46

87.2

83.4

0.862

0.824

0.903

0.872

0.894

0.855

0.901

0.856

0.919

0.862

0.911

0.846

30-40

25-30

20-25

65

41

53

85.4

82.1

77.9

0.811

0.779

0.612

0.874

0.782

0.599

0.867

0.788

0.627

0.87

0.795

0.633

0.892

0.837

0.678

0.925

0.836

0.661

15-20 84

10-15 151

5-10

0-5

204

122

73

64.4

50.4

39.5

0.381

0.451

0.457

0.16

0.186

0.234

0.49

0.302

0.496

0.35

-0.007

-0.014

0 -0.047

0.098

-0.033

-0.049 -0.051

-0.034

-0.024

HMM TreeHMM TreeHMM-Opt

0.951

0.954

0.96

0.903

0.855

0.899

0.868

0.727

0.904

0.855

0.892

0.866

0.728

0.554

0.351

0.578

0.387

0.572

0.363

0.075

0.096

0.085

-0.022

-0.026

-0.025

0.929

0.934

0.953

0.91

0.813

0.72

0.551

0.29

0.127

FlowerPower

Clustering global (or glocal) homologs

Minimize profile drift

Improved alignment accuracy

Nandini Krishnamurthy, Ph.D.

Step 1: Construct SearchDB

Q=query

Construct SearchDB using PSI-BLAST against target database

Q

Step 2: Select and align core set.

Q

Inclusion criteria:

E-value 1e-10

Bi-directional coverage

MUSCLE multiple alignment (Edgar, 2003)

Step 3: Run SCI-PHY to identify subfamilies and build subfamily HMMs (SHMMs)

Q

BETE subfamily identification: Sjölander 1998

SHMM construction: Brown et al, 2004

Step 4: SHMMs compete for sequences from SearchDB. Sequences meeting criteria are aligned to their closest SHMM.

Q

Step 5: Run SCI-PHY on extended alignment to identify new subfamilies and construct SHMMs.

Q

Iterate until convergence

Q

Comparing FlowerPower,

BLAST,

PSI-BLAST and UCSC T2K

Test: Clustering global homologs

Agreement at domain structure determined by PFAM. SCOP used to cluster PFAM domains into structural equivalence classes.

Subfamily Classification In PHYlogenomics

(SCI-PHY)

Seq1 LE R Y-K

Seq2 LD R FPR

Seq3 IE R YGK

Seq4 MD R F-K

Seq5 VE R YGK

Nandini Krishnamurthy, Ph.D.

Duncan Brown

Multiple sequence alignment

5

3

1

4

2

Phylogenetic tree & subfamily decomposition

Agglomerative clustering

Input: MSA

Initialize: construct profile 1 for each row in MSA

While (#clusters > 1) {

Join closest 2 pair of clusters

Re-estimate profile 1

Compute encoding cost 3 for this stage

} /* cut tree using minimum encoding cost */

1.

2.

Use Dirichlet mixture densities

Distance function: relative entropy

Sjolander, K. "Phylogenetic inference in protein superfamilies: Analysis of SH2 domains" Proceedings of Conference Intelligent Systems for Molecular Biology

(ISMB) 1998

Detection of critical positions

Subfamilies identified using minimum encoding cost principles

• Each stage of the algorithm defines a different set of alignments, one for each cluster (“subfamily”).

• Find the point during the clustering where the encoding cost of the alignments is minimal. This defines the subfamily decomposition.

Cost

N

# classes 1

N= number of sequences. S= number of subfamilies; n c,1

…n c,s are the amino acids aligned by subfamilies 1 through s at column c.

 represents the Dirichlet mixture prior .

SCI-PHY analysis of selected GPCRs

Venter et al , The sequence of the human genome (2001) Science .

Sjolander, “"Phylogenomic inference of protein molecular function: advances and challenges," (2004) Bioinformatics

Key residue prediction using subfamily and family-wide conservation analysis

D558

R627

D628

G629

Y221

W222

H745

Y743 A744

Elizabeth Hua-Mei Kellogg

Ryan Ritterson

Nandini Krishnamurthy

D RD E YA H

Parker JS, Roe

SM, Barford D. ,

EMBO J., 2004

Tanaka Hall, T.

Structure 2005

Rivas et al, 2005

Function Prediction Using HMMs

7TM GPCR

3.5.2.2

Dihydropyrimidinase

ABC Transporter

3.5.4.1

Cytosine deaminase

Amidohydrolase

3.5.2.3

Dihydroorotase

ATPase

Family

3.5.1.5

Urease

Subfamily

Subfamily HMM construction

1.

At completely conserved positions, and subfamily gapped positions: Use match state distributions estimated for general (family) HMM.

2.

At other positions:

1.

Estimate Dirichlet mixture density posterior for each subfamily at each position separately.

2.

Use Dirichlet density posteriors to weight contributions from other subfamilies.

3.

Compute amino acid distribution using weighted counts and standard Dirichlet procedure.

1 2

6 7

3 4 5

Brown et al,“Subfamily HMMs in functional genomics” (2005) Pacific Symposium on Biocomputing

Subfamily HMMs increase the separation between true and false positives

• 515 unique SCOP folds

• PFAM full MSAs

• Scored against Astral PDB90

1.5% error rate in subfamily classification using top-scoring SHMM

SATCHMO: Simultaneous Alignment and

Tree Construction using

Hidden Markov mOdels

Xia Jiang

Nandini Krishnamurthy

Duncan Brown

Michael Tung

Jake Gunn-Glanville

Bob Edgar

Edgar, R., and Sjölander, K .

, "SATCHMO: Sequence Alignment and Tree Construction using

Hidden Markov models," Bioinformatics . 2003 Jul 22;19(11):1404-11

SATCHMO motivation

• Structural divergence within a superfamily means that…

– Multiple sequence alignment (MSA) is hard

– Alignable positions varies according to degree of divergence

• Current MSA methods not designed to handle this variability

– Assume globally alignable, all columns (e.g. ClustalW)…

• Over-aligns, i.e. aligns regions that are not superposable

– …or identify and align only highly conserved positions (e.g., SAM software with HMM “surgery”)

• Challenge

– Different degrees of alignability in different sequence pairs, different regions

– Masking protocols are lossy: loop regions may be variable across the family but may be critical for function!

SATCHMO algorithm

• Input : unaligned sequences

• Initialize : a profile HMM is constructed for each sequence.

• While (#clusters > 1) {

– Use profile-profile scoring to select clusters to join

– Align clusters to each other, keeping columns fixed

– Analyze joint MSA to predict which positions appear to be structurally similar; these are retained, the remainder are masked.

– Construct a profile HMM for the new masked MSA

}

• Output : Tree and MSA

Alignment of proteins with different overall folds

0.3

0.2

0.1

0

0.7

0.6

0.5

0.4

1

0.9

0.8

Assessing sequence alignment with respect to structural alignment

Xia Jiang Duncan Brown Nandini Krishnamurthy

Alignment accuracy as a function of % ID

(including homologs, full-length sequences)

10-15% 15-20% 20-25% 25-30% 30-35%

Percent ID

CLUSTALW MUSCLE MAFFT SATCHMO

35-40%

Future work: Interactive specificity position identification

Catalytic residues colored red

• Enable users to select subtrees for analysis

• Identify positions conserved within each subtree, but which differentiate the two**

• Plot over MSA and on structure (if available)

Donald and Shakhnovich, NAR 2005

Major challenge: Phylogenetic uncertainty

Given : A (gene tree of unknown function), gene trees B and C (characterized function)

Predict function for A.

A

C

B B

C

A

A

B

C

Problem: use three phylogenetic tree methods, get 3 or more trees! Change the MSA, you also change the tree…

Need: Better simulation studies, benchmark datasets

http://phylogenomics.berkeley.edu

Berkeley Phylogenomics Group

PI: Kimmen Sjölander

Nandini Krishnamurthy, Ph.D.

Duncan Brown

Sriram Sankararaman

Xia Jiang

Jake Gunn-Glanville

Lead programmer and web administrator:

Dan Kirshner

This work is supported in part by a Presidential Early Career Award for Scientists and Engineers from the NSF, and by an R01 from the NHGRI (NIH).

Download