pro2724-sup-0001-suppinfo01

advertisement
Supporting Information Figure 1. Sequence-, structure-, and signature-based networks for the
enolase superfamily. Networks for the enolase superfamily were created using three different edge
metrics as the comparison tool. The sequence-, structure-, and signature-based networks utilized pairwise
BLAST, TM-Align, and ASP scores as their edge metrics, respectively (see Methods). From left to right,
the edge threshold is increased, removing the weakest relationships, creating smaller and smaller clusters
that are more and more closely related. A color key indicating SFLD subgroup and family annotation is
shown on the right. Edge thresholds were chosen to demonstrate cluster development and ended when
one or more subgroups had devolved into mostly singlets/doublets. Stars above the networks indicate the
edge threshold with the highest count of SFLD subgroups identified distinctly with all members in the
same cluster (number of subgroups in parenthesis next to stars). Arrows and boxes in the networks
correspond to annotation on figures found in the text.
Supporting Information Figure 2. Sequence-, structure-, and signature-based networks for the
peroxiredoxin superfamily. Networks for the peroxiredoxin superfamily were created using three
different edge metrics as the comparison tool. The sequence-, structure-, and signature-based networks
utilized pairwise BLAST, TM-Align, and ASP scores as their edge metrics, respectively (see Methods).
From left to right, the edge threshold is increased, removing the weakest relationships, creating smaller
and smaller clusters that are more and more closely related. A color key indicating SFLD subgroup
annotation is shown on the right. Edge thresholds were chosen to demonstrate cluster development and
ended when one or more subgroups had devolved into mostly singlets/doublets. Stars above the networks
indicate the edge threshold with the highest count of SFLD subgroups identified distinctly with all
members in the same cluster (number of subgroups in parenthesis next to stars). Circles in the networks
correspond to annotation on figures found in the text.
Supporting Information Figure 3. Sequence-, structure-, and signature-based networks for the
glutathione transferase superfamily. Networks for the glutathione transferase superfamily were created
using three different edge metrics as the comparison tool. The sequence-, structure-, and signature-based
networks utilized pairwise BLAST, TM-Align, and ASP scores as their edge metrics, respectively (see
Methods). From left to right, the edge threshold is increased, removing the weakest relationships,
creating smaller and smaller clusters that are more and more closely related. A color key indicating
SFLD subgroup annotation is shown on the right. Edge thresholds were chosen to demonstrate cluster
development and ended when one or more subgroups had devolved into mostly singlets/doublets. Stars
above the networks indicate the edge threshold with the highest count of SFLD subgroups identified
distinctly with all members in the same cluster (number of subgroups in parenthesis next to stars). Circles
in the networks correspond to annotation on figures found in the text.
Supporting Information Figure 4. Sequence-, structure-, and signature-based networks for the
crotonase superfamily. Networks for the crotonase superfamily were created using three different edge
metrics as the comparison tool. The sequence-, structure-, and signature-based networks utilized pairwise
BLAST, TM-Align, and ASP scores as their edge metrics, respectively (see Methods). From left to right,
the edge threshold is increased, removing the weakest relationships, creating smaller and smaller clusters
that are more and more closely related. A color key indicating SFLD subgroup and family annotation is
shown on the right. Edge thresholds were chosen to demonstrate cluster development and ended when
one or more families had devolved into mostly singlets/doublets. Stars above the networks indicate the
edge threshold with the highest count of SFLD families identified distinctly with all members in the same
cluster (number of families in parenthesis next to stars). Circles in the networks correspond to annotation
on figures found in the text.
Supporting Information Figure 5. Signature logos for the Prx and crotonase superfamilies. A.
Signature logos were created for the entire peroxiredoxin superfamily (top) and the four largest clusters at
the 0.35 filter threshold in the signature-based network. B. Signature logos were created for the three
largest crotonase clusters at the 0.25 filter threshold in the signature-based network. The number of
proteins in each cluster, as well as the dominant subgroup, is shown above the cluster.
Supporting Information Figure 6. Key residues are identified using structural overlays with a
representative protein. A. 1UIY (blue) and 2Q2X (purple) are structurally aligned with a representative
protein, 1DUB (grey), in which key residues have been experimentally identified. B. The key residues
defined for the representative protein (black) are used as guides to identify the structurally analogous
residues in the other proteins (dark blue and dark purple).
Supporting Information Figure 7. Pairwise score distribution for ASP, BLAST, and TM-Align
enolase networks demonstrate why multiple clusters are identified at the sequence-based “no
filter”. The count of enolase pairwise scores is shown in each bin of size 0.05 between 0 and 1. Bars of
blue, green, and black represent the scores from the ASP, TM-Align, and BLAST scoring metrics,
respectively. BLAST scores are shown on a log scale in the inset, with a bin size of 1E-5.
Though the pairwise edge scores for all three metrics are in similar ranges, the distribution of these scores
is not consistent among the three networks. The ASP scores and the TM-Align scores are left- and rightshifted, respectively, and their centers are different, with means of 0.22 and 0.78, respectively. BLAST
scores, on the other hand, exhibit a bimodal distribution and are mostly found in the first bin (0 – 0.05)
and the last bin (>1). The median of the BLAST scores is 3E-7 while the mean is 42 due to skewing from
the large values in the >1 bin (maximum is 4765). As a result of this score distribution, the edges >1 are
removed during MCL clustering, causing the no filter threshold network to contain multiple distinct
groups. Conversely, both the ASP or TM-Align networks show one large group at no filter because no
edges are extremely different from the median edge. The BLAST scores that should be defining relevant
protein clusters are the scores between 0 and 0.05; the distribution of these scores is skewed left (inset).
Download