Poly-syllabic Discovery of Specificity Estimates

advertisement
Supplementary Methods for MacIsaac et al.
A Hypothesis-Based Approach for Identifying
the Binding Specificity of Regulatory
Proteins from Chromatin Immunoprecipitation Data
Hypothesis Generation
Overview
We derived profiles from unaligned binding sites in the TRANSFAC v7.2
database (Matys et al. 2003). To identify the DNA-binding domains of the proteins
represented in TRANSFAC, we used hidden-Markov models from Pfam (Bateman et al.
2004) (“Pfam_ls”, release 10, E-value threshold 0.01). In 22 cases where the sequences
of transcriptional regulators were missing from TRANSFAC we obtained the sequences
from SwissProt (Boeckmann et al. 2003). We identified 37 families of DNA-binding
domains in TRANSFAC that each contained at least 4 proteins and 30 sites. To derive
profiles for a family, we pooled all the binding sites reported for its members in
TRANSFAC, and submitted these sequences to two motif discovery programs:
AlignACE (Roth et al. 1998), as described in the main text, and DimerFinder, which is
described in detail below.
DimerFinder Algorithm
DimerFinder is designed to identify Family Binding Profiles containing short
direct repeats or inverted repeats that may represent binding sites for dimeric proteins.
The algorithm consists of four steps:
(1) Identifying statistically over-represented direct and inverted sequence repeats
1
(2) Clustering
(3) Assembling consensus strings
(4) Refining consensus strings
We applied DimerFinder separately to the set of binding sites for each domain
family in the TRANSFAC database that contained as least 4 proteins and 30 sites.
Source code for DimerFinder is available from the authors’ website.
1.
Identifying over-represented repeats.
We tabulated the frequency of all possible DNA sequences containing a direct or
inverted repeat of any word 3-8 bases in length with gaps ranging from 0 to 14 base pairs.
We identified the most significant of these repeats using two criteria that are explained in
detail below: gap bias and specificity.
a. Gap bias
Suppose we found exactly n instances of a word pair (W,W'), where W' equals
either W (when searching for direct repeats) or its reverse complement (when searching
for inverse repeats), and the words are separated by between zero and g-1 bases. We
define the gap bias, Pn,g(c), to be the probability that the most frequent gap size between
W and W' would occur c or more times, under the null hypothesis that all gap sizes from
zero to g-1 are equally probable. We derive an equation for the gap bias below.
If the most common gap size has a significant gap bias, we identify additional
significant gaps using the following iterative strategy. First, we remove from the input
set those sites containing the word pair separated by the most frequent gap size G,
2
W nnnn
 W'. Then we recompute the gap bias for the next most frequent gap size
G
occurring with the word pair W,W'. This process is repeated until, for some next gap size
g, the gap bias Pn,g(c) falls below our threshold. We chose a very stringent gap bias
threshold of 5·10-9 by comparing the discovered repeats with the expected results for the
well-studied dimeric families SRF-TF, HLH, HSF, Zn_clus and bZIP.
b. Specificity
We evaluate the repeats with a significant gap bias using several criteria designed
to insure that the repeats are specific to a particular domain family. The first criterion is
the enrichment score which measures the over-representation of matching sites in the
family of interest compared to all other transcription factor families:


 min( B , g )
E   log 10  
 i b


 B  G  B  

 i 

 g i 

 


G 



g




(1)
where B is the number of binding sites associated with the domain family, G is the total
number of binding sites in the database for all families. The quantities b and g represent
the number of binding sites matching the motif within B and G respectively. We use a
threshold of 2.0 for this score. The second criterion tests whether the set of proteins with
at least one match to the string is enriched in the family of interest. We require an
enrichment score of at least 5.0 for this metric.
As a final requirement, we accept the repeats based on a word pair (W,W') that
satisfy all other criteria only if these repeats occur in a total of at least ten binding sites.
For families with fewer than 200 binding sites in the database, we relax this criterion by 1
3
dimer occurrence for every 50 binding sites. This choice of criteria and thresholds gave
the best agreement between the discovered Family Binding Profiles and the literature.
c. Estimating the gap bias
Each word pair (W,W') is associated with a distribution of gap sizes. Consider an
integer-valued frequency vector for this distribution:
g 1

V={v0, …, vg-1} such that
vi = n and all vi
≥0
i 0
Here vi is the number of word pairs with a gap of length i in a set of n motifs
based on a word W. Clearly, different events can correspond to the same frequency
vector. The probability distribution for all frequency vectors associated with n
occurrences of the word pair (W,W') is multinomial:
Prob ({V = {v0, …, vg-1} |
g 1

vi = n and all vi ≥ 0}) =
i 0
n!
1
 n
v 0 ! v1!... v g-1! g
(1)
Gap bias can now be expressed in terms of frequency vectors as follows:
g 1
Pn,g(c) = Prob ({V |

i 0
vi = n, all vi ≥ 0 and max
vi
i
≥ c})
(2)
To compute this value, we first compute the probability Xk that the gap of
particular size (say of size k) will occur at least c times. It equals the tail of the binomial
distribution with parameters (n,
1
):
g
g 1
Xk = Prob ({V|

n
vi = n, all vi
≥0
and vk ≥ c}) =

j c
i 0
4
 n

n 

1
  (1  ) n  j
j
g
1
  
g
j
(3)
The value of Xk does not depend on k.
g 1
Noting that Prob ({V | max vi ≥ c}) = Prob (  {V |vk ≥ c}), and using Boole's
i
k 0
inequality for probabilities,
Prob (  Ai ) ≤
i

Prob (Ai),
i 1
we obtain from equation (3) an upper bound estimate on Pn,g(c):
Pn,g(c) ≤
g 1
X
k 0
2.
n
k
= g  X1 = g  
j c
 n

n 

1
  (1  ) n  j
j
g
1
  
g
j
(4)
Clustering
The set of repeats identified by the above criteria will include many variants of the same
underlying profile, since the search is done exhaustively for different word and gap sizes.
For example, the motif TGACGTCA can be detected as TGAN2TCA, TGACN0GTCA
and GACN0GTC. To cluster these repeats while preserving symmetry and preventing
connections between overly dissimilar strings, we found it necessary to use a very
specific definition of similarity. A pair of strings W...W' and U...U' are similar if they
can be aligned as follows:
1. At each position in the alignment, either (a) both letters are identical, or (b)
one of the letters is overhanging. An overhanging base in W...W' is one that
does not align with a base in U or U', but rather lies either in the gap region or
outside the string.
2. Overlapping (i.e. not overhanging) parts of W and U (and of W' and U') are at
least 2 letters long.
3. The shorter overhang between two aligned words is at most one letter long.
5
The reverse complement of a string is used when it produces a better alignment.
We consider all repeats as nodes of an undirected graph with edges connecting
similar strings. We define a cluster as a connected component in this graph. In cases
where repeats with half-sites (W or U) of different lengths are connected, the node
belonging to the string with the shorter half-site is discarded if either (i) it is completely
contained in the other string, or (ii) the two strings have different orientations (i.e. one
tandem and one palindromic). These deletions prevent clustering of unrelated strings.
The resulting connected components of the graph are taken as clusters.
3.
Assembling Consensus Strings
We derive a multiple sequence alignment for all strings in a cluster using the pair-wise
alignments represented by each edge in the graph. From this alignment, we assemble a
consensus string by replacing all letters in each column by one representative base or
ambiguity code. In rare cases, the consensus string no longer contains either a direct or
an inverted repeat. For these cases we further subdivide the connected component into
refined clusters by requiring that an edge only connect two nodes if both are direct
repeats or both are inverted repeats.
4.
Refining Consensus Strings
A position-weight matrix for the Family Binding Profile is obtained by refining each
consensus string using a single round of the expectation maximization (EM) algorithm
implemented in TAMO (Gordon et al. 2005), using the complete set of TRANSFAC
binding sites for this family as the input sequences. Each consensus string is converted
6
into a probability matrix used directly in the “E” step of EM, and the final profile is
obtained from the subsequent “M” step.
Grid Search
In a single run of hypothesis testing by the THEME algorithm there are two parameters to
be specified. The first parameter is β, which is the weight we assign to the initial
hypothesis during the Expectation Maximization refinement. β can take on values
between 0.0 and 1.0. At β = 0.0 EM is not restrained and the hypothesis is simply an
initialization point for the EM algorithm. As β is increased we restrain the optimization
using pseudo-counts added to the motif PSSM during the M-step (Bailey and Elkan
1995). At β = 1.0, EM is completely restrained and no refinement occurs. The second
parameter, C, is used during training of the SVM classifier. When data is not linearly
separable, one may introduce an error penalization term to regularize the soft-margin
SVM optimization problem. C is a parameter that scales this regularization term.
The THEME algorithm performs a grid search over a range of settings for these two
parameters in order to find the particular setting that yields the lowest mean crossvalidation error on the dataset. We tested β’s of 0.05, 0.1, 0.33, 0.5, 0.67, and 1.0. We
tested C values of 1.0E-10, 1.0E-4, 1.0E-3, 1.0E-2, 0.05, 0.1, 1.0, 10.0, and 100.0. At
each setting of β, we perform optimization of the hypothesis on the training set using EM.
After scoring the sequences using the log-likelihood ratio criterion described in the main
text, we train SVM classifiers at each setting of the parameter C and assess their
performance on the held-out test data. We repeat this procedure for each partition of test
and training data and store the mean test error for each setting of C at this particular
setting of β. We then repeat the EM optimization of the hypothesis, using the same
7
training and test set partitions, using a new β value. By repeating this procedure over all
β’s we are able to identify a particular setting of [β,C] that yields the lowest crossvalidation error and we report this error for the hypothesis.
ChIP Experiments
HepG2 cells (ATCC) and MIN6 cells (originally gift of J. Miyazaki, Osaka
Univ.(Ishihara et al. 1993)) were grown by the National Center for Cell Culture
(University of Minnesota) under standard conditions in DMEM-10% FBS. At 80%
confluence, the cell cultures were crosslinked in situ for 10 minutes with 1% (final)
formaldehyde. After neutralization with 2.5 M Glycine, followed by rinsing with 1x
PBS, the cells were frozen in 1 x 108 aliquots at –80°C until needed. Human hepatocytes
were isolated from livers obtained from deceased donors at the Liver Tissue Procurement
and Distribution System (S. Strom, U-Pittsburgh) after liver digestion and percoll
gradient isolation of the hepatocyte fraction. Crosslinking, aliquoting, and freezing were
performed as above.
After treatment with formaldehyde to covalently link transcriptional regulators to
DNA sites of interaction, chromatin in cell lysates was sheared by sonication. The
regulator-DNA complexes were enriched by chromatin-IP with specific antibodies, the
crosslinks were reversed, and enriched DNA fragments and control genomic DNA
fragments were amplified using ligation-mediated PCR. The amplified DNA
preparations, labeled with distinct fluorophores, were mixed and hybridized onto a
promoter array. We used the following antibodies: HNF3Gift of Robert Costa; E2F4,
sc-1082; NeuroD1, sc-1084. The data have been submitted to ArrayExpress under
accession number E-WMIT-8.
8
References
Bailey, T. L. and C. Elkan (1995). "The value of prior knowledge in discovering motifs
with MEME." Proc Int Conf Intell Syst Mol Biol 3: 21-9.
Bateman, A., L. Coin, et al. (2004). "The Pfam protein families database." Nucleic Acids
Res 32 Database issue: D138-41.
Boeckmann, B., A. Bairoch, et al. (2003). "The SWISS-PROT protein knowledgebase
and its supplement TrEMBL in 2003." Nucleic Acids Res 31(1): 365-70.
Gordon, D. B., L. Nekludova, et al. (2005). "TAMO: a flexible, object-oriented
framework for analyzing transcriptional regulation using DNA-sequence motifs."
Bioinformatics 21(14): 3164-5.
Ishihara, H., T. Asano, et al. (1993). "Pancreatic beta cell line MIN6 exhibits
characteristics of glucose metabolism and glucose-stimulated insulin secretion
similar to those of normal islets." Diabetologia 36(11): 1139-45.
Matys, V., E. Fricke, et al. (2003). "TRANSFAC: transcriptional regulation, from
patterns to profiles." Nucleic Acids Res 31(1): 374-8.
Roth, F. P., J. D. Hughes, et al. (1998). "Finding DNA regulatory motifs within unaligned
noncoding sequences clustered by whole-genome mRNA quantitation." Nat
Biotechnol 16(10): 939-45.
9
Download