Supplementary Methods for MacIsaac et al. A Hypothesis-Based Approach for Identifying the Binding Specificity of Regulatory Proteins from Chromatin Immunoprecipitation Data Hypothesis Generation Overview We derived profiles from unaligned binding sites in the TRANSFAC v7.2 database (Matys et al. 2003). To identify the DNA-binding domains of the proteins represented in TRANSFAC, we used hidden-Markov models from Pfam (Bateman et al. 2004) (“Pfam_ls”, release 10, E-value threshold 0.01). In 22 cases where the sequences of transcriptional regulators were missing from TRANSFAC we obtained the sequences from SwissProt (Boeckmann et al. 2003). We identified 37 families of DNA-binding domains in TRANSFAC that each contained at least 4 proteins and 30 sites. To derive profiles for a family, we pooled all the binding sites reported for its members in TRANSFAC, and submitted these sequences to two motif discovery programs: AlignACE (Roth et al. 1998), as described in the main text, and DimerFinder, which is described in detail below. DimerFinder Algorithm DimerFinder is designed to identify Family Binding Profiles containing short direct repeats or inverted repeats that may represent binding sites for dimeric proteins. The algorithm consists of four steps: (1) Identifying statistically over-represented direct and inverted sequence repeats 1 (2) Clustering (3) Assembling consensus strings (4) Refining consensus strings We applied DimerFinder separately to the set of binding sites for each domain family in the TRANSFAC database that contained as least 4 proteins and 30 sites. Source code for DimerFinder is available from the authors’ website. 1. Identifying over-represented repeats. We tabulated the frequency of all possible DNA sequences containing a direct or inverted repeat of any word 3-8 bases in length with gaps ranging from 0 to 14 base pairs. We identified the most significant of these repeats using two criteria that are explained in detail below: gap bias and specificity. a. Gap bias Suppose we found exactly n instances of a word pair (W,W'), where W' equals either W (when searching for direct repeats) or its reverse complement (when searching for inverse repeats), and the words are separated by between zero and g-1 bases. We define the gap bias, Pn,g(c), to be the probability that the most frequent gap size between W and W' would occur c or more times, under the null hypothesis that all gap sizes from zero to g-1 are equally probable. We derive an equation for the gap bias below. If the most common gap size has a significant gap bias, we identify additional significant gaps using the following iterative strategy. First, we remove from the input set those sites containing the word pair separated by the most frequent gap size G, 2 W nnnn W'. Then we recompute the gap bias for the next most frequent gap size G occurring with the word pair W,W'. This process is repeated until, for some next gap size g, the gap bias Pn,g(c) falls below our threshold. We chose a very stringent gap bias threshold of 5·10-9 by comparing the discovered repeats with the expected results for the well-studied dimeric families SRF-TF, HLH, HSF, Zn_clus and bZIP. b. Specificity We evaluate the repeats with a significant gap bias using several criteria designed to insure that the repeats are specific to a particular domain family. The first criterion is the enrichment score which measures the over-representation of matching sites in the family of interest compared to all other transcription factor families: min( B , g ) E log 10 i b B G B i g i G g (1) where B is the number of binding sites associated with the domain family, G is the total number of binding sites in the database for all families. The quantities b and g represent the number of binding sites matching the motif within B and G respectively. We use a threshold of 2.0 for this score. The second criterion tests whether the set of proteins with at least one match to the string is enriched in the family of interest. We require an enrichment score of at least 5.0 for this metric. As a final requirement, we accept the repeats based on a word pair (W,W') that satisfy all other criteria only if these repeats occur in a total of at least ten binding sites. For families with fewer than 200 binding sites in the database, we relax this criterion by 1 3 dimer occurrence for every 50 binding sites. This choice of criteria and thresholds gave the best agreement between the discovered Family Binding Profiles and the literature. c. Estimating the gap bias Each word pair (W,W') is associated with a distribution of gap sizes. Consider an integer-valued frequency vector for this distribution: g 1 V={v0, …, vg-1} such that vi = n and all vi ≥0 i 0 Here vi is the number of word pairs with a gap of length i in a set of n motifs based on a word W. Clearly, different events can correspond to the same frequency vector. The probability distribution for all frequency vectors associated with n occurrences of the word pair (W,W') is multinomial: Prob ({V = {v0, …, vg-1} | g 1 vi = n and all vi ≥ 0}) = i 0 n! 1 n v 0 ! v1!... v g-1! g (1) Gap bias can now be expressed in terms of frequency vectors as follows: g 1 Pn,g(c) = Prob ({V | i 0 vi = n, all vi ≥ 0 and max vi i ≥ c}) (2) To compute this value, we first compute the probability Xk that the gap of particular size (say of size k) will occur at least c times. It equals the tail of the binomial distribution with parameters (n, 1 ): g g 1 Xk = Prob ({V| n vi = n, all vi ≥0 and vk ≥ c}) = j c i 0 4 n n 1 (1 ) n j j g 1 g j (3) The value of Xk does not depend on k. g 1 Noting that Prob ({V | max vi ≥ c}) = Prob ( {V |vk ≥ c}), and using Boole's i k 0 inequality for probabilities, Prob ( Ai ) ≤ i Prob (Ai), i 1 we obtain from equation (3) an upper bound estimate on Pn,g(c): Pn,g(c) ≤ g 1 X k 0 2. n k = g X1 = g j c n n 1 (1 ) n j j g 1 g j (4) Clustering The set of repeats identified by the above criteria will include many variants of the same underlying profile, since the search is done exhaustively for different word and gap sizes. For example, the motif TGACGTCA can be detected as TGAN2TCA, TGACN0GTCA and GACN0GTC. To cluster these repeats while preserving symmetry and preventing connections between overly dissimilar strings, we found it necessary to use a very specific definition of similarity. A pair of strings W...W' and U...U' are similar if they can be aligned as follows: 1. At each position in the alignment, either (a) both letters are identical, or (b) one of the letters is overhanging. An overhanging base in W...W' is one that does not align with a base in U or U', but rather lies either in the gap region or outside the string. 2. Overlapping (i.e. not overhanging) parts of W and U (and of W' and U') are at least 2 letters long. 3. The shorter overhang between two aligned words is at most one letter long. 5 The reverse complement of a string is used when it produces a better alignment. We consider all repeats as nodes of an undirected graph with edges connecting similar strings. We define a cluster as a connected component in this graph. In cases where repeats with half-sites (W or U) of different lengths are connected, the node belonging to the string with the shorter half-site is discarded if either (i) it is completely contained in the other string, or (ii) the two strings have different orientations (i.e. one tandem and one palindromic). These deletions prevent clustering of unrelated strings. The resulting connected components of the graph are taken as clusters. 3. Assembling Consensus Strings We derive a multiple sequence alignment for all strings in a cluster using the pair-wise alignments represented by each edge in the graph. From this alignment, we assemble a consensus string by replacing all letters in each column by one representative base or ambiguity code. In rare cases, the consensus string no longer contains either a direct or an inverted repeat. For these cases we further subdivide the connected component into refined clusters by requiring that an edge only connect two nodes if both are direct repeats or both are inverted repeats. 4. Refining Consensus Strings A position-weight matrix for the Family Binding Profile is obtained by refining each consensus string using a single round of the expectation maximization (EM) algorithm implemented in TAMO (Gordon et al. 2005), using the complete set of TRANSFAC binding sites for this family as the input sequences. Each consensus string is converted 6 into a probability matrix used directly in the “E” step of EM, and the final profile is obtained from the subsequent “M” step. Grid Search In a single run of hypothesis testing by the THEME algorithm there are two parameters to be specified. The first parameter is β, which is the weight we assign to the initial hypothesis during the Expectation Maximization refinement. β can take on values between 0.0 and 1.0. At β = 0.0 EM is not restrained and the hypothesis is simply an initialization point for the EM algorithm. As β is increased we restrain the optimization using pseudo-counts added to the motif PSSM during the M-step (Bailey and Elkan 1995). At β = 1.0, EM is completely restrained and no refinement occurs. The second parameter, C, is used during training of the SVM classifier. When data is not linearly separable, one may introduce an error penalization term to regularize the soft-margin SVM optimization problem. C is a parameter that scales this regularization term. The THEME algorithm performs a grid search over a range of settings for these two parameters in order to find the particular setting that yields the lowest mean crossvalidation error on the dataset. We tested β’s of 0.05, 0.1, 0.33, 0.5, 0.67, and 1.0. We tested C values of 1.0E-10, 1.0E-4, 1.0E-3, 1.0E-2, 0.05, 0.1, 1.0, 10.0, and 100.0. At each setting of β, we perform optimization of the hypothesis on the training set using EM. After scoring the sequences using the log-likelihood ratio criterion described in the main text, we train SVM classifiers at each setting of the parameter C and assess their performance on the held-out test data. We repeat this procedure for each partition of test and training data and store the mean test error for each setting of C at this particular setting of β. We then repeat the EM optimization of the hypothesis, using the same 7 training and test set partitions, using a new β value. By repeating this procedure over all β’s we are able to identify a particular setting of [β,C] that yields the lowest crossvalidation error and we report this error for the hypothesis. ChIP Experiments HepG2 cells (ATCC) and MIN6 cells (originally gift of J. Miyazaki, Osaka Univ.(Ishihara et al. 1993)) were grown by the National Center for Cell Culture (University of Minnesota) under standard conditions in DMEM-10% FBS. At 80% confluence, the cell cultures were crosslinked in situ for 10 minutes with 1% (final) formaldehyde. After neutralization with 2.5 M Glycine, followed by rinsing with 1x PBS, the cells were frozen in 1 x 108 aliquots at –80°C until needed. Human hepatocytes were isolated from livers obtained from deceased donors at the Liver Tissue Procurement and Distribution System (S. Strom, U-Pittsburgh) after liver digestion and percoll gradient isolation of the hepatocyte fraction. Crosslinking, aliquoting, and freezing were performed as above. After treatment with formaldehyde to covalently link transcriptional regulators to DNA sites of interaction, chromatin in cell lysates was sheared by sonication. The regulator-DNA complexes were enriched by chromatin-IP with specific antibodies, the crosslinks were reversed, and enriched DNA fragments and control genomic DNA fragments were amplified using ligation-mediated PCR. The amplified DNA preparations, labeled with distinct fluorophores, were mixed and hybridized onto a promoter array. We used the following antibodies: HNF3Gift of Robert Costa; E2F4, sc-1082; NeuroD1, sc-1084. The data have been submitted to ArrayExpress under accession number E-WMIT-8. 8 References Bailey, T. L. and C. Elkan (1995). "The value of prior knowledge in discovering motifs with MEME." Proc Int Conf Intell Syst Mol Biol 3: 21-9. Bateman, A., L. Coin, et al. (2004). "The Pfam protein families database." Nucleic Acids Res 32 Database issue: D138-41. Boeckmann, B., A. Bairoch, et al. (2003). "The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003." Nucleic Acids Res 31(1): 365-70. Gordon, D. B., L. Nekludova, et al. (2005). "TAMO: a flexible, object-oriented framework for analyzing transcriptional regulation using DNA-sequence motifs." Bioinformatics 21(14): 3164-5. Ishihara, H., T. Asano, et al. (1993). "Pancreatic beta cell line MIN6 exhibits characteristics of glucose metabolism and glucose-stimulated insulin secretion similar to those of normal islets." Diabetologia 36(11): 1139-45. Matys, V., E. Fricke, et al. (2003). "TRANSFAC: transcriptional regulation, from patterns to profiles." Nucleic Acids Res 31(1): 374-8. Roth, F. P., J. D. Hughes, et al. (1998). "Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation." Nat Biotechnol 16(10): 939-45. 9