SUPPLEMENTAL TEXT Diversified gene regulation of the vertebrate-specific Pcdh clusters The clustered Pcdh genes are absent in the Drosophila Melanogaster and Caenorhabditis elegans genomes (Rubin et al. 2000; Hill et al. 2001; Noonan et al. 2004). I found that the constant protein sequences of the and clusters are highly conserved among mammals, birds, amphibians, and fish (Supplemental Figure S2). The lengths of the constant region polypeptide sequences are almost identical (Figure S2A); however, the constant polypeptide sequences of the zebrafish proteins are longer than those of the non-fish vertebrates (Figure S2B). The two draft genomes of sea squirts (invertebrate chordates: Ciona intestinalis and Ciona savignyi) do not seem to contain the Pcdh clusters (Dehal et al. 2002) (www.broad.mit.edu /annotation /ciona). Therefore, the Pcdh clusters seem to be an evolutionary novelty specific to vertebrates. Although tandem duplicated genes tend to have conserved promoter motifs, the motifs in different groups may have distinct characteristics. For example, the variable exons of the UGT1 gene clusters can be divided into phenol and bilirubin groups. There is no common motif in the promoter regions of all of the UGT1 variable exons. However, there is a highly conserved motif upstream of each variable exon in the bilirubin group and a distinct motif in the phenol group (Zhang et al. 2004). Each Pcdh variable exon is preceded by a distinct promoter (Tasic et al. 2002). Each Pcdh promoter contains a highly conserved “CGCT” core sequence motif in humans and mice (Wu et al. 2001; Noonan et al. 2004). I reasoned that the motifs for different Pcdh groups may have distinct characteristics. By using the Gibbs Motif Sampler program (Thompson et al. 2003) (bayesweb.wadsworth.org /gibbs), I searched 1 the 350-bp regions upstream of the translation start codon for each member of 15 distinct groups of the chimpanzee, rat, and zebrafish Pcdh genes (Figure S3). The motifs for chimpanzee and rat (Figure S3, A and E), (Figure S3, B and F), a (Figure S3, C and G), and b (Figure S3, D and H) are conserved. All contain a highly conserved “CGCT” core motif. However, the flanking sequences are different between the mammalian a and b genes (Figure S3, C, D, G, and H). This observation suggests that the regulation of mammalian a and b groups may be different. The mammalian ctype Pcdh genes have a motif distinct from those of the other mammalian groups (Figure S3I). The motifs of the zebrafish and groups 1 genes are similar to each other (Figure S3, J and M), and resemble those of the mammalian genes, which have a “CGCT” core sequence. The motifs for the zebrafish groups 2 and 3 genes have a weak “CAGT” sequence instead of the “CGCT” core (Figure S3, K and L). The zebrafish group 2 genes (Figure S3N) have a motif related to those of the mammalian a group (Figure S3, C and G). The zebrafish group 3 genes have a distinct motif (Figure S3O). These results suggest that each zebrafish variable exon is preceded by a promoter that is related to a mammalian promoter, but its regulation has diverged considerably from the mammalian Pcdh genes. 2 SUPPLEMENTAL LITERATURE CITED Dehal, P., Y. Satou, R. K. Campbell, J. Chapman, B. Degnan et al., 2002 The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins. Science 298: 2157-2167. Hill, E., I. D. Broadbent, C. Chothia and J. Pettitt, 2001 Cadherin superfamily proteins in Caenorhabditis elegans and Drosophila melanogaster. J. Mol. Biol. 305: 10111024. Noonan, J. P., J. Grimwood, J. Schmutz, M. Dickson and R. M. Myers, 2004 Gene conversion and the evolution of protocadherin gene cluster diversity. Genome Res. 14: 354-366. Rubin, G. M., M. D. Yandell, J. R. Wortman, G. L. Gabor Miklos, C. R. Nelson et al., 2000 Comparative genomics of the eukaryotes. Science 287: 2204-2215. Tasic, B., C. E. Nabholz, K. K. Baldwin, Y. Kim, E. H. Rueckert et al., 2002 Promoter choice determines splice site selection in protocadherin alpha and gamma premRNA splicing. Mol. Cell 10: 21-33. Thompson, W., E. C. Rouchka and C. E. Lawrence, 2003 Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res. 31: 3580-3585. Wu, Q., T. Zhang, J. F. Cheng, Y. Kim, J. Grimwood et al., 2001 Comparative DNA sequence analysis of mouse and human protocadherin gene clusters. Genome Res. 11: 389-404. Zhang, T., P. Haws and Q. Wu, 2004 Multiple variable first exons: a mechanism for celland tissue-specific gene regulation. Genome Res. 14: 79-89. 3 SUPPLEMENTAL FIGURE LEGENDS Figure S1. RT-PCR of members of two clusters from the zebrafish brain RNA preparations. The amplified full-length coding sequences are indicated by 3 kb bands. The smaller bands are alternatively spliced products. M, marker. Figure S2. An alignment of vertebrate Pcdh (A) and (B) constant protein sequences with conserved residues highlighted. The high degree of conservation between the two zebrafish or constant regions demonstrates that these clusters are duplicated. HS, Homo sapiens; PT, Pan troglodytes; MM, Mus musculus; RN, Rattus norvegicus; GG, Gallus gallus; XT, Xenopus tropicalis; DR, Danio rerio. Figure S3. Characteristics of the conserved promoter sequence motifs in vertebrate clustered Pcdh genes. Shown are graphic logo representations of the chimpanzee (A), (B), a (C), b (D), rat (E), (F), a (G), b (H), chimpanzee and rat c-type (I), zebrafish groups 1 (J), 2 (K), 3 (L), and groups 1 (M), 2 (N), and 3 (O) motifs. The height of symbols indicates the relative frequency of each nucleotide at that position. Figure S4. An alignment of the human (A), chimpanzee (B), mouse (C), rat (D) Pcdh , , ECs 1-3 sequences with those of C-cadherin (C-cdh). The + codons predicted to be subject to positive selection with a posterior probability >0.90 by one model and >0.5 by at least one other model are highlighted in red for members of the cluster, in green for , in blue for a, and in violet for b. The corresponding positions in C-cadherin are also highlighted accordingly. Positions that were predicted to be under positive selection by two or more groups are indicated by an asterisk. 4 SUPPLEMENTAL TABLE LEGENDS Table S1. List of oligonucleotides used Table S2. Log-likelihood values and parameter estimates for 22 human, chimpanzee, mouse, rat, and zebrafish Pcdh groups Model1 Maximum likelihood models implemented in the codeml program of the PAML package. M0, one-ratio; M1, neutral; M2, selection; M3, discrete; M7, ; M8, +. 2 Estimated log-likelihood values by the codeml program. 3 Estimated transition/transversion rate ratio by the codeml program. Estimation of Parameters4 =KA/KS nonsynonymous/synonymous rate ratio; p=proportion of sites for each site class. M0: one estimated for all sites; M1: estimate p0=proportion of sites with =0, p1=1 - p0, proportion of sites with =1; M2: estimate p0 (=0), p1 (=1), and , p2=1 - p0 - p1. M3: estimate p0, p1, , 1, and 2; p2=1 - p0 - p1. M7: estimate p and q (parameters of distribution of between 0 and 1). M8: same as M7 except additional site class where an estimated is allowed. LRT(2)5 Statistical likelihood ratio test; comparing the test statistic (2) calculated from paired codeml models (M1 vs M2; M0 vs M3; and M7 vs M8) with the critical value of chi-square asymptotic distribution with appropriate degrees of freedom (i.e. 2 d.f., 4 d.f., and 2 d.f., respectively). 2 and level of significance are shown for M2, M3, and M8 models. Note that no positively selected sites are predicted by at least two pairs of codeml models for any of the six zebrafish Pcdh groups. N/A, Not applicable. 5 Positively Selected Sites6 Codon positions predicted to be under positive selection with a posterior probability >0.90 by one codeml model (M2, M3, or M8), and >0.50 by at least one other model. Note that residues are comparably numbered among Pcdh groups and between different species. Table S3. Summary information for 22 Pcdh groups analyzed Tree length1 measured as the number of nucleotide substitutions along the tree per codon by the codeml program. The + sites2 codon positions predicted to be under positive selection with a posterior probability >0.90 by one codeml model (M2, M3, or M8), and >0.50 by at least one other model. Note that residues are comparably numbered among Pcdh groups and between different species. Table S4. Log-likelihood values and parameter estimates for 8 primate and rodent Pcdh groups Model1 Maximum likelihood models implemented in the codeml program of the PAML package. M0, one-ratio; M1, neutral; M2, selection; M3, discrete; M7, ; M8, +. 2 Estimated log-likelihood values by the codeml program. 3 Estimated transition/transversion rate ratio by the codeml program. Estimation of Parameters4 =KA/KS nonsynonymous/synonymous rate ratio; p=proportion of sites for each site class. M0: one estimated for all sites; M1: estimate p0=proportion of sites with =0, p1=1 - p0, proportion of sites with =1; M2: estimate p0 (=0), p1 (=1), and , p2=1 - p0 - p1. M3: estimate p0, p1, , 1, 6 and 2; p2=1 - p0 - p1. M7: estimate p and q (parameters of distribution of between 0 and 1). M8: same as M7 except additional site class where an estimated is allowed. LRT(2)5 Statistical likelihood ratio test; comparing the test statistic (2) calculated from paired codeml models (M1 vs M2; M0 vs M3; and M7 vs M8) with the critical value of chi-square asymptotic distribution with appropriate degrees of freedom (i.e. 2 d.f., 4 d.f., and 2 d.f., respectively). 2 and level of significance are shown for M2, M3, and M8 models. No positively selected sites are predicted for zebrafish Pcdh genes (data not shown). N/A, Not applicable. Positively Selected Sites6 Codon positions predicted to be under positive selection with a posterior probability >0.90 by one codeml model (M2, M3, or M8), and >0.50 by at least one other model. Note that residues are comparably numbered among Pcdh groups and between different species. 7