Supplementary Information Text

advertisement

Supplementary Information Text

Alternative Splicing

An additional 296 genes have one or more EST sequence overlapping with non-annotated exons and also contain flanking canonical splice which supports potential alternative splicing for a minimum of 712 of the genes (77%) with 212 loci showing no alternative splicing evidence. As most of our conclusions are based on the existing transcribed sequence data, the depth of the EST database is the limiting factor in positively identifying alternative spliced loci. Unlike chromosome 19, where there is ample evidence for multiple instances of low levels of alternative splicing; for chromosome 5 there are only 15 loci where 2 ESTs or more confirm a low-level alternative splice site.

This disparity may indicate that more attempts have been made to isolate low-level alternative transcripts for the loci on chromosome 19, or suggests that the genes as a whole on chromosome 5 are less likely to have rare spliced variants.

Pseudogenes

445 of the 577 pseudogenes have at least one frameshift or stop codon mutation as compared to their original parent genes. 43 of the 479 processed pseudogenes that lack introns present at the parent locus and display poly-A stretches in adjacent genomic sequence were identified by manual validation of the collection of human processed pseudogenes as determined by Zhang et al

1

. An additional 62 processed and 28 nonprocessed pseudogenes were identified via further manual inspection of the candidate gene loci. The accumulation of insults to the open reading frames and intron loss of these genomic sequences confirm these as pseudogenes and suggest that they have lost their

1

original function. However, 47 of these pseudogenes have at least one overlapping EST sequence available, indicating they are occasionally transcribed and may have adopted an alternative functional role to the parent gene.

Protocadherin Gene Family

The largest gene family on chromosome 5 is the protocadherin ( PCDH ) gene cluster, which consists of 53 tandemly-arrayed, single-exon paralogous genes organized into three subclusters, designated

,

and

 2

. Each protocadherin exon encodes an extracellular domain consisting of six cadherin-like ectodomain repeats, a transmembrane domain and a short cytoplasmic tail. At the 3’ end of both the 

and

subclusters are an additional three short exons that are alternatively cisspliced to each

and

exon, providing a “constant” cytoplasmic region 2-4 . Each protocadherin gene is transcribed from its own promoter and all protocadherin cluster promoters share a highly conserved core motif

5, 6

. Promoter choice appears to determine the splicing of a particular

or

 variable exon to the first constant region exon, in that the splice donor site of the transcribed variable exon is used in cis -splicing

3

.

Each neuron appears to express a distinct combination of protocadherin genes

7

.

Protocadherin proteins are thought to form homophilic interactions at synapses, providing a molecular means to distinguish subsets of neurons based on the combinations of protocadherins they express 7, 8 . Protocadherin clusters are present in many vertebrate species, although the sequence content greatly differs between mammals and other vertebrates. Protocadherin cluster genes in humans and other species also undergo frequent gene conversion events. These events are restricted to specific ectodomains,

2

resulting in some ectodomains becoming nearly identical among paralogs while other ectodomains remain diverse. This process also generates allelic variants of human protocadherin cluster genes.

Comparative Methods

Regions of evolutionary conservation are detected using the program PEAK-VISTA (S.

Prabhakar, unpublished work), which takes MLAGAN 9 alignments as input. PEAK-

VISTA goes through a 3-step process to identify statistically significant slowly-evolving regions. First, noncoding regions in the input alignment are used to estimate the approximate local neutral mutation rates between all pairs of aligned sequences. The method is adapted from Cooper et al.

10

. The estimated rates are then used to derive a loglikelihood score for slow vs. neutral evolution at each aligned position, similar to the strategy of Boffelli et al .

11

. Conserved regions show up as high-scoring segments, which are assigned p-values relative to random permutations of the alignment columns. The statistical formalism for computing p-values is identical to that of the NCBI BLAST algorithm.

To generate substitution rates for the chimp/human comparison, we constructed four-way alignments of human, mouse, chimpanzee, and rat using M-LAGAN and limited our analysis to conserved regions with p-values less than a cutoff number using PEAK-

VISTA (S. Prabhakar, unpublished work). Simple scripts in PERL calculate the substitution rates of sequence falling in each category based on our internal annotation of chromosome 5. To ensure comparison of high-quality alignments and truly orthologous

3

sequences, we limited our analysis to aligned segments with reasonable levels of nucleotide diversity (0.5 between primates and rodents, 0.05 between primates, and 0.25 between rodents) encompassing approximately 130 Mb or 70% of the finished chromosome. It should be noted that the observed non-coding/non-conserved substitution rate was consistent among the variety of simple repeats, repetitive elements, and nonrepeat sequences encompassed by this category. Finally, all calculations employed the

Jukes-Cantor substitution model without corrections due to the extremely high sequence similarity between humans and chimpanzees.

Additional tiling set information

The distance from the most distal cosmid to the true telomere is estimated to be 63kb for the p-telomere and 20kb for the q-telomere (Harold Riethman, pers. comm.). A 5qtelomere-containing ‘half-YAC’ (Riethman 2001) has been identified and sequenced by subcloning into cosmid vector. No half-YAC has been identified for the 5p-telomere.

The boundary between euchromatin and heterochromatin at the centromere was identified by the presence of centromere specific alpha satellite repeats.

Supplementary Finishing Information

The tiling set of chromosome 5 consists of 1763 finished clones. 1685 of these clones were drafted at the Joint Genome Institute and finished at the Stanford Human Genome

Center while 71 clones were drafted and finished at Lawrence Berkeley National

Laboratories. The following clones were drafted and/or finished elsewhere.

4

AC022493 Human Genome Sequencing Center, Baylor

AC025156 Human Genome Sequencing Center, Baylor

AC009757 Whitehead Institute/MIT Center for Genome Research

AC020728 Washington University, Genome Sequencing Center

AC002428 Washington University, Genome Sequencing Center

AC022217 Washington University, Genome Sequencing Center

AP006257 Riken, Genomic Sciences Center

Supplementary references

1.

Zhang, Z., Harrison, P. M., Liu, Y. & Gerstein, M. Millions of years of evolution preserved: a comprehensive catalog of the processed pseudogenes in the human genome. Genome Res.

13 , 2541-2558 (2003).

2.

Wu, Q. & Maniatis, T. A striking organization of a large family of human neural cadherin-like cell adhesion genes. Cell 97 , 779-790 (1999).

3.

Tasic, B. et al . Promoter choice determines splice site selection in protocadhe mRNA splicing. Mol Cell 10 , 21-33 (2002).

4.

Wang, X., Su, H. & Bradley, A. Molecular mechanisms governing Pcdhgene expression: evidence for a multiple promoter and cis -alternative splicing model. Genes Dev . 16, 1890-

-

1905 (2002).

5.

Wu, Q. et al . Comparative DNA sequence analysis of mouse and human protocadherin clusters. Genome Res 11 , 389-404 (2001).

6.

Noonan, J.P. et al . Extensive linkage disequilibrium, a common 16.7 kilobase deletion, and

Am. J. Hum.Genet.

72 , evidence of balancing selection in the human protocadhe

621-635 (2003).

7.

Kohmura, N. et al . Diversity revealed by a novel family of cadherins expressed in neurons at a synaptic complex. Neuron 20 , 1137-1151 (1998).

5

8.

Obata, S. et al . Protocadherin Pcdh2 shows properties similar to, but distinct from, those of classical cadherins. J. Cell Sci.

108 , 3765-3773 (1995).

9.

Brudno, M. et al . LAGAN and Multi-LAGAN: Efficient Tools for Large-Scale Multiple

Alignment of Genomic DNA. Genome Res . 13 , 721-731 (2003).

10.

Cooper, G. M. et al .

Characterization of evolutionary rates and constraints in three

Mammalian genomes. Genome Res.

14 , 539-548 (2004).

11.

Boffelli, D. et al . Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299 , 1391-1394 (2003).

6

Download