Supplemental discussion of randomness of gene density

SUPPLEMENTARY METHODS Construction of clone map and sequence map of Chromosome 18 We began by constructing a path of overlapping large-insert clones across Chromosome 18. The final version of the clone path contains 596 large insert clones (primarily bacterial artificial chromosomes (BACs)) from 9 libraries (Table S3). The average length of a large insert clone on the Chromosome 18 path is 192575 bp, and the average physical overlap between clones is approximately 50kb. In many cases, the redundant overlapping regions of clones were not completely finished, so the average length of a finished accession is 162036 bp and the average overlap between finished accessions is 36713 bp. The path was built in several stages. First, seed clones were selected for sequencing based on their positions in an initial physical map based on restrictionfragment-based fingerprints of clones (18q)1, and from an integrated STS marker map (18p) (http://stt.gsc.riken.jp/projects/imap.html). These clones were then assembled into ordered contigs based on shared sequence and STS marker content along with fingerprint data. Additional clones were then selected to extend and join the sequence contigs, on the basis of shared sequence content, either in the form of clone-end sequences from characterized BAC libraries2 or STSs. Clone fingerprint data were not used to extend the sequence-ready clone paths. Once these options were exhausted, further clones extending the sequence contigs were identified by hybridization of radioactive ‘overgo’ probes3 to filters containing additional clone libraries (providing >30x physical coverage of the genome). The 18p telomere was isolated from a Chromosome 18-specific Fosmid library constructed and RIKEN, while the 18q telomere was isolated from a library of yeast artificial chromosomes (YACs) enriched for such regions (termed half-YACs4, 5). Unique sequences from the path of finished clones were concatenated to produce a ‘finished’ sequence of Chromosome 18. A detailed discussion of validation of the accuracy of clone overlaps and of resolution of polymorphisms in clone overlaps can be found elsewhere6. Building sequence ready maps at Broad Institute/Whitehead Center for Genome Research. 1 Sequence-ready clone paths providing deep coverage of some regions of Chromosome 18q were constructed in the following manner. YACs were selected based on their position on the Whitehead physical map7. Plugs of genomic DNA were prepared and YACs purified on the basis of size by CHEF gel electrophoresis8. Random shotgun sequences representing roughly 0.1X coverage of the YAC were generated and screened against the S. cerevisiae genome, the YAC vector and a variety of contaminants, and then masked for human repeats with RepeatMasker (www.repeatmasker.org). The remaining high quality sequences were used to design ‘overgo’ probes, which were radioactively labeled by primer extension3, pooled in batches of up to 60 loci and hybridized to filters containing the RPCI-11 BAC library. For validation and deconvolution, positive clones were streaked to single colonies, picked and replicated to filters that were hybridized to each overgo probe individually. Deep paths of BACs were assembled on the basis of shared marker content, from which a minimal tiling path of clones was selected for sequencing. Genome walking to extend clone paths. Electronic identification of walking clones. All available large-insert clone-end sequences were aligned to all available human genome sequence using the WU-BLAST program (http://blast.wustl.edu) without repeat masking. Clones for which the end sequences indicated that the insert may extend into gap regions were grown and then end sequenced for validation. Validated clones were shotgun sequenced and assembled, and electronic walks iterated if necessary. Hybridization-based identification of walking clones. In cases where electronic walking yielded no candidates, overgo probes were designed and hybridized in pools (as above) to library filters containing the RPCI-11, RPCI-13 libraries, representing a total of over 50x physical coverage of the genome. Positive clones were end sequenced for deconvolution and validation. The clones were then shotgun sequenced and assembled, and walks iterated if necessary. Sample sequencing of YACs to identify walking markers. For some regions of the clone path YACs from the CEPH library were identified that spanned gaps. These YACs were prepped and sample sequenced to 0.25-1X (as above). Sequences passing the above 2 screens and not matching the known human genome were presumed to represent the region of the YAC insert spanning the gap. These sequences were used to design overgo probes for screening as above. Building sequence ready maps at RIKEN Genomic Sciences Center. For Chromosome 18p, seed clones were initially selected by PCR screening of the RPCI11 BAC library using high-density human STS markers derived from the "Integrated Marker Arrangement Project" (http://stt.gsc.riken.jp/projects/imap.html). Clones were assembled into contigs on the basis of multiple shared STS markers. These clones were then sample sequenced, from which new chromosome-walking primers were designed for further screening. A set of minimum-tiling path clones was then selected for full-scale sequencing. These steps were repeated until all gaps were closed or no additional clones could be identified. Some representative clones in contigs were also examined cytogenetically to confirm their localization on the chromosome, although not all such clones were included in the minimum tiling path. Centromeric and telomeric sequence contig ends. The 18p telomere was sequenced in a Fosmid derived from a flow-sorted Chromosome 18-specific library. It was identified by screening with primers based on telomeric repeats9, and ends in >100 tandem telomeric CCCTAA repeats. A half-YAC containing the 18q telomere was sequenced, and telomeric repeats identified in the clone, but we were not able to link them to the assembly and conservatively estimate (based on the YAC size and reads assembled) that no more than 21 kb of sequence are missing. Chromosome 18-specific Fosmids were also used to approach the centromere from 18p. We identified arrays of alpha centromeric repeat sequences on each side of the centromere. Standards for clone overlap validation. Overlaps between finished clones in the final path were validated as agreed to by the International Human Genome Sequencing Consortium. Each clone path overlap was required to have a minimum length of 2kb, and a minimum identity of 99.6% to be 3 accepted without additional data. Overlaps that did not meet this threshold were required to be validated by additional data. For clone pairs that did not meet the overlap length threshold, the following data types were able to demonstrate accurate juxtaposition of clones on the path: 1. Overlapping drafted clone. An unfinished clone (or clones) that spans the overlap region by reaching in to unique sequence on either side. 2. Large insert clone read pair placement. Mated end read pairs from a human whole genome shotgun Fosmid and BAC libraries that are placed in unique sequence on either side of the overlap and demonstrate appropriate spacing. Size constraints on Fosmid inserts make their spacing particularly informative. At least 2 spanning mate pairs were required, but the number was almost always much larger. 3. PCR spanning the overlap region. A clean PCR product derived from whole human genome DNA generated from primers unique in the genome in sequences flanking the overlap. For clone pairs that met the overlap threshold, but did not meet the polymorphism threshold, the nature of the differences were examined in detail. In a small number of cases the difference was due to an insertion/deletion of an alu element or difference in repeat copy number in an SSLP. These were deemed to be a polymorphism in the absence of further data if the juxtaposition standard was met. In two cases, base divergence above the threshold that could not be explained by repeat insertion. These were resequenced by PCR from a panel of 24 human samples and demonstrated to be true alleles. Large insert clone sequencing at Broad Institute/Whitehead Center for Genome Research. Subclone libraries of large-insert clones were prepared in m13 or one of several plasmid subclone vectors, and sequenced with the dideoxy chain termination method using one of several versions of big dye chemistry10. Data were detected on several models of ABI sequencing machines and assembled with Phrap (http://www.phrap.org) or Arachne11, 12. Assemblies were visualized for finishing with either Gap413 or Consed14. A combination of the following methods was used to close sequence gaps and resolve low quality regions and misassemblies: transposon insertion- 4 based sequencing, primer walking, PCR, and shattered insert libraries15. Finished sequence assemblies of all large insert clones were validated by comparison to restriction digestion patterns generated by 3-5 6 cutter enzymes16. Large insert clone sequencing at RIKEN Genomic Sciences Center. Briefly, shotgun sequencing was performed to provide 8-fold coverage of draft sequences using an ET terminator cycle sequencing kit with MegaBACE 1000 capillary sequencers. In addition, we constructed plasmid clone libraries from appropriate restriction fragments and sequenced both ends of these clones to provide 2-fold additional coverage using one of several versions of the BigDye terminator cycle sequencing kit with ABI 3700 capillary sequencers. Data were assembled using Phred/Phrap/Consed17, 18, 14. Sequence gap-filling and resequencing of low quality regions in the assembled data were performed by a nested deletion method19 PCR, primer walking and direct sequencing of BAC clones. Estimation of clone path gap sizes. We were able to estimate the sizes of the three euchromatic gaps in human Chromosome 18 (see Table S2) because the flanking sequences for each gap fall in a region showing extended synteny to the mouse genome3. Human and mouse sequences were masked with RepeatMasker21. Human sequences flanking gaps were aligned to the mouse with PatternHunter22, and short aligning sequences in the human were ordered along the mouse sequence. From the full set of mouse-human alignments, we extracted the set of locally bidirectionally-unique alignments. Order and position of aligning sequences were confirmed by visual inspection of a graphical representation of the alignments. We identified human sequences closest to both sides of the gap, and determined the distance between them on the mouse assembly. The gap distance was then adjusted for the average expansion of the human genome relative to mouse of 15%. Finally, the distances between the last aligning human sequences and the ends of the finished clones in which they reside were subtracted. Alignment of markers from genetic and radiation hybrid maps to Chromosome 18. Of 158 markers on Chromosome 18 on a human genetic map23, we were able to 5 download sequence or primers for 156 from NCBI databases. For those with only primer sequences, we first aligned the primers to the chromosome sequence using MegaBLAST24, then manually curated their position, retrieved the matching sequence and combined this with the markers for which we had complete sequence. We then placed all markers on the chromosome, again by MegaBLAST, and manually checked any markers whose best placement was possibly ambiguous (score < 2x second best score) by hand. All markers were able to be placed on the finished sequence of the chromosome. We performed a similar analysis on 335 radiation hybrid map markers retrieved from UCSC (T. Furey, personal communication). Of these, we excluded 3 pairs that had multiple RH map positions under different names (but were in fact the same sequence). We were able to place all but 2 that hit no sequence in the finished sequence of the chromosome by ePCR25. Laboratory validation of gene model extensions As described in the main text, many of our evidence-based gene models extend both 3’ and 5’ of the corresponding RefSeq gene model. To validate our annotation methods, we obtained independent evidence of transcription of a sample of our gene models. We examined the 5’ extensions not supported by RefSeq data in 15 transcripts from human Chromosome 17 that were manually annotated at the Broad Institute. Since all manual annotation of human chromosomes at the BI was performed using identical methodology, these 15 genes are an appropriate sampling. The data generated confirm the accuracy of our gene models and the efficacy of the underlying annotation methodology. We performed PCR using one primer near the 5’ end of each gene model and a second primer matching a linker sequence present at the 5’ end of the cDNA library. We cloned and sequenced several PCR products for each of 15 transcripts. 10 loci gave informative sequence data, in each case confirming our 5’ extensions. In nine cases this independent evidence extended to at least within 10 bp of the 5’ end of our gene model. In the remaining case, the PCR product confirmed 87 out of 123 bases in our extension. These observations validate our annotation methodology, which was performed with equal care for all 4 finished human chromosomes annotated by this team. Five of the 15 produced no 6 interpretable data: one locus gave no passing sequence reads; 2 loci gave sequence reads that appeared arise from partially-spliced or unspliced RNA; and 2 loci gave sequence reads that suggested an exon structure divergent from both the RefSeq evidence and our gene models and so may represent an alternative splice form. Validation of EST placements As part of our analysis to validate our gene calls, all ESTs that were aligned to manually annotated genes on Chromosome 18 were aligned to the entire human genome (build 34). There were 4 cases of gene models for which the supporting ESTs had best hits to other chromosomes: G2024, HsG643, HsG2271 and HsG3285. Each of these was manually reviewed. G2024 is also supported by protein homology to c21orf81. HsG643 has a single exon with polyA tail and a perfect ORF match to a known protein. It is annotated as a Retrotransposed Copy. HsG2271 also has a single exon and is annotated as a Retrotransposed Copy. It is also annotated by Ensembl. HsG3285 is annotated as a Novel CDS, is supported by ESTs and homology to a mouse protein, and is contained in an intron of HsG1161. These data reflect our careful use and interpretation of EST data, which we used whenever possible to improve known gene models, and to find potential homologs and pseudogenes of source genes located on other chromosomes. Supplemental discussion of annotation rules and methods We emphasized aligned mRNA and protein sequences when analyzing the gene content of the finished 76.1 Mb euchromatic portion of human Chromosome 18. In accord with Hawk2 (www.sanger.ac.uk/Info/workshops/hawk2) conventions gene models were grouped into the following categories: 1. Known - identical to a human mRNA sequence, usually from Refseq or MGC. 2. Novel_CDS - identical to spliced human ESTs and coding for a protein with nonidentical homology to a protein in the public databases. 7 3. Novel_Transcript - identical to a spliced human EST with canonical splice junctions, no homology to known mRNAs or proteins, at least one exon without repeat content and the longest ATG to stop codon open reading frame (ORF) spans more than one exon. This category includes genes predicted by ab initio tools including Genscan31 and FGENESH (Softberry Inc., Mount Kisco, NY) were annotated when one or more exons overlapped a mouse-human BLASTN alignment (evolutionary conserved regions or ecores32). The comparative version of FGENESH was not used. 4. Putative - identical to a spliced human EST with canonical splice junctions, no homology to known mRNAs or proteins, at least one exon without repeat content and the longest ATG to stop codon open reading frame (ORF) is entirely contained in one exon. 5. Pseudogene - gene models that contain a disrupted ORF containing more than 50% of the ORF of a known protein at greater than 50% identity. This category is further subdivided into single exon processed pseudogenes and multi-exon unprocessed pseudogenes. 6. Putative novel transcript - gene models containing an ORF with less than 50% of the ORF of a known protein and greater than 50% identity. Annotation of Novel and Putative loci. The few novel and putative loci that we annotated were required to align over at least 30% of their length to either the mouse or dog genome with at least 70% identity. Some lineage specific and rarely detected genes will be lost using this method, as will gene models with a very high ratio of UTR to coding sequence. We found an additional set of 268 potential transcript fragments (composing at most 220 loci) supported by only one or 2 spliced ESTs, or supported by 3 or more ESTs but lacking sufficient conservation on dog or mouse genomes. Although some of these low confidence gene models may represent functional transcription units, overall these gene models are too speculative to include in our high confidence annotations. 8 Splice form confidence levels. We attempted to determine a confidence level for these splice forms to address concerns about using spliced ESTs as evidence for variant transcripts. We chose to exclude 97 potential transcript variants for which the EST structure cannot be distinguished from a partially-spliced RNA, even though some of these may be valid transcripts. We analyzed a larger set of candidate alternate splice forms drawn from annotation of human chromosomes 18, 8, 15 and 17, and found that 60% of new exons were a multiple of 3 bases long, preserving the reading frame across the insertion. This is significantly higher than the 39% predicted due to chance alone33. The frame will be restored in many of the remaining cases by additional splice variation seen within the same molecule, such as addition of new exons, alternate splice site usage, and exon skipping. Splice variants which introduce new stop codons that may render the transcript a candidate for nonsense-mediated decay (NMD) occur at a somewhat higher rate than are seen in RefSeq sequences, and as a result 85% of the annotated splice variants are not candidates for NMD, compared to 96% of RefSeqs. Comparison of Broad manual annotations to Ensembl We compared our manual gene annotations (mapped by VEGA onto Build 35) to the Ensembl (www.ensembl.org) annotation of Build 35 for chromosome 18. We compiled lists of all exonic features present in one annotation set but not the other. Out of 3310 total Broad exonic features, 399 were not present in Ensembl, and of 3197 Ensembl exonic features 399 were not present in Broad. All Broad exons not in Ensembl were broken down by HAWK category, and are listed below. 95 Broad and 78 Ensembl exons were manually reviewed. Of exonic features present in Broad but not Ensembl, all appeared to be parts of valid genes: 39 are in ‘novel’ genes. 20 were sampled and all are supported by EST evidence. 17 are in ‘putative’ genes. All were sampled and all are supported by EST evidence. 360 are in ‘known’ genes, which have RefSeq evidence. 20 were sampled and all are well supported. 41 are present in Refseq gene models, while the rest are present in alternative splice forms that were constructed using evidence. 9 47 are in ‘novel CDS’ genes. 20 were sampled and all are supported by EST and homology to known proteins. 59 are in ‘predicted plus’ genes. All were sampled and in each case, 2 or more exons are supported by well-conserved regions in dog and/or mouse. 9 are in gene fragments. All were sampled and each one matches a portion of CDS of a known gene. Of exonic features present in Ensembl but not Broad, none appears to have supporting evidence (EST, mRNA, RefSeq etc.) to indicate that we missed a gene in our annotation: 2 correspond to exons within Broad pseudogenes but not included in our pseudogene model. 48 fall in introns of Broad gene models. All appear to have no supporting evidence. 3 of these overlap repeat sequences with coding potential. 16 overlap gene predictions only. 3 of these overlap repeats. 1 overlaps a repeat only. 4 exons correspond to one novel transcript gene that was not a part of our manual annotation because the EST evidence on which it is based (BQ716208) did not exist at that time. It has been added to our gene set. 2 exons correspond to a known gene (FLJ25715, NM_182570.1) that was not a part of our manual annotation because the RefSeq entry on which is it based did not exist at that time. 5 appear to be 5’ exons that are predicted in genes with Broad annotations, but not supported. In addition, there were 36 genes annotated as pseudogenes by Broad that were annotated as genes by Ensembl. In summary, although many exons present in the Broad set and absent from Ensembl fall into low confidence categories by HAWK rules, our review of a sizable sampling shows no evidence that any of the exons not present in the Ensembl set represent misannotations according to HAWK standards. The two exceptions represent gene calls for which the primary evidence did not exist at the time the Broad manual 10 annotation was performed, and they have updated into our gene set. Conversely, the smaller number of Ensembl exons not present in the Broad set appear highly enriched for loci that, under manual inspection, appear to be pseudogenic or to lack sufficient experimental evidence to be annotated as genes. Automated whole genome annotation The BI’s automated annotation system uses an evidence-based approach to identify and annotate genes. With this approach, every annotated gene is supported by some transcriptional evidence, either from full-length mRNA, expressed sequence tag (EST) or protein sequences. We employed this system on the finished whole human genome (build 34). First, we scanned the finished genome for CpG islands, repeats and tRNAs using CpG finder (E. Mauceli, unpublished), RepeatMasker21 and tRNAScan-SE respectively39. We ran the two commonly used ab initio gene predictions (FGENESH40 and GENSCAN31). We performed BlastX41 searches on the genome against non-redundant protein database. Protein sequences from top two BlastX hits at each locus were then aligned to the genome using the Genewise42 program. Human mRNA and EST sequences from public databases were aligned to the genomic axis using BLAT43. All searches except BLAT were done on repeat-masked sequence. EST and mRNA alignments with 90% identity and 50% query coverage were retained as potential supporting evidence, but only the uniquely aligned EST/mRNA features were used as primary evidence. These high confident alignments with canonical splice junctions were then clustered to identify an initial set of gene loci. According to our process, expressed evidence in the form of mRNA alignments from the RefSeq44 database, if present, were given higher precedence over other types of evidence in defining reference transcript model at each locus. EST clusters with canonical splice junctions that were different from the reference transcript model were annotated as alternate splice forms. We used single-exon ESTs as evidence only if they overlapped either the 3’ or 5’ ends of transcripts derived from spliced expressed evidence. We assigned the start and stop codons to the annotated known genes using their homology to known proteins. For novel genes with no protein homology, we chose a default ORF with the longest reading frame with an in-frame start and stop 11 codon. Gene models with sequences homologous to proteins (over 50% of the subject length) with disrupted CDS of an active gene found elsewhere in the genome were annotated as pseudogenes. We classified and assigned gene product names to the annotated transcripts according to the human annotation workshop (HAWK) guidelines (http://www.sanger.ac.uk/HGP/havana/hawk.shtml). Chimp analysis. The sequence of Chromosome 18 was compared to the recently completed draft sequence of the chimpanzee genome (Chimpanzee Genome Sequencing Consortium, manuscript in preparation). The human and chimpanzee sequences were aligned using BLASTZ47 and chainNet48, and only reciprocally best alignments were kept. Divergence rates were estimated from all aligned chimpanzee nucleotides passing the relaxed NQS(30,25) quality filter (quality score 30 at the compared base, 25 at the five flanking bases on each side, and any number of flanking substitutions allowed), yielding a predicted error rate of less than 1/20,00049. Chimpanzee chromosome nomenclature follows McConkey50. SUPPLEMENTARY DATA: Gene families Chromosome 18 is home to several groups of paralogous genes including members of the laminin and cadherin gene families. Many of these proteins function in hemidesmosomes and desmosomes, adhesive junctions that link the extracellular space to the intermediate filament cytoskeleton. These types of adhesions are particularly important in epithelial tissues. Two genes encoding the laminin subunits 1 and 3 flank the centromere, in 18p11, and 18q11 respectively. These laminin  chains, when complexed with other laminin subunits, are assembled into extracellular matrices that support attachment of epithelial cells. In hemidesmosomes, the integrin 64 links the laminin-containing basement membrane to intracellular intermediate filaments. A set of 11 cadherins genes lies on 18q, including the entire human complement of desmosomal cadherins (3 desmocollins and 4 desmogleins). Cadherins support cell-cell 12 adhesion by binding to other cadherins expressed on nearby cells, through a calciumdependent mechanism. In desmosomes, the desmocollins and desmogleins form the transmembrane linkage between adjacent epithelial cells and their intermediate filament cytoskeletons. Numerous human skin disorders are associated with perturbation of function of these Chromosome 18 adhesion genes. Mutations in the laminin 3 gene26 cause junctional epidermolysis bullosa, mutations in the desmoglein 1 gene27 result in keratosis palmoplantaris striata I, and mutations in the desmoglein 4 gene28 affect hair follicle biology giving rise to a hairless phenotype in both human (inherited hypotrichosis) and mouse (lanceolate hair). Autoimmune targeting of desmoglein 1, 3 or 4 can cause the blistering diseases Pemphigus foliaceus or Pemphigus vulgaris28, 29. Another interesting gene family encodes serine proteinase inhibitors (serpins30), which are unusual ‘suicide’ protease inhibitors that function by binding covalently to their targets, destroying both the protease and themselves. Chromosome 18q21.33-q22.1 contains a cluster of 10 serpin genes of clade B, constituting the majority of the 13 such genes in the human genome. While most serpins are secreted and function extracellularly, the clade B serpins have no signal sequence and most are not secreted. Rather, their roles include protection of cells from proteinase-mediated injury30. The cluster is in proximity to 3 cadherin genes (CDH 7, 19 and 20), 2 large gene deserts (1.6 and 2.0 Mb) and the cell death gene BCL-2. Overlapping genes Consistent with some of the recent reports of widespread occurrence of overlapping genes in human and mouse genomes34, 35, we found a total of 58 pairs of overlapping genes on Chromosome 18. Table S11 lists the number and modes or overlap we observed. We discovered a complex pattern of overlaps by close examination of 21 pairs of known-known gene overlaps in a 165 kb region involving four distinct known genes: CLUL1, hypothetical protein similar to LOC147447, TYMS and ENOSF1. The hypothetical protein gene (similar to LOC147447) shares a tail-tail overlap with CLUL1 gene and head-head overlap with TYMS gene. This head-head overlap includes 95 and 13 72 bases of the coding region of TYMS and hypothetical protein gene respectively. In addition, the TYMS gene also shares a tail-tail overlap with a transcript variant of the rTS beta gene. The 3’ UTR rTS beta is complementary to TYMS and has been experimentally shown to be involved in its down regulation36. In another unrelated tailtail overlap between Niemann-Pick disease, type C1 (NPC1) gene and Colon cancerassociated protein Mic1 (MIC1 or C18orf8) gene, 83 bases of coding region of NPC1 overlap with the 3’ UTR of MIC1 and 80 bases of coding region of MIC1 overlap the 3’ UTR of NPC1. It is likely that some of these natural antisense overlaps found on Chromosome 18 may be involved in the regulation of expression of the protein-coding genes at various levels such as transcription, mRNA processing, splicing, stability, transport and translation. Based on our observations on Chromosome 18, we believe that due to the under-representation of UTRs in the public annotations, the number and size of overlaps between genes is likely more extensive than estimated previously. Although we observed several additional overlaps (>50) between known genes and potential novel transcriptional units, we did not include them in our analysis as these novel gene candidates did not meet our criteria to be included as novel loci. Some of these antisense transcriptional units may be noncoding or lineage-specific genes involved in regulation of known protein-coding genes. Interpro analysis We found that 62% of annotated genes on Chromosome 18 have Pfam domains. The majority (99%) of these are in the ‘known’ and ‘novel CDS’ classes of proteins. C2H2 type Zn finger domains are the most common domain on 18 and second most common in the genome. Of the 20 most frequently occurring Pfam domains found in Chromosome 18 proteins, 15 are also common domains in the entire human proteome. Of the 5 over-represented Pfam domains, 3 are due to the cadherin and serpin gene clusters. Two are laminin chain domains: two out of 5 laminin chain genes are on Chromosome 18. The remaining domain (Lipoxygenase, LH2 domain) has 5 copies present in 2 nonparalogous proteins. 14 Additional Paralogous Genes Paralogous genes on Chromosome 18 were investigated by clustering with cd-hit37 followed by analysis of a clustalw alignment38. All proteins encoded by annotated transcripts and pseudogenes on Chromosome 18 were clustered with cd-hit using a 65% threshold (-c 0.65). Sequences from multiple loci that clustered together with cd-hit were investigated. In addition to the known serpin and cadherin clusters, this method detected a myosin regulatory light chain cluster (18p11.31) consisting of 2 genes and a three member elongin cluster (18q21.1). The other closely related sequence groups detected by this method were 8 groups of pseudogenes (3 ribosomal proteins, tera proteins, tubulin 4Q, p160-ROCK (ROCK1) and a pair of novel transcripts annotated in a duplicated portion). A more sensitive clustering was carried out using clustalw. Pseudogenes, gene fragments, novel transcripts and putative transcripts were removed from the Chromosome 18 protein set. The resulting set was subjected to cd-hit clustering using a threshold of 0.4 and, following removal of proteins less than 70 residues in length, resulted in 316 proteins. These 316 were aligned with clustalw and a bootstrapped phylogenetic tree was created. Members of all clades supported by a bootstrap value of 40 or more were investigated. 6 potential paralog clusters were identified. 3 are clearly paralogs; SLC14A1 and SLC14A2, ZNF24 and ZNF396, SMAD2 and SMAD4. Conservation analysis of putative novel gene C18Orf2 To confirm that C18Orf2 encodes a functional protein, we devised an experiment to amplify fragments from several primates and resequence them to verify splice site conservation and coding frame and coding sequence conservation. Because the gene is also interrupted by a primate-specific inversion, we also chose primers that would amplify across the inversion breakpoints to attempt to determine the approximate date of the inversion. Samples used for amplification were one human (Coriell NA15510), 4 chimpanzees (including Clint, the genome sequence donor [Coriell NS06006]; the other chimpanzees are not publicly available), an orangutan, two old world monkeys (patas and macaque), two new world monkeys (Nucleic Acids Res and spider), and a lemur. 15 Because the region is highly repetitive and there is uncertainty surrounding the exact location of the downstream inversion breakpoint, we were, in several cases, unable to design successful primers for the region. None of our PCR attempts across either breakpoint yielded products, nor did primers for 2 of the 5 coding exons (exons 3 and 6). For the remaining three coding exons, we were able to generate sequences for human, all four chimpanzees, and orangutan. For exon 4 only we also generated and aligned sequence from the wooly monkey, which showed no conservation in either splice site, making it highly unlikely that C18Orf2 is conserved to new world monkeys. For the three aligned exons in human, chimpanzee, and orangutan, we were able to confirm that the splice sites are conserved at both ends in chimpanzee and orangutan for all exons and that the ATG is conserved (the stop is in the missing exon 6). None of the chimpanzees showed any fixed changes or polymorphism within the aligned coding sequence (185 bp), which would be expected 10% of the time for a single chimpanzee over that length of neutral sequence. Visual examination of the traces over this region showed no heterozygous bases in any of the human, chimpanzee, or orangutan sequences (this by itself is unsurprising but indicates that interspecies variation was not masked by fortuitous assignment of a variant; hets were observed in several individuals in intronic regions). However, the orangutan had 6 substitutions, 3 each in exons 4 and 5, all aminoacid altering. Overall, these data the data correlate with our hypothesis that C18Orf2 represents a newly evolved gene. Given the splicing mutations in exon 4 in the wooly monkey, it seems likely if this is a functional gene that it arose after the divergence of old world and new world monkeys. Search for newly evolved genes Since a break in synteny allowed us to identify c18orf2 as an apparent newly evolved gene in the primate lineage, we wondered whether there might be additional new genes on Chromosome 18 outside of conserved syntenic blocks, which would not have been found in our analysis. To address this we asked the general question of whether any multiexon genes on Chromosome 18 lack a mouse homolog. We compared all manually annotated exons on Chromosome 18 with the set of conserved synteny anchors (see 16 Methods) to identify transcripts that contain no exons that align to anchor sequences, and thus are not conserved to mouse and/or dog. Exons from all 1,194 Chromosome 18 manually annotated transcripts (derived from ESTs and mRNAs) with 22,179 humanmouse synteny anchors. After eliminating pseudogenes, variants, partial transcripts, obvious paralogs, single-exon transcripts and low confidence transcripts (transcript type_id 6 or above) we were left with 10 human Chromosome 18 transcripts that were potentially absent in mouse. Each was manually reviewed using BLAT43 and Blast 45 against other genome sequences and against the non-redundant protein database. Seven of the transcripts have clear homologs in mouse. The remaining three candidates are all hypothetical proteins: HsG648, HsG2120 and c18orf2 itself. HsG648 has homologs in chimp, dog and pig. It was apparently lost in the rodent lineage since, although it is absent in mouse and rat, synteny of the two genes flanking HsG648 is conserved in both species. HsG2120 has identifiable homologs in human and chimpanzee only, and not in dog, mouse, rat or other mammalian assemblies that are publicly available. This gene model has two exons, both of which show similarity to LTR type repeats. Thus, HsG2120 probably does not represent a real gene. In summary, other than c18orf2 we identified no other candidate newly evolving genes. SUPPLEMENTARY DISCUSSION: Supplemental discussion of randomness of gene density To test the significance of the observed low gene content on Chromosome 18 we compared the actual number of genes found to the number of genes expected on Chromosome 18 assuming a random distribution of genes on the genome. The observed gene density of Chromosome 18 would be expected to occur by chance with a probability of less then 10-12. Gene density for a chromosome is defined here as the average number of annotated genes per Mb. We used Ensembl annotation of build 34 so that we could analyze a uniform, standard annotation of the whole genome. A chi-squared test showed the genome-wide distribution of genes on chromosomes to be significantly non-random (p < 10-100, see Table S9). We found that among autosomes, chromosomes 21, 9, 10, 15, 6, 7, 14, and 12 best conform to a random gene distribution model (p> 10-7), whereas the 17 remaining autosomes have gene counts that significantly differ (p< 10-7) from the random distribution. Based on this whole genome annotation, Chromosome 18 (p <<10-6) has the second lowest gene density, surpassed only by Chromosome 13. We note that based on manual annotation of those chromosomes published to date, Chromosome 18 has the lowest gene density (4.4 genes/Mb), while 13 has a significantly higher density (6.6 genes/Mb). Additional discussion of conserved noncoding sequence methods Coding sequence. The "coding sequence" set contains all bases that are annotated as coding in any transcript (both constitutive and alternative exons) as predicted by Ensembl. The methods used do not require annotation of coding sequence on informant genomes (i.e., mouse, dog). All alignments are done to the human reference. Prediction of significance of conservation is obviously symmetric, but the denominators of syntenic length, fraction conserved, and fraction coding are all calculated purely based on human sequence and annotation. Dog and Mouse “Composite” Z-scores. Dog and mouse Z-scores associated with shared repeats were negligibly correlated and their joint probability density was centered near the origin. In contrast, windows associated with coding exons showed a strong positive correlation and led to a probability distribution concentrated on the positive diagonal. Motivated by this empirical separation, we defined a composite Z-score (Zcomp) as the linear combination of Zdog and Zmouse (aZmouse +(1-a)Zdog, 1>=a>=0) with the lowest empirical variance for ancestral repeats. In this way, we minimized the confounding effect of the background or spurious conservation when trying to identify true selection; using mouse and dog together achieves a better signal-to-noise tradeoff than either alone. Impact of pseudogenes. We assessed the possibility that pseudogene sequences might display a conservation signature, and if so, whether our analysis of distribution of conserved non-coding sequences could become biased in the event that there exists a significant number of such conserved sequences. In theory, this should not be so. 18 Pseudogenes fall in to two classes, those that are processed and those that result from segmental duplication (local or distant). The former class must be recent in order to show significant similarity to the parent gene (if they are truly pseudogenes), and so would not be aligned in our method as they would have no orthologous sequence in the conserved syntenic block in the vast majority of cases. The latter might meet the criterion, but would likely have been discarded from our alignments because we aligned only unambiguous 1:1 orthologous sequences. We stared with our manual annotation of pseudogenes on Chromosome 18. The 171 psedogenes identified contain a total of 332 "pseudoexons", with a total length of 145,160 bases, or 0.2% of the chromosome length and 0.18% of the conserved syntenic length. Since our method detects conserved sequences on the basis of their presence in blocks of conserved synteny, for genes or pseudogenes, to be detected as conserved they must align syntenically. In fact, only two of the pseudogenes on Chromosome 18 align sytenically: HsG663 and HsG866. These two pseudogenes have a total length of only 2105 bases, or 0.003% of the length of the syntenic blocks we detected. Thus, pseudogene sequences have a negligible on our analysis of 18. HAWK standards set a high bar for annotation of pseudogenes, and Chromosome 18 is relatively pseudogene poor. To assess the impact of all pseudogenes (on top of those identified by HAWK standards), we performed a genome-wide analysis on the set of putative loci detected genome-wide by Torrents et al.46 (from build 34), which represents a more comprehensive set of pseudogenes. For the 19 non-Broad annotated chromosomes (for which our alignments are on Build 34 and match the Torrents coordinates), we assessed conservation of the 21,997 loci consisting of 7,704,824 bases. In total, only 5259 50 bp windows (of a possible 1.5 million) containing 20 or more syntenically aligned bases overlapped pseudogene loci. Those windows that did align were contained in a total of only 74 of the 21,997 loci. We did observe strongly positive Z-scores (mean of 1.82 for mouse and 2.22 for dog) for those that aligned, which is not surprising in light of Torrents’ estimate that 5% of these regions might actually be undiscovered true coding genes. In fact, a manual examination of a small number revealed that in at least on case the pseudogene locus overlaps an Ensembl gene with RefSeq support. 19 Across the whole genome, our method identified just over 18 million non-exonic windows as being under positive selection (of over 500 million windows created bty this method). If every one of the 5259 pseudogene loci aligned had p=1 of being selected (the real average would probably be 0.5), they would contribute <0.04% of the non-exonic conservation in the genome. Even if all of these windows were in a single 5 Mb region (of 489 analyzed), it would only change that window’s value by <20% of the calculated value. From this we conclude that, as expected, pseudogenes have negligible impact on our estimation of non-coding conservation. Note on Build 34 alignments. When we generated our original alignments of build 34 to mouse and dog, we substituted the 4 manually annotated Broad chromosomes (8, 15, 17 and 18) for their build 34 counterparts sot that we could take advantage of our manual annotations. When we moved to build 35 to use Ensembl for the whole genome analysis, we used the standard build. Since the Torrents et al. pseudogene analysis has not yet been updated to build 35, we used our existing “build 34” genome, but since we cannot rerun only the 4 chromosomes (uniqueness constraints of our synteny algorithm require that all the alignments be processed simultaneously), we analyzed only the subset for which we had equivalent build 34 alignments. The omitted chromosomes represent 13.5% of the genome and 16.5% and 14.5% of pseudogene features and bases respectively in the Torrents set. As the results of the analysis overwhelmingly show that pseudogenes do not align, we do not believe that including these chromosomes would change this finding. Impact of non-coding RNAs. Known non-coding RNAs (ncRNAs) represent a minute fraction of the chromosome sequence: only 4 tRNAs were found on the chromosome. Thus, they present far too little length for statistical significance in our test. Supplementary References 1. International Human Genome Mapping Consortium. (2001) A physical map of the human genome. Nature 409:934-941. 20 2. International Human Genome Sequencing Consortium. (2001) Initial sequencing and analysis of the human genome. Nature 409:860-921. 3. Marra, M.A., et al. (1997) High throughput fingerprint analysis of large-insert clones. Genome Res. 7:1072-1084. 4. Reithman, H.C., et al (1989) Cloning human telomeric DNA fragments into Saccharomyces cerevisiae using a yeast-artificial-chromosome vector. Proc. Nat. Acad. Sci. 86:6240-6244. 5. Macina, R.A., et al. (1995) Molecular cloning and RARE cleavage mapping of human 2p, 6q, 8q, 12q, and 18q telomeres. Genome Res. 5:225-232. 6. International Human Genome Sequencing Consortium. (2004) Finishing the euchromatic sequence of the human genome. Nature 431:931-945. 7. Hudson, T.J., et al. (1995) An STS-based map of the human genome. Science 270:1945-1954. 8. Chu, G., Volrath, D., Davis, R.W. (1986) Separation of large DNA molecules by contour-clamped homogeneous electric fields. Science 234:1582-1585. 9. Park, H.S., et al. (2000) Newly identified sequences, derived from human chromosome 21qter, are also identified in the subtelomeric region of particular chromosomes and 2q13, and are conserved in the chimpanzee genome. FEBS Lett. 475:167-169. 10. Rosenblum, B.B., et al. (1997). New dye-labeled terminators for improved DNA sequencing patterns. Nucleic Acids Res. 25:4500-4504. 11. Batzoglou, S., et al. (2002) ARACHNE: a whole-genome shotgun assembler. Genome Res. 12:177-189. 12. Jaffe, D.B., et al. (2003) Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13:91-96. 13. Bonfield, J.K., Smith, K., Staden, R. (1995) A new DNA sequence assembly program. Nucleic Acids Res. 23:4992-4999. 14. Gordon D., Abajian C., Green P. (1998) Consed: a graphical tool for sequence finishing. Genome Res. 8:195-202. 15. MacMurray, A.A., Sulston, J.E., Quail, M.A. (1998) Short-insert libraries as a method of problem solving in genome sequencing. Genome Res. 8:562. 21 16. Wong, G.K., Yu, J., Thayer, E.C. and Olson, M.V. (1997) Multiple-completedigest restriction fragment mapping: generating sequence-ready maps for large-scale DNA sequencing. Proc Natl Acad Sci USA. 94:5225-5230. 17. Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186-194. 18. Ewing, B., et al. (1998) Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185. 19. Hattori, M., et al. (1997) A novel method for making nested deletions and its application for sequencing of a 300 kb region of human APP locus. Nucleic Acids Res. 25:1802-1808. 20. Mouse Genome Sequencing Consortium. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420:520-562. 21. Smit, A. and Green, P. (1999) RepeatMasker at http://ftp.genome.washington.edu/RM/RepeatMasker.html. 22. Ma, B., Tromp, J., and Li, M. (2002) PatternHunter: faster and more sensitive homology search. Bioinformatics 18:440-445. 23. Kong, A., et al. (2002) A high-resolution recombination map of the human genome. Nat. Genet. 31:225-226. 24. Zhang, Z., et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7:203-214. 25. Rotmistrovsky, K., Jang W., Schuler G.D. (2004) A web server for performing electronic PCR. Nucleic Acids Res. 32 (Web Server issue):W108112. 26. McGowan, K.A., Marinkovich, M.P. (2000) Laminins and human disease. Microsc. Res. Tech. 51:262-279. 27. Hunt, D.M., et al. (2001) Spectrum of dominant mutations in the desmosomal cadherin desmoglein 1, causing the skin disease striate palmoplantar keratoderma. Eur. J. Hum. Genet. 3:197-203. 28. Kljuic ,A., et al. (2003) Desmoglein 4 in hair follicle differentiation and epidermal adhesion: evidence from inherited hypotrichosis and acquired pemphigus vulgaris. Cell 113:249-260. 22 29. Garrod, D.R., Merritt, A.J., Nie, Z. (2002) Desmosomal cadherins. Curr. Opin. Cell. Biol. 5:537-545. 30. Silverman, G.A., et al. (2004) Human clade B serpins (ov-serpins) belong to a cohort of evolutionarily dispersed intracellular proteinase inhibitor clades that protect cells from promiscuous proteolysis. Cell. Molec. Life Sci. 61:301-325. 31. Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268:78-94. 32. Roest Crollius, H., et al. (2000) Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25:235–238. 33. Resch, A., et al. (2004) Evidence for a subpopulation of conserved alternative splicing events under selection pressure for protein reading frame preservation. Nucleic Acids Res. 34:1261-1269. 34. Yelin, R., et al. (2003) Widespread occurrence of antisense transcription in the human genome. Nat. Biotechnol. 4:379-386. 35. Kiyosawa H., et al. (2003) Antisense transcripts with FANTOM2 clone set and their implications for gene regulation. Genome Res. 13:1324-1334. 36. Chu, J. and Dolnick, B.J. (2002) Natural antisense (rTSalpha) RNA induces site-specific cleavage of thymidylate synthase mRNA. Biochem. Biophys. Acta. 1587:183-193. Review. 37. Li, W., Jaroszewski, L., Godzik, A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 17:282-283. 38. Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680. 39. Lowe T.M. and Eddy S.R. (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res. 25:955-64. 23 40. Salamov A. and Solovyev V. (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10:516-522 41. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang ,J., Zhang, Z., Miller, W. and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389MR44 42. Birney, E., Clamp, M., Durbin, R. (2004) GeneWise and Genomewise. Genome Res. 14:988-95. 43. Kent, W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res. 12:656-64. 44. Pruitt, K.D., Tatusova, T., and Maglott D.R. (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins Nucleic Acids Res. 33:D501-D504. 45. Altshul, S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402. 46. Torrents, D., Suyama, M., Zdobnov, E. and Bork, P. (2003) A genome-wide survey of human pseudogenes. Genome Res. 13:2559-2567. 47. Schwartz, S., et al. (2003) Human-mouse alignments with BLASTZ. Genome Res. 13:103-7. 48. Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W. and Haussler, D. (2003) Evolution's cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proc Natl Acad Sci USA. 100:11484-11489. 49. Altshuler, D., et al. (2000) An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407:513-6. 50. McConkey, E.H. (2004) Orthologous numbering of great ape and human chromosomes is essential for comparative genomics. Cytogenet. Gen. Res. 105:157-158. 24

Supplemental discussion of randomness of gene density

Related documents

Products

Support

Supplemental discussion of randomness of gene density

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib