Supplemental discussion of randomness of gene density

advertisement
SUPPLEMENTARY METHODS
Construction of clone map and sequence map of Chromosome 18
We began by constructing a path of overlapping large-insert clones across Chromosome
18. The final version of the clone path contains 596 large insert clones (primarily
bacterial artificial chromosomes (BACs)) from 9 libraries (Table S3). The average length
of a large insert clone on the Chromosome 18 path is 192575 bp, and the average
physical overlap between clones is approximately 50kb. In many cases, the redundant
overlapping regions of clones were not completely finished, so the average length of a
finished accession is 162036 bp and the average overlap between finished accessions is
36713 bp. The path was built in several stages. First, seed clones were selected for
sequencing based on their positions in an initial physical map based on restrictionfragment-based fingerprints of clones (18q)1, and from an integrated STS marker map
(18p) (http://stt.gsc.riken.jp/projects/imap.html). These clones were then assembled into
ordered contigs based on shared sequence and STS marker content along with fingerprint
data. Additional clones were then selected to extend and join the sequence contigs, on the
basis of shared sequence content, either in the form of clone-end sequences from
characterized BAC libraries2 or STSs. Clone fingerprint data were not used to extend the
sequence-ready clone paths. Once these options were exhausted, further clones extending
the sequence contigs were identified by hybridization of radioactive ‘overgo’ probes3 to
filters containing additional clone libraries (providing >30x physical coverage of the
genome). The 18p telomere was isolated from a Chromosome 18-specific Fosmid library
constructed and RIKEN, while the 18q telomere was isolated from a library of yeast
artificial chromosomes (YACs) enriched for such regions (termed half-YACs4, 5). Unique
sequences from the path of finished clones were concatenated to produce a ‘finished’
sequence of Chromosome 18. A detailed discussion of validation of the accuracy of clone
overlaps and of resolution of polymorphisms in clone overlaps can be found elsewhere6.
Building sequence ready maps at Broad Institute/Whitehead Center for Genome
Research.
1
Sequence-ready clone paths providing deep coverage of some regions of Chromosome
18q were constructed in the following manner. YACs were selected based on their
position on the Whitehead physical map7. Plugs of genomic DNA were prepared and
YACs purified on the basis of size by CHEF gel electrophoresis8. Random shotgun
sequences representing roughly 0.1X coverage of the YAC were generated and screened
against the S. cerevisiae genome, the YAC vector and a variety of contaminants, and then
masked for human repeats with RepeatMasker (www.repeatmasker.org). The remaining
high quality sequences were used to design ‘overgo’ probes, which were radioactively
labeled by primer extension3, pooled in batches of up to 60 loci and hybridized to filters
containing the RPCI-11 BAC library. For validation and deconvolution, positive clones
were streaked to single colonies, picked and replicated to filters that were hybridized to
each overgo probe individually. Deep paths of BACs were assembled on the basis of
shared marker content, from which a minimal tiling path of clones was selected for
sequencing.
Genome walking to extend clone paths.
Electronic identification of walking clones. All available large-insert clone-end sequences
were aligned to all available human genome sequence using the WU-BLAST program
(http://blast.wustl.edu) without repeat masking. Clones for which the end sequences
indicated that the insert may extend into gap regions were grown and then end sequenced
for validation. Validated clones were shotgun sequenced and assembled, and electronic
walks iterated if necessary.
Hybridization-based identification of walking clones. In cases where electronic walking
yielded no candidates, overgo probes were designed and hybridized in pools (as above) to
library filters containing the RPCI-11, RPCI-13 libraries, representing a total of over 50x
physical coverage of the genome. Positive clones were end sequenced for deconvolution
and validation. The clones were then shotgun sequenced and assembled, and walks
iterated if necessary.
Sample sequencing of YACs to identify walking markers. For some regions of the clone
path YACs from the CEPH library were identified that spanned gaps. These YACs were
prepped and sample sequenced to 0.25-1X (as above). Sequences passing the above
2
screens and not matching the known human genome were presumed to represent the
region of the YAC insert spanning the gap. These sequences were used to design overgo
probes for screening as above.
Building sequence ready maps at RIKEN Genomic Sciences Center.
For Chromosome 18p, seed clones were initially selected by PCR screening of the RPCI11 BAC library using high-density human STS markers derived from the "Integrated
Marker Arrangement Project" (http://stt.gsc.riken.jp/projects/imap.html). Clones were
assembled into contigs on the basis of multiple shared STS markers. These clones were
then sample sequenced, from which new chromosome-walking primers were designed for
further screening. A set of minimum-tiling path clones was then selected for full-scale
sequencing. These steps were repeated until all gaps were closed or no additional clones
could be identified. Some representative clones in contigs were also examined
cytogenetically to confirm their localization on the chromosome, although not all such
clones were included in the minimum tiling path.
Centromeric and telomeric sequence contig ends.
The 18p telomere was sequenced in a Fosmid derived from a flow-sorted Chromosome
18-specific library. It was identified by screening with primers based on telomeric
repeats9, and ends in >100 tandem telomeric CCCTAA repeats. A half-YAC containing
the 18q telomere was sequenced, and telomeric repeats identified in the clone, but we
were not able to link them to the assembly and conservatively estimate (based on the
YAC size and reads assembled) that no more than 21 kb of sequence are missing.
Chromosome 18-specific Fosmids were also used to approach the centromere from 18p.
We identified arrays of alpha centromeric repeat sequences on each side of the
centromere.
Standards for clone overlap validation.
Overlaps between finished clones in the final path were validated as agreed to by the
International Human Genome Sequencing Consortium. Each clone path overlap was
required to have a minimum length of 2kb, and a minimum identity of 99.6% to be
3
accepted without additional data. Overlaps that did not meet this threshold were required
to be validated by additional data.
For clone pairs that did not meet the overlap length threshold, the following data
types were able to demonstrate accurate juxtaposition of clones on the path:
1. Overlapping drafted clone. An unfinished clone (or clones) that spans the overlap
region by reaching in to unique sequence on either side.
2. Large insert clone read pair placement. Mated end read pairs from a human whole
genome shotgun Fosmid and BAC libraries that are placed in unique sequence on
either side of the overlap and demonstrate appropriate spacing. Size constraints on
Fosmid inserts make their spacing particularly informative. At least 2 spanning
mate pairs were required, but the number was almost always much larger.
3. PCR spanning the overlap region. A clean PCR product derived from whole
human genome DNA generated from primers unique in the genome in sequences
flanking the overlap.
For clone pairs that met the overlap threshold, but did not meet the polymorphism
threshold, the nature of the differences were examined in detail. In a small number of
cases the difference was due to an insertion/deletion of an alu element or difference in
repeat copy number in an SSLP. These were deemed to be a polymorphism in the
absence of further data if the juxtaposition standard was met. In two cases, base
divergence above the threshold that could not be explained by repeat insertion. These
were resequenced by PCR from a panel of 24 human samples and demonstrated to be true
alleles.
Large insert clone sequencing at Broad Institute/Whitehead Center for Genome
Research. Subclone libraries of large-insert clones were prepared in m13 or one of
several plasmid subclone vectors, and sequenced with the dideoxy chain termination
method using one of several versions of big dye chemistry10. Data were detected on
several models of ABI sequencing machines and assembled with Phrap
(http://www.phrap.org) or Arachne11, 12. Assemblies were visualized for finishing with
either Gap413 or Consed14. A combination of the following methods was used to close
sequence gaps and resolve low quality regions and misassemblies: transposon insertion-
4
based sequencing, primer walking, PCR, and shattered insert libraries15. Finished
sequence assemblies of all large insert clones were validated by comparison to restriction
digestion patterns generated by 3-5 6 cutter enzymes16.
Large insert clone sequencing at RIKEN Genomic Sciences Center. Briefly, shotgun
sequencing was performed to provide 8-fold coverage of draft sequences using an ET
terminator cycle sequencing kit with MegaBACE 1000 capillary sequencers. In addition,
we constructed plasmid clone libraries from appropriate restriction fragments and
sequenced both ends of these clones to provide 2-fold additional coverage using one of
several versions of the BigDye terminator cycle sequencing kit with ABI 3700 capillary
sequencers. Data were assembled using Phred/Phrap/Consed17, 18, 14. Sequence gap-filling
and resequencing of low quality regions in the assembled data were performed by a
nested deletion method19 PCR, primer walking and direct sequencing of BAC clones.
Estimation of clone path gap sizes. We were able to estimate the sizes of the three
euchromatic gaps in human Chromosome 18 (see Table S2) because the flanking
sequences for each gap fall in a region showing extended synteny to the mouse genome3.
Human and mouse sequences were masked with RepeatMasker21. Human sequences
flanking gaps were aligned to the mouse with PatternHunter22, and short aligning
sequences in the human were ordered along the mouse sequence. From the full set of
mouse-human alignments, we extracted the set of locally bidirectionally-unique
alignments. Order and position of aligning sequences were confirmed by visual
inspection of a graphical representation of the alignments. We identified human
sequences closest to both sides of the gap, and determined the distance between them on
the mouse assembly. The gap distance was then adjusted for the average expansion of the
human genome relative to mouse of 15%. Finally, the distances between the last aligning
human sequences and the ends of the finished clones in which they reside were
subtracted.
Alignment of markers from genetic and radiation hybrid maps to Chromosome 18.
Of 158 markers on Chromosome 18 on a human genetic map23, we were able to
5
download sequence or primers for 156 from NCBI databases. For those with only primer
sequences, we first aligned the primers to the chromosome sequence using
MegaBLAST24, then manually curated their position, retrieved the matching sequence
and combined this with the markers for which we had complete sequence. We then
placed all markers on the chromosome, again by MegaBLAST, and manually checked
any markers whose best placement was possibly ambiguous (score < 2x second best
score) by hand. All markers were able to be placed on the finished sequence of the
chromosome.
We performed a similar analysis on 335 radiation hybrid map markers retrieved
from UCSC (T. Furey, personal communication). Of these, we excluded 3 pairs that had
multiple RH map positions under different names (but were in fact the same sequence).
We were able to place all but 2 that hit no sequence in the finished sequence of the
chromosome by ePCR25.
Laboratory validation of gene model extensions
As described in the main text, many of our evidence-based gene models extend both 3’
and 5’ of the corresponding RefSeq gene model. To validate our annotation methods, we
obtained independent evidence of transcription of a sample of our gene models. We
examined the 5’ extensions not supported by RefSeq data in 15 transcripts from human
Chromosome 17 that were manually annotated at the Broad Institute. Since all manual
annotation of human chromosomes at the BI was performed using identical methodology,
these 15 genes are an appropriate sampling. The data generated confirm the accuracy of
our gene models and the efficacy of the underlying annotation methodology. We
performed PCR using one primer near the 5’ end of each gene model and a second primer
matching a linker sequence present at the 5’ end of the cDNA library. We cloned and
sequenced several PCR products for each of 15 transcripts. 10 loci gave informative
sequence data, in each case confirming our 5’ extensions. In nine cases this independent
evidence extended to at least within 10 bp of the 5’ end of our gene model. In the
remaining case, the PCR product confirmed 87 out of 123 bases in our extension. These
observations validate our annotation methodology, which was performed with equal care
for all 4 finished human chromosomes annotated by this team. Five of the 15 produced no
6
interpretable data: one locus gave no passing sequence reads; 2 loci gave sequence reads
that appeared arise from partially-spliced or unspliced RNA; and 2 loci gave sequence
reads that suggested an exon structure divergent from both the RefSeq evidence and our
gene models and so may represent an alternative splice form.
Validation of EST placements
As part of our analysis to validate our gene calls, all ESTs that were aligned to manually
annotated genes on Chromosome 18 were aligned to the entire human genome (build 34).
There were 4 cases of gene models for which the supporting ESTs had best hits to other
chromosomes: G2024, HsG643, HsG2271 and HsG3285. Each of these was manually
reviewed. G2024 is also supported by protein homology to c21orf81. HsG643 has a
single exon with polyA tail and a perfect ORF match to a known protein. It is annotated
as a Retrotransposed Copy. HsG2271 also has a single exon and is annotated as a
Retrotransposed Copy. It is also annotated by Ensembl. HsG3285 is annotated as a Novel
CDS, is supported by ESTs and homology to a mouse protein, and is contained in an
intron of HsG1161. These data reflect our careful use and interpretation of EST data,
which we used whenever possible to improve known gene models, and to find potential
homologs and pseudogenes of source genes located on other chromosomes.
Supplemental discussion of annotation rules and methods
We emphasized aligned mRNA and protein sequences when analyzing the gene content
of the finished 76.1 Mb euchromatic portion of human Chromosome 18. In accord with
Hawk2 (www.sanger.ac.uk/Info/workshops/hawk2) conventions gene models were
grouped into the following categories:
1. Known - identical to a human mRNA sequence, usually from Refseq or MGC.
2. Novel_CDS - identical to spliced human ESTs and coding for a protein with nonidentical homology to a protein in the public databases.
7
3. Novel_Transcript - identical to a spliced human EST with canonical splice junctions,
no homology to known mRNAs or proteins, at least one exon without repeat content and
the longest ATG to stop codon open reading frame (ORF) spans more than one exon.
This category includes genes predicted by ab initio tools including Genscan31 and
FGENESH (Softberry Inc., Mount Kisco, NY) were annotated when one or more exons
overlapped a mouse-human BLASTN alignment (evolutionary conserved regions or
ecores32). The comparative version of FGENESH was not used.
4. Putative - identical to a spliced human EST with canonical splice junctions, no
homology to known mRNAs or proteins, at least one exon without repeat content and the
longest ATG to stop codon open reading frame (ORF) is entirely contained in one exon.
5. Pseudogene - gene models that contain a disrupted ORF containing more than 50% of
the ORF of a known protein at greater than 50% identity. This category is further
subdivided into single exon processed pseudogenes and multi-exon unprocessed
pseudogenes.
6. Putative novel transcript - gene models containing an ORF with less than 50% of the
ORF of a known protein and greater than 50% identity.
Annotation of Novel and Putative loci. The few novel and putative loci that we annotated
were required to align over at least 30% of their length to either the mouse or dog
genome with at least 70% identity. Some lineage specific and rarely detected genes will
be lost using this method, as will gene models with a very high ratio of UTR to coding
sequence. We found an additional set of 268 potential transcript fragments (composing at
most 220 loci) supported by only one or 2 spliced ESTs, or supported by 3 or more ESTs
but lacking sufficient conservation on dog or mouse genomes. Although some of these
low confidence gene models may represent functional transcription units, overall these
gene models are too speculative to include in our high confidence annotations.
8
Splice form confidence levels. We attempted to determine a confidence level for these
splice forms to address concerns about using spliced ESTs as evidence for variant
transcripts. We chose to exclude 97 potential transcript variants for which the EST
structure cannot be distinguished from a partially-spliced RNA, even though some of
these may be valid transcripts. We analyzed a larger set of candidate alternate splice
forms drawn from annotation of human chromosomes 18, 8, 15 and 17, and found that
60% of new exons were a multiple of 3 bases long, preserving the reading frame across
the insertion. This is significantly higher than the 39% predicted due to chance alone33.
The frame will be restored in many of the remaining cases by additional splice variation
seen within the same molecule, such as addition of new exons, alternate splice site usage,
and exon skipping. Splice variants which introduce new stop codons that may render the
transcript a candidate for nonsense-mediated decay (NMD) occur at a somewhat higher
rate than are seen in RefSeq sequences, and as a result 85% of the annotated splice
variants are not candidates for NMD, compared to 96% of RefSeqs.
Comparison of Broad manual annotations to Ensembl
We compared our manual gene annotations (mapped by VEGA onto Build 35) to the
Ensembl (www.ensembl.org) annotation of Build 35 for chromosome 18. We compiled
lists of all exonic features present in one annotation set but not the other. Out of 3310
total Broad exonic features, 399 were not present in Ensembl, and of 3197 Ensembl
exonic features 399 were not present in Broad. All Broad exons not in Ensembl were
broken down by HAWK category, and are listed below. 95 Broad and 78 Ensembl exons
were manually reviewed.
Of exonic features present in Broad but not Ensembl, all appeared to be parts of
valid genes:
39 are in ‘novel’ genes. 20 were sampled and all are supported by EST evidence.
17 are in ‘putative’ genes. All were sampled and all are supported by EST evidence.
360 are in ‘known’ genes, which have RefSeq evidence. 20 were sampled and all are well
supported. 41 are present in Refseq gene models, while the rest are present in alternative
splice forms that were constructed using evidence.
9
47 are in ‘novel CDS’ genes. 20 were sampled and all are supported by EST and
homology to known proteins.
59 are in ‘predicted plus’ genes. All were sampled and in each case, 2 or more exons are
supported by well-conserved regions in dog and/or mouse.
9 are in gene fragments. All were sampled and each one matches a portion of CDS of a
known gene.
Of exonic features present in Ensembl but not Broad, none appears to have
supporting evidence (EST, mRNA, RefSeq etc.) to indicate that we missed a gene in our
annotation:
2 correspond to exons within Broad pseudogenes but not included in our pseudogene
model.
48 fall in introns of Broad gene models. All appear to have no supporting evidence. 3 of
these overlap repeat sequences with coding potential.
16 overlap gene predictions only. 3 of these overlap repeats.
1 overlaps a repeat only.
4 exons correspond to one novel transcript gene that was not a part of our manual
annotation because the EST evidence on which it is based (BQ716208) did not exist at
that time. It has been added to our gene set.
2 exons correspond to a known gene (FLJ25715, NM_182570.1) that was not a part of
our manual annotation because the RefSeq entry on which is it based did not exist at that
time.
5 appear to be 5’ exons that are predicted in genes with Broad annotations, but not
supported.
In addition, there were 36 genes annotated as pseudogenes by Broad that were annotated
as genes by Ensembl.
In summary, although many exons present in the Broad set and absent from
Ensembl fall into low confidence categories by HAWK rules, our review of a sizable
sampling shows no evidence that any of the exons not present in the Ensembl set
represent misannotations according to HAWK standards. The two exceptions represent
gene calls for which the primary evidence did not exist at the time the Broad manual
10
annotation was performed, and they have updated into our gene set. Conversely, the
smaller number of Ensembl exons not present in the Broad set appear highly enriched for
loci that, under manual inspection, appear to be pseudogenic or to lack sufficient
experimental evidence to be annotated as genes.
Automated whole genome annotation
The BI’s automated annotation system uses an evidence-based approach to identify and
annotate genes. With this approach, every annotated gene is supported by some
transcriptional evidence, either from full-length mRNA, expressed sequence tag (EST) or
protein sequences. We employed this system on the finished whole human genome (build
34).
First, we scanned the finished genome for CpG islands, repeats and tRNAs using
CpG finder (E. Mauceli, unpublished), RepeatMasker21 and tRNAScan-SE respectively39.
We ran the two commonly used ab initio gene predictions (FGENESH40 and
GENSCAN31). We performed BlastX41 searches on the genome against non-redundant
protein database. Protein sequences from top two BlastX hits at each locus were then
aligned to the genome using the Genewise42 program. Human mRNA and EST sequences
from public databases were aligned to the genomic axis using BLAT43. All searches
except BLAT were done on repeat-masked sequence. EST and mRNA alignments with
90% identity and 50% query coverage were retained as potential supporting evidence, but
only the uniquely aligned EST/mRNA features were used as primary evidence. These
high confident alignments with canonical splice junctions were then clustered to identify
an initial set of gene loci. According to our process, expressed evidence in the form of
mRNA alignments from the RefSeq44 database, if present, were given higher precedence
over other types of evidence in defining reference transcript model at each locus. EST
clusters with canonical splice junctions that were different from the reference transcript
model were annotated as alternate splice forms. We used single-exon ESTs as evidence
only if they overlapped either the 3’ or 5’ ends of transcripts derived from spliced
expressed evidence. We assigned the start and stop codons to the annotated known genes
using their homology to known proteins. For novel genes with no protein homology, we
chose a default ORF with the longest reading frame with an in-frame start and stop
11
codon. Gene models with sequences homologous to proteins (over 50% of the subject
length) with disrupted CDS of an active gene found elsewhere in the genome were
annotated as pseudogenes. We classified and assigned gene product names to the
annotated transcripts according to the human annotation workshop (HAWK) guidelines
(http://www.sanger.ac.uk/HGP/havana/hawk.shtml).
Chimp analysis. The sequence of Chromosome 18 was compared to the recently
completed draft sequence of the chimpanzee genome (Chimpanzee Genome Sequencing
Consortium, manuscript in preparation). The human and chimpanzee sequences were
aligned using BLASTZ47 and chainNet48, and only reciprocally best alignments were
kept. Divergence rates were estimated from all aligned chimpanzee nucleotides passing
the relaxed NQS(30,25) quality filter (quality score 30 at the compared base, 25 at the
five flanking bases on each side, and any number of flanking substitutions allowed),
yielding a predicted error rate of less than 1/20,00049. Chimpanzee chromosome
nomenclature follows McConkey50.
SUPPLEMENTARY DATA:
Gene families
Chromosome 18 is home to several groups of paralogous genes including members of the
laminin and cadherin gene families. Many of these proteins function in hemidesmosomes
and desmosomes, adhesive junctions that link the extracellular space to the intermediate
filament cytoskeleton. These types of adhesions are particularly important in epithelial
tissues.
Two genes encoding the laminin subunits 1 and 3 flank the centromere, in
18p11, and 18q11 respectively. These laminin  chains, when complexed with other
laminin subunits, are assembled into extracellular matrices that support attachment of
epithelial cells. In hemidesmosomes, the integrin 64 links the laminin-containing
basement membrane to intracellular intermediate filaments.
A set of 11 cadherins genes lies on 18q, including the entire human complement of
desmosomal cadherins (3 desmocollins and 4 desmogleins). Cadherins support cell-cell
12
adhesion by binding to other cadherins expressed on nearby cells, through a calciumdependent mechanism. In desmosomes, the desmocollins and desmogleins form the
transmembrane linkage between adjacent epithelial cells and their intermediate filament
cytoskeletons.
Numerous human skin disorders are associated with perturbation of function of
these Chromosome 18 adhesion genes. Mutations in the laminin 3 gene26 cause
junctional epidermolysis bullosa, mutations in the desmoglein 1 gene27 result in keratosis
palmoplantaris striata I, and mutations in the desmoglein 4 gene28 affect hair follicle
biology giving rise to a hairless phenotype in both human (inherited hypotrichosis) and
mouse (lanceolate hair). Autoimmune targeting of desmoglein 1, 3 or 4 can cause the
blistering diseases Pemphigus foliaceus or Pemphigus vulgaris28, 29.
Another interesting gene family encodes serine proteinase inhibitors (serpins30),
which are unusual ‘suicide’ protease inhibitors that function by binding covalently to
their targets, destroying both the protease and themselves. Chromosome 18q21.33-q22.1
contains a cluster of 10 serpin genes of clade B, constituting the majority of the 13 such
genes in the human genome. While most serpins are secreted and function extracellularly,
the clade B serpins have no signal sequence and most are not secreted. Rather, their roles
include protection of cells from proteinase-mediated injury30. The cluster is in proximity
to 3 cadherin genes (CDH 7, 19 and 20), 2 large gene deserts (1.6 and 2.0 Mb) and the
cell death gene BCL-2.
Overlapping genes
Consistent with some of the recent reports of widespread occurrence of overlapping
genes in human and mouse genomes34, 35, we found a total of 58 pairs of overlapping
genes on Chromosome 18. Table S11 lists the number and modes or overlap we
observed.
We discovered a complex pattern of overlaps by close examination of 21 pairs of
known-known gene overlaps in a 165 kb region involving four distinct known genes:
CLUL1, hypothetical protein similar to LOC147447, TYMS and ENOSF1. The
hypothetical protein gene (similar to LOC147447) shares a tail-tail overlap with CLUL1
gene and head-head overlap with TYMS gene. This head-head overlap includes 95 and
13
72 bases of the coding region of TYMS and hypothetical protein gene respectively. In
addition, the TYMS gene also shares a tail-tail overlap with a transcript variant of the rTS
beta gene. The 3’ UTR rTS beta is complementary to TYMS and has been
experimentally shown to be involved in its down regulation36. In another unrelated tailtail overlap between Niemann-Pick disease, type C1 (NPC1) gene and Colon cancerassociated protein Mic1 (MIC1 or C18orf8) gene, 83 bases of coding region of NPC1
overlap with the 3’ UTR of MIC1 and 80 bases of coding region of MIC1 overlap the 3’
UTR of NPC1.
It is likely that some of these natural antisense overlaps found on Chromosome 18
may be involved in the regulation of expression of the protein-coding genes at various
levels such as transcription, mRNA processing, splicing, stability, transport and
translation. Based on our observations on Chromosome 18, we believe that due to the
under-representation of UTRs in the public annotations, the number and size of overlaps
between genes is likely more extensive than estimated previously. Although we observed
several additional overlaps (>50) between known genes and potential novel
transcriptional units, we did not include them in our analysis as these novel gene
candidates did not meet our criteria to be included as novel loci. Some of these antisense
transcriptional units may be noncoding or lineage-specific genes involved in regulation of
known protein-coding genes.
Interpro analysis
We found that 62% of annotated genes on Chromosome 18 have Pfam domains.
The majority (99%) of these are in the ‘known’ and ‘novel CDS’ classes of proteins.
C2H2 type Zn finger domains are the most common domain on 18 and second most
common in the genome. Of the 20 most frequently occurring Pfam domains found in
Chromosome 18 proteins, 15 are also common domains in the entire human proteome. Of
the 5 over-represented Pfam domains, 3 are due to the cadherin and serpin gene clusters.
Two are laminin chain domains: two out of 5 laminin chain genes are on Chromosome
18. The remaining domain (Lipoxygenase, LH2 domain) has 5 copies present in 2 nonparalogous proteins.
14
Additional Paralogous Genes
Paralogous genes on Chromosome 18 were investigated by clustering with cd-hit37
followed by analysis of a clustalw alignment38. All proteins encoded by annotated
transcripts and pseudogenes on Chromosome 18 were clustered with cd-hit using a 65%
threshold (-c 0.65). Sequences from multiple loci that clustered together with cd-hit were
investigated. In addition to the known serpin and cadherin clusters, this method detected
a myosin regulatory light chain cluster (18p11.31) consisting of 2 genes and a three
member elongin cluster (18q21.1). The other closely related sequence groups detected by
this method were 8 groups of pseudogenes (3 ribosomal proteins, tera proteins, tubulin
4Q, p160-ROCK (ROCK1) and a pair of novel transcripts annotated in a duplicated
portion).
A more sensitive clustering was carried out using clustalw. Pseudogenes, gene
fragments, novel transcripts and putative transcripts were removed from the Chromosome
18 protein set. The resulting set was subjected to cd-hit clustering using a threshold of 0.4
and, following removal of proteins less than 70 residues in length, resulted in 316
proteins. These 316 were aligned with clustalw and a bootstrapped phylogenetic tree was
created. Members of all clades supported by a bootstrap value of 40 or more were
investigated. 6 potential paralog clusters were identified. 3 are clearly paralogs;
SLC14A1 and SLC14A2, ZNF24 and ZNF396, SMAD2 and SMAD4.
Conservation analysis of putative novel gene C18Orf2
To confirm that C18Orf2 encodes a functional protein, we devised an experiment to
amplify fragments from several primates and resequence them to verify splice site
conservation and coding frame and coding sequence conservation. Because the gene is
also interrupted by a primate-specific inversion, we also chose primers that would
amplify across the inversion breakpoints to attempt to determine the approximate date of
the inversion. Samples used for amplification were one human (Coriell NA15510), 4
chimpanzees (including Clint, the genome sequence donor [Coriell NS06006]; the other
chimpanzees are not publicly available), an orangutan, two old world monkeys (patas and
macaque), two new world monkeys (Nucleic Acids Res and spider), and a lemur.
15
Because the region is highly repetitive and there is uncertainty surrounding the
exact location of the downstream inversion breakpoint, we were, in several cases, unable
to design successful primers for the region. None of our PCR attempts across either
breakpoint yielded products, nor did primers for 2 of the 5 coding exons (exons 3 and 6).
For the remaining three coding exons, we were able to generate sequences for human, all
four chimpanzees, and orangutan. For exon 4 only we also generated and aligned
sequence from the wooly monkey, which showed no conservation in either splice site,
making it highly unlikely that C18Orf2 is conserved to new world monkeys.
For the three aligned exons in human, chimpanzee, and orangutan, we were able
to confirm that the splice sites are conserved at both ends in chimpanzee and orangutan
for all exons and that the ATG is conserved (the stop is in the missing exon 6). None of
the chimpanzees showed any fixed changes or polymorphism within the aligned coding
sequence (185 bp), which would be expected 10% of the time for a single chimpanzee
over that length of neutral sequence. Visual examination of the traces over this region
showed no heterozygous bases in any of the human, chimpanzee, or orangutan sequences
(this by itself is unsurprising but indicates that interspecies variation was not masked by
fortuitous assignment of a variant; hets were observed in several individuals in intronic
regions). However, the orangutan had 6 substitutions, 3 each in exons 4 and 5, all aminoacid altering.
Overall, these data the data correlate with our hypothesis that C18Orf2 represents
a newly evolved gene. Given the splicing mutations in exon 4 in the wooly monkey, it
seems likely if this is a functional gene that it arose after the divergence of old world and
new world monkeys.
Search for newly evolved genes
Since a break in synteny allowed us to identify c18orf2 as an apparent newly evolved
gene in the primate lineage, we wondered whether there might be additional new genes
on Chromosome 18 outside of conserved syntenic blocks, which would not have been
found in our analysis. To address this we asked the general question of whether any
multiexon genes on Chromosome 18 lack a mouse homolog. We compared all manually
annotated exons on Chromosome 18 with the set of conserved synteny anchors (see
16
Methods) to identify transcripts that contain no exons that align to anchor sequences, and
thus are not conserved to mouse and/or dog. Exons from all 1,194 Chromosome 18
manually annotated transcripts (derived from ESTs and mRNAs) with 22,179 humanmouse synteny anchors. After eliminating pseudogenes, variants, partial transcripts,
obvious paralogs, single-exon transcripts and low confidence transcripts (transcript
type_id 6 or above) we were left with 10 human Chromosome 18 transcripts that were
potentially absent in mouse. Each was manually reviewed using BLAT43 and Blast 45
against other genome sequences and against the non-redundant protein database. Seven
of the transcripts have clear homologs in mouse. The remaining three candidates are all
hypothetical proteins: HsG648, HsG2120 and c18orf2 itself. HsG648 has homologs in
chimp, dog and pig. It was apparently lost in the rodent lineage since, although it is
absent in mouse and rat, synteny of the two genes flanking HsG648 is conserved in both
species. HsG2120 has identifiable homologs in human and chimpanzee only, and not in
dog, mouse, rat or other mammalian assemblies that are publicly available. This gene
model has two exons, both of which show similarity to LTR type repeats. Thus, HsG2120
probably does not represent a real gene. In summary, other than c18orf2 we identified no
other candidate newly evolving genes.
SUPPLEMENTARY DISCUSSION:
Supplemental discussion of randomness of gene density
To test the significance of the observed low gene content on Chromosome 18 we
compared the actual number of genes found to the number of genes expected on
Chromosome 18 assuming a random distribution of genes on the genome. The observed
gene density of Chromosome 18 would be expected to occur by chance with a probability
of less then 10-12. Gene density for a chromosome is defined here as the average number
of annotated genes per Mb. We used Ensembl annotation of build 34 so that we could
analyze a uniform, standard annotation of the whole genome. A chi-squared test showed
the genome-wide distribution of genes on chromosomes to be significantly non-random
(p < 10-100, see Table S9). We found that among autosomes, chromosomes 21, 9, 10, 15,
6, 7, 14, and 12 best conform to a random gene distribution model (p> 10-7), whereas the
17
remaining autosomes have gene counts that significantly differ (p< 10-7) from the random
distribution. Based on this whole genome annotation, Chromosome 18 (p <<10-6) has the
second lowest gene density, surpassed only by Chromosome 13. We note that based on
manual annotation of those chromosomes published to date, Chromosome 18 has the
lowest gene density (4.4 genes/Mb), while 13 has a significantly higher density (6.6
genes/Mb).
Additional discussion of conserved noncoding sequence methods
Coding sequence. The "coding sequence" set contains all bases that are annotated as
coding in any transcript (both constitutive and alternative exons) as predicted by
Ensembl. The methods used do not require annotation of coding sequence on informant
genomes (i.e., mouse, dog). All alignments are done to the human reference. Prediction
of significance of conservation is obviously symmetric, but the denominators of syntenic
length, fraction conserved, and fraction coding are all calculated purely based on human
sequence and annotation.
Dog and Mouse “Composite” Z-scores. Dog and mouse Z-scores associated with shared
repeats were negligibly correlated and their joint probability density was centered near
the origin. In contrast, windows associated with coding exons showed a strong positive
correlation and led to a probability distribution concentrated on the positive diagonal.
Motivated by this empirical separation, we defined a composite Z-score (Zcomp) as the
linear combination of Zdog and Zmouse (aZmouse +(1-a)Zdog, 1>=a>=0) with the
lowest empirical variance for ancestral repeats. In this way, we minimized the
confounding effect of the background or spurious conservation when trying to identify
true selection; using mouse and dog together achieves a better signal-to-noise tradeoff
than either alone.
Impact of pseudogenes. We assessed the possibility that pseudogene sequences might
display a conservation signature, and if so, whether our analysis of distribution of
conserved non-coding sequences could become biased in the event that there exists a
significant number of such conserved sequences. In theory, this should not be so.
18
Pseudogenes fall in to two classes, those that are processed and those that result from
segmental duplication (local or distant). The former class must be recent in order to show
significant similarity to the parent gene (if they are truly pseudogenes), and so would not
be aligned in our method as they would have no orthologous sequence in the conserved
syntenic block in the vast majority of cases. The latter might meet the criterion, but
would likely have been discarded from our alignments because we aligned only
unambiguous 1:1 orthologous sequences.
We stared with our manual annotation of pseudogenes on Chromosome 18. The
171 psedogenes identified contain a total of 332 "pseudoexons", with a total length of
145,160 bases, or 0.2% of the chromosome length and 0.18% of the conserved syntenic
length. Since our method detects conserved sequences on the basis of their presence in
blocks of conserved synteny, for genes or pseudogenes, to be detected as conserved they
must align syntenically. In fact, only two of the pseudogenes on Chromosome 18 align
sytenically: HsG663 and HsG866. These two pseudogenes have a total length of only
2105 bases, or 0.003% of the length of the syntenic blocks we detected. Thus,
pseudogene sequences have a negligible on our analysis of 18.
HAWK standards set a high bar for annotation of pseudogenes, and Chromosome
18 is relatively pseudogene poor. To assess the impact of all pseudogenes (on top of those
identified by HAWK standards), we performed a genome-wide analysis on the set of
putative loci detected genome-wide by Torrents et al.46 (from build 34), which represents
a more comprehensive set of pseudogenes. For the 19 non-Broad annotated chromosomes
(for which our alignments are on Build 34 and match the Torrents coordinates), we
assessed conservation of the 21,997 loci consisting of 7,704,824 bases. In total, only 5259
50 bp windows (of a possible 1.5 million) containing 20 or more syntenically aligned
bases overlapped pseudogene loci. Those windows that did align were contained in a total
of only 74 of the 21,997 loci. We did observe strongly positive Z-scores (mean of 1.82
for mouse and 2.22 for dog) for those that aligned, which is not surprising in light of
Torrents’ estimate that 5% of these regions might actually be undiscovered true coding
genes. In fact, a manual examination of a small number revealed that in at least on case
the pseudogene locus overlaps an Ensembl gene with RefSeq support.
19
Across the whole genome, our method identified just over 18 million non-exonic
windows as being under positive selection (of over 500 million windows created bty this
method). If every one of the 5259 pseudogene loci aligned had p=1 of being selected (the
real average would probably be 0.5), they would contribute <0.04% of the non-exonic
conservation in the genome. Even if all of these windows were in a single 5 Mb region
(of 489 analyzed), it would only change that window’s value by <20% of the calculated
value. From this we conclude that, as expected, pseudogenes have negligible impact on
our estimation of non-coding conservation.
Note on Build 34 alignments. When we generated our original alignments of build
34 to mouse and dog, we substituted the 4 manually annotated Broad chromosomes (8,
15, 17 and 18) for their build 34 counterparts sot that we could take advantage of our
manual annotations. When we moved to build 35 to use Ensembl for the whole genome
analysis, we used the standard build. Since the Torrents et al. pseudogene analysis has
not yet been updated to build 35, we used our existing “build 34” genome, but since we
cannot rerun only the 4 chromosomes (uniqueness constraints of our synteny algorithm
require that all the alignments be processed simultaneously), we analyzed only the subset
for which we had equivalent build 34 alignments. The omitted chromosomes represent
13.5% of the genome and 16.5% and 14.5% of pseudogene features and bases
respectively in the Torrents set. As the results of the analysis overwhelmingly show that
pseudogenes do not align, we do not believe that including these chromosomes would
change this finding.
Impact of non-coding RNAs. Known non-coding RNAs (ncRNAs) represent a minute
fraction of the chromosome sequence: only 4 tRNAs were found on the chromosome.
Thus, they present far too little length for statistical significance in our test.
Supplementary References
1.
International Human Genome Mapping Consortium. (2001) A physical map
of the human genome. Nature 409:934-941.
20
2.
International Human Genome Sequencing Consortium. (2001) Initial
sequencing and analysis of the human genome. Nature 409:860-921.
3.
Marra, M.A., et al. (1997) High throughput fingerprint analysis of large-insert
clones. Genome Res. 7:1072-1084.
4.
Reithman, H.C., et al (1989) Cloning human telomeric DNA fragments into
Saccharomyces cerevisiae using a yeast-artificial-chromosome vector. Proc.
Nat. Acad. Sci. 86:6240-6244.
5.
Macina, R.A., et al. (1995) Molecular cloning and RARE cleavage mapping
of human 2p, 6q, 8q, 12q, and 18q telomeres. Genome Res. 5:225-232.
6.
International Human Genome Sequencing Consortium. (2004) Finishing the
euchromatic sequence of the human genome. Nature 431:931-945.
7.
Hudson, T.J., et al. (1995) An STS-based map of the human genome. Science
270:1945-1954.
8.
Chu, G., Volrath, D., Davis, R.W. (1986) Separation of large DNA molecules
by contour-clamped homogeneous electric fields. Science 234:1582-1585.
9.
Park, H.S., et al. (2000) Newly identified sequences, derived from human
chromosome 21qter, are also identified in the subtelomeric region of
particular chromosomes and 2q13, and are conserved in the chimpanzee
genome. FEBS Lett. 475:167-169.
10.
Rosenblum, B.B., et al. (1997). New dye-labeled terminators for improved
DNA sequencing patterns. Nucleic Acids Res. 25:4500-4504.
11.
Batzoglou, S., et al. (2002) ARACHNE: a whole-genome shotgun assembler.
Genome Res. 12:177-189.
12.
Jaffe, D.B., et al. (2003) Whole-genome sequence assembly for mammalian
genomes: Arachne 2. Genome Res. 13:91-96.
13.
Bonfield, J.K., Smith, K., Staden, R. (1995) A new DNA sequence assembly
program. Nucleic Acids Res. 23:4992-4999.
14.
Gordon D., Abajian C., Green P. (1998) Consed: a graphical tool for sequence
finishing. Genome Res. 8:195-202.
15.
MacMurray, A.A., Sulston, J.E., Quail, M.A. (1998) Short-insert libraries as a
method of problem solving in genome sequencing. Genome Res. 8:562.
21
16.
Wong, G.K., Yu, J., Thayer, E.C. and Olson, M.V. (1997) Multiple-completedigest restriction fragment mapping: generating sequence-ready maps for
large-scale DNA sequencing. Proc Natl Acad Sci USA. 94:5225-5230.
17.
Ewing, B. and Green, P. (1998) Base-calling of automated sequencer traces
using phred. II. Error probabilities. Genome Res. 8:186-194.
18.
Ewing, B., et al. (1998) Base-calling of automated sequencer traces using
phred. I. Accuracy assessment. Genome Res. 8:175-185.
19.
Hattori, M., et al. (1997) A novel method for making nested deletions and its
application for sequencing of a 300 kb region of human APP locus. Nucleic
Acids Res. 25:1802-1808.
20.
Mouse Genome Sequencing Consortium. (2002) Initial sequencing and
comparative analysis of the mouse genome. Nature 420:520-562.
21.
Smit, A. and Green, P. (1999) RepeatMasker at
http://ftp.genome.washington.edu/RM/RepeatMasker.html.
22.
Ma, B., Tromp, J., and Li, M. (2002) PatternHunter: faster and more sensitive
homology search. Bioinformatics 18:440-445.
23.
Kong, A., et al. (2002) A high-resolution recombination map of the human
genome. Nat. Genet. 31:225-226.
24.
Zhang, Z., et al. (2000) A greedy algorithm for aligning DNA sequences. J.
Comput. Biol. 7:203-214.
25.
Rotmistrovsky, K., Jang W., Schuler G.D. (2004) A web server for
performing electronic PCR. Nucleic Acids Res. 32 (Web Server issue):W108112.
26.
McGowan, K.A., Marinkovich, M.P. (2000) Laminins and human disease.
Microsc. Res. Tech. 51:262-279.
27.
Hunt, D.M., et al. (2001) Spectrum of dominant mutations in the desmosomal
cadherin desmoglein 1, causing the skin disease striate palmoplantar
keratoderma. Eur. J. Hum. Genet. 3:197-203.
28.
Kljuic ,A., et al. (2003) Desmoglein 4 in hair follicle differentiation and
epidermal adhesion: evidence from inherited hypotrichosis and acquired
pemphigus vulgaris. Cell 113:249-260.
22
29.
Garrod, D.R., Merritt, A.J., Nie, Z. (2002) Desmosomal cadherins. Curr.
Opin. Cell. Biol. 5:537-545.
30.
Silverman, G.A., et al. (2004) Human clade B serpins (ov-serpins) belong to a
cohort of evolutionarily dispersed intracellular proteinase inhibitor clades that
protect cells from promiscuous proteolysis. Cell. Molec. Life Sci. 61:301-325.
31.
Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in
human genomic DNA. J. Mol. Biol. 268:78-94.
32.
Roest Crollius, H., et al. (2000) Estimate of human gene number provided by
genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat.
Genet. 25:235–238.
33.
Resch, A., et al. (2004) Evidence for a subpopulation of conserved alternative
splicing events under selection pressure for protein reading frame
preservation. Nucleic Acids Res. 34:1261-1269.
34.
Yelin, R., et al. (2003) Widespread occurrence of antisense transcription in the
human genome. Nat. Biotechnol. 4:379-386.
35.
Kiyosawa H., et al. (2003) Antisense transcripts with FANTOM2 clone set
and their implications for gene regulation. Genome Res. 13:1324-1334.
36.
Chu, J. and Dolnick, B.J. (2002) Natural antisense (rTSalpha) RNA induces
site-specific cleavage of thymidylate synthase mRNA. Biochem. Biophys.
Acta. 1587:183-193. Review.
37.
Li, W., Jaroszewski, L., Godzik, A. (2001) Clustering of highly homologous
sequences to reduce the size of large protein databases. Bioinformatics.
17:282-283.
38.
Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving
the sensitivity of progressive multiple sequence alignment through sequence
weighting, position-specific gap penalties and weight matrix choice. Nucleic
Acids Res. 22:4673-4680.
39.
Lowe T.M. and Eddy S.R. (1997) tRNAscan-SE: a program for improved
detection of transfer RNA genes in genomic sequence. Nucleic Acids Res.
25:955-64.
23
40.
Salamov A. and Solovyev V. (2000) Ab initio gene finding in Drosophila
genomic DNA. Genome Res. 10:516-522
41.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang ,J., Zhang, Z., Miller, W.
and Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res 25:3389MR44
42.
Birney, E., Clamp, M., Durbin, R. (2004) GeneWise and Genomewise.
Genome Res. 14:988-95.
43.
Kent, W.J. (2002) BLAT--the BLAST-like alignment tool. Genome Res.
12:656-64.
44.
Pruitt, K.D., Tatusova, T., and Maglott D.R. (2005) NCBI Reference
Sequence (RefSeq): a curated non-redundant sequence database of genomes,
transcripts and proteins Nucleic Acids Res. 33:D501-D504.
45.
Altshul, S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation
of protein database search programs. Nucleic Acids Res. 25:3389-3402.
46.
Torrents, D., Suyama, M., Zdobnov, E. and Bork, P. (2003) A genome-wide
survey of human pseudogenes. Genome Res. 13:2559-2567.
47.
Schwartz, S., et al. (2003) Human-mouse alignments with BLASTZ. Genome
Res. 13:103-7.
48.
Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W. and Haussler, D. (2003)
Evolution's cauldron: duplication, deletion, and rearrangement in the mouse
and human genomes. Proc Natl Acad Sci USA. 100:11484-11489.
49.
Altshuler, D., et al. (2000) An SNP map of the human genome generated by
reduced representation shotgun sequencing. Nature 407:513-6.
50.
McConkey, E.H. (2004) Orthologous numbering of great ape and human
chromosomes is essential for comparative genomics. Cytogenet. Gen. Res.
105:157-158.
24
Download