Supplementary Notes - Word file (173 KB )

advertisement
Supplementary Information
Generation of the chromosome sequences
Both chromosomes were sequenced using a clone-by-clone shotgun sequencing strategy1
supported by the BAC-based whole genome physical map2. The quality of the
chromosome 2 and 4 sequences was determined to exceed the 99.99% accuracy standard
established by the International Human Genome Sequencing Consortium (IHGSC) for
sequencing the human genome3. In addition to the centromeric gaps, there are 17 gaps in
the chromosome 2 sequence and 12 gaps in the chromosome 4 sequence. Sequences
extend into the centromere on both chromosomes and reach the p-arm telomeric sequence
on both chromosomes. On 2q, the sequence reaches within 200kb of the telomere and
within 10kb of the telomere on 4q4. Based on the size estimates of remaining gaps, the
available sequence represents greater than 99.6% of the total euchromatic sequence
(Supplementary Table 1).
Attempts were made to close all remaining gaps first using probes derived from
BAC-end sequences from clones flanking each gap and from available chimpanzee
sequence (PanGSC, in preparation) against all available libraries (100-fold clone
coverage of the genome, Supplementary Methods below) and second using sequence
placement of fosmid paired end reads{Consortium, 2004 #94}. From the fosmid end
placements, 76 clones were selected which added just over 500kb of sequence (now
included in build35/hg17). Remaining gaps were sized using FISH and using comparative
placement to mouse, rat and chimp. In only one case was additional chimpanzee
sequence detectable that fell in the human gap region. Some of the remaining gaps are
associated with segmental duplications (60%), local complex repetitive structures (20%),
local increases in G+C content of 2 to 10% (50%), and locally high regions of G+C
content (35%, with 10% >=55% G+C). The rapid local increases in GC content flanking
gaps is consistent with the notion that GC-rich regions are difficult to clone5.
The integrity of the sequence and its assembly was tested using a variety of
methods. First, as each clone was finished, an in silico digest of the sequence was
compared for verification purposes to restriction digests of the clone DNA. Second, we
checked the fully assembled sequence by performing in silico digests of clone-sized
fragments across the chromosome against the underlying fingerprint data used to
construct the physical map. In this way, we directly confirmed greater than 99.99% of
the testable bands. Third, while not providing conclusive evidence of a “problem” and
many times only identifying polymorphism, we examined BACs, fosmid and plasmid6
paired end sequences to assess consistency. We searched for two or more ends clustered
and missing, or spanning inappropriate distances. Based on BAC placements, eight
suspicious areas were identified, but fosmid and plasmid data through those regions were
supportive of the genome sequence. For the 13 areas with inconsistent fosmid pairing
data, we sequenced each of them and only two suggested modification of the genome
sequence. Five were conclusively determined to be polymorphic (Supplementary Table
9). Finally, we compared the order of the placement of BAC ends against the
chromosome 2 and 4 sequences to the order of the BACs within the fingerprint map2. No
inconsistencies were found.
Orthology to mouse
The relationship between the human chromosome 2 and 4 sequences and the
mouse genome could be readily defined for ~94% and ~97% of the sequenced
chromosomes, respectively. The largest defined segment (44.1 Mb) on chromosome 2
includes two distal-less homeobox containing genes (DLX1 and DLX2) and the HOXD
gene cluster. The largest defined segment (21.7 Mb) on chromosome 4 includes a
replication-independent member of the histone 2A family, H2AFZ. In mice, this
particular histone functions in embryonic development and lack of H2AFZ leads to
embryonic lethality.
Supplementary mRNA information
As described in the text, we examined discrepancies between our genomic sequence and
known mRNAs that created frameshifts and/or truncations in the genomic sequence. In
cases where there was no support for the mRNA sequence variant, we attempted to
construct a gene that avoided the sequence discrepancy. In cases where the site was
determined to be an error in our BAC sequence (updates to our BAC sequence have also
been submitted) or polymorphic and without the base change we could not create a
protein the amino acids for the following GenBank7 gi identifiers were added to the
chromosome gene index, even though the BAC-based sequence did not support those
translations. The following mRNAs were not included in the gene index for the reasons
indicated: 19923278 (missing nucleotide causes frameshift, confirmed BAC error),
21314676 (missing nucleotide causes frameshift, confirmed BAC error), 23510369 (extra
nucleotide causes frameshift, BAC sequence supported by PCR), 24850108 (two missing
nucleotides causes frameshift, confirmed BAC error), 27735096 (extra nucleotide causes
frameshift, repetitive region could not be sampled by PCR for verification), 4557854
(several missing nucleotides causes frameshift, BAC sequence supported by PCR
screening), 6006032 (several missing nucleotides causes frameshift, BAC sequence
supported by PCR screening), 8394043 (several missing nucleotides causes frameshift,
BAC sequence supported by PCR screening), 11496888 (frameshift causes internal stop,
ESTs suggest it is a genomic sequencing error), 23468353 (frameshift, confirmed BAC
error), and 16876981 (frameshift, confirmed BAC error).
One mRNA, gi17572816, spans across from chr4 to chr4_random. In the
hg17/build35/May, 2004 release of the chromosomal sequence, this problem has fixed.
One mRNA, gi21314625, was found to be incomplete (80 bases missing in the middle of
the gene, a confirmed deletion in the BAC sequence which is found in an overlapping
BAC.). This has been corrected in the May, 2004, release. Two additional mRNAs
(gi9507164 and gi5031798 as well as the TWIST2 gene (gi|17981707|ref|NM_057179.1|,
gi|21620050|gb|BC033168.1|, gi|17389790|gb|BC017907.1| gene which lies in the gap of
NT_005120 and NT_022173) spanned across a sequence gap with some exons falling in
the gap and were not included in the gene set. For gi9507164, although the gap still exists
in the May, 2004, release, we have extended sequence into that gap and can now account
for all exonic sequence.
Based on placements of mRNAs against chromosome 2 and 4, only one possible
deletion was detected and that one was on chromosome 4. The mRNA, BC029568,
appears to indicate a deletion in AC018687. It was found completely in AC018692, a
finished clone that was not used in build34. It is basepair 1-176 of that mRNA that are
not in AC018687. The data in AC018692 is currently represented (in build35) in
chr4_random. In the next release of the human genome, we will update the main
chromosome 4 sequence to include the data from AC018692.
Additionally, alignment with the mRNA NM_005277 indicates a 156 and a 236
basepair deletion in the clone AC093819.3. These were verified by alignment with an
overlapping BAC clone and have been correctly updated in GenBank.
As described in the main text, but here provided in greater detail, we also
investigated 161 mRNA regions where the matched genomic sequence contained
substitution or insertion/deletion differences from the mRNA, 83 of which caused a
frameshift and/or truncation of the protein product. To determine the origin of the
difference, we re-sequenced the region of interest in a panel of 24 diverse individuals8, in
the starting BAC and in some cases in overlapping BACs. A total of 78
substitution/triplet insertion discrepancies were examined (only 2 of the 78 were triplet
insertion; one was polymorphic and in the other case all individuals agreed with the
BAC). In eight cases, primers could not be chosen because the sequence was too
repetitive. In eight cases, all genomic samples agreed with the BAC suggesting an error
in the mRNA or a highly rare polymorphism. In two cases, all individuals matched the
mRNA, but the sequence of the control DNA matched the BAC and other underlying
BACs could not be sampled. Only one was determined to be an error in the BAC. A total
of 44 were found to be polymorphic in the population of 24 individuals. An additional 15
were found to not be polymorphic within the panel of 24 individuals but of those 11 were
polymorphic within the RP11 library and in four cases all overlapping RP11 clones
sampled agreed with the BAC sequence.
Of the 83 possible frameshifts, two were too repetitive to obtain data. Eight were
found to be errors in the genomic sequence. In 69 of the 84 frameshift cases, the sequence
in the 24 genomic samples agreed with the sequence of the BAC again suggesting errors
in the underlying mRNA. Of those, 54 were simple single base insertion or deletion
errors in the mRNA that shifted or truncated the reading frame in the cDNA relative to
the genomic sequence. The other 15 had multiple base insertion/deletion differences
such that the frame was eventually restored. In such cases, comparison with the related
mouse confirmed the genomic translation with a more conserved match between mouse
with the genomic translation than with the original mRNA translation. Most interestingly,
four were determined to be polymorphic in the population (Table 1; Supplementary Table
5). The first polymorphic example is found on chromosome 2 in the protein PMS1
(postmeiotic segregation increase 1), identified by its homology to a yeast protein
involved in DNA mismatch repair. Some cases of hereditary nonpolyposis colorectal
cancer are associated with mutations in this gene. This particular single base deletion
event in the BAC with respect to the mRNA causes a change of frame (4 amino acids
before the stop codon) and extension of the open reading frame by an additional 15
amino acids. Another PMS1 mRNA (BC036376) avoids the genomic region of this
frameshift with an alternative splicing event 71 basepairs upstream. Similarities to
predicted mouse proteins end at amino acid 149, and this frameshift discrepancy is at
amino acid 163.
The polymorphic frameshift in the final exon of the PAS domain containing
serine/threonine kinase (PASK/ NM_015148) leads to a longer open reading frame. The
amino acids extending past the stop indicated by NM_015148, however, does not match
the orthologous predicted gene in the mouse genome.
The third polymorphism was identified in ancient ubiquitous protein 1 (AUP1/
NM_012103), one of three isoforms referred to as the longest isoform at 476 amino acids.
The frameshift modifies the protein beginning at position 147 and then truncates the
protein with a stop codon at amino acid 149. The similarity to mouse corresponds with an
alternate isoform and does not extend through this region (ends at aa 113).
A fourth polymorphism, an insertion of four basepairs in the genomic sequence
relative to the mRNA, was identified on chromosome 4 in TSARG2, testis and
spermatogenesis cell related protein 2 and causes truncation of the protein at amino acid
281 out of 305 (frameshift occurs at amino acid position 279). Two underlying mRNAs
agree with the genomic data and read through the region of this 4 basepair deletion
without a frameshift, confirming the polymorphism. The similarity to the related mouse
protein (ENSMUST00000033917) extends through the region of the frameshift (to amino
acid 287) indicated by the mRNA sequence.
In addition to the frameshifts detected by comparisons of the mRNA with the
genomic DNA, we also detected possible frameshifts by comparing all genomic coding
regions to the random genomic shotgun data available through the HAPMAP project6.
We detected 28 additional potential high quality frameshifts in coding regions of
REFSEQ9 or MGC10 mRNAs on chromosome 2, only 6 of which were annotated in
dbSNP11. On chromosome 4, we detected 15 possible frameshifts in coding regions of
REFSEQ mRNAs, only one of which was annotated in dbSNP. In all cases but one, the
frameshift led to truncation of the protein. This suggests an additional four possible
polymorphic frameshifts or BAC errors where the chimp sequence agreed with the
mRNA, each of which will require further validation.
Supplemental non-coding RNA information
The lower than expected density of ncRNA genes on these two chromosomes may reflect
the fact that many ncRNA genes are strongly clustered, such as the large array of 18
cysteine tRNAs at 7q36 (Hillier et al., 2003), the tandem array of 5-25 U2 snRNAs at
17q21 (Pavelitz et al., 1999), or the complex array of ~30 U1 snRNA genes and tRNAs at
1p36 (Lindgren et al. 1985). Chromosomes 2 and 4 appear to lack any large clusters of
this sort. There are seven small clusters of 2-3 genes each within <100 kb of each other:
two tRNA pairs, one pair of miRNAs, and four sets of snoRNAs in the introns of coding
genes (2 in EF-1-beta, 3 in nucleolin, 2 in APG16L, and 2 in ribosomal protein S3a).
Supplemental Segmental Duplication Analysis
Segmental duplications have long been noted for their potential role in the evolution of
new genes12. To examine the transcriptional and coding potential of duplicated regions,
we analyzed a non-overlapping set of known genes with or without additional spliced
EST. For each group, we categorized every exon as unique or duplicated on the basis of
its overlap with duplicated sequence (Supplementary Table 10). About 5.49% (1162 out
of 21164) of exons of chromosome 2 are duplicated, while only 2.47% (277 out of
11213) of exons of chromosome 4 are duplicated. Our analysis shows that the relative
number of transcribed exons is slightly greater for duplicated DNA when compared with
non-duplicated DNA on chromosome 2 and 4. These results support a previous
observation13 that recently duplicated regions are rich in genes/transcripts.
Supplementary Methods
Physical Mapping
The human libraries listed here were probed for filling gaps in the chromosome 2 and 4
maps. In addition, we have probed against filters from the RP43 chimpanzee library.
Library
Estimated Coverage
RPCI-4 RPCI Male PAC
4x
RPCI-5 RPCI Male PAC
6x
RPCI-11 Segments 1-5 Male BAC
RPCI-13 Segments 1-4 Female BAC
Genome Systems BAC library 1
32.2x14,15
21.8x
10x
Cal Tech Segments A, B
13x16
Cal Tech D
17x16
Broad/MIT fosmid library (H_AA)
15x
MRC Geneservice single chromosome cosmids/fosmids:
LL02NC02 – chromosome 2
4x
LL02NC03 – chromosome 2
2.6x
LA04NC01 – chromosome 4
3x
RP libraries :
Cal Tech libraries :
http://www.chori.org/bacpac/
http://informa.bio.caltech.edu/idx_www_tree.html
Genome Systems libraries: http://www.genomesystems.com/
MRC libraries : http://www.hgmp.mrc.ac.uk/geneservice/reagents/products/descriptions
Sequence and Assembly Validation
Each chromosome sequence was segmented into 150-kb fragments each with 40%
overlap with the previous simulated clone. An in silico HindII digest (ISD) of these
simluated clones was performed; bands less than 600 bp were removed from
consideration to be consistent with the fingerprint data. Fingerprint data from the human
genome physical mapping effort2 were converted from mobilities to sizes so that clones
from the fingerprinting effort could then be directly compared to the chromosome
fragments.
The ISD of each 150-kb simulated clone (2,662 clones for chromosome 2; 2,094
for chromosome 4) was then compared to the overlapping clones in the fingerprint map.
We required that each band in the simulated clone be confirmed by no more than two of
the overlapping fingerprints. A band was considered confirmed if the difference between
the size of the fragment in the simulated clone varied by less than 3% from a band in the
fingerprints under consideration. For bands between 600 and 700 bp (N=1828 bands in
chromosome 2 are in this range; N=1438 bands in chr 4) and for bands greater than 11-kb
(N=2119 chromosome 2; N=1424 chromosome 4), we allowed the discrepancy to be
<4% given the slightly higher variability of the fingerprint gels in those ranges17. Only
405 and 226 additional bands on chromosome 2 and 4 respectively were confirmed using
the 4% threshold however rather than the 3% threshold.
An ISD of the full chromosome 2 sequence revealed 71,331 total bands of which
58,741 were greater than 600 bp. For chromosome 4, there were 58,729 total bands and
48,002 bands greater than 600 bp. Using the above method on chromosome 2, only 6
bands from 5 clones (2 of the clones were at contig ends) were not confirmed. For
chromosome 4, there were only 6 unmatched bands (from 12 clones, 7 of which were at
contig ends). In this way, 99.99% of the testable fingerprint bands were confirmed.
Additionally, the order of the in silico digests placement in the FPC database was
consistent with the assembly position, with the only minor discrepancies found adjacent
to contig ends where sequence had been used to resolve the final assembly ordering.
Tiling Path Verification, Overlap Polymorphisms and Assaying mRNA/genomic
discrepancies
To evaluate clone overlaps where the rate of difference between overlapping clone
sequences was higher than 1 in 1000 bases, a PCR product encompassing the differences
was sequenced from each BAC in the overlap region and from each of a panel of 24
ethnically diverse genomic DNA samples8. If the 24 samples showed allelic variation, the
overlap was judged to be correct, but if the 24 samples yielded persistent heterozygosity,
the sequence was determined to be derived from a repeated sequence, with sequence
differences between the copies.
To investigate discrepancies between mRNA and genomic sequence, PCR
products were generated, resequenced and the polymorphic based examined in the DNA
from 24 individuals, the original BAC and in some cases other BACs.
Initially, we tested 21 and 32 overlaps for chromosome 2 and 4, respectively, that
contained at least three differences per kb. There were 3,547 differences in 955kb or 3.7
differences per kb on average. There were 8,785 differences in 2186kb or 4.0 differences
per kb on average. The highest rate of apparent variation identified was 5.6 per kb on
chromosome 2 and 6.3 per kb on chromosome 4.
We further evaluated the overlap regions of 2,718 human clones (from
chromosomes 2, 4 and 7) by aligning each overlap using cross_match (P. Green,
unpublished) using the following parameters: -minmatch 50 –minscore 2000 –
discrep_lists –tags –penalty –1. We counted the number of phred18 high quality (base
quality value >19) substitution, insertion and deletion events (where each insertion or
deletion event, regardless of the numbers of basepairs inserted, was counted as one
polymorphic event). We counted events in 5kb non-overlapping windows. Of those
2,718 overlaps (8,309 windows), 678 had at least 5kb of overlap and more than 3
“polymorphic events” in at least one of the 5kb windows. The greatest number of events
per window was 136.
We identified 24 different regions where there were at least 3 consecutive windows
with more polymorphic events than two standard deviations from the mean. When
considering substitution + insertion + deletion events, the mean number of polymorphic
events per 5kb window was 5.23 (s.d. 6.44). There were 19 overlaps with at least 3
consecutive windows of > 2 standard deviations (>18). When considering only
subsitution events, the mean number of events per 5kb window was 4.20 (s.d 5.57). There
were 21 overlaps with at least 3 consecutive windows greater than 2 standard deviations
(>15). When considering only insertion + deletion events, the mean number of events
per 5kb windows was 1.03 (s.d. 1.48). There were 4 overlaps with at least 3 consecutive
windows of greater than 2 standard deviations. We examined these regions for features
that may indicate balancing selection. Of the 24, 15 were in the region of a gene.
Sequencing in other primates
We attempted to amplify across each exon in each gene of interest from the following
primates: Homo sapiens, celebes crested macaca, sumatran orangutan, Gorilla gorilla,
black-handed spider monkey (new world monkey), and chimpanzee (Pan troglodytes).
Primers were chosen in highly conserved human/chimp intronic regions directly flanking
the exons. Multiple primers were chosen to increase the possibility of getting a
successful product. For primates where we were not able to amplify a given exon, new
primers were chosen based on sequence using conservation data from sequence from
other monkeys where amplification had been successful.
For mRNAs with no similarity to mouse (even after translating mouse genomic
sequence in all 6 frames) or to anything in the non-redundant protein database, we
resequenced six : a) NM_022492, exon 4 of 9 , b) BC021739, exon 2 of 5, c)
NM_153031, exon 1 of a 2 exon gene, d) NM_152997, exon 3 of a 5 exon gene, e)
NM_178497 exon2 of a 2 exon gene, and f) NM_005750, exon 1 of a 2 exon gene. As
described in the text, only NM_153031 did not have an open reading frame in all of the
sequenced exons.
For our detailed examination of NM_153031, we sequenced exon 1, the only
coding exon (501 bp of coding nucleotides) of a 2 exon gene (NM_153031/gi 23308528/
hypothetical protein FLJ32063). The chimp PCR product gave slightly larger products
than the other products. Only chimp had an open reading frame throughout the exon and
was 99% identical at the nucleotide level to the mRNA. In both orangutan (97%
identical) and gorilla (87% identical), there is an early frameshift (and several other
frameshifts later in the exon). In monkey (92% identical), a frameshift occurred about
halfway through the coding region followed by other frameshifts. (If the frameshifts are
artificially removed, the percent identities at the amino acid level are chimp 99%, gorilla
77%, monkey 83% and orangutan 95%. Again, if frameshifts are artificially removed, the
Ka/Ks data suggest these genes are evolving at varying rates (human/chimp=0.342;
human/gorilla=0.881; human/monkey=0.563; human/orangutan=0.562). Based on these
data, this protein appears to only be functional in human and chimp (if it is functional at
all) and not in the more distant primates.
Sequence-Map Comparisons
Locations of STS markers are determined using a combination of three methods. First, all
available complete marker sequences are aligned using BLAT19 with parameter –
ooc=11.ooc. These are further filtered to include only the best alignments with at least
60% coverage. Second, fasta sequences created using primer sequence information are
aligned using BLAT with parameters -tileSize=10 -ooc=10.ooc -minMatch=1 minScore=1 -minIdentity=75. These are then filtered based on the number of mismatches
and deviance from the reported product size. For cases with no mismatches, the size is
allowed to deviate up to 200 bases. Similarly, combinations of one mismatch/150 bases,
two mismatches/50 bases, and three mismatches/25 bases are used. Third, e-PCR20 is run
using primer information with parameters N=1 M=50 W=5, where N is the number of
allowed mismatches, M is the allowed deviance from the reported product size, and W is
the word size. The results of the three methods are combined with preference being
given to full sequence alignments. That is, in cases where full sequence alignments are
found, these are the only placements reported, otherwise primer-based locations are
reported. In build34, placements were located for 406 of 407 deCODE21 markers on chr2,
and 299 of 302 markers on chr4. The missing markers are D2S2355 (chr2
37.04),
D4S2969 (chr4
66.38).
80.96), D4S2627 (chr4
91.86), and D4S1635 (chr4
The deCODE markers provide a reasonable test of the long range consistency of
the construction of the chromosomal sequence in that they are distributed fairly evenly
across the chromosome. On chromosome 2, 406 markers are distributed over 243387553
bp (density = 1 per 599,477 bp) evenly distributed except for centromere (approx 8.9e+07
to 9.7e+07) with 0 markers and the region between 4.7e+07 and 5.6e+07 which has only
4 markers. For chromosome 4, 299 markers are distributed over 191713478 bp (density =
1 per 641,182 bp) evenly distributed except for 4.5e+07 to 5.2e+07 that has 4 markers
and 1.25e+08 to 1.35e+08 which has 4 markers. There is only one case where the
positions of the deCODE markers relative to the sequence suggest an inversion:
[AFM112YD4 genetic position (chr2, 258.19, 0) physical position (chr2,241488777)/
AFM356TE5 genetic position (chr2, 256.32, 0) physical position (chr2,241560233) ].
As described in the main text, <1% of the genetic markers could not be found on
the chromosome 2/4 sequence. We categorized those 78 markers having sequence that
could not be localized. For example, 43 were largely repetitive making them difficult to
localize reliably by computational methods. Additionally, 23 were localized to
chromosomes other than chromosome 2 or 4. A total of six fell in regions where we
currently have remaining gaps in the chromosome sequence. Finally, six fall in regions
where we do not have any corresponding sequence gaps suggesting polymorphisms,
unknown gaps in coverage, or clerical errors during genetic mapping.
GC content
GC content was measured in 20kb windows across the chromosomes. As described
in the paper, but provided with more detail here, the GC content of chromosomes 2
(40.2%) and 4 (38.2% overall, with the q-arm 37.4%) is lower than that of the genome as
a whole (41%). On chromosome 2, the region of lowest GC content is 40 kb of 29.9%
containing no known genes. The regions of highest GC content are near the q-telomere
on both chromosomes. Adjacent to sequence gaps on chromosome 2 and 4, we find rapid
increases in GC content. On chromosome 4, 19% of the chromosome has a GC content of
less than 35% as compared to 9% in the genome as a whole, yet one of the highest GC
content windows in the genome at 72.7% is also located on chromosome 4. This 20 kb
region is only 25% repetitive and contains the double homeobox 4 gene (DUX4), which
may play a role in facioscapulohumeral muscular dystrophy22. Also related to the overall
low GC content, 2.3% of chromosome 4 has a GC content of >=50%, while 7.4% of the
genome has a GC content of >=50%.
CpG Islands
CpG islands23 were predicted by searching the sequence one base at a time, scoring each
dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments.
Each segment was then evaluated to determine GC content (>=50%), length (>200bp),
and ratio of observed proportion of CG dinucleotides to the expected proportion on the
basis of the GC content of the segment (>0.6), using a modification of a program
developed by G. Miklem (personal communication). A CpG island was defined to
overlap a predicted gene if it fell within 5 kb upstream or 1 kb downstream of the
predicted first exon. A 200kb window size was used on repeat-masked DNA for this
analysis.
Analysis of the repeat-masked chromosome 2 and 4 sequences revealed 1,662
(7/Mb) and 1,004 (5.4/Mb, the lowest density of any human chromosome) CpG islands
with an average length of ~800 bp. Of the known mRNAs localized to chromosomes 2
and 4 (see below), the 5’ end of 69.5% and 66.5% were at or near 5 kb upstream to 1 kb
downstream from a CpG island. For the full gene sets presented here, the number with
overlapping CpG islands was 67% (chromosome 2) and 64% (chromosome 4), higher
than the reported figure of 60% for the genome as a whole24.
Identifying Known Genes
We built a non-redundant set of human mRNAs using the human division of REFSEQ25
and the human division of the Mammalian Full Length cDNA collection26 (October,
2003). We removed redundancy between the two gene sets retaining the largest mRNA in
cases where an identical shorter mRNA was detected. Each mRNA was searched against
a soft-masked (see repeat masking methods) chromosome 2 and 4 sequence using
BLAT19 and the following parameters: -q=rna -trimHardA –fine. Those mRNAs
matching with >=30% coverage were re-aligned using SPIDEY27, BLAT19, EXALIN
(Zhang M, Yeh RT, Gish W, manuscript submitted) and EST_GENOME28 to their
matching chromosomal region(s) with an additional 100,000 basepairs added to the start
and end of the genomic sequence. Parameters used for each application were as follows
(strand assigned by first BLAT round where needed):
SPIDEY: -S <strand> -d T
BLAT: -q=rna –trimHardA –fine -ooc=11.ooc
EXALIN: -p HumanSPM.par –s 2 –i <strand>
EST_GENOME: -match 5 –mismatch 11 –gap_penalty 11 –intron_penalty 130 –
splice_penalty 50
We sorted the output from each program to find the best alignments based on these
criteria: (a) coverage/identity of the mRNA (b) coverage/identity of the coding sequence
(c) percentage of canonical splice sites (GT/AG, CT/AC or GC/AG).
In addition to the criteria used to determine candidate genes on chr 2/4, each
mRNA was searched against the entire human genome (build34) to find matches >=60%
coverage. Those mRNAs were then aligned using SPIDEY to each of these regions to
verify that the best hit for each mRNA truly was to chromosome 2 or 4. Further, all
matches giving a single exon hit on chromosome 2/4 and having a multi-exon hit on
another chromosome were classified as potential pseudogenes (chromosome 2: 193
REFSEQ + 292 MGC, chromosome 4: 157 REFSEQ + 188 MGC)
All mRNAs which aligned with >97% mRNA coverage and identity, 100%
protein coverage/identity and 100% of canonical splice sites were considered “good”
genes and required no further manual review (chromosome 2: 714 REFSEQ + 484 MGC,
chromosome 4: 430 REFSEQ + 257 MGC). The remaining mRNA alignments were
subjected to full manual review (chromosome 2: 534 REFSEQ + 313 MGC, chromosome
4: 252 REFSEQ + 135 MGC).
All discrepancies were manually evaluated both in the genomic and mRNA
sequences. To facilitate the manual review process EAnnot (see detailed methods
description below) provided information about internal stops, frameshifts and mismatches
of the gene models with the translation of the underlying mRNA. Also, based on other
mRNA and EST information from GenBank, EAnnot reported the position and possible
source of differences, polymorphisms, genomic sequence and underlying mRNA errors.
If there were ESTs or mRNAs matching both underlying mRNA and genomic sequence
the difference was assumed to be due to a polymorphism. If the genomic sequence was
different than the underlying mRNA but there were other mRNAs or ESTs that supported
the genomic sequence, the difference was attributed to the underlying mRNA sequence
error or possible polymorphism. If there were no mRNAs or ESTs to support the genomic
sequence or if the source of the difference could not be determined with confidence the
problematic region was sent for screening in a panel of 24 individuals to determine the
source of the discrepancy.
Gene models were manually adjusted based on genomic sequence data. These
changes included extending or truncating translation by finding small 5' exons and using
alternative stop sites based on genomic sequence data, adjusting untranslated regions and
correcting mis-identified pseudogenes. If needed splice sites were adjusted to keep
translation in frame and regions around non-canonical splice sites were checked for the
presence of alternative canonical splice sites that would not change the CDS. As a part of
the checking process, the annotated CDS was translated and searched against the
translated mRNAs for verification.
After creation of the final non-redundant set of mRNAs, a summary of the
performance of each of the alignment tools was created. Each genomic region (extended
by 2kb in each direction) was given to the tools along with the mRNA sequence identical
to that region. In 75% of the cases on chromosome 2 and 81% of the cases on
chromosome 4, the SPIDEY gene prediction was in agreement identical to the final
mapping of the mRNA. Further, 84% of the exons on chromosome 2 and 83% of the
exons on chromosome 4 were predicted exactly. For EST_GENOME, 79% on
chromosome 2 and 82% on chromosome 4 of the full genes, and 95% of the exons on
both chromosomes 2 and 4 were predicted identically to the final mRNA. For EXALIN,
83% on chromosome 2 and 88% on chromosome 4 of the full genes and 91% on
chromosome 2 and 92% on chromosome 4 of the exons were predicted identically to the
final mRNA. For BLAT, 92% on chromosome 2 and 94% on chromosome 4 of the exons
were predicted identically to the final mRNA. Additionally, for BLAT in 94% of the
mRNAs on chromosome 2 and 95% on chromosome 4 were aligned identically over
greater than 99% of their length.
Only genes with identities to known mRNAs have been entered into the GenBank
annotation for each clone. Genes in other classes are available on our web site but where
not submitted to the public databases.
To create a non-redundant set of known genes, first, we removed identical
predictions. Next, we manually examined each locus for any remaining redundant forms
that were not removed automatically. When two forms had the same CDS but differed in
the length of their UTRs we kept the longest one. All known genes with multiple variants
were verified by their entries in LocusLink.
Known genes without similarity to mouse
A set of 102 genes predicted from the known mRNAs did not have any similarities to the
mouse protein database, 65% of which were single exons. Of these genes, 66 were found
to have some similarity to mouse or rat when instead searching the entire translated
genomic sequences. Only three of these, however, were convincing similarities and
many of the 66 may well only be conserved UTR. In the end, 36 had no match at all to
mouse or rat sequences and are described in the main text.
Identifying Frameshifts in Known Genes using HAPMAP
HAPMAP6 reads were aligned to the human genome (hg16/build34) using SSAHA29.
Only best in genome alignments were retained with greater than or equal to 97% identity.
We identified those high quality insertion/deletion SNPs (quality >=25 and average
quality >=30) that overlapped CDS exons and manually reviewed the alignments. We
detected 27 potential frameshifts in coding regions of REFSEQ mRNAs on human
chromosome 2, only 6 of which were annotated in dbSNP. On chromosome 4, we
detected 15 possible frameshifts in coding regions of REFSEQ mRNAs, only one of
which was annotated in dbSNP. Only in the case of FLJ38379 did the frameshift result in
a “longer” open reading frame. In all other cases, the frameshift resulted in a truncated
protein.
Alternative splicing evaluation and polyA signal identification
A tool, EAnnot30, has been developed to incorporate and cluster vertebrate mRNA and
EST data as well as polyA signal identification to predict gene structures, identify
polymorphism data, and integrate available data to annotate the resulting genes. For this
analyses, EAnnot was used for polyA identification and quantification of alternative
splicing events.
EAnnot gene boundary determination. EAnnot begins with correcting mapping gaps,
repetitive mappings, misidentification of splice sites, and other errors. EAnnot then gives
the strand assignments to spliced mRNAs and ESTs based on splicing consensus
provided by SplicePredictor31 (http://ftp.zmdb.iastate.edu). EAnnot also utilizes
information from dbEST32 regarding the clonal origin of the EST that is critical for the
positive identification of ESTs linked to the same clone (clone linking). Gene boundaries
are further determined by sequence overlaps, strands, and clone linking. The number of
gene boundaries reflects the number of genes. Gene prediction proceeds within each
gene boundary.
EAnnot Sequence clustering. EAnnot sorts mRNAs into mRNA clusters based on
splicing patterns. EAnnot then compares ESTs with mRNA clusters to identify ESTs
with novel splicing patterns (alternatively spliced ESTs). These alternatively spliced
ESTs are further grouped into EST clusters. Non-redundant mRNA clusters and novel
EST clusters are used to generate gene models. For newly identified gene boundaries
(loci) without mRNA information, all ESTs are clustered.
PolyA identification: All ESTs and mRNAs (spliced and non-spliced) are examined for
the presence of a stretch of As at the 3' end of the sequence or a stretch of Ts at the 5' end
of the sequence. The presence of at least 8 As or Ts among the last or first 10 bases of the
sequence is required for the further analysis. If at least 8 As or Ts are found, EAnnot will
continue to count As immediately before the last 10 bps or Ts immediately after the first
10 bps until a different nucleotide is found. EAnnot requires at least 5 As or Ts at either
end of the sequence not mapped to genomic sequence. In addition, the difference between
the number of As or Ts at either end of the sequence and the number of As or Ts not
mapped to the genomic sequence should not exceed 8. If all conditions listed above are
satisfied, polyA signals (AATAAA or ATTAAA) will be searched within 50 bps
upstream from the polyA adenylation site. If the polyA sites and the polyA signals are
found, they are assigned in pairs.
EST comparisons
All human expressed sequence tags (ESTs)32 in GenBank (July 15, 2003) were searched
against NCBI build34 of the human genome using BLAT. There were 4,643,853 human
ESTs which met the following criteria of: EST length >50 bp, fewer than 50 alignments
in the genome, <=5% substitutions, <=4% insertions, >=80% of the EST aligned. Of
those 4.6 million ESTs, 262,813 (6% of the ESTs) found their best similarity to
chromosome 2 and 139,791 (3% of the ESTs) found their best match to chromosome 4.
The ESTs were associated with their overlapping known or predicted gene but
were not integrated with mRNA data or with other gene predictions to produce new gene
structures because of the ambiguity inherent in the data, i.e. it is impossible to be certain
which transcript/isoform a given EST is associated with. However, the ESTs were
clustered with the mRNA data to provide overall estimates for alternative splicing (see
above).
Homology-based prediction of predicted protein coding genes and pseudogenes
GENEWISE33
Pseudogenes and predicted genes were predicted using similarity to known proteins
following the protocol originally developed for the annotation of mammalian
pseudogenes5,34,35. In brief, candidate (pseudo)genic regions were first identified in
chromosome 2 and 4 by comparing all DNA fragments between common repeats with a
non-redundant protein database (RefSeq + SwissProt36 + trEMBL36) using BLASTx29
(Evalue cutoff = 0.001). In order to find and delimit regions corresponding to
independent genes or pseudogenes, the intervals containing all DNA regions in the same
chromosome that had common matching known proteins were reanalyzed separately with
BLAST2GENE37. Regions with repetitive, viral, coiled coil and LCR fragments were
also filtered out of the candidate set. Finally, to obtain the corresponding (pseudo)gene
structure and information about in-frame truncations present in pseudogenes, each of
these regions was further compared against the closest protein sequence using
GENEWISE33 (Score > 20).This procedure led us to the identification of an initial set of
2945 chromosome 2 and 1738 chromosome 4 candidate (pseudo)genes. In parallel, we
also applied the same similarity search through the remaining human and the complete
mouse genomes, which led to the identification of 43,247 (including chromosomes 2 and
4) and 43,715 (pseudo)genes for each of the organisms respectively. Fragmentation was
removed from the GENEWISE set using manual inspection of cases where multiple
neighboring GENEWISE predictions had similarities to multiple regions of the same
protein.
As human-mouse orthology relations are expected to be detectable only for
functional, and not for neutral evolving regions, we have used this basic criteria to
differentiate between genes and pseudogenes. For this analysis, to identify the
orthologous regions of mouse (Mm3) and human (hg16/build34) and thus the mouse and
human (pseudo)gene sets we have applied a two step approach, based on sequence
comparison and evaluation of genomic position. First we have compared with BLASTP
the translations of all human and mouse predicted regions (stop codons and frameshifts
were circumvented for this comparison) and extracted up to 23,000 best reciprocal
mouse-human hits that we take as candidate orthologous pairs. We accepted as reliable
ortholog pairs only best reciprocal hits that had other neighboring best reciprocal hits in
both organisms. With this filter we excluded those pairs detected where recently formed
processed pseudogenes (almost identical to their parental gene) were likely to be involved
(the majority of these pairs were formed by a single exon in one organism and an intron
containing prediction in the other). By fixing the accepted pair of orthologs, we compared
with BLASTp the unassigned human predictions in between with the corresponding
unassigned mouse predictions in the orthologous region.
KA/KS analysis was performed using the PAML package38. We inferred the
ancestral sequence of each of the target sequences (A) with the PAMP program using a
protein-based DNA multiple alignment (CLUSTALW39).
As normally KaKs values cannot be directly used for the distinction of genes and
pseudogenes, in order to evaluate the pattern of evolution and hence the functionality of a
given candidate sequence set, we need to compare its distribution of KaKs values with
benchmark distributions obtained from reliable functional and pseudogenic regions. The
associated error (calculated to be of 3%) comes from an internal cross-validation
performed with benchmark distributions and reflecting its discriminative power (see 534).
TWINSCAN40-42
The TWINSCAN (http://www.genes.cs.wustl.edu) program is designed to analyze
eukaryotic genomic sequences by simultaneously modeling both gene structure and
evolutionary conservation. Here, genes were predicted for human chromosomes 2 and 4
using the entire mouse genome sequence41 as the informant using TWINSCAN version
1.2 available from the authors web site. The resulting set for chromosome 2 was
composed of 1925 predictions with an average of 8.85 exons per gene. The chromosome
4 set consisted of 1172 predictions with 7.65 exons per gene. These sets were then
filtered for inclusion in the final gene set.
Initially, each TWINSCAN prediction was compared to a mouse protein database
using WU-BLASTP and the following parameters:
M=BLOSUM80 Q=12 R=4 topcomboN=1 -sort_by_totalscore -sort_by_subjectlength
E=0.01 hspmax=0 B=1e9 V=1e9 Y=1e7 Z=1e7 restest
The mouse protein database was composed of mouse ENSEMBL predictions43 (6 May
2003, v. 18.30.1), and TWINSCAN predictions for all of mouse (v1.2 against human
build 34/hg16 available from the authors web site). Similarities within a P-value
exponent range of 9 from the most significant similarity were examined to find an
unprocessed (multi-exon) mouse prediction (preference given to those similarities in the
orthologous region of mouse as defined by the mouse/human orthology maps available
from http://www.genome.ucsc.edu; on human chromosome 2 and 4 there are 48 and 32
identifiable segments sharing the same order of highly conserved sequences in the two
species). This information was also used to indicate processed pseudogenes when single
exon gene predictions found similarities with multi-exon gene predictions in mouse, and
the aligned region in the human prediction spanned an intron in the multi-exon mouse
gene prediction.
Because of differing fragmentation in the human and mouse predicted gene sets, the
aligned region of the mouse protein (rather than the entire protein) was then compared
back to a human protein database also using WU-BLASTP and the parameters listed
above for the human to mouse comparison. In summary, we reduced the number of false
positives and pseudogenes by requiring that the predicted genes have a highly significant
match in the mouse gene set in the orthologous region of mouse where assigned, and in
turn that the matching mouse gene have as its best match the original chromosome 2 or 4
predicted gene. In detail, the human protein database was composed of the human
ENSEMBL predictions (NCBI build34/hg16, 4 Nov 2003 v 18.34.1) and human
chromosome 2 and 4 known and predicted genes. Each gene prediction with a similarity
with a p-value exponent of 9 of the best hit were examined to look for a protein which
overlapped the original human chromosome 2/4 gene prediction. If one was found, it was
defined as reciprocal or reciprocal best. If there was not a similarity within a p-value
exponent of 9 of the best similarity, but there was a similarity at >e-49 (or more
significant than 1e-29 if the best hit was less than 1e-49), that original human
chromosome 2/4 prediction was defined as "reciprocal friend". The categories of
reciprocal best and reciprocal friend were further subdivided into those where the aligned
regions contained an intron versus those which did not. While these “reciprocal friends”
have some likelihood of being unprocessed pseudogenes, information derived from the
search of these proteins against the rest of the human genome and Ka/Ks information was
used to estimate the likelihood that each of these are pseudogenes. Comparisons of the
predicted proteins to mouse protein databases were also used to identify processed
pseudogenes (single exon genes whose similarity spanned an intron in a multi-exon
human gene) and to identify possible duplication events and gene families. Single exon
genes with similarities to multi-exon mouse genes, where the region of similarity spanned
an intron in the mouse gene, were identified as possible pseudogenes. Each predicted
gene was also compared to a database of human proteins to aid in identifying processed
pseudogenes and to identify the closest relative of each protein to aid in identification of
unprocessed pseudogenes. Therefore, single exon genes with matches to multi-exon
genes in either the human or mouse genomes, predictions showing signs of nonfunctionality, and those that produced L1/reverse transcriptase were also removed from
the set.
After categorization of each of the human chromosome 2/4 predictions, those
predictions were filtered to form a non-redundant set in a method similar to that used for
human chromosome 7. First, all known genes were considered as absolute, only identical
versions of known genes and identical subsets of known genes were removed. From this
base, all predictions which overlapped known genes were removed. If there were no
known genes in a region, a hierarchical method for removing predictions was used. Any
GENEWISE predicted protein coding gene which did not overlap a known gene was kept
as the next set of “reliable” predictions. Finally, any “reciprocal best” TWINSCAN
prediction was kept that did not overlap either the known gene or GENEWISE sets.
Annotation of RANBP2 region
We are particularly interested in defining the evolutionary mechanism of the RanBP2
gene cluster, comprised in the chromosome 2 region (87050000-113380000). We
manually reviewed annotation of the genes surrounding each of the cluster forms. In
doing this, we defined two main regions of interest: 1- chr2:87050000-88235000
(region1), 2- chr2:106600000-113380000 (region2). At this point, we set up a procedure
to proceed with manual annotation: 1) take the known gene set for region1/region2. In
case of overlapping matches, take the best in coverage, with preference to SwissProt; 2)
Blast2seq against region1/region2 using the protein sequence; 3) Manually verify each
match with E<0.001 and annotate in region1/region2 only the reliable matches; 4) map
ESTs for each predicted gene
Of course, applying this procedure we obtained matches involving genes but also
some involving fragments. All of these were checked for overlap with the pseudogene
sets and/or merged with the existing gene set.
Protein Similarity Searching
After creation of the chromosome 2 and 4 gene sets, the predicted amino acid
sequences were each compared to the August 22, 2004, NCBI non-redundant protein
database using WUBLAST and the following parameters:
topcomboN=1 -sort_by_totalscore -sort_by_subjectlength E=1e-6 hspmax=0 B=1e9
V=1e9
At this threshold (E=1e-06), 98% of the final chromosome 2 and 4 gene sets found some
similarity in the non-redundant protein database.
Comparative Data
Protein sets for Tetraodon nigroviridis44 (assembly version 7; annotation release 25.2.2,
Feb 9, 2004; Genoscope data available via
http://www.genoscope.cns.fr/externe/tetraodon/Ressource.html), Fugu rubripes45,46
(Fugu build 2c; annotation release 25.2c.1 from July 5, 2004;
http://fugu.hgmp.mrc.ac.uk), and Danio rerio (sequencing data were generated by the
Zebrafish Sequencing Group at the Sanger Centre and data are available from
http://www.sanger.ac.uk/Projects/D_rerio ; assembly release Zv4: annotation release
25.4.1 from Feb 9, 2004) were downloaded from http://www.ensembl.org. The Ciona
intestinalis47 protein set (release 1.0 April 2002) was downloaded from
http://genome.jbgi-psf.org/ciona/ . Each human chr 2/4 protein was compared to the
relevant databases using BLASTP with the following settings (topcomboN=1 E=1e-04
hspmax=0 -filter=seg+xnu B=1e9 V=1e9).
Interpro
Interpro48 (http://www.ebi.ac.uk/interpro) combines sequence-pattern information from
available databases and integrates motif and domain assignments into a hierarchical
classification system. Comparisons were with Interpro 8.0 using Interproscan49 v3.3.
In the Interpro analyses, 994 (74%)and 584 (73%) of the proteins can be classified
at a threshold of E=0.01. Of those, 662 on chromosome 2 (67%) and 395 on chromosome
4 (68%) are multi-domain. The Pfam analyses for Table 2 involved only those matches
meeting the threshold of E=0.01.
Segmental Duplication Analysis
We used a BLAST-based detection scheme53 to identify all pairwise similarities
representing duplicated regions (≥1 kb and ≥90% identity) within the finished sequence
of chromosome 2 and chromosome 4 and compared to all other chromosomes in the
NCBI genome assembly (build 34). Highly identical duplications were confirmed by a
second detection strategy which assays for excess random-read coverage across the
genome13. While there was, in general, very good correspondence between these two
methods, one 200 kb region (chr4;19166956-19138783) was identified by whole-genome
shotgun sequence detection that could not be confirmed by whole-genome analysis of the
human genome (build34). The program Parasight (J. Bailey, unpublished) was used to
generate images of pairwise alignments. We also analyzed pairwise alignments for
percent identity and the number of aligned bases. Satellite repeats were detected using
RepeatMasker (version: 2002/05/15) on slow settings (A Smit and P Green,
unpublished).
References
1.
The International Human Genome Sequencing Consortium. Initial sequencing and
analysis of the human genome. Nature 409, 860-921 (2001).
2.
The International Human Genome Mapping Consortium. A physical map of the human
genome. Nature 409, 934-941 (2001).
3.
Felsenfeld, A., Peterson, J., Schloss, J. & Guyer, M. Assessing the quality of the DNA
sequence from the Human Genome Project. Genome Res. 9, 1-4 (1999).
4.
Riethman, H. C. et al. Integration of telomere sequences with the draft human genome
sequence. Nature 409, 948-951 (2001).
5.
Hillier, L. W. et al. The DNA sequence of human chromosome 7. Nature 424, 157-64
(2003).
6.
The International HapMap Consortium. The International HapMap Project. Nature 426,
789-796 (2003).
7.
Benson, D. A., et al. Genbank: update. Nucleic Acids Res. 1, D23-6 (2004).
8.
Sachidanandam, R., et al. A map of human genome sequence variation containing 1.42
million single nucleotide polymorphisms. Nature 409, 928-933 (2001).
9.
Pruitt, K. D., Tatusova, T. & Maglott, D. R. NCBI Reference Sequence project: update
and current status. Nucleic Acids Res 31, 34-7 (2003).
10.
Strausberg, R. L. et al. Generation and initial analysis of more than 15,000 full-length
human and mouse cDNA sequences. Proc Natl Acad Sci U S A 99, 16899-903 (2002).
11.
Wheeler, D. L. e. a. Database resources of the National Center for Biotechnology
Information: update. Nucleic Acids Res 32, D35-40 (2004).
12.
Ohno, S., Wolf, U. & Atkin, N. B. Evolution from fish to mammals by gene duplication.
Hereditas 59, 169-87 (1968).
13.
Bailey, J. A. et al. Human-specific duplication and mosaic transcripts: the recent
paralogous structure of chromosome 22. Am J Hum Genet 70, 83-100 (2002).
14.
Osoegawa, K., et al. A Bacterial Artificial Chromosome library for sequencing the
complete human genome. Genome Res. 11, 483-496 (2001).
15.
Osoegawa, K., et al. An improved approach for construction of Bacterial Artificial
Chromosome libraries. Genomics 52, 1-8 (1998).
16.
Kim, U. J., et al. Construction and characterization of a human bacterial artificial
chromosome library. Genomics 34, 213-218 (1996).
17.
Marra, M. A., et al. A map for sequence analysis of the Arabidopsis thaliana ge. Nature
Genet. 22, 265-270 (1999).
18.
Ewing, B., Hillier, L., Wendl, M. C. & Green, P. Base-calling of automated sequencer
traces using phred. I. Accuracy assessment. Genome Res 8, 175-185 (1998).
19.
Kent, W. J. et al. The human genome browser at UCSC. Genome Res 12, 996-1006
(2002).
20.
Schuler, G. D. Sequencing mapping by electronic PCR. Genome Res 7, 541-550 (1997).
21.
Kong, A. et al. A high-resolution recombination map of the human genome. Nat Genet
31, 241-247 (2002).
22.
Gabriels, J. et al. Nucleotide sequence of the partially deleted D4Z4 locus in a patient
with FSHD identifies a putative gene within each 3.3 kb element. Gene 236, 232-235
(1999).
23.
Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol.
196, 261-282 (1987).
24.
Antequera, F. & Bird, A. Number of CpG islands and genes in human and mouse. Proc.
Natl. Acad. Sci. USA 90, 11995-11999 (1993).
25.
Pruitt, K. D. & Maglott, D. R. RefSeq and LocusLink: NCBI gene-centered resources.
Nucleic Acids Res. 29, 137-140 (2001).
26.
Mammalian Gene Collection Program Team. Generation and initial analysis of more than
15,000 full-length human and mouse cDNA sequences. Proc. Natl. Acad. Sci. USA 99,
16899-16903 (2002).
27.
Wheelan, S. J., Church, D. M. & Ostell, J. M. Spidey: a tool for mRNA-to-genomic
alignments. Genome Res 11, 1952-1957 (2001).
28.
Mott, R. EST_GENOME: a program to align spliced DNA sequences to unspliced
genomic DNA. Comput. Appl. Biosci. 13, 477-478 (1997).
29.
Ning, Z., Cox, A. J. & Mullikin, J. C. SSAHA: a fast search method for large DNA
databases. Genome Res 11, 1725-1729 (2001).
30.
Ding, L. et al. EAnnot: a genome annotation tool using experimental evidence. Genome
Res 14, 2503-2509 (2004).
31.
Usuka, J., Zhu, W. & Brendel, V. Optimal spliced alignment of homologous cDNA to a
genomic DNA template. Bioinformatics 16, 203-211 (2000).
32.
Boguski, M. S., Lowe, T. M. & Tolstoshev, C. M. dbEST--database for "expressed
sequence tags. Nature Genet. 4, 332-333 (1993).
33.
Birney, E. & Durbin, R. Using GeneWise in the Drosophila annotation experiment.
Genome Res. 10, 547-548 (2000).
34.
Torrents, D., Suyama, M., Zdobnov, E. & Bork, P. A genome-wide survey of human
pseudogenes. Genome Res 13, 2559-67 (2003).
35.
RGSP-Consortium. Genome sequence of the Brown Norway rat yields insights into
mammalian evolution. Nature 428, 493-521 (2004).
36.
O'Donovan, C., et al. High-quality protein knowledge resource: SWISS-PROT and
TrEMBL. Brief Bioinform. 3, 275-284 (2002).
37.
Suyama, M., Torrents, D. & Bork, P. BLAST2GENE: a comprehensive conversion of
BLAST output into independent genes and gene fragments. Bioinformatics, Epub ahead
of print (2004).
38.
Yang, Z. PAML: a program package for phylogenetic analysis by maximum likelihood.
Comput. Appl. Biosci. 13, 555-556 (1997).
39.
Thompson, J. D., Higgins, D. G. & Gibson, T. J. CLUSTAL W: improving the sensitivity
of progressive multiple sequence alignment through sequence weighting, positionspecific gap penalties and weight choice. Nucleic Acids Res. 22, 4673-4680 (1994).
40.
Flicek, P., Keibler, E., Hu, P., Korf, I. & Brent, M. R. Leveraging the mouse genome for
gene prediction in human: From whole-genome shotgun reads to a global synteny map.
Genome Res. 13, 46-54 (2003).
41.
Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of
the mouse genome. Nature 6915, 520-562 (2002).
42.
Korf, I., Flicek, P., Duan, D. & Brent, M. R. Integrating genomic homology into gene
structure prediction. Bioinformatics 17, S1:140-148 (2001).
43.
Hubbard, T., et al. The Ensembl genome database project. Nucleic Acids Res. 30, 38-41
(2002).
44.
Roest-Crollius, H. et al. Estimate of human gene number provided by genome-wide
analysis using Tetraodon nigroviridis DNA sequence. Nature Genet. 25, 235-238 (2000).
45.
Elgar, G., et al. Generation and analysis of 25Mb of genomic DNA from the pufferfish
Fugu rubripes by sequence scanning. Genome Res. 9, 960-971 (1999).
46.
Aparicio, S. et al. Whole-genome shotgun assembly and analysis of the genome of Fugu
rubripes. Science 297, 1301-1310 (2002).
47.
Dehal, P. et al. The draft genome of Ciona intestinalis: insights into chordate and
vertebrate origins. Science 298, 2157-2167 (2002).
48.
Mulder, N. J., et al. InterPro: an integrated documentation resource for protein families,
domains and functional sites. Brief Bioinform. 3, 225-235 (2002).
49.
Zdobnov, E. M. & Apweiler, R. InterProScan--an integration platform for the signaturerecognition methods in InterPro. Bioinformatics 17, 847-848 (2001).
50.
Schwartz, S., et al. Human-Mouse alignments with BLASTZ. Genome Res. 13, 103-107
(2003).
51.
Kolbe, D. et al. Regulatory potential scores from genome-wide three-way alignments of
human, mouse, and rat. Genome Res 14, 700-707 (2004).
52.
Chang, C. C. & Chih-Jen, L. (2001). LIBSVM: a library for support vector machines.
Software available at http://www.csie.ntu.edu.tw/~cjilin/libsvm/
53.
Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental
duplications: organization and impact within the current human genome project
assembly. Genome Res 11, 1005-17 (2001).
54.
Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the
human genome. Nature 431, 931-945 (2004).
Supplemental Legends
Supplementary Figure 1 Comparison of mapped positions of deCODE STSs
and their location within the (a) chromosome 2 and (b) chromosome 4
sequences. The apparent break at 90 Mb (a) and 50 Mb (b) reflects the position
of the centomere.
Supplementary Figure 2 Sequence identity distribution of segmental
duplications. For all pairwise alignments, the total number of aligned bases was
calculated and binned based on percent sequence identity. Sequence identity
distributions for interchromsomally (red) and intrachromosomally (blue)
duplicated bases are shown. Assuming a clock-like rate for nuclear DNA
substitution, percent similarity is linear with evolutionary time providing a
surrogate for the age of the duplication or gene conversion event.
Supplementary Figure 3 Pattern of recent Segmental Duplications on
Chromosome 2 and chromosome 4 .Large (>10 kb) highly-similar (>95%)
intrachromosomal (blue) and interchromosomal (red) segmental duplications are
shown for chromosome 2 (a) and chromosome 4 (b). Chromosome 2 (a) and
chromosome 4 (b) are drawn at a greater scale with respect to other human
chromosomes. The ancestral telomere-telomere fusion region: (A) and the likely
ancestral centromere region (B) are shown for chromosome 2. A 750 kb tandem
intrachromosomal duplication cluster in chromosome 4; (C) demarcates a
tandem cluster of microsomal UDP-glucuronosyltransferase genes; Two
subtelomeric duplication clusters (D and E) are displayed in chromosome 4: (D)
interstitial 1.1 Mb region of subtelomeric duplication identified within 4q26; (E)
sixteen duplications ranging in size from 10-20kb distributed among nine
subtelomeric regions (chromosome 1,2,5-8,10,16,19, and 20).
Supplementary Figures 4 and 5 Location and % identity of segmental
duplications on chromosome 2 (Supplementary Figure 4) and 4 (Supplementary
Figure 5). Interchromosomal (red bars) and intrachromsomal (blue bars) are
represented along the horizontal line (0.2 Mb increments). The % identity of
each pairwise alignment is shown along the vertical axis with colors representing
different chromosomes (color key shown at the bottom of each figure). Black bars
above the horizontal line represent alpha satellite sequences. Light green bars
above the horizontal line represent regions detected by whole-genome shotgun
sequence detection strategy (WSSD). A ~200 kb duplication block near the
FSHD (facioscapulohumeral muscular dystrophy) region of 4q35
(chr4:191166956-191387833) was detected by WSSD by not by the wholegenome analysis method. The ancestral telomere-telomere fusion region (A) and
the likely ancestral centromere region (B) are shown for chromosome 2. A 750 kb
tandem intrachromosomal duplication cluster in chromosome 4 (C) demarcates a
tandem cluster of microsomal UDP-glucuronosyltransferase genes. Two
subtelomeric duplication clusters (D and E) are displayed in chromosome 4.
Supplementary Table 1 Contiguous sequence accessions and lengths for
chromosome 2 and 4 sequences. Estimated gap sizes are also included as
estimated by FISH and by orthologous mouse and rat sequences.
Supplementary Table 2 Brief summary of repetitive content of chromosomes 2
and 4.
Supplementary Table 3 Recombination rates for human chromosomes 2 and 4
based on deCODE and Genethon genetic maps.
Supplementary Table 4 Known mRNAs that had no match at all to the mouse or
rat genomes or their protein sets. The chromosome, gi identifier for the mRNA,
number of coding exons, amino acid length, interpro domains, comments, and
information on gi’s for which we obtained sequences for the mRNA in multiple
primates.
Supplementary Table 5 Additional details to support Table 1 in main text related
to polymorphic mRNAs detected on human chromosomes 2 and 4.
Supplementary Table 6 Fraction of duplicated basepairs. The percentage of
non-redundant duplications (>90% sequence identity; >1 kb in length) were
based on the total genome size 2,865,069,170, chromosome 2 size 237,541,603
and chromosome 4 size 186,841,959.
Supplementary Table 7 Accessions identifying regions containing highly
polymorphic overlap regions. The chromosome and starting and ending
positions of those highly polymorphic overlap regions are provided for human
genome build35/hg17. General observations of sequence features in the region
including gene content are also provided.
Supplementary Table 8 Positions of possible human deletions based on fosmidend placements.
Supplementary Table 9 Polymorphisms detected from fosmid paired-end
placements and verified by fosmid clone-based sequencing. The Broad Institute
library fosmid name, fosmid clone accession, and a description of each
polymorphism are provided.
Supplementary Table 10 Distribution of non-redundant known genes and
spliced ESTs within duplicated regions of chromosomes 2 and 4. The nonredundant transcript set was screened for confirmatory transcriptional support by
best genomic placement of EST. The spliced EST and transcripts required at
least two exons and evidence of splicing (two or more exons). In addition, each
gene feature was binned as duplicated if at least 50 bp overlapped a duplicated
region. Thus, exons less than 50bp were not included in this analysis. Known
gene + EST is composed of non-redundant known gene and spliced EST
transcription. Duplicated known genes and their coordinates are listed.
Download