Supplementary Notes - Word file (91 KB )

advertisement
2005-10-11925A
SUPPLEMENTAL INFORMATION
Interspersed repeats
The repeat content of HSA11 consists of 21.1% LINEs, 13.3% SINEs, 8.1% LTR
elements, 2.7% DNA elements, with the remaining 2.7% composed of small RNAs,
satellite repeats, simple repeats, low complexity sequence and VNTRs (Table S5).
The percent of Alu repeats (9.42%) is slightly lower than the genome average of
10.6%, other than that all of the other elements are very close to the genome averages.
The Tandem Repeat Finder (TRF) program1 is often used to detect tandem repeats
larger than six bp; however, the bulk of the results overlap with those from the
RepeatMasker program (A.F.A. Smit, R. Hubley & P. Green; http://repeatmasker.org).
In the case of HSA11, out of nearly 3 Mb predicted by TRF, 83% overlap with
RepeatMasker as follows: satellite repeats - 36.87%, simple repeats - 20.94%, SINEs 10.11%, LINEs - 4.52%, LTR elements - 4.00%, low complexity - 3.28%,
unclassified repeats - 2.74%, and DNA elements - 0.58%. The remaining 17% (511
Kb) are tandem repeats that were predicted by TRF only.
Gene catalog
The 1524 protein-coding genes have an average exon length of 301 bp. The largest
number of exons, at least 64, was found in the ATM gene, which encodes a
phosphatidylinositol-3 kinase which when mutated affects DNA damage repair and
can lead to the autosomal recessive disorder ataxia telangiectasia. The longest intron,
just over 589 Kb, was found in the OPCML gene, which encodes a protein that binds
opioid alkaloids in the presence of acidic lipids. The longest gene on HSA11 is DLG2,
1
2005-10-11925A
a membrane-associated guanylate kinase family member, which spans nearly 2.17 Mb
over 28 exons and encodes an 870 amino acid protein.
Overlapping genes
We found many cases of internal genes (genes within genes) and overlapping genes
(neighboring genes that overlap by at least one base pair, usually in their untranslated
5’- and 3’-ends, and are most commonly in opposite orientation with one another). In
most cases, the overlaps reported here are supported by multiple mRNA or EST
sequences. In addition to 242 pseudogenes and RNA genes contained within another
gene, we identified 369 protein-coding genes with some form of overlap with at least
one other gene. Of these, 90 genes were found to be completely contained within 42
surrounding genes, 182 genes showed partial overlaps with other genes, 19 genes had
a mixed overlap type (i.e., they contained at least one gene and partially overlapped
another), and 36 genes were involved in a co-transcribed or read-through manner (that
is, 12 cases, each involving the joining of a pair of genes). The total genomic overlap
is 1.9 Mb, with the longest being 43.3 Kb for genes BDNF (brain-derived
neurotrophic factor) and BDNFOS (brain-derived neurotrophic factor opposite strand).
BDNFOS is a non-coding RNA in reverse orientation with BDNF and may regulate
the expression of this gene. Both genes have distinct CpG islands at their 5'-ends.
BDNF is induced by cortical neuron activity and is necessary for survival of striatal
neurons in the brain2,3. This gene may also be important in the regulation of stress
response and in the biology of mood disorders4. We emphasize that the above cases
will each require careful analysis to confirm the true boundaries of the genes; it is
possible that some cases of apparent overlap may reflect inaccurate determination of
2
2005-10-11925A
gene boundaries. The rate of overlapping genes found here, 24%, is similar to
previously published observations5.
Clustered gene families
In addition to the olfactory receptors genes, we identified 142 genes (140 expressed,
two pseudogenes) in 37 clusters with at least two members from the same gene family
(Table S7). The cluster with the most members, 13, is from the MS4A family,
proteins with at least 4 potential transmembrane domains and N- and C-terminal
cytoplasmic domains. The apolipoprotein gene cluster on 11q23 consists of four
members which are components of high-density lipoprotein. These genes are strongly
associated with plasma triglyceride levels and are major risk factors for various
disorders including coronary artery disease, hypertriglyceridemia, and systemic nonneuropathic amyloidosis. They have also been associated with blood glucose, plasma
lipoprotein levels, total cholesterol, and triglycerides in a gender-specific manner. The
beta hemoglobin gene cluster on 11p15.4 consists of five members, in 5' to 3' order
HBE1 (epsilon 1), HBG2 (gamma G), HBG1 (gamma A), HBD (delta), and HBB
(beta). Along with the alpha globin gene cluster located on chromosome 16, which
also contains five members, these loci determine the structure of the 2 types of
polypeptide chains in adult hemoglobin, Hb A. The normal adult hemoglobin tetramer
consists of two alpha chains and two beta chains. Mutations in beta-globin can lead to
a number of disorders including sickle cell anemia, beta thalassemia, erythrocytosis
and so on. In addition, there are two ultrahigh-sulfur keratin-associated protein
(KRTAP5) gene clusters at 11p15.5 (6 members) and 11q13.4 (7 members) which
resulted from an intrachromosomal duplication on HSA116. These genes show
preferential expression in human hair root, suggesting they are required for hair
3
2005-10-11925A
formation. All of the KRTAP5 genes are highly conserved in chimpanzee, but
interestingly, in mouse they are part of two non-adjacent, significantly larger blocks
of synteny to chromosome 7; however, there is only one KRTAP cluster on mouse
chromosome 7. The two human clusters lie adjacent to synteny breakpoints which, in
mouse, are found in proximity to the single cluster.
Pseudogenes and RNAs
We annotated 765 pseudogenes, including 205 olfactory receptor pseudogenes, 558
non-olfactory receptor pseudogenes, and two tRNA pseudogenes. Most of the nonolfactory receptor pseudogenes identified in this report were derived from the
"Retroposed Genes" UCSC genome browser track (Supplemental Methods). The
average pseudogene spans about 1.2 Kb, and all but three are currently annotated as
processed pseudogenes. The 203 olfactory receptor pseudogenes are the exception and
have arisen by duplications of large chromosomal domains followed by extensive
gene duplication and divergence. 204 of the pseudogenes are internal to other
expressed genes. TRIM5, a known gene, which spans over 275 Kb, contains six
olfactory receptor pseudogenes, the TRIMP1-2 pseudogene, nine expressed olfactory
receptor genes, and theTRIM22 gene.
Of the 60 RNA genes, 38 are internal to other genes, with the most extreme cases
being eight small nucleolar RNAs which are internal to the novel transcript predicted
from accession AK095849, and nine (eight small nucleolar, one small Cajal bodyspecific) RNAs which are internal to the novel CDS KIAA1731. Eight predicted
tRNAs are located in a small cluster around 59.08 Mb.
4
2005-10-11925A
Gene deserts
According to the criteria of Ovcharenko et al.7, we annotated 19 gene deserts greater
than 651 Kb, the longest being over 3.4 Mb (36,652,967-40,088,070 bp) flanked by
genes RAG1 and AK127441, a novel transcript (Table S17). In total the 19 gene
deserts, five of which are less than 100 Kb apart from one another, account for 23.4
Mb of the HSA11 sequence. These gene deserts contain 88 pseudogenes and one
RNA, but there are no annotated expressed genes of any type within them.
In a separate analysis8, we identified two neighboring large, ancient duplications,
which are also conserved in mouse and dog. These duplications from 22.8 to 24.8 (2
Mb), and 24.8 to 26.3 (1.5 Mb), completely overlap with two of the gene deserts we
identified here. These intrachromosomal duplications are composed of long
intermittent sequences with similarity as low as 60% and are suggestive that some
gene deserts originated from duplications of segments lacking genes in a mammalian
common ancestor.
CpG islands
CpG islands are unmethylated regions of the genome that are associated with the 5'ends of many house-keeping genes and regulated genes. Of the 1369 calculated CpG
islands9 (Supplemental Methods, Table S11), 895 are associated with expressed genes,
including 781 (58.7%) known genes, 53 (51%) novel CDSs, 60 (27.1%) novel
transcripts, and one (25%) putative genes. 806 of these genes contain CpG islands in
or near their 5'-ends, 23 in their 3'-ends, 60 internally, and six which are completely
encompassed by the CpG island. 291 genes share a CpG island with at least one other
member, which means they may be under the same regulatory control. Of the 157
5
2005-10-11925A
shared CpG islands, 149 are shared by two genes and eight are shared by three genes.
The longest CpG island on HSA11, which is highly conserved in chimp, dog, mouse,
and rat, is 7,460 bp and is found at the 5'-end of CCND1. CCND1, or cyclin D1, is a
member of the highly conserved cyclin family, whose members are characterized by a
dramatic periodicity in protein abundance throughout the cell cycle. Mutations,
amplification and over expression of this gene, which alters cell cycle progression, are
observed frequently in a variety of tumors and may contribute to tumorigenesis.
Imprinting
While CpG islands are not normally methylated, there are several cases in the human
genome where methylation occurs on one of the parental alleles, violating the usual
rule of inheritance that both alleles in a heterozygote are equally expressed. This
phenomenon is called genomic imprinting. When a gene is suppressed through
imprinting from one parent, and the allele from the other parent is not expressed
because of mutation, the child will be deficient for that gene. The 11p15.5 region of
HSA11 contains two of the most well studied and reciprocally imprinted genes H19
and IGF2, in which the disruption can lead to Beckwith-Wiedemann syndrome, of
which the cardinal features are exomphalos, macroglossia, and gigantism in the
neonate. Mutations in several imprinted genes in this region can lead to this, as well as
other syndromes. Table S18 lists all of the imprinted-related genes on HSA11 as
derived from the OMIM database.
Duplication Analysis
We performed a detailed analysis of duplicated genomic sequence (≥90% sequence
identity and ≥1 kb in length) comparing HSA11 against the May 2004 assembly of the
6
2005-10-11925A
human genome (Supplemental Methods). We estimated that 4.23% (5.55 Mb) of
HSA11 consists of segmental duplications (Tables S19, S20). Compared to other
finished chromosomes as well as the genome average (5.3%), HSA11 is not enriched
for segmental duplications. Unlike the genome wide distribution, in which the aligned
base pairs of interchromosomal duplications are slightly lower than the
intrachromosomal duplications, duplications in HSA11 are predominantly
interchromosomal: 14.14 Mb out of 17.75 Mb of aligned base pairs and 1399 out of
1667 pairwise alignments are with the non-homologous chromosomes (Fig. S3, Table
S19). While the duplications with higher divergence (>0.08) tended to be short, those
with lower divergence are more scattered in the length distribution (Fig. S4). A
bimodal distribution pattern of sequence identity is observed based on the distribution
pattern of the alignments. The majority of interchromosomal duplication alignments
show 93-95% sequence identity while intrachromosomal duplications show 95- 97%
sequence identity.
Segmental duplications are particularly clustered in the subtelomeric and
pericentromeric regions of HSA11p, with the subtelomeric region accounting for
18.3% (305/1667) and the pericentromeric region accounting for 13.6% (226/1667) of
the total alignment (Table S21). This subtelomeric region is clustered with
interchromosomal duplications mostly mapping to the subtelomeric or
pericentromeric regions in other chromosomes. The pericentromeric region on the
HSA11p is mainly clustered with intrachromosomal duplications (Fig. S5). In
addition, the overwhelming majority of the other segmental duplications are clustered
in another 12 blocks (>100 kb and > 5 duplication alignments) (Figs. S6, S7). These
regions contain genes or fragment of genes, such as IFITM, FOLH, ALDH, NOX4,
7
2005-10-11925A
TRIM49, NAALAD as well as OR family members (Table S22). However, only 41 OR
pseudogenes and 18 intact genes, all but two from class II, overlap with segmentally
duplicated regions, mostly scattered across the chromosome. This suggests that
segmental duplication was not a major factor in the expansion of this large gene
family, at least on HSA11. Copy number polymorphism and macro insertion deletion
and inversion among different human populations have been recently reported10-13.
We observed that at least four of the HSA11 duplication clusters overlap with or are
adjacent to known copy number polymorphism sites, suggesting the clustered
duplications play a role in the generation of these polymorphisms.
Comparative biology
To define further the chromosomal landscape, we performed a comparative analysis
of finished HSA11 versus the draft mouse14, rat15, dog16, chimpanzee17 and chicken18
genomes. Using DNA alignments we constructed a map of conserved synteny
between HSA11 and the other genomes (Fig. S8, Methods). By scanning these regions
for contiguous collinear nucleotide similarity, 36 blocks of conserved synteny larger
than 250 kb were identified between human and mouse, with the longest segment
being 17.4 Mb. Results for the other organisms can be found in Table S23. In the map
of conserved synteny, the chicken seems to lack some of the gene-rich regions.
However, this may simply be due to the exclusion of smaller blocks of conserved
synteny. Indeed, comparative analysis using the Tblastx programs suggests that many
of the genes in these regions are present in the chicken genome.
Additionally, we identified 6218 conserved non-coding elements (CNEs) by
combining the blastn hits for mammals (mouse, rat and dog), and separately for
8
2005-10-11925A
mammals plus chicken (942 CNEs) (Table S24). These conserved non-coding
elements defined among mammals are fairly evenly distributed along the chromosome
for mammals only, with a slightly higher density in the region of 90-120 Mb on
HSA11 (Fig. S9). However, the elements that are conserved with chicken show a
more skewed distribution with a higher density of CNEs from 11p-tel to
approximately 18 Mb, and then much lower from there to the centromere. The reasons
for these trends are unclear, but may partially reflect the lack of clearly defined
synteny with chicken.
Out of 7,487 evolutionary conserved regions (ECRs) with an average length of 112 bp
covering 841,811 bp across HSA11 (data courtesy of O. Jaillon, Genoscope), 5,964
(79.66%) overlapped potential protein-coding genes and 798 (10.66%) with
pseudogenes. 661 (8.83%) of the ECRs overlapped with a CpG island, while only 154
(2.06%) fell within gene desert regions.
Supplemental Methods
Large insert clone sequencing and mapping at RIKEN Genomic Sciences Center
Large-insert BAC and fosmid clone DNA was prepared by the standard alkaline lysis
method (Kurabo PI-1100). Shotgun libraries were constructed by random sheared
DNA (1-2 kb) (HydroShear, GeneMachines) and cloned in plasmid vector. The
template DNA was prepared either by PCR amplification of the insert DNA (TaKaRa
Ex Taq, Biometra and ABI GeneAmp PCR System 9700), GenomiPhi amplification
(Amersham Biosciences) or plasmid DNA isolation (Kurabo PI-1100). Cycle
9
2005-10-11925A
sequencing was performed by BigDye v3.1 chemistry and ABI3700 and ABI3730
sequencers (Applied Biosystems), and by ET chemistry and the MegaBACE1000
sequencer (Amersham Biosciences). Basecalling, quality assessment and assembly
were carried out using the Phred/Phrap software package19,20. Assemblies of clones
sequenced at 8-10 fold redundancy were visualized for finishing with Consed21 and
Sequencher (Gene Codes Corp.). A combination of the following methods was used
to close sequence gaps and resolve low-quality or problematic regions: nested
deletion22, primer walking, PCR, direct sequencing of large-insert BAC and fosmid
clones, and subcloning of BAC clones into fosmid vectors. The average accuracy of
the finished sequence data was estimated to be greater than Phrap 40. Clones were
finished according to the agreed international standard for the human genome
(http://genomeold.wustl.edu/Overview/g16stand.php). There were 41 sequence gaps,
for an estimated total of 13.7 Kb, which could not be resolved by sequencing (Table
S25). For the Human Genome Project (HGP), there were a number of quality control
checks that were performed to ensure the highest quality data and uniformity
throughout the genome. Like all the other human chromosomes, HSA11 was
inspected to make sure that none of the following applied: missing known genes,
missing STSs, contamination, partially present genes, compressions or insertions, and
false clone overlaps.
One particularly difficult region to finish was the 566 Kb interval from 88.92-89.49
Mb, which consists of several large intra- and inter-chromosomal duplications (Fig.
S10). There is a 350 Kb, near-perfect (99.65%) intrachromosomal duplication in this
region which contains two copies each of the PSMAL and TRIM49 genes, and at least
10
2005-10-11925A
four retrotranposed processed pseudogenes, making this region also a challenge to
annotate.
As noted in the main text, gaps were size-estimated by fiber-FISH analysis were
possible. Initial size estimates were roughly made to the nearest 1-10 Kb. In cases
where additional sequence data was incorporated to reduce the gaps, the estimated
size of the gaps were decreased by the exact amount of new data adding, thus leading
to what seem as very precise estimates of the gaps sizes. Fiber FISH analysis at best
can only give estimates in the 1-10 Kb range.
Large insert clone sequencing at Broad Institute/Whitehead Center for Genome
Research
Subclone libraries of large-insert clones were prepared in m13 or one of several
plasmid subclone vectors, and sequenced with the dideoxy chain termination method
using one of several versions of big dye chemistry23. Data were detected on several
models of ABI sequencing machines and assembled with Phrap
(http://www.phrap.org) or Arachne24,25. Assemblies were visualized for finishing with
either Gap426 or Consed21. A combination of the following methods was used to close
sequence gaps and resolve low quality regions and misassemblies: transposon
insertion-based sequencing, primer walking, PCR, and shattered insert libraries27.
Finished sequence assemblies of all large insert clones were validated by comparison
to restriction digestion patterns generated by 3-5 6 cutter enzymes28.
STS markers on genetic maps
11
2005-10-11925A
This data is from the UCSC Genome Browser (http://genome.ucsc.edu/cgibin/hgGateway)
Positions of STS markers are determined using both full sequences and primer
information. Full sequences are aligned using blat29, while isPCR (Jim Kent) and
ePCR (http://www.ncbi.nih.gov/sutils/e-pcr/) are used to find locations using primer
information. Both sets of placements are combined to give final positions. In nearly
all cases, full sequence and primer-based locations are in agreement, but in cases of
disagreement, full sequence positions are used. Sequence and primer information for
the markers were obtained from the primary sites for each of the maps, and from
UniSTS (http://www.ncbi.nih.gov/entrez/query.fcgi?db=unists). This track was
designed and implemented by Terry Furey.
Construction of gene catalog
Alignments of all available (as of August 2005) human RefSeq30 and GenBank31
messenger RNA sequences to the finished sequence were derived from the UCSC
Genome Browser according to their methodology (http://genome.ucsc.edu/index.html).
Data from the following tracks were inspected manually to ensure accurate
transcriptional start and stop sites, and to correct splice sites: Known Genes, RefSeq
Genes, Human mRNAs, Ensembl Genes, CCDS, Retroposed Genes, and sno/miRNA.
In addition, data from these tracks was reviewed: Human ESTs, Other RefSeq, Other
mRNAs, and Other ESTs. Non-canonical splice sites were used only if supported by
sufficient complementary DNA-based evidence. Partial transcripts (those containing a
partial open reading frame) were annotated in cases for which there was firm evidence
of their existence. All gene models (Table S6) were created manually using these
aligned sequences as evidence, following HAWK2
12
2005-10-11925A
(www.sanger.ac.uk/Info/workshops/hawk2) transcript type conventions. Evidence
was given relative priority as follows (high–low): RefSeq, other mRNAs, spliced
ESTs, unspliced ESTs, non-human orthologous mRNAs. When there was more than
one variant for a gene, we selected the longest genomic transcript as the representative
model. Gene symbols for biologically characterized loci were assigned by the HUGO
Gene Nomenclature Committee (http://www.gene.ucl.ac.uk/nomenclature/). Our
annotations will be made available to the Vertebrate Genome Annotation database
(VEGA, http://vega.sanger.ac.uk/Homo_sapiens).
As part of our validation process, we compared our gene annotations with those from
Ensembl, RefSeq and the Consensus CDS (CCDS) project
(http://www.ncbi.nlm.nih.gov/projects/CCDS/). In Ensembl, 1147 (96%) of known
genes closely matched our annotation, while 1231 (80.8%) of all expressed genes
were annotated in their database. In RefSeq, 1158 (96.9%) of known genes closely
matched our annotation, while 1227 (80.5%) of all expressed genes were annotated
with coordinates in their database. In CCDS, in which four different centers must
agree on a consensus CDS for each gene, to date 745 genes (62.3% known, 48.9% all
expressed), including one novel transcript, have been annotated.
Retroposed genes, including pseudogenes
This data is from the UCSC Genome Browser
The retroGene track shows processed mRNAs that have been inserted back into the
genome since the mouse/human split32. RetroGenes can be either functional genes that
have acquired a promoter from a neighboring gene, non-functional pseudogenes, or
transcribed pseudogenes. All mRNAs of a species from GenBank were aligned to the
13
2005-10-11925A
genome using blastz33. mRNAs that aligned twice in the genome (once with introns
and once without introns) were initially screened. Next, a series of features were
scored to determine candidates for retrotransposition events. These features include
position and length of the polyA tail, degree of synteny with mouse, coverage of
repetitive elements, number of exons that can still be aligned to the retroGene and
degree of divergence from the parent gene. These features are combined heuristic
weighting based on analysis of known processed pseudogenes. RetroGenes in the
final set have a score threshold greater than 425 based on a ROC plot against the Vega
(http://vega.sanger.ac.uk/) annotated pseudogenes. The RetroFinder program and
browser track were developed by Robert Baertsch at UCSC.
CpG Islands
This data is from the UCSC Genome Browser
CpG islands were predicted by searching the sequence one base at a time, scoring
each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring
segments. Each segment was then evaluated for the following criteria:
1. GC content of 50% or greater
2. length greater than 200 bp
3. ratio greater than 0.6 of observed number of CG dinucleotides to the
expected number on the basis of the number of Gs and Cs in the segment
4. The CpG count is the number of CG dinucleotides in the island. The
Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG
count) to the length. The ratio of observed to expected CpG is calculated
according to the formula cited in Gardiner-Garden et al.9
14
2005-10-11925A
Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G)
This track was generated using a modification of a program developed by G. Miklem
and L. Hillier.
Segmental duplication analysis
We used a BLAST-based detection scheme34 to identify all pairwise similarities
representing duplicated regions (≥1 kb and ≥90% identity) within the finished
sequence of HSA11 and compared to all other chromosomes in the NCBI genome
assembly (May, 2004 build 35). A total of 2146 pairwise alignments representing
17.75 Mb of aligned basepairs and 5.55 Mb of non-redundant duplicated bases were
analyzed on HSA11. The program Parasight
(http://humanparalogy.gene.cwru.edu/parasight/) was used to generate images of
pairwise alignments. Divergence of duplication, the number of substitutions per site
between the two sequences, were calculated using Kimura's two-parameter method,
which corrects for multiple events and transversion/transition mutational biases35.
Analysis of haplotype structural variation was performed using the program
Miropeats (threshold =3000)36.
Gene content of each 1% duplicated regions of 90%-100% identity was analyzed
using a non-redundant/non-overlapping set of known genes. A gene feature (exon)
was considered duplicated if >50 bp of the feature overlapped duplication. Thus,
exons less than 50 bp were lost in this analysis.
Comparative analysis A BLAST-based method was used to define the map of
conserved synteny. The chimpanzee, mouse, rat, dog and chicken genomes were
15
2005-10-11925A
obtained from http://genome.ucsc.edu. We used the repeat-masked version of the
chimpanzee November 2003 freeze (panTro1), the mouse March 2005 freeze (mm6),
the rat June 2003 freeze (rn3), the dog July 2004 freeze (canFam1) and the chicken
February 2004 freeze (galGal2). First, blastn37 was performed using the "-e 0.1"
option to align each of the genomes to the HSA11 sequence. After blast analysis,
adjacent hits that were properly oriented and reasonably spaced (< 1,000,000 bp) from
the same chromosome were merged into small blocks, as long as there was no ‘other’
hit between them. Blocks shorter than (n x 25,000; where ‘n’ equals the step number
in the iterative block-building process) bp were removed, and remaining blocks were
used as hits in the next step. After the analysis, synteny blocks of length at least
250,000 bp were obtained. For the conserved non-coding region analysis,
intersections (overlaps) of blastn hits from mammals (mouse, rat and dog) and
mammals plus chicken were defined as conserved regions. Conserved regions
overlapping with Ensembl genes were removed and the remaining regions were used
as CNEs if they were greater than 100 bp in length for mammals, or greater than 50
bp in length for mammals plus chicken. Percent identities were not used for the
analysis.
Supplemental Resources
Blast 2 sequences (http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html)
BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat)
Consensus CDS (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/)
DDBJ (http://www.ddbj.nig.ac.jp/Welcome-e.html)
16
2005-10-11925A
DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi)
Ensembl (http://www.ensembl.org/index.html)
Entrez (http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi)
Entrez Gene (RefSeq) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene)
Eponine Transcription Start Site finder (http://servlet.sanger.ac.uk:8080/eponine/)
Exofish (http://www.genoscope.cns.fr/cgi-bin/exofish.cgi)
HUGO Gene Nomenclature Committee (HGNC)
(http://www.gene.ucl.ac.uk/nomenclature/)
Human Annotation Workshop (Hawk) (www.sanger.ac.uk/Info/workshops/hawk2)
Human Olfactory Receptor Data Exploratorium (HORDE)
(http://bip.weizmann.ac.il/HORDE/)
miRBase (http://microrna.sanger.ac.uk/sequences/index.shtml)
Online Mendelian Inheritance in Man (OMIM)
(http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM)
ORF Finder (Open Reading Frame Finder)
(http://www.ncbi.nlm.nih.gov/gorf/gorf.html)
RepeatMasker (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker)
Rfam (http://www.sanger.ac.uk/Software/Rfam/index.shtml)
Spidey (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/)
Tandem Repeats Finder (http://tandem.bu.edu/trf/trf.html)
tRNAscan-SE (http://www.genetics.wustl.edu/eddy/tRNAscan-SE/)
UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway)
UCSC Sequence and Annotation Downloads
(http://hgdownload.cse.ucsc.edu/downloads.html)
UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables)
17
2005-10-11925A
VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html)
Vega Genome Browser (http://vega.sanger.ac.uk/)
Supplemental References
1. Benson, G. Tandem repeats finder: a program to analyze DNA sequences.
Nucleic Acids Res. 27, 573-580 (1999)
2. Dugich-Djordjevic M. M. et al. Regionally specific and rapid increases in
brain- derived neurotrophic factor messenger RNA in the adult rat brain
following seizures induced by systemic administration of kainic acid.
Neuroscience 47, 303-315 (1992)
3. Canals, J. M. et al. Expression of brain-derived neurotrophic factor in cortical
neurons is regulated by striatal target area. J. Neurosci. 21, 117-124 (2001)
4. Jiang, X. et al. BDNF variation and mood disorders: a novel functional
promoter polymorphism and Val66Met are associated with anxiety but have
opposing effects. Neuropsychopharmacology 30, 1353-1361 (2005)
5. Nusbaum, C. et al. DNA sequence and analysis of human chromosome 18.
Nature 437, 551-555 (2005)
6. Yahagi, S. et al. Identification of two novel clusters of ultrahigh-sulfur
keratin-associated protein genes on human chromosome 11. Biochem. Biophys.
Res. Commun. 318, 655-664 (2004)
7. Ovcharenko, I. et al. Evolution and functional classification of vertebrate gene
deserts. Genome Res. 15, 137-145 (2005)
8. Itoh, T., Toyoda, A., Taylor, T. D., Sakaki, Y. & Hattori, M. Identification of
large ancient duplications associated with human gene deserts. Nat. Genet. 37,
1041-1043 (2005)
18
2005-10-11925A
9. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J.
Mol. Biol. 196, 261-282 (1987)
10. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat.
Genet. 37, 727-732 (2005)
11. Sharp, A. J. et al. Segmental duplications and copy number variation in the
human genome. Am. J. Hum. Genet. 77, 78-88 (2005)
12. Sebat, J. et al. Large-scale copy number polymorphism in the human genome.
Science 305, 525-528 (2004)
13. Iafrate, A. J. et al. Detection of large-scale variation in the human genome.
Nat. Genet. 36, 949-951 (2004)
14. Mouse Genome Sequencing Consortium. Initial sequencing and comparative
analysis of the mouse genome. Nature 420, 520−562 (2002)
15. Gibbs, R. A. et al. Genome sequence of the Brown Norway rat yields insights
into mammalian evolution. Nature 428, 493-521 (2004)
16. Lindblad-Toh, K. et al. Genome sequence, comparative analysis and haplotype
structure of the domestic dog. Nature 438, 803-819 (2005)
17. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the
chimpanzee genome and comparison with the human genome. Nature 437, 6987 (2005)
18. Hillier, L. W. et al. Sequence and comparative analysis of the chicken genome
provide unique perspectives on vertebrate evolution. Nature 432, 695-716
(2004)
19. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred.
II. Error probabilities. Genome Res. 8, 186-194 (1998)
19
2005-10-11925A
20. Ewing, B., Hillier, L., Wendl, M. C., & Green, P. Base-calling of automated
sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175185 (1998)
21. Gordon, D., Abajian, C., & Green, P. Consed: a graphical tool for sequence
finishing. Genome Res. 8, 195-202 (1998)
22. Hattori, M. et al. A novel method for making nested deletions and its
application for sequencing of a 300 kb region of human APP locus. Nucleic
Acids Res. 25, 1802-1808 (1997)
23. Rosenblum, B.B. et al. New dye-labeled terminators for improved DNA
sequencing patterns. Nucleic Acids Res. 25, 4500-4504 (1997)
24. Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome
Res. 12, 177-189 (2002)
25. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes:
Arachne 2. Genome Res. 13, 91-96 (2003)
26. Bonfield, J. K., Smith, K., & Staden, R. A new DNA sequence assembly
program. Nucleic Acids Res. 23, 4992-4999 (1995)
27. MacMurray, A. A., Sulston, J. E., & Quail, M. A. Short-insert libraries as a
method of problem solving in genome sequencing. Genome Res. 8, 562 (1998)
28. Wong, G. K., Yu, J., Thayer, E. C. & Olson, M. V. Multiple-complete-digest
restriction fragment mapping: generating sequence-ready maps for large-scale
DNA sequencing. Proc. Natl. Acad. Sci. USA. 94, 5225-5230 (1997)
29. Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656-664
(2002)
20
2005-10-11925A
30. Pruitt, K. D., Tatusova, T., & Maglott, D. R. NCBI Reference Sequence
(RefSeq): a curated non-redundant sequence database of genomes, transcripts
and proteins. Nucleic Acids Res. 33, D501-D504 (2005)
31. Benson, D. A. et al. GenBank. Nucleic Acids Res. 33, D34-D38 (2005)
32. Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W., & Haussler, D. Evolution's
cauldron: Duplication, deletion, and rearrangement in the mouse and human
genomes. Proc. Natl. Acad. Sci. USA 100, 11484-11489 (2003)
33. Schwartz, S. et al. Human-Mouse Alignments with BLASTZ. Genome Res. 13,
103-107 (2003)
34. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded
blockset aligner. Genome Res. 14, 708-715 (2004)
35. Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E.
Segmental duplications: organization and impact within the current human
genome project assembly. Genome. Res. 11, 1005-1017 (2001)
36. Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput.
Appl. Biosci. 11, 615-619 (1995)
37. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res. 25, 3389-3402 (1997)
38. Bailey, J. A. et al. Recent segmental duplications in the human genome.
Science 297, 1003-1007 (2002)
21
Download