2005-10-11925A SUPPLEMENTAL INFORMATION Interspersed repeats The repeat content of HSA11 consists of 21.1% LINEs, 13.3% SINEs, 8.1% LTR elements, 2.7% DNA elements, with the remaining 2.7% composed of small RNAs, satellite repeats, simple repeats, low complexity sequence and VNTRs (Table S5). The percent of Alu repeats (9.42%) is slightly lower than the genome average of 10.6%, other than that all of the other elements are very close to the genome averages. The Tandem Repeat Finder (TRF) program1 is often used to detect tandem repeats larger than six bp; however, the bulk of the results overlap with those from the RepeatMasker program (A.F.A. Smit, R. Hubley & P. Green; http://repeatmasker.org). In the case of HSA11, out of nearly 3 Mb predicted by TRF, 83% overlap with RepeatMasker as follows: satellite repeats - 36.87%, simple repeats - 20.94%, SINEs 10.11%, LINEs - 4.52%, LTR elements - 4.00%, low complexity - 3.28%, unclassified repeats - 2.74%, and DNA elements - 0.58%. The remaining 17% (511 Kb) are tandem repeats that were predicted by TRF only. Gene catalog The 1524 protein-coding genes have an average exon length of 301 bp. The largest number of exons, at least 64, was found in the ATM gene, which encodes a phosphatidylinositol-3 kinase which when mutated affects DNA damage repair and can lead to the autosomal recessive disorder ataxia telangiectasia. The longest intron, just over 589 Kb, was found in the OPCML gene, which encodes a protein that binds opioid alkaloids in the presence of acidic lipids. The longest gene on HSA11 is DLG2, 1 2005-10-11925A a membrane-associated guanylate kinase family member, which spans nearly 2.17 Mb over 28 exons and encodes an 870 amino acid protein. Overlapping genes We found many cases of internal genes (genes within genes) and overlapping genes (neighboring genes that overlap by at least one base pair, usually in their untranslated 5’- and 3’-ends, and are most commonly in opposite orientation with one another). In most cases, the overlaps reported here are supported by multiple mRNA or EST sequences. In addition to 242 pseudogenes and RNA genes contained within another gene, we identified 369 protein-coding genes with some form of overlap with at least one other gene. Of these, 90 genes were found to be completely contained within 42 surrounding genes, 182 genes showed partial overlaps with other genes, 19 genes had a mixed overlap type (i.e., they contained at least one gene and partially overlapped another), and 36 genes were involved in a co-transcribed or read-through manner (that is, 12 cases, each involving the joining of a pair of genes). The total genomic overlap is 1.9 Mb, with the longest being 43.3 Kb for genes BDNF (brain-derived neurotrophic factor) and BDNFOS (brain-derived neurotrophic factor opposite strand). BDNFOS is a non-coding RNA in reverse orientation with BDNF and may regulate the expression of this gene. Both genes have distinct CpG islands at their 5'-ends. BDNF is induced by cortical neuron activity and is necessary for survival of striatal neurons in the brain2,3. This gene may also be important in the regulation of stress response and in the biology of mood disorders4. We emphasize that the above cases will each require careful analysis to confirm the true boundaries of the genes; it is possible that some cases of apparent overlap may reflect inaccurate determination of 2 2005-10-11925A gene boundaries. The rate of overlapping genes found here, 24%, is similar to previously published observations5. Clustered gene families In addition to the olfactory receptors genes, we identified 142 genes (140 expressed, two pseudogenes) in 37 clusters with at least two members from the same gene family (Table S7). The cluster with the most members, 13, is from the MS4A family, proteins with at least 4 potential transmembrane domains and N- and C-terminal cytoplasmic domains. The apolipoprotein gene cluster on 11q23 consists of four members which are components of high-density lipoprotein. These genes are strongly associated with plasma triglyceride levels and are major risk factors for various disorders including coronary artery disease, hypertriglyceridemia, and systemic nonneuropathic amyloidosis. They have also been associated with blood glucose, plasma lipoprotein levels, total cholesterol, and triglycerides in a gender-specific manner. The beta hemoglobin gene cluster on 11p15.4 consists of five members, in 5' to 3' order HBE1 (epsilon 1), HBG2 (gamma G), HBG1 (gamma A), HBD (delta), and HBB (beta). Along with the alpha globin gene cluster located on chromosome 16, which also contains five members, these loci determine the structure of the 2 types of polypeptide chains in adult hemoglobin, Hb A. The normal adult hemoglobin tetramer consists of two alpha chains and two beta chains. Mutations in beta-globin can lead to a number of disorders including sickle cell anemia, beta thalassemia, erythrocytosis and so on. In addition, there are two ultrahigh-sulfur keratin-associated protein (KRTAP5) gene clusters at 11p15.5 (6 members) and 11q13.4 (7 members) which resulted from an intrachromosomal duplication on HSA116. These genes show preferential expression in human hair root, suggesting they are required for hair 3 2005-10-11925A formation. All of the KRTAP5 genes are highly conserved in chimpanzee, but interestingly, in mouse they are part of two non-adjacent, significantly larger blocks of synteny to chromosome 7; however, there is only one KRTAP cluster on mouse chromosome 7. The two human clusters lie adjacent to synteny breakpoints which, in mouse, are found in proximity to the single cluster. Pseudogenes and RNAs We annotated 765 pseudogenes, including 205 olfactory receptor pseudogenes, 558 non-olfactory receptor pseudogenes, and two tRNA pseudogenes. Most of the nonolfactory receptor pseudogenes identified in this report were derived from the "Retroposed Genes" UCSC genome browser track (Supplemental Methods). The average pseudogene spans about 1.2 Kb, and all but three are currently annotated as processed pseudogenes. The 203 olfactory receptor pseudogenes are the exception and have arisen by duplications of large chromosomal domains followed by extensive gene duplication and divergence. 204 of the pseudogenes are internal to other expressed genes. TRIM5, a known gene, which spans over 275 Kb, contains six olfactory receptor pseudogenes, the TRIMP1-2 pseudogene, nine expressed olfactory receptor genes, and theTRIM22 gene. Of the 60 RNA genes, 38 are internal to other genes, with the most extreme cases being eight small nucleolar RNAs which are internal to the novel transcript predicted from accession AK095849, and nine (eight small nucleolar, one small Cajal bodyspecific) RNAs which are internal to the novel CDS KIAA1731. Eight predicted tRNAs are located in a small cluster around 59.08 Mb. 4 2005-10-11925A Gene deserts According to the criteria of Ovcharenko et al.7, we annotated 19 gene deserts greater than 651 Kb, the longest being over 3.4 Mb (36,652,967-40,088,070 bp) flanked by genes RAG1 and AK127441, a novel transcript (Table S17). In total the 19 gene deserts, five of which are less than 100 Kb apart from one another, account for 23.4 Mb of the HSA11 sequence. These gene deserts contain 88 pseudogenes and one RNA, but there are no annotated expressed genes of any type within them. In a separate analysis8, we identified two neighboring large, ancient duplications, which are also conserved in mouse and dog. These duplications from 22.8 to 24.8 (2 Mb), and 24.8 to 26.3 (1.5 Mb), completely overlap with two of the gene deserts we identified here. These intrachromosomal duplications are composed of long intermittent sequences with similarity as low as 60% and are suggestive that some gene deserts originated from duplications of segments lacking genes in a mammalian common ancestor. CpG islands CpG islands are unmethylated regions of the genome that are associated with the 5'ends of many house-keeping genes and regulated genes. Of the 1369 calculated CpG islands9 (Supplemental Methods, Table S11), 895 are associated with expressed genes, including 781 (58.7%) known genes, 53 (51%) novel CDSs, 60 (27.1%) novel transcripts, and one (25%) putative genes. 806 of these genes contain CpG islands in or near their 5'-ends, 23 in their 3'-ends, 60 internally, and six which are completely encompassed by the CpG island. 291 genes share a CpG island with at least one other member, which means they may be under the same regulatory control. Of the 157 5 2005-10-11925A shared CpG islands, 149 are shared by two genes and eight are shared by three genes. The longest CpG island on HSA11, which is highly conserved in chimp, dog, mouse, and rat, is 7,460 bp and is found at the 5'-end of CCND1. CCND1, or cyclin D1, is a member of the highly conserved cyclin family, whose members are characterized by a dramatic periodicity in protein abundance throughout the cell cycle. Mutations, amplification and over expression of this gene, which alters cell cycle progression, are observed frequently in a variety of tumors and may contribute to tumorigenesis. Imprinting While CpG islands are not normally methylated, there are several cases in the human genome where methylation occurs on one of the parental alleles, violating the usual rule of inheritance that both alleles in a heterozygote are equally expressed. This phenomenon is called genomic imprinting. When a gene is suppressed through imprinting from one parent, and the allele from the other parent is not expressed because of mutation, the child will be deficient for that gene. The 11p15.5 region of HSA11 contains two of the most well studied and reciprocally imprinted genes H19 and IGF2, in which the disruption can lead to Beckwith-Wiedemann syndrome, of which the cardinal features are exomphalos, macroglossia, and gigantism in the neonate. Mutations in several imprinted genes in this region can lead to this, as well as other syndromes. Table S18 lists all of the imprinted-related genes on HSA11 as derived from the OMIM database. Duplication Analysis We performed a detailed analysis of duplicated genomic sequence (≥90% sequence identity and ≥1 kb in length) comparing HSA11 against the May 2004 assembly of the 6 2005-10-11925A human genome (Supplemental Methods). We estimated that 4.23% (5.55 Mb) of HSA11 consists of segmental duplications (Tables S19, S20). Compared to other finished chromosomes as well as the genome average (5.3%), HSA11 is not enriched for segmental duplications. Unlike the genome wide distribution, in which the aligned base pairs of interchromosomal duplications are slightly lower than the intrachromosomal duplications, duplications in HSA11 are predominantly interchromosomal: 14.14 Mb out of 17.75 Mb of aligned base pairs and 1399 out of 1667 pairwise alignments are with the non-homologous chromosomes (Fig. S3, Table S19). While the duplications with higher divergence (>0.08) tended to be short, those with lower divergence are more scattered in the length distribution (Fig. S4). A bimodal distribution pattern of sequence identity is observed based on the distribution pattern of the alignments. The majority of interchromosomal duplication alignments show 93-95% sequence identity while intrachromosomal duplications show 95- 97% sequence identity. Segmental duplications are particularly clustered in the subtelomeric and pericentromeric regions of HSA11p, with the subtelomeric region accounting for 18.3% (305/1667) and the pericentromeric region accounting for 13.6% (226/1667) of the total alignment (Table S21). This subtelomeric region is clustered with interchromosomal duplications mostly mapping to the subtelomeric or pericentromeric regions in other chromosomes. The pericentromeric region on the HSA11p is mainly clustered with intrachromosomal duplications (Fig. S5). In addition, the overwhelming majority of the other segmental duplications are clustered in another 12 blocks (>100 kb and > 5 duplication alignments) (Figs. S6, S7). These regions contain genes or fragment of genes, such as IFITM, FOLH, ALDH, NOX4, 7 2005-10-11925A TRIM49, NAALAD as well as OR family members (Table S22). However, only 41 OR pseudogenes and 18 intact genes, all but two from class II, overlap with segmentally duplicated regions, mostly scattered across the chromosome. This suggests that segmental duplication was not a major factor in the expansion of this large gene family, at least on HSA11. Copy number polymorphism and macro insertion deletion and inversion among different human populations have been recently reported10-13. We observed that at least four of the HSA11 duplication clusters overlap with or are adjacent to known copy number polymorphism sites, suggesting the clustered duplications play a role in the generation of these polymorphisms. Comparative biology To define further the chromosomal landscape, we performed a comparative analysis of finished HSA11 versus the draft mouse14, rat15, dog16, chimpanzee17 and chicken18 genomes. Using DNA alignments we constructed a map of conserved synteny between HSA11 and the other genomes (Fig. S8, Methods). By scanning these regions for contiguous collinear nucleotide similarity, 36 blocks of conserved synteny larger than 250 kb were identified between human and mouse, with the longest segment being 17.4 Mb. Results for the other organisms can be found in Table S23. In the map of conserved synteny, the chicken seems to lack some of the gene-rich regions. However, this may simply be due to the exclusion of smaller blocks of conserved synteny. Indeed, comparative analysis using the Tblastx programs suggests that many of the genes in these regions are present in the chicken genome. Additionally, we identified 6218 conserved non-coding elements (CNEs) by combining the blastn hits for mammals (mouse, rat and dog), and separately for 8 2005-10-11925A mammals plus chicken (942 CNEs) (Table S24). These conserved non-coding elements defined among mammals are fairly evenly distributed along the chromosome for mammals only, with a slightly higher density in the region of 90-120 Mb on HSA11 (Fig. S9). However, the elements that are conserved with chicken show a more skewed distribution with a higher density of CNEs from 11p-tel to approximately 18 Mb, and then much lower from there to the centromere. The reasons for these trends are unclear, but may partially reflect the lack of clearly defined synteny with chicken. Out of 7,487 evolutionary conserved regions (ECRs) with an average length of 112 bp covering 841,811 bp across HSA11 (data courtesy of O. Jaillon, Genoscope), 5,964 (79.66%) overlapped potential protein-coding genes and 798 (10.66%) with pseudogenes. 661 (8.83%) of the ECRs overlapped with a CpG island, while only 154 (2.06%) fell within gene desert regions. Supplemental Methods Large insert clone sequencing and mapping at RIKEN Genomic Sciences Center Large-insert BAC and fosmid clone DNA was prepared by the standard alkaline lysis method (Kurabo PI-1100). Shotgun libraries were constructed by random sheared DNA (1-2 kb) (HydroShear, GeneMachines) and cloned in plasmid vector. The template DNA was prepared either by PCR amplification of the insert DNA (TaKaRa Ex Taq, Biometra and ABI GeneAmp PCR System 9700), GenomiPhi amplification (Amersham Biosciences) or plasmid DNA isolation (Kurabo PI-1100). Cycle 9 2005-10-11925A sequencing was performed by BigDye v3.1 chemistry and ABI3700 and ABI3730 sequencers (Applied Biosystems), and by ET chemistry and the MegaBACE1000 sequencer (Amersham Biosciences). Basecalling, quality assessment and assembly were carried out using the Phred/Phrap software package19,20. Assemblies of clones sequenced at 8-10 fold redundancy were visualized for finishing with Consed21 and Sequencher (Gene Codes Corp.). A combination of the following methods was used to close sequence gaps and resolve low-quality or problematic regions: nested deletion22, primer walking, PCR, direct sequencing of large-insert BAC and fosmid clones, and subcloning of BAC clones into fosmid vectors. The average accuracy of the finished sequence data was estimated to be greater than Phrap 40. Clones were finished according to the agreed international standard for the human genome (http://genomeold.wustl.edu/Overview/g16stand.php). There were 41 sequence gaps, for an estimated total of 13.7 Kb, which could not be resolved by sequencing (Table S25). For the Human Genome Project (HGP), there were a number of quality control checks that were performed to ensure the highest quality data and uniformity throughout the genome. Like all the other human chromosomes, HSA11 was inspected to make sure that none of the following applied: missing known genes, missing STSs, contamination, partially present genes, compressions or insertions, and false clone overlaps. One particularly difficult region to finish was the 566 Kb interval from 88.92-89.49 Mb, which consists of several large intra- and inter-chromosomal duplications (Fig. S10). There is a 350 Kb, near-perfect (99.65%) intrachromosomal duplication in this region which contains two copies each of the PSMAL and TRIM49 genes, and at least 10 2005-10-11925A four retrotranposed processed pseudogenes, making this region also a challenge to annotate. As noted in the main text, gaps were size-estimated by fiber-FISH analysis were possible. Initial size estimates were roughly made to the nearest 1-10 Kb. In cases where additional sequence data was incorporated to reduce the gaps, the estimated size of the gaps were decreased by the exact amount of new data adding, thus leading to what seem as very precise estimates of the gaps sizes. Fiber FISH analysis at best can only give estimates in the 1-10 Kb range. Large insert clone sequencing at Broad Institute/Whitehead Center for Genome Research Subclone libraries of large-insert clones were prepared in m13 or one of several plasmid subclone vectors, and sequenced with the dideoxy chain termination method using one of several versions of big dye chemistry23. Data were detected on several models of ABI sequencing machines and assembled with Phrap (http://www.phrap.org) or Arachne24,25. Assemblies were visualized for finishing with either Gap426 or Consed21. A combination of the following methods was used to close sequence gaps and resolve low quality regions and misassemblies: transposon insertion-based sequencing, primer walking, PCR, and shattered insert libraries27. Finished sequence assemblies of all large insert clones were validated by comparison to restriction digestion patterns generated by 3-5 6 cutter enzymes28. STS markers on genetic maps 11 2005-10-11925A This data is from the UCSC Genome Browser (http://genome.ucsc.edu/cgibin/hgGateway) Positions of STS markers are determined using both full sequences and primer information. Full sequences are aligned using blat29, while isPCR (Jim Kent) and ePCR (http://www.ncbi.nih.gov/sutils/e-pcr/) are used to find locations using primer information. Both sets of placements are combined to give final positions. In nearly all cases, full sequence and primer-based locations are in agreement, but in cases of disagreement, full sequence positions are used. Sequence and primer information for the markers were obtained from the primary sites for each of the maps, and from UniSTS (http://www.ncbi.nih.gov/entrez/query.fcgi?db=unists). This track was designed and implemented by Terry Furey. Construction of gene catalog Alignments of all available (as of August 2005) human RefSeq30 and GenBank31 messenger RNA sequences to the finished sequence were derived from the UCSC Genome Browser according to their methodology (http://genome.ucsc.edu/index.html). Data from the following tracks were inspected manually to ensure accurate transcriptional start and stop sites, and to correct splice sites: Known Genes, RefSeq Genes, Human mRNAs, Ensembl Genes, CCDS, Retroposed Genes, and sno/miRNA. In addition, data from these tracks was reviewed: Human ESTs, Other RefSeq, Other mRNAs, and Other ESTs. Non-canonical splice sites were used only if supported by sufficient complementary DNA-based evidence. Partial transcripts (those containing a partial open reading frame) were annotated in cases for which there was firm evidence of their existence. All gene models (Table S6) were created manually using these aligned sequences as evidence, following HAWK2 12 2005-10-11925A (www.sanger.ac.uk/Info/workshops/hawk2) transcript type conventions. Evidence was given relative priority as follows (high–low): RefSeq, other mRNAs, spliced ESTs, unspliced ESTs, non-human orthologous mRNAs. When there was more than one variant for a gene, we selected the longest genomic transcript as the representative model. Gene symbols for biologically characterized loci were assigned by the HUGO Gene Nomenclature Committee (http://www.gene.ucl.ac.uk/nomenclature/). Our annotations will be made available to the Vertebrate Genome Annotation database (VEGA, http://vega.sanger.ac.uk/Homo_sapiens). As part of our validation process, we compared our gene annotations with those from Ensembl, RefSeq and the Consensus CDS (CCDS) project (http://www.ncbi.nlm.nih.gov/projects/CCDS/). In Ensembl, 1147 (96%) of known genes closely matched our annotation, while 1231 (80.8%) of all expressed genes were annotated in their database. In RefSeq, 1158 (96.9%) of known genes closely matched our annotation, while 1227 (80.5%) of all expressed genes were annotated with coordinates in their database. In CCDS, in which four different centers must agree on a consensus CDS for each gene, to date 745 genes (62.3% known, 48.9% all expressed), including one novel transcript, have been annotated. Retroposed genes, including pseudogenes This data is from the UCSC Genome Browser The retroGene track shows processed mRNAs that have been inserted back into the genome since the mouse/human split32. RetroGenes can be either functional genes that have acquired a promoter from a neighboring gene, non-functional pseudogenes, or transcribed pseudogenes. All mRNAs of a species from GenBank were aligned to the 13 2005-10-11925A genome using blastz33. mRNAs that aligned twice in the genome (once with introns and once without introns) were initially screened. Next, a series of features were scored to determine candidates for retrotransposition events. These features include position and length of the polyA tail, degree of synteny with mouse, coverage of repetitive elements, number of exons that can still be aligned to the retroGene and degree of divergence from the parent gene. These features are combined heuristic weighting based on analysis of known processed pseudogenes. RetroGenes in the final set have a score threshold greater than 425 based on a ROC plot against the Vega (http://vega.sanger.ac.uk/) annotated pseudogenes. The RetroFinder program and browser track were developed by Robert Baertsch at UCSC. CpG Islands This data is from the UCSC Genome Browser CpG islands were predicted by searching the sequence one base at a time, scoring each dinucleotide (+17 for CG and -1 for others) and identifying maximally scoring segments. Each segment was then evaluated for the following criteria: 1. GC content of 50% or greater 2. length greater than 200 bp 3. ratio greater than 0.6 of observed number of CG dinucleotides to the expected number on the basis of the number of Gs and Cs in the segment 4. The CpG count is the number of CG dinucleotides in the island. The Percentage CpG is the ratio of CpG nucleotide bases (twice the CpG count) to the length. The ratio of observed to expected CpG is calculated according to the formula cited in Gardiner-Garden et al.9 14 2005-10-11925A Obs/Exp CpG = Number of CpG * N / (Number of C * Number of G) This track was generated using a modification of a program developed by G. Miklem and L. Hillier. Segmental duplication analysis We used a BLAST-based detection scheme34 to identify all pairwise similarities representing duplicated regions (≥1 kb and ≥90% identity) within the finished sequence of HSA11 and compared to all other chromosomes in the NCBI genome assembly (May, 2004 build 35). A total of 2146 pairwise alignments representing 17.75 Mb of aligned basepairs and 5.55 Mb of non-redundant duplicated bases were analyzed on HSA11. The program Parasight (http://humanparalogy.gene.cwru.edu/parasight/) was used to generate images of pairwise alignments. Divergence of duplication, the number of substitutions per site between the two sequences, were calculated using Kimura's two-parameter method, which corrects for multiple events and transversion/transition mutational biases35. Analysis of haplotype structural variation was performed using the program Miropeats (threshold =3000)36. Gene content of each 1% duplicated regions of 90%-100% identity was analyzed using a non-redundant/non-overlapping set of known genes. A gene feature (exon) was considered duplicated if >50 bp of the feature overlapped duplication. Thus, exons less than 50 bp were lost in this analysis. Comparative analysis A BLAST-based method was used to define the map of conserved synteny. The chimpanzee, mouse, rat, dog and chicken genomes were 15 2005-10-11925A obtained from http://genome.ucsc.edu. We used the repeat-masked version of the chimpanzee November 2003 freeze (panTro1), the mouse March 2005 freeze (mm6), the rat June 2003 freeze (rn3), the dog July 2004 freeze (canFam1) and the chicken February 2004 freeze (galGal2). First, blastn37 was performed using the "-e 0.1" option to align each of the genomes to the HSA11 sequence. After blast analysis, adjacent hits that were properly oriented and reasonably spaced (< 1,000,000 bp) from the same chromosome were merged into small blocks, as long as there was no ‘other’ hit between them. Blocks shorter than (n x 25,000; where ‘n’ equals the step number in the iterative block-building process) bp were removed, and remaining blocks were used as hits in the next step. After the analysis, synteny blocks of length at least 250,000 bp were obtained. For the conserved non-coding region analysis, intersections (overlaps) of blastn hits from mammals (mouse, rat and dog) and mammals plus chicken were defined as conserved regions. Conserved regions overlapping with Ensembl genes were removed and the remaining regions were used as CNEs if they were greater than 100 bp in length for mammals, or greater than 50 bp in length for mammals plus chicken. Percent identities were not used for the analysis. Supplemental Resources Blast 2 sequences (http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html) BLAT (http://genome.ucsc.edu/cgi-bin/hgBlat) Consensus CDS (CCDS) project (http://www.ncbi.nlm.nih.gov/CCDS/) DDBJ (http://www.ddbj.nig.ac.jp/Welcome-e.html) 16 2005-10-11925A DIGIT (http://digit.gsc.riken.go.jp/cgi-bin/index.cgi) Ensembl (http://www.ensembl.org/index.html) Entrez (http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi) Entrez Gene (RefSeq) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene) Eponine Transcription Start Site finder (http://servlet.sanger.ac.uk:8080/eponine/) Exofish (http://www.genoscope.cns.fr/cgi-bin/exofish.cgi) HUGO Gene Nomenclature Committee (HGNC) (http://www.gene.ucl.ac.uk/nomenclature/) Human Annotation Workshop (Hawk) (www.sanger.ac.uk/Info/workshops/hawk2) Human Olfactory Receptor Data Exploratorium (HORDE) (http://bip.weizmann.ac.il/HORDE/) miRBase (http://microrna.sanger.ac.uk/sequences/index.shtml) Online Mendelian Inheritance in Man (OMIM) (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=OMIM) ORF Finder (Open Reading Frame Finder) (http://www.ncbi.nlm.nih.gov/gorf/gorf.html) RepeatMasker (http://www.repeatmasker.org/cgi-bin/WEBRepeatMasker) Rfam (http://www.sanger.ac.uk/Software/Rfam/index.shtml) Spidey (http://www.ncbi.nlm.nih.gov/IEB/Research/Ostell/Spidey/) Tandem Repeats Finder (http://tandem.bu.edu/trf/trf.html) tRNAscan-SE (http://www.genetics.wustl.edu/eddy/tRNAscan-SE/) UCSC Genome Browser (http://genome.ucsc.edu/cgi-bin/hgGateway) UCSC Sequence and Annotation Downloads (http://hgdownload.cse.ucsc.edu/downloads.html) UCSC Table Browser (http://genome.ucsc.edu/cgi-bin/hgTables) 17 2005-10-11925A VecScreen (http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html) Vega Genome Browser (http://vega.sanger.ac.uk/) Supplemental References 1. Benson, G. Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res. 27, 573-580 (1999) 2. Dugich-Djordjevic M. M. et al. Regionally specific and rapid increases in brain- derived neurotrophic factor messenger RNA in the adult rat brain following seizures induced by systemic administration of kainic acid. Neuroscience 47, 303-315 (1992) 3. Canals, J. M. et al. Expression of brain-derived neurotrophic factor in cortical neurons is regulated by striatal target area. J. Neurosci. 21, 117-124 (2001) 4. Jiang, X. et al. BDNF variation and mood disorders: a novel functional promoter polymorphism and Val66Met are associated with anxiety but have opposing effects. Neuropsychopharmacology 30, 1353-1361 (2005) 5. Nusbaum, C. et al. DNA sequence and analysis of human chromosome 18. Nature 437, 551-555 (2005) 6. Yahagi, S. et al. Identification of two novel clusters of ultrahigh-sulfur keratin-associated protein genes on human chromosome 11. Biochem. Biophys. Res. Commun. 318, 655-664 (2004) 7. Ovcharenko, I. et al. Evolution and functional classification of vertebrate gene deserts. Genome Res. 15, 137-145 (2005) 8. Itoh, T., Toyoda, A., Taylor, T. D., Sakaki, Y. & Hattori, M. Identification of large ancient duplications associated with human gene deserts. Nat. Genet. 37, 1041-1043 (2005) 18 2005-10-11925A 9. Gardiner-Garden, M. & Frommer, M. CpG islands in vertebrate genomes. J. Mol. Biol. 196, 261-282 (1987) 10. Tuzun, E. et al. Fine-scale structural variation of the human genome. Nat. Genet. 37, 727-732 (2005) 11. Sharp, A. J. et al. Segmental duplications and copy number variation in the human genome. Am. J. Hum. Genet. 77, 78-88 (2005) 12. Sebat, J. et al. Large-scale copy number polymorphism in the human genome. Science 305, 525-528 (2004) 13. Iafrate, A. J. et al. Detection of large-scale variation in the human genome. Nat. Genet. 36, 949-951 (2004) 14. Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature 420, 520−562 (2002) 15. Gibbs, R. A. et al. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428, 493-521 (2004) 16. Lindblad-Toh, K. et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438, 803-819 (2005) 17. Chimpanzee Sequencing and Analysis Consortium. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437, 6987 (2005) 18. Hillier, L. W. et al. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432, 695-716 (2004) 19. Ewing, B. & Green, P. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8, 186-194 (1998) 19 2005-10-11925A 20. Ewing, B., Hillier, L., Wendl, M. C., & Green, P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8, 175185 (1998) 21. Gordon, D., Abajian, C., & Green, P. Consed: a graphical tool for sequence finishing. Genome Res. 8, 195-202 (1998) 22. Hattori, M. et al. A novel method for making nested deletions and its application for sequencing of a 300 kb region of human APP locus. Nucleic Acids Res. 25, 1802-1808 (1997) 23. Rosenblum, B.B. et al. New dye-labeled terminators for improved DNA sequencing patterns. Nucleic Acids Res. 25, 4500-4504 (1997) 24. Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177-189 (2002) 25. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91-96 (2003) 26. Bonfield, J. K., Smith, K., & Staden, R. A new DNA sequence assembly program. Nucleic Acids Res. 23, 4992-4999 (1995) 27. MacMurray, A. A., Sulston, J. E., & Quail, M. A. Short-insert libraries as a method of problem solving in genome sequencing. Genome Res. 8, 562 (1998) 28. Wong, G. K., Yu, J., Thayer, E. C. & Olson, M. V. Multiple-complete-digest restriction fragment mapping: generating sequence-ready maps for large-scale DNA sequencing. Proc. Natl. Acad. Sci. USA. 94, 5225-5230 (1997) 29. Kent, W. J. BLAT--the BLAST-like alignment tool. Genome Res. 12, 656-664 (2002) 20 2005-10-11925A 30. Pruitt, K. D., Tatusova, T., & Maglott, D. R. NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33, D501-D504 (2005) 31. Benson, D. A. et al. GenBank. Nucleic Acids Res. 33, D34-D38 (2005) 32. Kent, W. J., Baertsch, R., Hinrichs, A., Miller, W., & Haussler, D. Evolution's cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. USA 100, 11484-11489 (2003) 33. Schwartz, S. et al. Human-Mouse Alignments with BLASTZ. Genome Res. 13, 103-107 (2003) 34. Blanchette, M. et al. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14, 708-715 (2004) 35. Bailey, J. A., Yavor, A. M., Massa, H. F., Trask, B. J. & Eichler, E. E. Segmental duplications: organization and impact within the current human genome project assembly. Genome. Res. 11, 1005-1017 (2001) 36. Parsons, J. D. Miropeats: graphical DNA sequence comparisons. Comput. Appl. Biosci. 11, 615-619 (1995) 37. Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389-3402 (1997) 38. Bailey, J. A. et al. Recent segmental duplications in the human genome. Science 297, 1003-1007 (2002) 21