FEMS Yeast Research 4 (2003) 207^215 www.fems-microbiology.org The Hansenula polymorpha (strain CBS4732) genome sequencing and analysis Massoud Ramezani-Rad a; , Cornelis P. Hollenberg a , Juergen Lauber b , Holger Wedler b , Eike Griess b , Christian Wagner c , Kaj Albermann c , Jean Hani c , Michael Piontek d , Ulrike Dahlems e , Gerd Gellissen e a Institute for Microbiology, Heinrich-Heine University Du«sseldorf, Universita«tsstrasse 1, 40225 Du«sseldorf, Germany b Qiagen Genomic Services, Qiagen GmbH, Max-Volmer Strasse 4, 40724 Hilden, Germany c Biomax Informatics AG, Lochhamer Strasse 11, 82152 Martinsried, Germany d ARTES Biotechnology GmbH, Agnesstrasse 8, 45136 Essen, Germany e Rhein Biotech GmbH, Eichsfelder Strasse 11, 40595 Du«sseldorf, Germany Received 28 December 2002; received in revised form 24 February 2003; accepted 13 March 2003 First published online 30 April 2003 Abstract The methylotrophic yeast Hansenula polymorpha is a recognised model system for investigation of peroxisomal function, special metabolic pathways like methanol metabolism, of nitrate assimilation or thermostability. Strain RB11, an odc1 derivative of the particular H. polymorpha isolate CBS4732 (synonymous to ATCC34438, NRRL-Y-5445, CCY38-22-2) has been developed as a platform for heterologous gene expression. The scientific and industrial significance of this organism is now being met by the characterisation of its entire genome. The H. polymorpha RB11 genome consists of approximately 9.5 Mb and is organised as six chromosomes ranging in size from 0.9 to 2.2 Mb. Over 90% of the genome was sequenced with concomitant high accuracy and assembled into 48 contigs organised on eight scaffolds (supercontigs). After manual annotation 4767 out of 5933 open reading frames (ORFs) with significant homologies to a non-redundant protein database were predicted. The remaining 1166 ORFs showed no significant similarity to known proteins. The number of ORFs is comparable to that of other sequenced budding yeasts of similar genome size. A 2003 Federation of European Microbiological Societies. Published by Elsevier B.V. All rights reserved. Keywords : Yeast genomics; Sequence analysis ; Hansenula polymorpha 1. Introduction Yeasts constitute an important group of industrial microorganisms. Its long tradition of human use, the overwhelming knowledge of its genetics and physiology made the baker’s yeast Saccharomyces cerevisiae a eukaryotic model organism for basic research and industrial applications [1]. In 1996, it was the ¢rst eukaryotic organism for which the complete genome sequence was established [2]. The initial focus on S. cerevisiae has been extended by investigations of a range of alternative yeast species. As a consequence, the number of fully or partially sequenced budding yeast genomes has continued to grow. Among * Corresponding author. Tel. : +49 (211) 311 3425 ; Fax : +49 (211) 311 5370. E-mail address : ramezani@uni-duesseldorf.de (M. Ramezani-Rad). others, a comparative genomic exploration of 13 species was conducted selected from hemiascomycetous yeasts [3]. The methylotrophic yeast Hansenula polymorpha (syn. Pichia angusta) is one of the most important industrially applied non-conventional yeasts [4,5]. H. polymorpha is a ubiquitous yeast species occurring naturally in spoiled orange juice, maize meal, in the gut of various insect species and in soil. It grows as white to cream, butyrous colonies and does not form ¢laments [6]. H. polymorpha isolates are homothallic and reproduction occurs vegetatively by budding. H. polymorpha belongs to the fungal family of Saccharomycetaceae, subfamily Saccharomycetoideae [6,7]. Most research has been performed with three basic strains designated as H. polymorpha DL-1, CBS4732 and NCYC495, respectively. These strains are of independent origin and unclear relationship and exhibit di¡erent features, including di¡erent chromosome numbers. Depending on strain and separation conditions, between two and 1567-1356 / 03 / $22.00 A 2003 Federation of European Microbiological Societies. Published by Elsevier B.V. All rights reserved. doi:10.1016/S1567-1356(03)00125-9 FEMSYR 1572 27-10-03 Cyaan Magenta Geel Zwart 208 M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 seven chromosomes can be distinguished [8,9]. Strain CBS4732 (syn. ATCC34438, NRRL-Y-5445 ; CCY38-222) was originally isolated from soil irrigated with waste water from a distillery in Pernambuco, Brazil [10]. Its odc1 derivatives LR9 [11] and RB11 [12] have been developed as hosts for heterologous gene expression [12]. Recombinant compounds produced in these hosts include enzymes like the feed additive phytase [13,14], anticoagulants like hirudin and saratin [15^17] and an e⁄cient vaccine against hepatitis B infection [18^20]. The signi¢cance of H. polymorpha in basic research stems largely from studies focussed on peroxisome homeostasis [21] and nitrate assimilation [22]. Although much is known about the physiology, biochemistry and ultra structure of this yeast (for review see monograph on H. polymorpha [4]), little information is available about the genomic structure and function [23]. Several groups worldwide have initiated studies on its genome several years ago. Included in the comparative genome analysis on 13 hemiascomycetous yeasts mentioned above part of the H. polymorpha (P. angusta) genome sequence was established using a partial random sequencing strategy with a coverage of 0.3 genome equivalents. Using this approach, about 3 Mb of sequencing raw data of the H. polymorpha genome was obtained [3]. We performed a genome analysis aimed at a higher coverage and using a BAC-to-BAC approach. This work now culminated in the comprehensive genome analysis of this organism. A ¢rst description of the data generated is provided in this study. Access to the genome data can be granted upon request (G.G.) and after signing a Material Transfer Agreement. The access has already been granted to six academic groups working on various aspects of functional genomics of H. polymorpha. The present paper describes the results of the sequencing and characterisation of 8.733 Mb assembled into 48 contigs. The sequence covers over 90% of the estimated total genome content of 9.5 Mb located on six chromosomes ranging in size between 0.9 and 2.2 Mb [23]. The established sequence contains 5933 ORFs. 2. Materials and methods 2.1. Construction of the genomic BAC library For the sequencing of H. polymorpha strain RB11, an odc1 derivative of wild-type strain CBS4732 was selected [12]. For the construction of the genomic BAC library of H. polymorpha, the vector pBACe3.6 was used and prepared according to Osoegawa et al. [24]. H. polymorpha cells from a 50 ml YPD (1% yeast extract, 2% peptone, 2% glucose) culture were washed twice with TSE bu¡er (25 mM Tris^HCl, 300 mM sucrose, 25 mM EDTA, pH 8) and resuspended in TSE bu¡er. Then, agarose plugs from these cells were prepared according to the Bio-Rad manual of the Chef DR II pulsed-¢eld gel electrophoresis system FEMSYR 1572 27-10-03 (PFGE system) using 1.5% low melting point agarose. Preelectrophoresis was carried out on a Bio-Rad PFGE system. Partial digestion of genomic DNA was carried out according to Osoegawa et al. [25] using Sau3AI for restriction. Gel electrophoresis was carried out on a Bio-Rad PFGE system according to conditions given at Rod Wing’s homepage (Clemson University, Genomics Institute, construction of BAC libraries protocol : 6 V cm31 , 90 s pulse, 13‡C 18 h). Agarose digestion with gelase, ligation and transformation were carried out using the same protocol. Subsequent electroporation of DH10B cells (Invitrogen) was again carried out according to Osoegawa et al. [25], and bacteria were plated onto 2UYT plates supplemented with chloramphenicol as selecting agent. Clones obtained from that procedure were picked and used to inoculate 1.2 ml of 2UYT supplemented with chloramphenicol. These bacterial cultures were used to prepare glycerol stocks in 96-well microtitre plate format as resource for all subsequent work. 2.2. Construction of shotgun libraries from BAC DNA Large-scale preparations of BAC DNA were carried out using the Large-Construct kit from Qiagen (Qiagen GmbH, Hilden, Germany; cat. no. 12462). After soni¢cation and enzymatic repair of the ends, fragments of desired size (usually 1.2^1.5 kb) were isolated from a 1% preparative agarose gel using the MinElute Gel Extraction kit (Qiagen, cat. no. 28604) and inserted into a SmaI-digested and alkaline phosphatase-treated pUC19 vector [26]. Ligation was carried out with the Rapid Ligation kit (Roche) according to the manufacturer’s protocol. The ligation mixture was then desalted using a QIAquick kit (Qiagen, cat. no. 28304) according to the instructions of the supplier with the exception of the elution step. This was carried out with ddH2 O. 1/10 volume of the eluted DNA was used for transformation of competent Escherichia coli DH10B cells using a Genepulser II device (Bio-Rad). 1 ml Luria^ Bertani (LB) medium [26] was added and incubated for 1 h at 37‡C. 1/200 and 1/20 volumes of the transformed cells were plated onto Petri dishes containing LB agar, ampicillin, X-Gal and isopropyl thiogalactose (IPTG) [26] and grown overnight at 37‡C to determine the yield of recombinant clones. Usually the transformation rate was greater than 108 transformants per Wg vector DNA and the white:blue ratio was approximately 10:1 or better. 2.3. Plasmid preparation of shotgun clones For subsequent DNA sequencing, plasmid DNA from white colonies was isolated after growth in 1.2 ml 2UYT cultures containing ampicillin for 24 h at 37‡C and shaking at 220 rpm. Plasmid puri¢cation of shotgun clones was carried out using the REAL Prep 96 kit (Qiagen, cat. no. 26173). Cyaan Magenta Geel Zwart M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 209 Fig. 1. Summary of sequencing statistics. 2.4. DNA sequencing 3. Results and discussion DNA sequencing reactions were set up using BigDye Terminator v 2.0 cycle sequencing chemistry (Applied Biosystems, cat. no. 4314416) and puri¢ed using the DyeEx 96 (Qiagen, cat. no. 63183). Sequencing data were generated using ABI Prism 3700 sequence analyzers. 3.1. Genome sequencing A BAC library with approximately s 17U coverage was constructed in pBACe3.6 and characterised by endsequencing and restriction digestion. Insert sizes of BAC clones ranged from below 50 to over 100 kb per clone. A total of 2880 BAC clones were generated with an average insert size of 65 kb. 4892 BAC end sequences were generated with 483 bases average read length (phred20). BACend sequencing success rate was 85.5%. In total, 213 BAC clones were selected for analysis, out of which 188 BACs representing the minimal tiling path were selected for shotgun sequencing, BAC-by-BAC. Sequencing coverage of BACs was 8.27-fold on average (Fig. 1). The number of BACs with one contig only was 162, with two contigs 15, with three contigs 9 and BACs with four contigs were 2. 2.5. Sequence assembly Base calling and quality checks were carried out using Phred [27]. Sequences were assembled with Phrap and editing was performed after import into gap4. BAC assemblies and raw data were visualised and edited using the STADEN package (version 4.5; developed by Roger Staden et al.; http://www.mrc-lmb.cam.ac.uk/pubseq/staden_ home.html). 2.6. Automated bioinformatic annotation 3.2. Genome assembly Fully automated annotation was carried out using the ConSequence1 software system provided by Qiagen (based on Pedant-Pro1 from Biomax Informatics AG) [28]. The BAC library constructed covers the genome 18fold. 4892 BAC-end sequences from those clones yielded approximately 2.4 Mb of raw data, covering 25% of the Table 1 Overview of genome organisation and assembled sequences in supercontigs Chromosome karyotype Size (Mb) Chromosome marker Sequencing supercontig Size (bp) I II III IV V VI 0.95 1.25 1.5 1.7 1.9 2.2 URA3; CPY (PRC1); GAP rDNA (5.8S, 18S, 26S) HARS1 PEP4 (PRA1); TPS1 MOX FMD 6 9.5 5 6 1 8 4 3 2 7 8 968 770 983 699 1 220 583 1 290 524 1 306 376 1 494 936 218 529 1 250 065 8 733 482 FEMSYR 1572 27-10-03 Cyaan Magenta Geel Zwart 210 M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 Fig. 2. Overview of supercontigs. The framed numbers within a stretch of BACs representing the respective supercontigs indicate the approximated size of a particular gap between neighbouring ends. genome (at 1U). On average, every 2 kb one BAC-end sequence is located on the genome, suggestive of an estimating genome size of about 9.78 Mb. Pulsed-¢eld gel electrophoresis of H. polymorpha RB11 chromosomes revealed six bands and the sum of the molecular masses of chromosomal DNA bands suggested a genome size of about 9^10 Mb [5] (Table 1). Mapping the end sequences onto the growing and eventually ¢nal genomic sequence showed a very even distribution of those end sequences with no local clustering, underlining the good random cloning of large genomic sub-fragments into this BAC library. The only exception were clones and end sequences falling into the rDNA region of the genome. There were no further large repetitive regions noticed. Smaller repeat regions have all been resolved for each individual BAC. Further, no repeats within BAC/BAC overlapping regions, potentially confounding a correct BAC-to-BAC assembly, were found. In addition to the BAC-to-BAC assembly based on overlapping regions, all BAC-end sequences with their forward/reverse constraints per clone as well as sizing information for individual BAC clones were used to layer a BAC map on top of the resulting assemblies. The consistency of the assembly was checked on the back of that BAC map for each BAC/BAC overlap and assembly. No discrepancies were detected between a single BAC/BAC overlap assembly and the BAC map backbone. The genome was assembled into 48 contigs and could be logically joined using clones physically bridging known gaps to eight supercontigs with a unique total size of 8.733 Mb from the six known chromosomes with assigned gene markers to electrophoretically separated chromosomes [5] (Table 1 and Fig. 2). Sequence overlaps between individual BACs with a total size of 1.521 Mb (approximately 15% of the total sequence generated) were used to measure the sequencing accuracy. It was determined to Table 2 Comparisons of the S. cerevisiae and H. polymorpha genomes Genome size (Mb) Sequenced non-redundant genome length (bp) GC content (%) Number of ORFs (with similarities) Average ORF distance (bp) Average protein length (aa) Number of tRNAs FEMSYR 1572 27-10-03 S. cerevisiae H. polymorpha 13.5 12 156 307 38.1 6449 (5978) 1885 471 278 V9.5 8 733 442 47.9 5933 (4767) 1472 437 80 Cyaan Magenta Geel Zwart M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 211 Fig. 3. Functional comparison of S. cerevisiae and H. polymorpha gene content (general functional categories). 99.998% or fewer than 1.75 errors in 100 kb. As the same technologies, expertise and work scheme were applied for all sequencing work, we conclude from this analysis that more than 90% of the total genome was sequenced with this high accuracy of 99.998%. The estimated 10% of the genome not yet sequenced includes telomeric regions, approximately 45^50 additional rDNA repeats (with a total of approximately 0.3 Mb only), and small gaps, some of which are indicated as boxes in Fig. 2. These results indicated that using end sequencing as a way to map the BAC clones allowed for high accuracy and eventual direct alignment onto the assembled genomic contigs as well as sequence comparisons between all sequences obtained (BACs but also shotgun sequences from three di¡erent shotgun libraries with inserts in the 1, 3 and 6^8 kb range) during the course of the project. 3.3. Genome organisation The Pedant-Pro1 Sequence Analysis Suite was used for gene identi¢cation. Out of the sequenced 8.73 Mb, 5933 ORFs have been extracted for proteins longer than 80 amino acids. ORFs whose sequence is entirely contained within another reading frame have been excluded from the analysis. 70 shorter ORFs ( 6 80 amino acids) with signi¢cant BLAST similarities have been extracted manually. 4767 ORFs show signi¢cant similarities to a non-redundant protein database. Out of the 4767 ORFs with similarities, 4109 showed signi¢cant similarity to ORFs from S. cerevisiae. The remaining 1166 ORFs have no signi¢cant similarities to known sequences. 410 ORFs are shorter than 100 amino acids. The numbers are not comparable due to di¡erent automatic gene-prediction methods and Fig. 4. Functional comparison of S. cerevisiae and H. polymorpha gene content (functional categories of metabolism). FEMSYR 1572 27-10-03 Cyaan Magenta Geel Zwart 212 M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 Table 3 Synteny analysis between H. polymorpha and S. cerevisiae H.p. BAC H.p. ORF BLAST E value S.c. ORF S.c. Description S.c. Chr. cqbh_00 cqbh_00 cqbh_00 orf129 orf158 orf155 7.00E324 4.00E357 2.00E339 ypr185w ypr186c ypr187w 16 16 16 cqbh_00 cqbh_00 orf135 orf121 0.0 4.00E369 ypr189w ypr190c cqbh_00 cqgr.00 cqgs.00 cqag_00 cqhn.00 cqgr.00 cqhm.00 cqan_00 orf117 orf129 orf143 orf148 orf161 orf168 orf177 orf362 6.00E350 1.00E342 1.00E3101 3.00E344 4.00E310 0.0 0.0 2.00E312 ypr191w ylr403w ylr405w ylr406c ylr407w ylr409c ylr410w yjr086w cqan_00 orf357 5.00E321 yjr088c cqan_00 orf324 1.00E3123 yjr090c cqan_00 cqan_00 cqan_00 orf304 orf248 orf231 3.00E377 1.00E332 1.00E314 yjr091c yjr092w yjr093c cqan_00 cqga.00 cqga.00 cqga.00 cqga.00 cqga.00 cqga.00 cqfq.00 cqfq.00 cqfq.00 cqfq.00 cqfq.00 cqfq.00 orf230 orf27 orf19 orf15 orf30 orf42 orf56 orf75 orf72 orf70 orf68 orf64 orf60 1.00E324 5.00E338 1.00E3167 4.00E346 0.0 3.00E328 7.00E345 4.00E330 1.00E3145 8.00E333 4.00E340 5.00E336 3.00E387 yjr094w-a ygr091w ygr092w ygr093w ygr094w ygr095c ygr096w ygl191w ygl190c ygl189c ygl187c ygl185c ygl184c cqav_00 cqav_00 cqav_00 cqav_00 orf272 orf276 orf330 orf315 8.00E324 2.00E355 1.00E325 8.00E369 ygl111w ygl110c ygl106w ygl105w cqav_00 cqav_00 cqav_00 cqbp_00 cqbp_00 cqbp_00 cqbp_00 cqbp_00 cqbp_00 orf294 orf292 orf263 orf17 orf21 orf26 orf75 orf68 orf189 2.00E362 2.00E314 1.00E3100 2.00E346 1.00E3125 5.00E367 2.00E363 5.00E315 1.00E3135 ygl103w ygl102c ygl100w ydr447c ydr448w ydr449c ydr450w ydr451c ydr452w cqaq_00 orf216 2.00E322 ydr362c cqaq_00 cqaq_00 cqaq_00 cqaq_00 orf202 orf191 orf165 orf180 2.00E375 3.00E323 2.00E396 1.00E3139 ydr365c ydr367w ydr372c ydr375c cqaq_00 orf251 1.00E3104 ydr380w cqaq_00 orf245 1.00E320 ydr381w APG13 ^ protein required for the autophagic process PZF1 ^ TFIIIA (transcription initiation factor) RPO26 ^ DNA-directed RNA polymerase I, II, III 18 kDa subunit SKI3 ^ antiviral protein RPC82 ^ DNA-directed RNA polymerase III, 82 kDa subunit QCR2 ^ ubiquinol-cytochrome-c reductase 40 kDa chain II SFP1 ^ zinc ¢nger protein similarity to Azospirillum brasilense nifR3 protein RPL31B ^ 60S large subunit ribosomal protein L31.e.c12 hypothetical protein strong similarity to Schizosaccharomyces pombe L-transducin VIP1 ^ strong similarity to S. pombe protein Asp1p STE18 ^ GTP-binding protein Q subunit of the pheromone pathway weak similarity to S. pombe hypothetical protein SPBC14C8.18c GRR1 ^ required for glucose repression and for glucose and cation transport JSN1 ^ suppresses the high-temperature lethality of tub2-150 BUD4 ^ budding protein FIP1 ^ component of pre-mRNA polyadenylation factor PF I RPL43B ^ 60S large subunit ribosomal protein PRP31 ^ pre-mRNA splicing protein DBF2 ^ ser/thr protein kinase related to Dbf20p similarity to hypothetical S. pombe protein VAS1 ^ valyl-tRNA synthetase RRP46 ^ involved in rRNA processing similarity to bovine Graves disease carrier protein COX13 ^ cytochrome-c oxidase chain VIa CDC55 ^ ser/thr phosphatase 2A regulatory subunit B RPS26A ^ 40S small subunit ribosomal protein S26e.c7 COX4 ^ cytochrome-c oxidase chain IV weak similarity to dehydrogenases STR3 ^ strong similarity to Emericella nidulans and similarity to other cystathionine L-lyase and Cys3p weak similarity to hypothetical protein S. pombe similarity to hypothetical protein SPCC1906.02c S. pombe MLC1 ^ Myo2p light chain ARC1 ^ protein with speci¢c a⁄nity for G4 quadruplex nucleic acids RPL28 ^ 60S large subunit ribosomal protein L27a.e questionable ORF SEH1 ^ nuclear pore protein RPS17B ^ ribosomal protein S17.e.B ADA2 ^ general transcriptional adapter or co-activator similarity to hypothetical protein S. pombe RPS18A ^ ribosomal protein S18.e.c4 YHP1 ^ strong similarity to Yox1p PHM5 ^ similarity to human sphingomyelin phosphodiesterase (PIR :S06957) TFC6 ^ TFIIIC (transcription initiation factor) subunit, 91 kDa weak similarity to Streptococcus M protein similarity to hypothetical protein SPAC26H5.13c S. pombe similarity to hypothetical S. pombe protein BCS1 ^ mitochondrial protein of the CDC48/PAS1/SEC18 (AAA) family of ATPases ARO10 ^ similarity to Pdc6p, Thi3p and to pyruvate decarboxylases YRA1 ^ RNA annealing protein FEMSYR 1572 27-10-03 Cyaan Magenta Geel Zwart 16 16 16 12 12 12 12 12 12 10 10 10 10 10 10 10 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 4 4 4 4 4 4 4 4 4 4 4 4 4 M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 213 Table 3 (Continued). H.p. BAC H.p. ORF BLAST E value S.c. ORF S.c. Description cqaq_00 cqdw.p1 cqdw.p1 cqdw.p1 cqdw.p1 cqdw.p1 cqdw.p1 orf236 orf217 orf208 orf271 orf228 orf257 orf250 2.00E320 7.00E382 0.0 1.00E328 4.00E390 5.00E334 7.00E357 ydr382w ydr061w ydr062w ydr067c ydr069c ydr071c ydr072c RPP2B ^ 60S large subunit acidic ribosomal protein similarity to E. coli modF and photorepair protein phrA LCB2 ^ serine C-palmitoyltransferase subunit similarity to YNL099c DOA4 ^ ubiquitin-speci¢c protease similarity to Ovis aries arylalkylamine N-acetyltransferase IPT1 ^ mannosyl diphosphorylinositol ceramide synthase due to the di¡erent genomes. Only after an in-depth analysis will an evaluation of the number of questionable ORFs be possible and will maybe reduce the number of ORFs shorter than 100 amino acids. Calculation of the gene density and protein length, taking into account the S.c. Chr. gene numbers, showed an average length of 1472 bp and an average protein length of 437 amino acids. No experiments have been performed so far for the evaluation of these predicted numbers. Introns have been identi¢ed by homology to known Table 4 Nuclear tRNA genes identi¢ed in the H. polymorpha genome tRNA species Anticodon H. polymorpha S. cerevisiae tRNA-Ala tRNA-Ala tRNA-Arg tRNA-Arg tRNA-Arg tRNA-Arg tRNA-Asn tRNA-Asp tRNA-Cys tRNA-Gln tRNA-Gln tRNA-Glu tRNA-Glu tRNA-Gly tRNA-Gly tRNA-Gly tRNA-His tRNA-Ile tRNA-Ile tRNA-Leu tRNA-Leu tRNA-Leu tRNA-Leu tRNA-Leu tRNA-Leu tRNA-Lys tRNA-Lys tRNA-Met tRNA-Phe tRNA-Pro tRNA-Pro tRNA-Ser tRNA-Ser tRNA-Ser tRNA-Ser tRNA-Thr tRNA-Thr tRNA-Thr tRNA-Trp tRNA-Tyr tRNA-Val tRNA-Val Total Di¡erent tRNAs AGC UGC ACG CCG CCU UCU GUU GUC GCA CUG UUG CUC UUC CCC GCC UCC GUG AAU UAU AAG CAA CAG GAG UAA UAG CUU UUU CAU GAA AGG UGG AGA CGA GCU UGA AGU CGU UGU CCA GUA AAC CAC 3 2 2 1 1 3 2 3 1 1 2 3 2 0 4 2 2 4 1 1 3 1 0 1 1 3 2 4 3 0 3 2 1 2 1 3 1 1 2 2 3 1 80 40 11 5 6 1 1 11 10 16 4 1 9 2 9 2 16 3 7 13 2 13 10 0 1 7 3 14 7 10 10 2 10 11 1 2 3 11 1 3 6 8 14 2 278 41 FEMSYR 1572 27-10-03 4 4 4 4 4 4 4 Cyaan Magenta Geel Zwart 214 M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 [3] in the literature, we have taken all MIPS genes into account for the comparisons. S. cerevisiae contains 6449 ORFs with an average distance of 1885 bp in comparison to 5933 ORFs in H. polymorpha with an average distance of 1472 bp. The gene density in H. polymorpha appears higher than that in S. cerevisiae when correlating the number of ORFs in the two organisms with the size of the respective genomes. An exhaustive synteny analysis has been performed between H. polymorpha and S. cerevisiae. It revealed up to eight syntenic proteins in both organisms. Six clusters were found to contain six syntenic proteins; two clusters were found to contain seven syntenic proteins and one cluster contains eight syntenic proteins (Table 3). Overall, 80 nuclear tRNA genes were identi¢ed in the H. polymorpha genome sequence (Table 4), in comparison to S. cerevisiae where 278 tRNA genes have been found. Despite these di¡erences, both yeasts have nearly the same amount of di¡erent tRNA species, in H. polymorpha 40, in S. cerevisiae 41. The lower number of tRNA genes in H. polymorpha is consistent with the tRNA analysis of RST sequences from Pichia sorbitophila [3], a close relative of H. polymorpha. One-third of the P. sorbitophila genome was found to contain 23 nuclear tRNA genes only. The estimated number for the complete P. sorbitophila genome (V70) is thus comparably low. The identi¢cation of relevant genes of the mating system and pheromone signal transduction pathway are shown in Table 5. Data analyses indicate that H. polymorpha contains several genes attributed to the regulation of mating, such as STE3, STE6, GPA1, STE18, CDC42, STE50 and STE11. These data suggest that a conserved mitogen-activated protein kinase pathway might regulate mating in H. polymorpha. In addition, the data analyses indicate that H. polymorpha contains a gene that corresponds to the mating type regulatory protein gene at the HMR locus of Kluyveromyces lactis (HMRa1). The cryptic mating type loci like HMRa1 in S. cerevisiae and K. lactis act as reservoirs of mating type information in mating type switching in homothallic yeast strains. The function of this homologue in H. polymorpha remains unknown. proteins and con¢rmed by using GeneWise [29]. In a preliminary analysis 91 intron-containing genes were identi¢ed in this way. These include all genes identi¢ed previously [3] as intron-containing genes. 80 tRNAs were identi¢ed, corresponding to all 20 amino acids. From approximately 50 rRNA clusters [5], seven clusters have been fully sequenced. All clusters are completely identical and have a precise length of 5033 bp. Although representing only 10% of the estimated total number of rDNA repeats to be present in H. polymorpha, the seven fully sequenced rDNA repeats are absolutely identical. The main functional categories and their distribution in the gene set are automatically predicted for: transposable elements, 1%; energy, 5%; cellular communication, signal transduction mechanism, 6%; protein synthesis, 6%; cell rescue, defense and virulence, 9%; cellular transport and transport mechanisms, 12%; cell cycle and DNA processing, 12%; protein fate (folding, modi¢cation, destination) 12%; transcription, 14%; and metabolism, 23% (Fig. 3). Localisation was assigned to 2858 ORFs. 3.4. Comparison with S. cerevisiae sequences The comparative genomic analysis of closely related organisms allowed us to identify species-speci¢c genes and permitted us to estimate the rates of sequence divergence of the derived proteins. Comparing the genomic organisation of S. cerevisiae to that of H. polymorpha reveals differences and similarities at di¡erent levels (Table 2 and Figs. 3 and 4). The overall H. polymorpha genome exhibits a GC content of 47.9% compared to 38.1% found for the S. cerevisiae genome. The amino acid composition properties are essentially driven by GC content. The size of the genome of S. cerevisiae is 13.5 Mb (sequenced non-redundant genome length 12 156 kb) in comparison to the 9.5 Mb (sequenced non-redundant genome length 8733 kb) of H. polymorpha. For the comparison of H. polymorpha to S. cerevisiae we have used the MIPS comprehensive yeast genome database CYGD [30]. It includes 6449 genes. Out of these, 471 genes are marked as questionable. As the exact gene number of S. cerevisiae is still under debate Table 5 Mating-speci¢c genes in H. polymorpha Hp_ORF AA length BLAST hit AA length BLASTP score Function BJ_37 BO_26 CA_130 BI_65 AG_50 AN_362 AY_145 AL_42 215 433 1227 700 398 127 295 197 Kl_YCR097w Sc_STE3 Sc_STE6 Sc_STE11 Sc_STE50 Sc_STE18 Sc_GPA1 Sc_CDC42 126 470 1290 738 364 110 472 192 154 509 1167 690 223 126 130 248 mating-type regulatory protein, silence copy at HMR locus pheromone a-factor receptor ATP-binding cassette transporter protein pheromone response pheromone response G protein Q subunit G protein K subunit G protein FEMSYR 1572 27-10-03 Cyaan Magenta Geel Zwart M. Ramezani-Rad et al. / FEMS Yeast Research 4 (2003) 207^215 Acknowledgements Erika Wedler, Kathleen Balke, Nicole Lokmer, and Do«rte Mo«stl are acknowledged for their excellent technical work during the entire DNA sequencing phase of the project. References [1] Joseph, R. (1999) Yeasts : production and commercial uses. In: Encyclopedia of Food Microbiology, Vol. 3 (Robinson, R. K., Batt, C. A. and Patel, P.D., Eds.), pp. 2335^2341. Academic Press, San Diego, CA. [2] Go¡eau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., Louis, E.J., Mewes, H.W., Murakami, Y., Philippsen, P., Tettelin, H. and Oliver, S.G. (1996) Life with 6000 genes. Science 274, 563^567. [3] Feldmann, H. (Ed.) (2000) Ge¤nolevures. Genomic exploration of the hemiascomycetous yeasts. FEBS Lett. 487, 1^150. [4] Gellissen, G. (2000) Heterologous protein production in methylotrophic yeasts. Appl. Microbiol. Biotechnol. 54, 741^750. [5] Gellissen, G. (Ed.) (2002) Hansenula polymorpha - Biology and Applications. Wiley-VCH, Weinheim. [6] Barnett, J.A., Payne, R.W. and Yarrow, D. (2000) Yeasts : Characteristics and Idendi¢cation, 3rd edn. Cambridge University Press, Cambridge. [7] Middelhoven, W.J. (2002) History, habitat, varability, nomenclature and phylogenetic position of Hansenula polymorpha. In: Hansenula polymorpha - Biology and Applications (Gellissen, G., Ed.), pp. 1^7. Wiley-VCH, Weinheim. [8] Marri, L., Rossolini, G.M. and Satta, G. (1993) Chromosome polymorphism among strains of Hansenula polymorpha. Appl. Environ. Microbiol. 59, 939^941. [9] Lahtchev, K. (2002) Basic genetics of Hansenula polymorpha. In: Hansenula polymorpha - Biology and Applications (Gellissen, G., Ed.), pp. 8^20. Wiley-VCH, Weinheim. [10] Morais, J.O.F. and Maia, M.H.D. (1959) Estudos de microorganismos enconcentrados em leitos de despe¤jos de caldas de destilarias de Pernambuco. II. Uma nova espe¤cie de Hansenula, H. polymorpha. Anais de Escola Superior de Qimica, Universidade do Recife 1, 15^ 20. [11] Roggenkamp, R., Hansen, H., Eckart, M., Janowicz, Z. and Hollenberg, C.P. (1986) Transformation of the methylotrophic yeast Hansenula polymorpha by autonomous replication and integration vectors. Mol. Gen. Genet. 202, 302^308. [12] Suckow, M. and Gellissen, G. (2002) The expression platform based on H. polymorpha strain RB11 and its derivatives - history, status and perspectives. In: Hansenula polymorpha - Biology and Applications (Gellissen, G., Ed.), pp. 105^123. Wiley-VCH, Weinheim. [13] Mayer, A.F., Hellmuth, K., Schlieker, H., Lopez-Ulibarri, R., Oertel, S., Dahlems, U., Strasser, A.W.M. and van Loon, A.P.G.M. (1999) An expression system matures : a highly e⁄cient and cost-e¡ective process for phytase production by recombinant strains of Hansenula polymorpha. Biotechnol. Bioeng. 63, 373^381. [14] Papendieck, A., Dahlems, U. and Gellissen, G. (2002) Technical enzyme production and whole-cell biocatalysis : application of Hansenula polymorpha. In: Hansenula polymorpha - Biology and Applications (Gellissen, G., Ed.), pp. 255^271. Wiley-VCH, Weinheim. FEMSYR 1572 27-10-03 215 [15] Avgerinos, G.C., Turner, B.G., Gorelick, K.J., Papendieck, A., Weydemann, U. and Gellissen, G. (2001) Production and clinical development of a Hansenula polymorpha-derived PEGylated hirudin. Sem. Thromb. Hemostas. 27, 357^371. [16] Barnes, C.S., Kra¡t, B., Frech, M., Hofmann, U.R., Papendieck, A., Dahlems, U., Gellissen, G. and Hoylaerts, M.F. (2001) Production and charcaterization of saratin, an inhibitor of von Willebrand’s factor-dependent platelet adhesion to collagen. Sem. Thromb. Hemostas. 27, 337^347. [17] Bartelsen, O., Barnes, C.S. and Gellissen, G. (2002) Production of anticoagulants in Hansenula polymorpha. In: Hansenula polymorpha Biology and Applications (Gellissen, G., Ed.), pp. 211^228. WileyVCH, Weinheim. [18] Janowicz, Z.A., Melber, K., Merckelbach, A., Jacobs, E., Harford, N., Comberbach, M. and Hollenberg, C.P. (1991) Simultaneous expression of the S and L surface antigens of hepatitis B and formation of mixed particles in the methylotrophic yeast, Hansenula polymorpha. Yeast 7, 431^433. [19] Schaefer, S., Piontek, M., Ahn, S.-J., Papendieck, A., Janowicz, Z.A. and Gellissen, G. (2001) Recombinant hepatitis B vaccines - characterization of the viral disease and vaccine production in the methylotrophic yeast, Hansenula polymorpha. In: Novel Therapeutic Proteins - Selected Case Studies (Dembowsky, K. and Stadler, P., Eds.), pp. 245^274. Wiley-VCH, Weinheim. [20] Schaefer, S., Piontek, M., Ahn, S.-J., Papendieck, A., Janowicz, Z.A., Timmermans, I. and Gellissen, G. (2002) Recombinant hepatitis B vaccines - disease characterization and vaccine production. In: Hansenula polymorpha - Biology and Applications (Gellissen, G.. Ed.), pp. 175^210. Wiley-VCH, Weinheim. [21] Van der Klei, I.J. and Veenhuis, M. (2002) Hansenula polymorpha: a versatile model organism in peroxisome research. In: Hansenula polymorpha - Biology and Applications (Gellissen, G., Ed.), pp. 76^94. Wiley-VCH, Weinheim. [22] Siverio, J.M. (2002) Biochemistry and genetics of nitrate assimilation. In: Hansenula polymorpha - Biology and Applications (Gellissen, G., Ed.), pp 21^40. Wiley-VCH, Weinheim. [23] Waschk, D., Klabunde, J., Suckow, M. and Hollenberg, C.P. (2002) Characteristics of the Hansenula polymorpha genome. In: Hansenula polymorpha - Biology and Applications (Gellissen, G., Ed.), pp. 95^ 104. Wiley-VCH, Weinheim. [24] Osoegawa, K., de Jong, P.J., Frengen, E. and Ioannou, P.A. (1999) Construction of bacterial arti¢cial chromosome (BAC/PAC) libraries. Current Protocols in Human Genetics. 5.15.1^5.15.33. [25] Osoegawa, K., Woon, P.Y., Zhao, B., Frengen, E., Tateno, M., Catanese, J.J. and de Jong, P.J. (1998) An improved approach for construction of bacterial arti¢cial chromosome libraries. Genomics 52, 1^ 8. [26] Sambrook, J., Fritsch, E.F. and Maniatis, T. (1989) Molecular Cloning, A Laboratory Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. [27] Ewing, B., Hillier, L., Wendl, M.C. and Green, P. (1998) Base-calling of automated sequencer traces using phred. Genome Res. 8, 175^194. [28] Frishman, D., Albermann, K., Hani, J., Heumann, K., Metanomski, A., Zollner, A. and Mewes, H.W. (2001) Functional and structural genomics using PEDANT. Bioinformatics 17, 44^57. [29] Birney, E. and Durbin, R. (2000) Using GeneWise in the Drosophila annotation experiment. Genome Res. 10, 547^548. [30] Mewes, H.W., Frishman, D., Gu“ldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Munsterkotter, M., Rudd, S. and Weil, B. (2002) MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 30, 31^34. Cyaan Magenta Geel Zwart