Supplemental Materials Oil accumulation mechanisms of the oleaginous microalga Chlorella protothecoides revealed through its genome, transcriptomes and proteomes Chunfang Gao1,#,$ Email: n1cfgao@sina.com Yun Wang2,# Email: wangyun@genomics.cn Yue Shen2,# Email: shenyue@genomics.cn Dong Yan1 Email: severedong588@163.com Xi He1 Email: 337979924@qq.com Junbiao Dai1,* Email: jbdai@tsinghua.edu.cn Qingyu Wu1,* Email: qingyu@tsinghua.edu.cn 1MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing 100084, China 2BGI-Shenzhen, Shenzhen 518083, China # These authors contributed equally to this work. $ Current address: Department of Criminal Science and Technology, People's Public Security University of China, Beijing 100038, China *Corresponding authors: Dr. Junbiao Dai, Center for Epigenetics and Chromatin, School of Life Sciences, Tsinghua University, Beijing 100084, China; Phone: +86-10-62796190; Fax: +86-10-62796190; E-mail: jbdai@tsinghua.edu.cn. Dr. Qingyu Wu, School of Life Sciences, Tsinghua University, Beijing 100084, China; Phone: +86-10-62781825; Fax: +86-10-62781825; E-mail: qingyu@mail.tsinghua.edu.cn 07 10 Cp st Ye a 1900+1640 1120+1100 945 915 815+785+745 680 610 550 450 375 295 225 Supplemental Figure 1: Pulsed field gel electrophoresis of C. protothecoides chromosomes. Electrophoresis conditions were as follows: 1% agarose gel in 0.5× TBE buffer with pulse ramped from 47 sec to 188 sec for 18 h at 5 V/cm. 2 5 Percentage (%) 4 3 2 1 0 0 10 20 Depth (X) 30 40 Supplemental Figure 2: 17 kmer estimation of genome size. The genome size of C. protothecoides sp. 0710 was estimated to be 27.6 Mb based on reads from short insert size library. 3 0.01 apo p last cell e e nvelocell extern ndomem extern bran cell pape al enal enca e sys rt caps psula tem ulatin ting envelo e s stru pre extrg a tructu ctu part extraextracceellular re cellu llular matrix lar re regio g n intra intracellu intraion part cellu ellula la lar o ca ne r macro inr trorganrg mole acelluelle palle cular lar part mem comp rt bra emmembralex m ly membne-boum nono dopte rane-endedborane pane -mebm clo rgan rt branrin synn e-bo thasseed lumeelle unde com n d org plex a elle orga organ nelle elle organ enve envenelle lo e lope o rg o lu e rgan aneplle prote n elle m lum in se e b men rine/t rane periporganemlle hreo la nine phos protesinmic sppart a p c h c ribofl prote atase omplee -DN comp x riboanvin syin co p lex ucleo nthasA le prote e com in complexx mple D viral vesiclex RNANA poly caps polym mera id s eras e pro e II tr cess virionvirion ansc ivity part fa ri r ampintion faccto e in tor antib electr oxiddaing carb on tr o yd carb ansp rate bindinngt oxyhlic orter, b in acid in trans bind d g chro ferrin m catalyintig g ele c atin in ctron copopfactor b d gc s wit er ch bindin hin c a ytoch pe ing n rome c cro lase b6/f drug deamyin e comp ansp ase lex o electr tr on aorter fp rrier enzhyotosyscte glyco enzyme acti m II v m gen deberanzymee inhibaittor nchin regula or g en tor h y zym ion dbrolase dine kinas isomin e regerasg ulatoe lipid ligaser bindin meta lc g e luster lyas mom le talloc bind e g nucle nuccleular trhaaperoin oside n -triph nucleic acidnbsduceer osph oside indin nucaletase rebginding g o u ti la d nutrie e bin tor ding oxidnot reserv oxyg reductaoir patteen bind se pepti rn binding de ing phos phata perobxinding se re idas qu riboinnolinateprotein bgulatoer ucleo synth indin g p e ro small sele tein btase A din strucstructuprotein signanlium bin tu l ral co activa transinding g c n d strura ti o s u n n ti c g ctura stitue tuent enz er l con nt of of ce yme stitue nucle ll wa subs ll trate strucntut of ribaor pore -spe ral m som e te cific tranolecule transtrapyrro s p o rt transcriptionle bindiner crip a v g trans ator n ccotifa criptitranscritio tr on in ption facctor to trans traannscriptiitiation fa o s c trans riptio cripti n reg ctorr lation n term on re ulato facto inatiopressor r , nu n fac r tr c a le n r s trans trans ic acid ferato se mem latio bran n rebginding two-c e tra ulato omp sport r onen tran r t res tran nsporte pons spos er vitame regulaase actin tor b din men DNA pin alcfiola anato t-bas ackin g h o a l mica meta ed pro ging l stru bolic ce bio cture processss rm bio logica fo adh ation biosloygical lre sion nthe gue carb tic prolation on c cata bolicutilizatieoss proc n cell c cell ad omm heseioss u n cell c cenllication ycle cycle p cellrodcess c ell div eath cellu is lar co cell ion ce ce mpo cellullular cocell wallllprolifekrailling nent lar co mpon biog tion cellu macro m lar co mpon ent a enesis cellu celluoleculemponenetnt biosgsembly lar m b la d io r com syn isas enesis acro pone thetic sembly mole pro nt cular cellu la organ cess comp le cellu r hom oizsation cellux subunlar loce lar m it org alizatatisis etab aniz on cellu o lic roation la chror respocnellular p cess moso se to proce stim ss e estab cytom kinestiegregaulus lishm tio c e p n e e ro stablis t of R cesn extern esstablish al en tablish ment hmentNoA localizdeaths of c f loc atio m ca aliza alizati n extrpasulatienngt of plo ro ti cellu stru tein on in on lar str cture localiz cell ucture orga ation iz tio gam organ macro ete g niza mole ation e era n cular immguene snile tion comp ne re ncin sp o n g macro maclerox subu lo c mole mole nit org alizati se cule cule aniza on main maintemetab localiz tion o p ation tena nce noance lic ro f caof locacess tio ti memmemlo b bran on in n multic micro mraene orgeadockcinell ellula ta n tu izatiog r org multbule-babolic pro n nega mult anism i-org sed ss anism procce al e n gtiave regic process u llular meta olic nege ativetive reglaution oof rganisbm p cess e regu lation biolog al pro nitro s gen lation of c ical rocess comp of m ellu proc s ound etab lar pro ess lic rocess orgaorganic orgametabo ce noph ethe nelle olic p osph r me orgaprocesss ate m tabolic niza s tion perip etab lasm oxid o lic proce pigm ic spa ation reprocesss posit c e s d e nt m ive re etaborganizuction gulaptiositive olic p ation on o regulapore c p f ce la tion ofomplepxigmenrotacess positositivellu r b b ti c ive re regulaomponiologicaiogeneson gulati tion o ent orgl proc is f ellu aniz ess o primn of mecta atio la a olicr proce n reguproteinryc metab regu lation omp bolic processs regu lation s pro lex gula o lation of cre f b logicbiogenceess of ce ellulartion of io a b l llular comp iolog proc sis re co po onent bical quaess regugulatim nent ioge lity o n org ne o re lati regugulatioonn of mfecellularanizatisois lation of m tabo proc n of re olecu lic pro ess spon lar fu cess se to nctio resp s repro rep timulun ir rodu du respatory ga ctio s ons seocutive pro s exc cesns resprespones to abio h o resp nse toe to biotitic stimange onse chem c sti ulus resp to ex ical s mulu onse tern tim s ribon o al s ulus resto ucleo po ther o timulu p te re nse to srgtianisms secro mulu comspponse to ondin ary m lex b s etab iogesntress olic p esis secre roce sexu ti o s n a tr trans anslatil reprodby cesll mem onal uctio bran initia n e tra tion vesic tr nspo le-m an ansport ediatr rt ted strpositio ansp n ort Percent of genes CPRO CVAR CSUB Comparison of GO classification 10 1 35 56 51 0.1 Cellular Component Molecular Function Biological Process Supplemental Figure 3: Comparison of genes among C. protothecoides sp 0710, C. variabilis NC64A and C. subellipsoidea C-169. Every gene from each genome was annotated and clustered based on GO. Each bar represents the number of genes and different species is color-coded. 4 0 0 0 Number of genes 100 3559 5623 5152 Supplemental Figure 4: The gene families of four sequenced green algae. C. protothecoides, C. variabilis NC64A, C. subellipsoidea C-169 and C. reinhardtii were compared. The venn diagram showed the shared and unique gene families. Numbers of gene families are indicated in black. CPRO: C. protothecoides, CVAR: C. variabilis NC64A; CSUB: C. subellipsoidea C-169; CREI: C. reinhardtii. 5 Supplemental Figure 5: The phylogenetic tree of H+/hexose cotransporter homologous in the seven green algae with genome sequenced (C. protothecoides, C. variabilis NC64, C. subellipsoidea C-169, C. reinhardtii, V. carteri, M. pusilla CCMP1545 and O. tauri). The phylogenetic tree was generated by Neighbor-Joining method in MEGA 4, and the bootstrap was set as 1000. All the homologous were categorized into three classes, and two of the branches which contained proteins in all of the seven algae were defined as common proteins. However, the third branch which included the three HUP proteins was only presented in three Chlorella species and the other related alga C. subellipsoidea C-169, and this group was defined as the HUP-like protein. This kind of protein may be restricted to Chlorella. (The Study Accession URL: http://purl.org/phylo/treebase/phylows/study/TB2:S15529) 6 Supplemental Figure 6: Sequence comparison of HUP1 and the three HUP-like proteins (Cpr004256.1, Cpr001753.1 and Cpr003452.1) in C. protothecoides. The 12 transmembrane helixes in Cpr004256.1, Cpr001753.1 and Cpr003452.1 were predicted according to the structure of HUP1. The 6 conserved amino acids responsible for hexose recognition were labeled in red. 7 Fig S6 A H 23 Cuts per lane Trypsin LC/MS/MS ……… Digestion (LTQ Orbitrap Velos ) 100μg per sample SEQUEST Analysis Program Up regulated: 205 Down regulated: 293 Compare the PSM and Area The changes ≥ 1.5-fold 674 Proteins ( PSM ≥ 5) Protein database of C. protothecoides 1,931 Proteins identified Supplemental Figure 7: The main process in the proteomic study of autotrophic and heterotrophic C. protothecoides. After one-dimension protein separation, each lane was cut into 23 stripes, and every of which was digested in gel by trypsin, separately. Then, every sample was analyzed by LC/MS/MS (LTQ Orbitrap Velos). In protein identification, the 23 results of each lane were combined and searched against the protein database of C. protothecoides with the program of SEQUEST. As a result, 1931 proteins were identified in autotrophic and heterotrophic cells. PSM and area in protein identification were used for the different expression analysis. In heterotrophic cells, 205 proteins are up-regulated (both PSM and area are increased ≥ 1.5 fold) and 293 proteins are down-regulated (both PSM and area are decreased ≥ 1.5 fold) 8 Supplemental Figure 8: Comparison of transcriptome and proteome in autotrophic and heterotrophic cell. ARNA and HRNA: transcriptomic gene expression in autotrophic and heterotrophic cell respectively (coverage >=50% or rpkm>=10); APROT and HPROT: proteomic gene expression in autotrophic and heterotrophic cell respectively (PSM>=5). 9 Supplemental Figure 9: The functional classification of the genes differently expressed in comparative transcriptomic analysis of autotrophic and heterotrophic C. protothecoides. After mRNA sequencing, the expression levels of them in autotrophic and heterotrophic conditions were compared, and the genes with log2 (fold-change) >1 and P < 0.01 were defined as significantly changed. As a result, in heterotrophic cells 984 genes were up-regulated and 1136 genes were down-regulated. All of the genes were classified to 9 clusters according to KEGG metabolism pathway annotation. 10 Nile Red Chlorophyll Merge Bright-field Heterotrophic Autotrophic A B CP LB CP Supplemental Figure 10. The elimination of chloroplast after transition from autotrophic to heterotrophic growth. (A) Visualization of chloroplast or lipid body by confocal microscopy in autotrophic (upper) and heterotrophic (lower) cells. Red, Nile Red fluorescence used to show the lipid body. Green, chlorophyll autofluorescence pseudocolored in green, indicating the presence of chloroplast. (B) The ultrastructure of C. protothecoides cultivated in autotrophic and heterotrophic condition using transmission electron microscopy (TEM). The autotrophic cell (left) contained a cuplike chloroplast (CP). The heterotrophic cell (right) contained a big lipid body (LB). 11 Supplemental Figure 11: The heat map of PFAM domain with biased distribution in C. protothecoides. The number in C. reinhardtii were used as the reference and indicated in black. Red and green indicated the higher or lower numbers of PFAM in different algae. CPRO: C. protothecoides, CVAR: C. variabilis NC64A; CSUB: C. subellipsoidea C-169; CREI: C. reinhardtii. VCAR: V. carteri; MPUS: M. pusilla CCMP1545; OTAR: O. tauri. 12 Supplemental Table 1: Chlorella protothecoides sp. 0710 17 kmer statistics Species K K_num K_depth Genome_size X C. protothecoides 17 414,205,607 15 27,613,707 16.70 13 Supplemental Table 2: Statistics of the completeness of the genome based on 248 CEGs Complete Group 1 Group 2 Group 3 Group 4 #Prots %Completeness - #Total Average %Ortho 225 56 51 58 60 90.73 84.85 91.07 95.08 92.31 - 274 62 64 75 73 1.22 1.11 1.25 1.29 1.22 19.56 8.93 21.57 25.86 21.67 Partial 236 95.16 300 1.27 23.31 Group 1 60 90.91 71 1.18 15 Group 2 53 94.64 73 1.38 28.3 Group 3 59 96.72 77 1.31 27.12 Group 4 64 98.46 79 1.23 23.44 “Prots” indicated number of 248 ultra-conserved CEGs present in genome; “% Completeness” indicates percentage of 248 ultra-conserved CEGs present; “Total” indicates total number of CEGs present including putative orthologs; “Average” indicates average number of orthologs per CEG; “%Ortho” indicates percentage of detected CEGS that have more than 1 ortholog. 14 Supplemental Table 3: Repeats in the genome of Cp0710 with combined approaches. Type Trf Repeatmasker Proteinmask De novo Total a Repeat Size a 431,366 544,464 564,023 1,028,311 1,397,655 % of genome 1.8817 2.3750 2.4603 4.4856 6.0967 The overlaps between repeats have been excluded before the calculation . 15 Supplemental Table 4: Results of repeat prediction without TRF. RepeatMasker Type Length (Bp) % in genome ProteinMasker Length (Bp) Combined a Denovo % in genome Length (Bp) % in genome Length (Bp) % in genome DNA 5,342 0.0233 6,327 0.0276 84,696 0.3695 96,207 0.4197 LINE 4,431 0.0193 96 0.0004 84,183 0.3672 88,710 0.3870 SINE 642 0.0028 0 0.0000 0 0.0000 642 0.0028 LTR 15,007 0.0655 14,494 0.0632 41,173 0.1796 66,527 0.2902 Satellite 851 0.0037 449 0.0020 2,919 0.0127 2,921 0.0127 Simple repeat 140,139 0.6113 139,387 0.6080 153,278 0.6686 197,878 0.8632 Low complexity 379,791 1.6567 330,534 1.4418 373,521 1.6293 399,048 1.7407 Tandem Repeat 0 0 73,073 0.3188 0 0.0000 73,073 0.3188 210 0.0009 0 0.0000 302,440 1.3193 302,650 1.3202 544,464 2.3750 564,023 2.4603 1028,311 4.4856 1,123,650 4.9015 Unknown Total a b b The length of repeats was calculated and overlaps have been excluded before. This refers to the repeats that can’t be classified by RepeatMasker. 16 Supplemental Table 5: General statistics of predicted protein-coding genes De novo Total number of genes Average length of mRNA Average length of cds Average number of exon Average length of exon Average length of intron Total number of exon Total number of intron Homolog Augustus SNAP 5473 2960.31 1413.21 7.30 193.66 245.67 39939 34466 7446 2379.82 1137.91 6.35 179.10 231.99 47307 39861 Glimmer HMM 20068 1068.94 937.18 3.10 301.90 62.61 62297 42229 CREI CVAR CSUB 4302 1553.34 888.11 3.91 227.32 228.85 16807 12505 5166 1948.45 1030.37 4.65 221.67 251.65 24013 18847 4782 1714.04 950.78 4.23 224.60 236.07 20243 15461 GLEAN 6247 2598.03 1292.44 6.28 205.73 247.17 39245 32998 CREI: Chlamydomonas reinhardtii; CVAR: Chlorella variabilis NC64A; CSUB: Coccomyxa subellipsoidea C-169. 17 Supplemental Table 6: Statistics of functional annotation Type Total gene Annotated All annotated Database TrEMBL Swissprot KEGG InterPro GO Gene 7039 5725 4537 3910 4599 3559 5831 Percentage (%) 100 81.33 64.46 55.55 65.34 50.56 82.87 18 Supplemental Table 7: Proteins involved in nitrogen transport and assimilation in green algae a. Reference proteins b Transporter XP_001694496.1 nitrate transporter XP_001694067.1 nitrite transporter XP_001701575.1 ammonium transporter XP_001691580.1 urea active transporter XP_001694885.1 amino acid transporter Assimilation XP_001696697.1 nitrate reductase XP_001696787.1 nitrite reductase NP_176922.1 c urease NP_173602.1 c urease accessory protein F AAD16984.1 c urease accessory protein UREG NP_850239.1 c urease accessory protein D XP_001692927.1 glutamine synthetase XP_001693082.1 glutamate synthase XP_001702270.1 glutamate dehydrogenase XP_001703658.1 nitrogen regulatory protein PII Urea Cycle XP_001690709.1 carbamoyl phosphate synthase, large subunit XP_001690929.1 ornithine carbamoyltransferase XP_001696749.1 argininosuccinate synthase CAA34615.1 argininosuccinate lyase NP_192629.1 c arginase a Only MPUS OTAR VCAR CSUB CVAR CPRO XP_003057942.1 XP_003081529.1 EFJ43737.1 EIE23748.1 EFN52690.1 XP_003058319.1 XP_003081525.1 EFJ43209.1 EIE18297.1 EFN58263.1 Cpr000340.1 XP_003063809.1 XP_003084401.1 EFJ40601.1 EIE23179.1 EFN53204.1 Cpr001664.1 XP_003083319.1 EFJ41618.1 EIE20547.1 EFN55634.1 XP_003074291.1 EFJ48238.1 EIE22399.1 EFN60084.1 Cpr002375.1 XP_003058323.1 XP_003081526.1 EFJ43675.1 EIE21865.1 EFN52691.1 Cpr000877.1 XP_003057941.1 XP_003081527.1 EFJ43735.1 EIE21866.1 EFN52613.1 Cpr001933.9 XP_003083318.1 XP_003083320.1 XP_003083317.1 XP_003078095.1 EFN50428.1 XP_003057550.1 XP_003074553.1 EFJ51602.1 EIE23502.1 EFN56917.1 Cpr003038.1 XP_003057676.1 XP_003083015.1 EFJ40691.1 EIE24001.1 EFN59782.1 Cpr004691.1 EFJ45751.1 EIE23148.1 EFN55208.1 Cpr003571.2 EFN50797.1 Cpr004333.1 XP_003062922.1 EFJ41943.1 XP_003058904.1 XP_003080815.1 EFJ40705.1 EIE20839.1 EFN52062.1 Cpr002904.1 XP_003056510.1 XP_003074205.1 EFJ50768.1 EIE22039.1 EFN58131.1 Cpr003449.1 XP_003063309.1 XP_003083741.1 EFJ48533.1 EIE18346.1 EFN55601.1 Cpr002598.3 XP_003055265.1 XP_003082446.1 EFJ46826.1 EIE22869.1 EFN52305.1 Cpr003750.5 one protein with the highest score in BlastP (E-value<1E-5) were listed. b Most of the reference proteins used for BlastP were proteins of C. reinhardtii. proteins of Arabidopsis thaliana were used when they were not found in C. reinhardtii. MPUS: Micromonas pusilla CCMP1545; OTAR: Ostreococcus tauri; VCAR: Volvox carteri; CSUB: Coccomyxa subellipsoidea C-169; CVAR: Chlorella variabilis NC64A; CPRO: Chlorella protothecoides sp. 0710. c The 19 Supplemental Table 8: H+/hexose cotransporters in Chlorella kessleri and their homologs in green algae (BlastP, E-value<1E-5, Alignment length>30%) Green algae Homologs CKES P15686.2 Q39524.1 Q39525.1 CPRO Cpr004256.1 Cpr001753.1 Cpr003452.1 Cpr002964.1 Cpr003677.1 Cpr003720.1 Cpr003252.1 Cpr003700.3 CVAR EFN53774.1 EFN53666.1 EFN50678.1 EFN60010.1 EFN50549.1 EFN54539.1 EFN53468.1 EFN59043.1 EFN52027.1 CSUB EIE21809.1 EIE25646.1 EIE22313.1 EIE22314.1 EIE19027.1 EIE27221.1 EIE20603.1 EIE22323.1 EIE25022.1 EIE21778.1 EIE25408.1 EIE22371.1 EIE26813.1 EIE26164.1 EIE25526.1 EIE20660.1 EIE21964.1 EIE20094.1 EIE20926.1 CREI XP_001693177.1 XP_001701103.1 VCAR EFJ48518.1 EFJ42942.1 MPUS XP_003063909.1 XP_003062688.1 XP_003054757.1 XP_003062182.1 XP_003059968.1 XP_003059519.1 OTAR XP_003082978.1 XP_003078139.1 XP_003077948.1 XP_003080102.1 EFN59533.1 EFN58991.1 Cpr005023.1 EFN55620.1 EFN50679.1 CKES: Chlorella Kessler; CPRO: Chlorella protothecoides sp. 0710; CVAR: Chlorella variabilis NC64A; CSUB: Coccomyxa subellipsoidea C-169; CREI: Chlamydomonas reinhardtii; VCAR: Volvox carteri; MPUS: Micromonas pusilla CCMP1545; OTAR: Ostreococcus tauri. 20 Supplemental Table 9: Transcriptome sequencing data statistics. Culture condition Autotrophy Heterotrophy Total raw reads (M) 43.7 40.9 Total raw base (G) 3.93 3.68 Total clean reads (M) 40.2 37.9 21 Total clean base (G) 3.62 3.41 Reads map to genome (M) 32.4 30.8 Mapping ratio (%) 80.7 81.2 Supplemental Methods Pulse Field Gel Electrophoresis The Pulse field gel electrophoresis (PFGE) was carried out according to Blanc et al.1, with slight modifications. The algal cells were harvested from 4 day old cultures by centrifugation at 5,000×g for 5 minutes. Approximately 0.25 ml cell pellets were re-suspended in 200 μl deionized water, mixed with 2% low melting point agarose in 100 mM EDTA at 42°C, poured into plug molds (Bio-Rad), and placed at 4°C for about 20-30 minutes to harden the agarose. Once solidified, all the plugs were incubatd with solution V (500mM EDTA, 10mM Tris, pH7.5) overnight at 37°C. On the next day 400 μl of 5 mg/ml proteinase K in 500 mM EDTA (pH7.5) was added to the tube and gently inverted a few times to homogenize the solution. They were then incubated overnight and washed with deionized water. Subsequent washes were done using 1× TE in the cold room overnight. Electrophoresis was performed under following condition: 1% agarose gel in 0.5× TBE buffer with pulse ramped from 47 sec to 188 sec for 18 h at 5 V/cm. Saccharomyces cerevisiae chromosomes were used as the PFGE markers (225 to 1900kb). Supplemental Refereneces 1. Blanc, G. et al. The genome of the polar eukaryotic microalga Coccomyxa subellipsoidea reveals traits of cold adaptation. Genome biology 13, R39 (2012). 22