Notes from the July 1, 2009 conference call for the Cacao Genome Sequencing Project Conference call attendees: Ed Cahoon Niina Haiminen David Kuhn Uilson Lopes Dorrie Main Greg May Keithanne Mockaitis Juan Carlos Motamayor Gerald Tuskan University of Nebraska ecahoon2@unlnotes.unl.edu IBM nhaimin@us.ibm.com USDA-ARS SHRS, Miami, FL David.Kuhn@ars.usda.gov Mars Center for Cocoa Science (MCCS), Itabuna BR uilson.lopes@effem.com Washington State University dorrie@bioinform.wsu.edu NCGR gdm@ncgr.org Indiana University kmockait@cgb.indiana.edu Mars juan.motamayor@effem.com JGI tuskanga@ornl.gov I. Sequencing of the Matina 1-6 leaf transcriptome Discussion started with Keithanne Mockaitis who had done the Matina 1-6 RNA sequencing on 454 Flex Titanium. The first run was assembled with Newbler, but there were a significant number of sequences (`500k of the total 1.1M) that were put into a “repeat” category and not included in the assembly. [This first assembly is called “Matina 1-6 454 v1 052909 large contigs on the BLAST database menu on the cacaogenomedb.org website.] Keithanne sequenced another half plate of the same Matina 1-6 RNA library and assembled that, but the repeat number is still very high (700k of 1.5M). Assembly2 on the IU website https://lims.cgb.indiana.edu/files/Cacao/ Username: Matina Password: Tihi295b I think Gerald Tuskan or Greg May suggested that Keithanne contact Erica Lundquist (sp?) at Roche and ask about the unusually high number of sequences in the repeat category. Keithanne had taken the sequences in the repeat category and mapped them against the assembled sequences (~30,000 unigene contigs) and 65% of the repeat sequences mapped. Most of this was done just minutes before the conference call, so it’s not known yet whether the other 35% of the sequences in the repeat category represent novel sequences or what. Suggestions were made to BLAST those sequences, or to try to map the Illumina GAII sequences (Greg May) onto the repeat category sequences. I’m not sure whether someone (Keithanne perhaps?) was going to pursue this. Part of the reason I was so interested in the quality of the assembly of the Matina 1-6 leaf RNA sequences from the 454 sequencing run was that, in a previous conference call, we had discussed the need to have a single reference sequence set for the SNP discovery project. I thought we had agreed to use the Matina 1-6 leaf RNA sequence generated by Keithanne. I also thought that Dorrie had agreed to BLAST annotate those sequences and that Greg would use that for aligning the Illumina GAII sequence of the leaf transcriptome of diverse cacao cultivars that we were sending him from Miami. Therefore, I was worried that a lot of time would be wasted if the assembly were incorrect and everything had to be done again. Isidore and Niina had used their special annotation tool to annotate the genomic DNA 454 sequence generated by Brian’s group in Stoneville and Tim Harkins at Roche 454 and, it turns out, that’s what Greg used to align the Illumina GAII sequence data for the Matina 1-6 RNA and 5 other cultivars. So, here’s what I think would be good: 1. Niina uses Velvet or some other assembler to assemble all of the sequence data generated by Keithanne. This assembly is compared to the Newbler assembly of the same data (Assembly2 on the IU website). Whichever assembly is deemed the best will be the reference sequence set for the SNP discovery project. 2. Dorrie does a BLAST annotation of reference sequence set and Niina does an IBM annotation of the reference sequence set. 3. Greg uses this annotated reference sequence set for the alignment of the Illumina GAII generated sequences from the genetically diverse cultivars of cacao for SNP discovery. 4. The annotated reference sequence set is available on the website both for BLAST search and for download. 5. If possible, the IBM annotation is made BLAST searchable, so that the annotated protein translations of the sequences can be searched directly with a protein or peptide query. 6. An updated cacao EST dataset consisting of all the ESTs on NCBI plus all the new 454 data on Matina 1-6 and the contigs from Illumina GAII be created on the website, so that we can get as much sequence information as possible in a single search for designing overgo probes to candidate genes. II. De novo assembly of the cacao genome from primarily 454 data: Issues and strategies Discussion then turned to the topic of whether it would be possible to assemble the cacao genome from a 15x shotgun coverage with 454 Titanium sequence. The following is some background information that may not have been clear to all those involved in the call and for which I apologize. Previously, Niina and Isidore had done simulations that strongly suggested that it would not be possible to complete the cacao genome assembly just from 454 data. In response to these concerns, we had solicited a proposal from Chris Saski at CUGI to increase the number of Sanger BAC end sequences (BES) from 12,000 (6,000 BACs of the minimum tiling path MTP) to 60,000 (30,000 BACs that would provide increased number of ends per BAC in the MTP so that it would be a pseudo 20kb paired end library). Increasing the amount of Sanger reads is believed to improve the chances of assembling the genome due to less error and longer reads. In addition, Chris Saski suggested an assembly strategy that takes advantage of the physical map that CUGI has almost finished producing. Basically, the idea is to 454 sequence pools of BACs (30 – 50) representing about 3 Mb of sequence whose position and orientation are known from the physical map. Chris has proposed a pilot project where we will do one 3 Mb region. If assembly is successful, the strategy would be scaled up to cover the whole genome (~150 pools) which would represent ~15 runs on the 454 (12 pools per run), which was approximately our original budgeted coverage. Niina and Isidore proposed a simulation of the pilot project sequencing using 3 Mb of rice genome sequence to test the CUGI assembly pipeline and determine the effect of introduced error into the efficiency of assembly. OK, back to the conference call. Everyone agreed (or rather no one disagreed) with the idea of inviting Chris Saski to the Santa Fe meeting to present the pilot project proposal. I asked the group to consider what our strategy would be if the pilot project failed. Gerald Tuskan responded that: 1. The increase in Sanger BES was a good thing, no matter what, for completing the assembly. 2. That if we had many more genetically mapped markers (which will come from both the overgos on the physical map and from the SNP discovery project) that those marker sequences would help us to finish the assembly. 3. He would ask Jeremy Schutz (sp?) of Alpha Hudson what the cost of 4x coverage Sanger whole genome shotgun sequencing would be. So, there appears to be a Plan C (of currently unknown cost) if Plan B does not pan out. III. The Santa Fe meeting agenda We finished up with discussion of the Santa Fe meeting agenda. I suggested we invite Chris Saski to present his proposal and I also put myself in on Wednesday afternoon to try to sum up and get an idea of how far we’ve come, what problems have occurred since we started and how we are addressing them. Cocoa Genome Meeting August 4-5, 2009 National Center for Genome Resources, Santa Fe, New Mexico Suggested changes to the agenda which came from the conference call are in blue. Tues Aug 4: Project updates and discussion Morning: BAC status, physical map (Kuhn) SNP development (May and Schnell) 454 sequencing (Scheffler) Afternoon: Update on potato genome – lessons for cacao (Visser) Genome assembly and annotation (Tuscan and Haiminen) Proposal for a pilot project to test assembly of genomic 454 DNA sequence of 3Mb region of cacao genome from a minimum tile path of BACs. (Chris Saski, CUGI) Database and website (Main) Wed Aug 5: Quality traits and Project timelines Morning: Defining flavor and fat composition targets for Mars (Jerome) Flavor Evaluation of USDA clones- lessons learned (Seguine) Controlled fermentation for flavor evaluation (Schwann) Evaluations of TAG composition and interspecific hybrids (Lopes) High throughput TAG analysis (Cahoon) Afternoon: Cacao genome sequencing project summary, issues and actions (Kuhn) Completing the genome - next 12 month timeline (Schnell) Tools for breeding community – next steps (Schnell) Proprietary quality trait development – next steps (Motomayor)