Cocoa Genome Meeting - Cacao Genome Database

advertisement
Notes from the July 1, 2009 conference call for the Cacao Genome Sequencing Project
Conference call attendees:
Ed Cahoon
Niina Haiminen
David Kuhn
Uilson Lopes
Dorrie Main
Greg May
Keithanne Mockaitis
Juan Carlos Motamayor
Gerald Tuskan
University of Nebraska ecahoon2@unlnotes.unl.edu
IBM nhaimin@us.ibm.com
USDA-ARS SHRS, Miami, FL David.Kuhn@ars.usda.gov
Mars Center for Cocoa Science (MCCS), Itabuna BR
uilson.lopes@effem.com
Washington State University
dorrie@bioinform.wsu.edu
NCGR gdm@ncgr.org
Indiana University kmockait@cgb.indiana.edu
Mars juan.motamayor@effem.com
JGI
tuskanga@ornl.gov
I. Sequencing of the Matina 1-6 leaf transcriptome
Discussion started with Keithanne Mockaitis who had done the Matina 1-6 RNA sequencing
on 454 Flex Titanium. The first run was assembled with Newbler, but there were a
significant number of sequences (`500k of the total 1.1M) that were put into a “repeat”
category and not included in the assembly. [This first assembly is called “Matina 1-6 454
v1 052909 large contigs on the BLAST database menu on the cacaogenomedb.org website.]
Keithanne sequenced another half plate of the same Matina 1-6 RNA library and assembled
that, but the repeat number is still very high (700k of 1.5M).
Assembly2 on the IU website https://lims.cgb.indiana.edu/files/Cacao/
Username: Matina
Password: Tihi295b
I think Gerald Tuskan or Greg May suggested that Keithanne contact Erica Lundquist (sp?)
at Roche and ask about the unusually high number of sequences in the repeat category.
Keithanne had taken the sequences in the repeat category and mapped them against the
assembled sequences (~30,000 unigene contigs) and 65% of the repeat sequences mapped.
Most of this was done just minutes before the conference call, so it’s not known yet
whether the other 35% of the sequences in the repeat category represent novel sequences
or what. Suggestions were made to BLAST those sequences, or to try to map the Illumina
GAII sequences (Greg May) onto the repeat category sequences. I’m not sure whether
someone (Keithanne perhaps?) was going to pursue this.
Part of the reason I was so interested in the quality of the assembly of the Matina 1-6 leaf
RNA sequences from the 454 sequencing run was that, in a previous conference call, we
had discussed the need to have a single reference sequence set for the SNP discovery
project. I thought we had agreed to use the Matina 1-6 leaf RNA sequence generated by
Keithanne. I also thought that Dorrie had agreed to BLAST annotate those sequences and
that Greg would use that for aligning the Illumina GAII sequence of the leaf transcriptome
of diverse cacao cultivars that we were sending him from Miami. Therefore, I was worried
that a lot of time would be wasted if the assembly were incorrect and everything had to be
done again. Isidore and Niina had used their special annotation tool to annotate the
genomic DNA 454 sequence generated by Brian’s group in Stoneville and Tim Harkins at
Roche 454 and, it turns out, that’s what Greg used to align the Illumina GAII sequence data
for the Matina 1-6 RNA and 5 other cultivars. So, here’s what I think would be good:
1. Niina uses Velvet or some other assembler to assemble all of the sequence data
generated by Keithanne. This assembly is compared to the Newbler assembly of the same
data (Assembly2 on the IU website). Whichever assembly is deemed the best will be the
reference sequence set for the SNP discovery project.
2. Dorrie does a BLAST annotation of reference sequence set and Niina does an IBM
annotation of the reference sequence set.
3. Greg uses this annotated reference sequence set for the alignment of the Illumina GAII
generated sequences from the genetically diverse cultivars of cacao for SNP discovery.
4. The annotated reference sequence set is available on the website both for BLAST search
and for download.
5. If possible, the IBM annotation is made BLAST searchable, so that the annotated protein
translations of the sequences can be searched directly with a protein or peptide query.
6. An updated cacao EST dataset consisting of all the ESTs on NCBI plus all the new 454
data on Matina 1-6 and the contigs from Illumina GAII be created on the website, so that we
can get as much sequence information as possible in a single search for designing overgo
probes to candidate genes.
II. De novo assembly of the cacao genome from primarily 454 data: Issues and
strategies
Discussion then turned to the topic of whether it would be possible to assemble the cacao
genome from a 15x shotgun coverage with 454 Titanium sequence. The following is some
background information that may not have been clear to all those involved in the call and
for which I apologize.
Previously, Niina and Isidore had done simulations that strongly suggested that it would
not be possible to complete the cacao genome assembly just from 454 data. In response to
these concerns, we had solicited a proposal from Chris Saski at CUGI to increase the
number of Sanger BAC end sequences (BES) from 12,000 (6,000 BACs of the minimum
tiling path MTP) to 60,000 (30,000 BACs that would provide increased number of ends per
BAC in the MTP so that it would be a pseudo 20kb paired end library). Increasing the
amount of Sanger reads is believed to improve the chances of assembling the genome due
to less error and longer reads.
In addition, Chris Saski suggested an assembly strategy that takes advantage of the
physical map that CUGI has almost finished producing. Basically, the idea is to 454
sequence pools of BACs (30 – 50) representing about 3 Mb of sequence whose position and
orientation are known from the physical map. Chris has proposed a pilot project where we
will do one 3 Mb region. If assembly is successful, the strategy would be scaled up to cover
the whole genome (~150 pools) which would represent ~15 runs on the 454 (12 pools per
run), which was approximately our original budgeted coverage.
Niina and Isidore proposed a simulation of the pilot project sequencing using 3 Mb of rice
genome sequence to test the CUGI assembly pipeline and determine the effect of
introduced error into the efficiency of assembly.
OK, back to the conference call. Everyone agreed (or rather no one disagreed) with the
idea of inviting Chris Saski to the Santa Fe meeting to present the pilot project proposal. I
asked the group to consider what our strategy would be if the pilot project failed. Gerald
Tuskan responded that:
1. The increase in Sanger BES was a good thing, no matter what, for completing the
assembly.
2. That if we had many more genetically mapped markers (which will come from both the
overgos on the physical map and from the SNP discovery project) that those marker
sequences would help us to finish the assembly.
3. He would ask Jeremy Schutz (sp?) of Alpha Hudson what the cost of 4x coverage Sanger
whole genome shotgun sequencing would be.
So, there appears to be a Plan C (of currently unknown cost) if Plan B does not pan out.
III. The Santa Fe meeting agenda
We finished up with discussion of the Santa Fe meeting agenda. I suggested we invite
Chris Saski to present his proposal and I also put myself in on Wednesday afternoon to try
to sum up and get an idea of how far we’ve come, what problems have occurred since we
started and how we are addressing them.
Cocoa Genome Meeting
August 4-5, 2009
National Center for Genome Resources, Santa Fe, New Mexico
Suggested changes to the agenda which came from the conference call are in blue.
Tues Aug 4: Project updates and discussion
Morning:
BAC status, physical map (Kuhn)
SNP development (May and Schnell)
454 sequencing (Scheffler)
Afternoon:
Update on potato genome – lessons for cacao (Visser)
Genome assembly and annotation (Tuscan and Haiminen)
Proposal for a pilot project to test assembly of genomic 454 DNA sequence of
3Mb region of cacao genome from a minimum tile path of BACs. (Chris Saski,
CUGI)
Database and website (Main)
Wed Aug 5: Quality traits and Project timelines
Morning:
Defining flavor and fat composition targets for Mars (Jerome)
Flavor Evaluation of USDA clones- lessons learned (Seguine)
Controlled fermentation for flavor evaluation (Schwann)
Evaluations of TAG composition and interspecific hybrids (Lopes)
High throughput TAG analysis (Cahoon)
Afternoon:
Cacao genome sequencing project summary, issues and actions (Kuhn)
Completing the genome - next 12 month timeline (Schnell)
Tools for breeding community – next steps (Schnell)
Proprietary quality trait development – next steps (Motomayor)
Download