NCBI Genome Annotation (Maglott)

advertisement
NCBI’s Genome Annotation: Overview
•
•
•
•
Incremental processing
Re-annotation (batch)
Post-annotation review
Case studies
NOTE: limiting discussion to annotation of genes and pseudogenes
Donna Maglott, Ph.D.
for the RefSeq and Annotation groups
NCBI: Incremental processing
• Maintain gene/sequence relationship
– Data sources
• MGI’s ftp site (names, MGI ids, sequence accessions)
• Sequences and annotation from INSD (DDBJ, EMBL, GenBank)
• UniProt (names, sequence)
• CCDS Collaboration (CDS definition; Kim will expand on this later.)
• Gene family-specific databases
• HomoloGene
• UniGene
• Scientific community
– Actions
• Create or update data via Entrez Gene
• Create or update RefSeq sequences
• Identify conflicts and discuss with stakeholders
Create or update RefSeq sequences
http://www.ncbi.nlm.nih.gov/projects/sviewer/?id=NG_007044&v=947227:11232
Curated annotation of the T-cell receptor alpha/delta locus
Changed annotation after release of Mm37.1
Annotated as version 3,
overlapping model at 5’
end
Identify conflicts and discuss with stakeholders
Hi MGI,
I believe Tnrc18 (geneID: 381742, MGI:3648294) and Zfp469 (geneID: 231861,
MGI:2448535) should be merged. This region of chr 5 has major assembly
problems in the reference assembly, but the Celera assembly appears to
accurately represent the structure as compared to transcript data and the
orthologous regions of the human and rat genomes. In the reference
assembly, Zfp469 and Tnrc18 are on separate scaffolds… there are multiple
mouse and human transcripts spanning both loci… Zfp469 is currently
represented as NM_178242.2 (based on BC049818.1), and this appears to be a
valid transcript variant that uses a well-supported early polyA signal/site.
However, mouse AK141849.1, CB249799.1, AK173280.1, and human
AB058759.1 all extend past this early polyA signal/site to include an
additional 13 exons that overlap with Tnrc18 and potentially encode an
additional 1125 aa at the C-terminus…There is also an issue of nomenclature. I
cannot find any evidence of the transcripts associated with Zfp469 encoding a zinc
finger protein…the long variant does contain a trinucleotide repeat so the Tnrc18
name may be more appropriate, although the repeat is not found in the shorter
NM_178242.2 variant. Also be aware that we had mis-associated the human
TNRC18 nomenclature (HGNC: 11962).. -------- Terence Murphy
One Gene or
Two?
BLAST alignment
of human RefSeq
NM_001080495.2
to Mm37.1
Exon coverage
better in Celera
assembly
BLAST alignment
of human RefSeq
NM_001080495.2
to Celera
NCBI: Incremental processing
• Maintain gene/sequence relationship
(cont’d)
– Products
• RefSeq sequences via
– Entrez Nucleotide,
– Entrez Protein,
– ftp
• Gene-specific data via
– Entrez Gene
– ftp
• Nomenclature propagated to UniGene and HomoloGene
NCBI: Re-annotation of genes
• Timing
– Always with a new assembly
– May occur without a re-assembly
• Evidence used
–
–
–
–
cDNAs aligned to the genome by Splign
Proteins aligned to the genome by proSplign
Ab initio predictions (gnomon)
Annotated RefSeq genomic sequences
NCBI: Re-annotation
• Tracking/identification of annotation (decreasing
weight)
– Best RefSeq placement (Splign/global alignment)
– Comparison to previous annotation
• Assembly to assembly
• Clone to clone
• ‘product’ to ‘product’
– Best GenBank placement
• Products of annotation
– If tracked, reassign GeneID and RefSeq model accession(s)
– If novel and transcribed, assign new GeneID and RefSeq
model accession(s)
– If novel pseudogene, assign new GeneID and annotate on
the genomic sequence without assigning a model RefSeq
accession
One Gene or
Two?
BLAST alignment
of human RefSeq
NM_001080495.2
to Mm37.1
NCBI: Post annotation review
• Data reviewed
– GeneIDs with sequence data, not
annotated
– GeneIDs annotated previously, not in
the current annotation
– CDS features NOT included in the
CCDS set
• Actions taken
– Create new RefSeqs if necessary
– Update existing RefSeqs if necessary
– Provide annotation comment to
explain cases currently under review
NCBI: Post annotation review
NCBI: Representative review cases
• Under-representation of ncRNAs in
the RefSeq set
– May result in failure to annotate
ncRNA that overlap a protein coding
gene
– More RefSeqs for this category are
being generated
• Management of ‘read-through’
transcripts
– Generate RefSeq if multiple lines of
evidence
– Discuss with all nomenclature groups
NCBI: Representative review cases
• Adjudication of the name to be
assigned to a given genomic
location
– Evaluate conserved synteny
– Discuss with all nomenclature groups
A case history: Arhgap27 and 5730442P18Rik
VEGA: One Gene
NCBI/MGI: Two Genes
A case history: Arhgap27 and 5730442P18Rik
Contributing factors
Computation:
•model prediction suggests one gene, but independent RefSeq
mRNAs force two
•Only one gene is in the CCDS set
Curation:
•NCBI merged in 2005
•Reversed the merge in 2006 in response to request from MGI
•UniProt treats as one gene, Arhgap27
•Evidence: one cDNA in rat, one cDNA from mouse, Arhgap12 all
consistent with the one-gene model
A case history: A read-through locus
http://www.ncbi.nlm.nih.gov/projects/sviewer/?id=NG_006051&v=2242:4418;g&p=theme:0&m=1
A case history: A (vertical) read-through locus
http://tinyurl.com/2oy36j
A case history: olfactory receptors
Action: Merge the Gene records in collaboration with MGI
A case history: interspersed loci
http://www.ncbi.nlm.nih.gov/projects/sviewer/?id=NC_000073.5&v=52381146-52384105
Download