Known genes and Method

advertisement
Quality assessment of the Rice GD with known rice genes
Longjiang Fan1,2(樊龙江)
Jianbin Wang1(王建斌) Yang Zhang1 (张扬)
(Bioinplant Lab, 1. Institute of Bioinformatics / 2. Institute of Seed Science, Zhejiang University,
Hangzhou 310029; E-mail fanlj@zju.edu.cn / bioinplant@zju.edu.cn)
Known genes and Method
SWISS-PROT is a curated protein sequence database which strives to provide a high
level of annotations. There are total 395 rice entries in the database up to 7
March,2002 in which 101 are translated from genomic DNA sequences in GenBank.
The 101 rice genes which are employed as known rice genes are used to query the
Rice GD using BLAST. All query genes were classified to groups with different
evidence index based on BLAST results. All gene sequences are from start nucleotide
to end one of CDS.
Results
Evidence Index
Query gene number
Percent (%)
1. Found
1.1 MM
1.2 PM
1.3 SM
1.4 GM
1.5 OM
1.6 SC
1.7 DC
1.8 MC
96
96
48
42
6
1
86
8
5
95.0
95.0
47.5
41.6
5.9
1.0
85.1
7.9
5.0
2. Un-found
Total
5
101
5.0
100
*Found & Un-found: the query gene which is found (Found) must have one or more matched
regions which cover 60% of its total sequence and 90% identity with one or more contigs in the
Rice GD, whereas the query gene is not found (Un-found);
MM: most part (over 80%) of whole gene sequence is matched;
PM: perfect matched, without one unmatched region in whole query gene region (include cover
two contigs);
SM: matched with small (not over 100bp) unmatched regions or gaps and at least 80% of whole
One known gene(bottom) and six contigs
query gene sequence is matched;
GM: matched with a big gap (over 100bp);
OM: matched with more than one over 100bp overlapped region in a contig;
SC: the query gene sequence is covered by a single contig in the Rice GD;
DC: the query gene sequence is covered by two contigs in the Rice GD;
MC: multi-copy. There are more than two MM returns or two over 1000bp matched regions for
one query gene.
** It is entirely based on BALST results with default values whether a sequence region matched
or not with contigs in the Rice GD; It is a gap if the all or part of unmatched region between two
matched regions is not found.
7
6
A 395bp gap
5
4
3
2
1
0
250
500
750 1000 1250 1500 1750 2000 2250 2500 2750 3000 3250 3500 3750 4000 4250 4500
Gene size(bp)
Figure. A putative alignment model of a query known rice gene (accession X57563).
based on BLAST results. Six contig numbers are 150020, 99015, 115832, 37882,
47679 and 18003 from 2 to 7. All identities of matched regions (grey) are over 95%.
This is an extreme example in all 101 query gene entries.
Conclusion
The Rice GD has a comprehensive functional coverage (95.0%) of rice genome based
on this research.. Most rice genescan be found their whole gene sequences (PM+SM)
in the database (89.1%) and in one contig (85.1%). A small part (5.0%) of rice genes
are multi-copy genes.
It is a good way to assess the quality of genome sequence database using genomic
DNA sequences of known genes.
第一作者简介
樊龙江,37 岁,博士,浙江大学生物信息学研究所/种子科学与工程研究所副
教授,目前主要从事水稻基因组生物信息学、数量遗传等方面研究。已发表
SCI 论文 4 篇,国内一级学报论文 10 篇,获国家教委科技进步二等奖(1991)
和农业部(1999)、浙江省(2001)科技进步三等奖各一次。
实验室(Bioinplant Lab)主页:
http://www.cab.zju.edu.cn/depart/nx/Bioinplant/bioinplant_page.htm
E-mail: fanlj@zju.edu.cn 或 bioinplant@zju.edu.cn
Download