geneid

advertisement

From ccu.umanitoba.ca!bonnie.concordia.ca!newsserver.csri.toronto.edu!utgpu!watserv1!watmath!att!news.cs.indiana.edu!sd

d.hp.com!wupost!micro-heart-of-gold.mit.edu!bu.edu!steen Sat Feb 22

14:16:23 CST 1992

Article 1612 of bionet.software:

Xref: ccu.umanitoba.ca bionet.software:1612 bionet.molbio.genomeprogram:182

Path: ccu.umanitoba.ca!bonnie.concordia.ca!newsserver.csri.toronto.edu!utgpu!watserv1!watmath!att!news.cs.indiana.edu!sd

d.hp.com!wupost!micro-heart-of-gold.mit.edu!bu.edu!steen

>From: steen@darwin.bu.edu (Steen Knudsen)

Newsgroups: bionet.software,bionet.molbio.genome-program

Subject: NETGENE AND GENEID ONLINE SERVER

Message-ID: <STEEN.92Feb21160330@darwin.bu.edu>

Date: 21 Feb 92 21:03:30 GMT

Sender: news@bu.edu

Followup-To: bionet.software

Organization: Bio-Molecular Engineering Research Center

Lines: 168

GENEID AND NETGENE ONLINE SYSTEMS FOR PREDICTION OF GENE STRUCTURE

version 1.0 2/1/1992

GENEID

_________________________________________________________________________

______

Geneid is an Artificial Intelligence system for analyzing vertebrate genomic

DNA and prediction of exons and gene structure (1). A prototype is implemented as a fast, automatic email-response system. Users have the option of having their DNA sequence analyzed by NetGene (2) simultaneously.

REGISTRATION:

Before or simultaneously with submitting a sequence for analysis, you need to register your name by sending a line with the word "register", followed by your name and address. Example: register, Don Johnson, Miami Vice, Baywiev Marina Dock A12, Miami, FL

34566-

1234, U.S.A.

NOTE>> The line can be longer than 80 characters as long as it contains

NO linebreaks, (that is, do NOT press the <Return> key until the end of the address.)

Send the line in a mail to: geneid@darwin.bu.edu. The registration information will only be used for maintaining a file of the number and geographic distribution of the users.

SUBMITTING SEQUENCES:

Your sequences must be submitted in the following format (approximately same format as used for fasta, BLAST and GRAIL):

You can submit only one sequence per mail. Put the sequence after the keyword

"Genomic Sequence" as shown below:

Genomic Sequence

>seqname

TTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCC

CGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTG

GCCAACTCCATCACTA...................

(Restrict the line length to 80 characters. The seqname is limited to 20 characters).

NOTE>> IF YOUR MAIL DOES NOT CONTAIN THE KEYWORD "GENOMIC SEQUENCE", OR

ANY OTHER KEYWORDS LISTED IN THIS FILE, NO MAIL WILL BE RETURNED TO YOU.

If the reply file with the results will exceed the Mail limit of 300 kB, the reply will be split into several files. On a UNIX system you could send the File containing the sequence as follows: mail -v geneid@darwin.bu.edu <File

LIMITS:

GeneId currently will not accept sequences smaller than 100 bp or larger than 20 kb.

CONFIDENTIALITY:

Your submitted sequence will be deleted automatically immediately after reception by GeneID.

ANALYSIS:

GeneID will scan your sequence for potential splice sites, startcodons, and stopcodons. Then it will try to assemble these into potential first exons, internal exons, and last exons. Exons will be evaluated according to a number of characteristics related to coding and splicing, and only likely exons will be kept. Mutually exchangeable exons (normally overlapping and in the same frame) will be put together in classes. Only the top 15 ranking first and last exon classes, and the top 35 ranking internal exon classes from each sequence will be kept, and assembled into potential gene models with open reading frame, that will be ranked according to quality of the exons they contain. The top 20 models will be included in the return mail. Your return mail will also contain lists of the sites and exons created during the analysis. GeneID will not analyze the reverse complement of your sequence. If you suspect a gene on the other strand, submit the reverse complement sequence separately.

TIPS FOR USE OF GENEID:

GeneID will try to identify first, internal, and last exons in each of the sequences you submit, and try to assemble these into models of ONE likely

gene in each sequence. To avoid missing any exons, the number of exons will be vastly overpredicted, and only a few of them are likely to be true

(they tend to be the top ranking exons, but a few true exons rank very low).

But these few true exons are likely to be found in the gene models because they fit together to form a continuous open reading frame. Thus you should look to the gene models to find a probable coding region.

If you submit a sequence that turns out to contain two genes, the behavior of

GeneID is unpredictable. It could either predict one large gene containing both, or it could predict only the gene with the most typical charateristics.

If you submit a sequence that contains only part of a gene, GeneID will try to identify an entire gene in this sequence. Thus the predicted first exon may actually be part of a true internal exon, or the predicted last exon may be part of a true internal exon. If GeneID fails to predict any genes, you might look at the potential exon lists.

Thus you can experiment with input and response, by starting out with sequences that are not too long (for example less than 10 kb), and see if GeneID is able to extend the gene if you extend the sequence. If you have very large sequences, it may be a good idea to request analysis by NetGene first

(see below). NetGene will analyze sequences up to 100 kb, and may find regions containing exons of very high likelihood. These regions can then be resubmitted to GeneID for further analysis.

GeneID will not construct models with more than 22 exons.

If the sequence contains frameshift errors in exons, then that may affect the quality of the prediction in the current implementation.

ACCURACY:

In a test on 28 genes from GenBank, 91% of the nucleotides were correctly predicted as coding or non-coding. Since these two categories are unequally represented, a better measure of accuracy may be the correlation coefficient, which was found to be 0.68. See paper for details.

ANALYSIS TIME:

Will depend on the load on the system and grows approximately linearly with the length of the sequence input. Expect at least 1 minute per kb. Longer response times can occur if the system is temporarily down (check with the

UNIX command: "finger geneid@darwin.bu.edu").

FURTHER INFORMATION:

A preprint of a paper describing the development and testing of GeneID is available as a Stuffit.hqx file for Macintosh. Simply include the line:

Preprint Request in your mail to geneid@darwin.bu.edu, and the manuscript will be mailed to you.

REFERENCING:

Publication of output from GeneID must be referenced as follows:

(1) Guigo, R., Knudsen, S., Drake, N., and Smith, T. (1992) Prediction of

Gene

Structure. Journal of Molecular Biology. In Press.

PROBLEMS, COMMENTS, AND SUGGESTIONS:

Can be mailed to steen@darwin.bu.edu.

Users of the MBCRR and BMERC national computer resources have direct online access to GeneID from their account. Contact Tom Graf at tom@mbcrr.harvard.edu for information on these accounts.

NETGENE

_________________________________________________________________________

_______

Users now have the option of having their submitted sequence analyzed by

NetGene also. NetGene predicts splice sites and gives information about the likelihood of the prediction. NetGene detects both coding regions and splice signals, and combines that information to predict both small and large exons (it predicts one end of the exon, the acceptor or donor site).

Simply include the keyword "NetGene" between the keyword "Genomic

Sequence" and your sequence. The results of the NetGene analysis will be mailed to you separately. The only difference in sequence format is that NetGene will accept sequences UP TO 100 kb. Thus, NetGene can be used in conjunction with

GeneID by first submitting a large sequence to NetGene (specify the keyword

"NetGene";

GeneID will not respond if the sequence is larger than 20 kb). Regions that show exons with very high likelihood can then be resubmitted to GeneID

(<20kb) for further analysis. The minimum sequence length that NetGene will faithfully analyze is 451 bp.

REFERENCING AND FURTHER INFORMATION

Publication of output from NetGene must be referenced as follows:

(2) Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of

Human mRNA

Donor and Acceptor Sites from the DNA Sequence. Journal of Molecular

Biology

220:49-65.

PROBLEMS, COMMENTS AND SUGGESTIONS:

Can be mailed to : steen@darwin.bu.edu

Download