The USPTO Genetic Sequence Database, USGENE, on STN

advertisement
The USPTO Genetic Sequence
Database, USGENE®, on STN
Robert Austin – FIZ Karlsruhe
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
Similarity searching GETSIM (FASTA)
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
BLAST is a registered trademark of the U.S. National Library of Medicine (NLM)
2
STN sequence searchable databases
• CAS REGISTRYSM
– Chemical Abstracts Service (CAS) Registry File
• DGENE
– Thomson Scientific GENESEQTM
• PCTGEN
– WIPO/PCT Patent Application Biosequences
• USGENE
– The USPTO Genetic Sequence Database
See Effective patent sequence searching on STN:
http://www.stn-international.com/training_center/bioseq/epss.pdf
3
A new subject for many….
Bluff Your Way in Genetics!!
http://www.stn-international.com/training_center/bioseq/bluff.pdf
4
USGENE is the USPTO Genetic
Sequence Database
• Sequences from all relevant USPTO published patent
applications and granted (issued) patents
• Assignee and full inventor names; publication,
application and parent case PCT numbers and dates;
original publication title, abstract, and claims
• Organism name, sequence length, Molecule Type, SEQ
ID, and feature tables for features/annotations
• Produced by the SequenceBase Corporation
• Updated weekly – within 7 days of publication
• 1982 – present
5
USGENE consolidates unique USPTO
sequence data from different sources
• USPTO Publication Site for Issued and Published
Sequences (PSIPS)
– The official mega-publication download site, 2001-date
• International Nucleotide Sequence Database
Collaboration (INSDC) (NCBI/EMBL/DDBJ, Genbank)
– U.S. granted patent nucleotide sequences, 1982-date
• USPTO Protein Database (NCBI/EMBL)
– U.S. granted patent protein/peptide sequences, 1982-date
• USPTO Patents and Published Applications Full-Text
– Filling in omissions, coverage gaps and to enhance timeliness
The USGENE Sequence Source (/SSO) field indicates which
source any given USGENE sequence record was derived from.
6
USGENE combines these sequences with
bibliographic data and claims text
USPTO biblio,
title, abstract
and claims text
INSDC
USPTO nucleotide
Sequences
NCBI/EMBL-EBI
USPTO peptide
Sequences
USPTO PSIPS
Sequences
USPTO full-text
sequences
7
An individual publication is represented by
one or more USGENE sequence records
AN .... Protein
PI
USGENE
US …. B2
SEQ 1 ….
AN .... DNA
PI
USGENE
US …. B2
SEQ 2 ….
AN .... cDNA
PI
USGENE
US …. B2
SEQ n ….
8
Each USGENE sequence record includes
full patent bibliography, title and abstract
L1
AN
TI
IN
PA
PI
AI
RLI
ED
DT
AB
(7)
ANSWER 1 OF 1 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
ALL display format.
7255990.1
DNA
USGENE
Method for screening genes expressing at desired sites (Patent)
Higo Kenichi (Tsukuba, JP); Iwamoto Masao (Tsukuba, JP)
National Institute of Agrobiological Sciences (Ibaraki JP)
US 7255990
B2
20070814
US 20040086855
A1
20040506
WO 2003044227
A
20030530
See (1) - (7)
US 2001-221596
20011121
on slide 12.
WO 2001-JP10195
20011121
20070817
Patent
The present invention relates to a method for inferring a plant
organ, in which a certain gene is to be expressed, using a part of a
base sequence, a method for searching for a gene which is to be
USGENE records are typically
expressed at a desired site, and a composition, kit, system and
availableout
within
7 days
of The present invention also
program for carrying
these
methods.
the USPTO.
relates to a publication
method for by
inferring
a plant organ, in which a plant
gene is to be expressed, based on information about the presence or
absence of a base sequence which is highly similar to . . . .
(1)
(2)
(3)
(4)
(5)
(6)
9
Each USGENE sequence record includes
patent or published application claims text
CLM
(8)
ALL
display
format (cont.)
US7255990 B2: 1. A method for detecting a gene
which
is expressed
in
a flower and other organs in a rice plant, comprising the steps
of:(1)searching a gene population using a Tourist C transposon
sequence consisting of SEQ ID NO: 1 as a key sequence,(2) selecting a
gene having the transposon sequence in the vicinity of a putative
protein coding region, and(3) detecting expression of said gene in
the flower and other organs.
2. The method according to claim 1, wherein the expression of said
gene includes expression of at least one site selected from a stamen
and a pistil.
3. The method according to claim 1, wherein the gene population is a
library and the key sequence is a probe sequence.
4. The method according to claim 3, wherein the database is a DNA
library.
5. The method according to claim 3, wherein the search . . . .
10
All USGENE sequences are provided
in STN standardized format
SSO
NUCLEIC; USPTO; GRANTED
ORGN
Zea mays
SQL
352
(11)
(10)
(9)
ALL display format (cont.)
SEQ
1 gggtctgttt agttcccaaa caaaattttt cacgctgtta cataggatgt
51 ttggacacat gcatagagta ctaaatgtag aaaaaaaaca attaaacatt
(12)
101 tcgccttgaa attacgagac aaatctttta agcctaattg cgccatgatt
151 tgacaatttg gtgctacaat aaatatttgc taataataga ttaattaggc
201 ttaataaatt cgtcttgcag tttccagacg gaatctgtaa tttattttat
251 gagatacagc tgcttcgatc ttccatcaca tattcagacc gtacctaatc
301 tgaaaggtta gtaatttgaa ctgcgtagta atgctacaag gtaaatcaat
351 ca
(13)
FEATURE TABLE:
Key
|Location
|
============+==========+=======================
misc_feature|(1)..(352)|
See (8) - (13)
on slide 13.
11
USGENE sample record annotations
1) USGENE Accession Number (AN), including the
sequence identity number (SEQ ID NO)
2) Molecule Type (MTY)
3) Original publication title – a “Published Application”
or “Patent” indication is given in parentheses
4) Full inventor names, city and state/country
5) Patent assignee name, city and state/country
6) Publication, application and related PCT parent
case application details and dates
7) Original patent or published application abstract
12
USGENE sample record annotations
8) Published application or granted patent claims
9) The Sequence Source (SSO) – nucleic or protein;
PSIPS/USPTO, NCBI, etc; granted or application
10) Organism (where given) – providing the name of
the organism from which the sequence is derived
11) Searchable and sortable Sequence Length (SQL)
12) Standardized patent sequence (SEQ) – each
USGENE record is based upon a sequence
13) Feature table including sequence modifications,
features and/or annotations, as provided by the
patent applicant or assignee
13
The original format of a USGENE sequence is
available for display using the SEQO display
=> S 20070224666.21/AN
L1
1 20070224666.21/AN
=> D TRI SEQO
L1
TI
MTY
SQL
SEQO
14
USGENE Accession Numbers (/AN)
comprise the publication number + the
sequence identity number (SEQ ID NO).
ANSWER 1 OF 1 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
Alleles of the zwf gene from coryneform bacteria
(PublishedApplication)
DNA
Often the SEQO original format includes
1263
gtg
Met
1
gaa
Glu
20
gga
Gly
35
. .
gcc ctg gtc gta cag
Ala Leu Val Val Gln
5
cgc att aga aac gtc
Arg Ile Arg Asn Val
25
aat gat gtc gtg gtt
Asn Asp Val Val Val
40
. . . . .
aaa
Lys
the patent applicant’s alignment of the
tat
ggc ggt tcc
tcg cttcoding
gag agt
gcg with48
nucleotide
sequence
region
Tyr Gly Gly Ser Ser Leu Glu Ser Ala
its corresponding
protein sequence.
10
15
gct gaa cgg atc gtt
Ala Glu Arg Ile Val
30
gtc tgc tcc gca atg
Val Cys Ser Ala Met
45
gcc acc aag aag gct
Ala Thr Lys Lys Ala
96
gga gac acc acg gat
Gly Asp Thr Thr Asp
144
In contrast, NCBI/EMBL/DDBJ patent records
have minimal bibliographic and text data
Note: USPTO peptide sequence records
available at EMBL (shown here), are
also available a NCBI, but not at DDBJ.
15
USGENE represents a new tool for tackling
business critical searches
• DGENE and REGISTRY sequences are indexed
by Thomson from the DWPISM basic and by CAS
from the CAplusSM basic respectively
– 65% of basics are PCT published applications
• USGENE provides sequences from both USPTO
granted patents and published applications
– Updated weekly, within 7 days of USPTO publication
• Sequence listing variation often occurs between
published application and granted patent stage
– Especially important, e.g. for freedom-to-operate
16
USGENE provides sequences from both USPTO
published applications and granted patents
AN .... Protein
PI
USGENE
US …. A1
PI
SEQ 1 ….
AN .... DNA
PI
USGENE
AN .... Protein
FR ….. A1
USGENE
US …. B2
AN .... DNA
US …. B2
SEQ 2 ….
US ….. A1
EP ….. B1
USGENE
AN .... Protein
PI
US ….. B2
DGENE
WO …. A1
SEQ 1 ….
AN .... DNA
PI
EP ….. A1
SEQ 1 ….
PI
WO ….. A1
US …. A1
SEQ 2 ….
PI
AN .... WPINDEX
DGENE
WO …. A1
SEQ 2 ….
In contrast, DGENE
sequences are
indexed from DWPI
basic publications.
WPINDEX = Derwent World Patents Index® on STN
DGENE = GENESEQTM on STN
USGENE® = USPTO Genetic Sequence Database
17
Sequence listing variation often occurs between
published application and granted patent stage
L1
AN
TI
ANSWER 1 OF 1 WPINDEX COPYRIGHT 2007
THE THOMSON CORP on STN
1994-358278 [44]
WPINDEX
New polynucleotide(s) specific for hepatitis C virus types 4, 5 and 6 and related antigenic peptide(s) and antibodies, useful in vaccines,
diagnosis, HCV typing and treatment
DC
B04; D16; S03
IN
PIKE I H; SIMMONDS P; YAP P L
PA
(COMM-N) COMMON SERVICES AGENCY; (MURE-N) MUREX DIAGNOSTICS INT INC; . . .
PI
WO 9425602
A1 19941110 (199444)* EN 70[5]
AU 9465797
A 19941121 (199508) EN
FI 9505224
A 19951220 (199611) FI
this example
the patent family has:
EP 698101
A1 19960228 In
(199613)
EN [0]
JP 09500009
W 19970107 (199711) JA 52[0]
9 sequences
from WO 9425602 in DGENE
AU 695259
B 19980813 •(199844)
EN
58 sequences
from US 6881821 in USGENE
EP 698101
B1 20041103 •(200475)
EN
DE 69434116
E 20041209 (200481) DE
US 20050032047 A1 20050210 (200512) EN
US 6881821
B2 20050419 (200527) EN
. . . . .
ADT WO 9425602 A1 WO 1994-GB957 19940505 . . . .
PRAI GB 1994-263 19940107
GB 1993-9237 19930505
18
USGENE covers a comprehensive variety of
USPTO patent publication types
PK
Patent Kind covered in USGENE (field /PK)
USA1
USA2
USA9
USA
USB1
USB2
USE
USP1
USP2
USP3
WOA
Published patent application
Republished patent application
Corrected published patent application
Granted patent (until 2000)
Granted patent without pre-grant publication (2001 onwards)
Granted patent with pre-grant publication (2001 onwards)
Reissued patent
Published plant patent application
Granted plant patent without pre-grant publication
Granted plant patent with pre-grant publication
WIPO/PCT published patent application (parent case data)
19
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
Similarity searching GETSIM (FASTA)
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
20
USGENE offers the same sequence search
options as DGENE
• NCBI BLAST similarity
– RUN BLAST
• FASTA similarity
– RUN GETSIM
• Sequence Code Match (SCM)
– RUN GETSEQ
• Offline BATCH and ALERT options
The DGENE Workshop Manual is the complete guide:
http://www.stn-international.com/training_center/bioseq/dgene_wm.pdf
21
The 7 basic steps of USGENE BLAST
1)
2)
3)
4)
5)
SAVE, UPLOAD, and VERIFY the query (L1)
RUN the BLAST search (/SQP or /SQN)
Decide how many answers to keep (L2)
SORT SCORE in Descending order (L3)
Review answers in a free-of-charge format
e.g. D L3 TRI ORGN ALIGN 16) Display selected answers in bibliographic
format, e.g. D L3 BIB AB CLM ALIGN 1,3,10
7) Ensure transcript was captured and Logoff
22
The 7 basic steps of USGENE BLAST
Search Question:
Find relevant U.S. published application and
patent references for this protein sequence:
1
51
101
151
vqtvplsrlf
sfcfsdsipt
lvydtsdsdd
hnhdallkny
dhamleahra
psnmeetqqk
yhllkdleeg
gllycfrkdm
helaidtyqe
snlellrisl
iqtlmgrled
dkvetflrmv
feetyipkdq
llieswlepv
gsrrtgqilk
qcrsvegscg
kysflhdsqt
rflrsmfann
qtyskfdtns
f
23
The 7 basic steps of USGENE BLAST
1) SAVE, UPLOAD, and VERIFY the sequence
query text file (L1)
¾ Upload options
•
•
STN Express®: Use UPLOAD command or Upload
Query Wizard (STN Express 8.2+)
STN® on the WebSM: Use Upload feature or
Sequence Assistant (link below)
¾ Verify the sequence with D LQUE
STN on the Web Sequence Search Assistant:
http://www.stn-international.com/training_center/bioseq/seq_se_ass.pdf
24
Requirements for sequences for the
STN Express Upload Query Wizard
• Sequence queries must be saved individually in
text (.txt) format
• Files may
– Be 3 letter codes (amino acids) or single letter
– Have header information as seen in, e.g. WIPO
ST.25, USPTO PSIPS or EMBL formats
– Include sequence count numbers
• Query (.txt) files must
– Be 10,000 characters or less
– Not have any lines longer than 300 characters
• After upload to STN verify with D LQUE
25
Examples of formats that work
DETD SEQUENCE CHARACTERISTICS:
SEQ ID NO: 4
USPATFULL/USPAT2 format
LENGTH: 724
TYPE: PRT
ORGANISM: Artificial Sequence
FEATURE:
OTHER INFORMATION: Description of Artificial Sequence; Note = synthetic construct
SEQUENCE: 4
Met Ser Phe Val Asp His Pro Pro Asp Trp Leu Glu Glu Val Gly Glu
1
5
10 137
<210> SEQ ID NO
15
Gly Leu Arg Glu Phe Leu Gly
Leu
Glu Ala
<211>
LENGTH:
951 Gly Pro Pro Lys Pro Lys
USPTO PSIPS ST.25 format
20
25 DNA
30
<212> TYPE:
<213> ORGANISM: Zea mays
<400> SEQUENCE: 137
accgaggccg acttcccgtt cactggccac gacgggacgt gcgatctcaa actgaaaaat 60
acaagggttg tatccataga ttcgttcgag cgtgtgccca tcaactacga gagagcgctg 120
cagaaggccg tggcgcacca gcctgttagt gccagcattg aagcatctcg gcgcgcgttc 180
cagctctaca gttctggcat cttcgacggg agatgcggga cgtacctgga ccacggtgtg 240
26
a) Choose the Upload Query Wizard
From the Discover! button menu.
OR
From the Select Discover!
Wizard window.
27
b) Browse to locate sequence file
Click Next button to go
to the next step.
28
c) Change File type to .txt
29
d) Verify it’s the right query!
30
e) Select STN file to upload to
Use PCTGEN to upload queries and
verify them (lower connect hour). The
resulting L-numbers may be searched
in DGENE, PCTGEN, or USGENE.
Click Finish for the file to be “scrubbed”
and uploaded to STN.
31
1) SAVE, UPLOAD and VERIFY (cont.)
=>
FILE PCTGEN
=>
UPL R BLAST
These commands are automatically run by the
STN Express Sequence Query Upload wizard.
UPLOAD SUCCESSFULLY COMPLETED
L1
GENERATED
=> D L1 LQUE
L1
LQUE
=>
ANSWER 1 PCTGEN COPYRIGHT 2007 WIPO on STN
vqtvplsrlfdhamleahrahelaidtyqefeetyipkdqkysflhdsqtsfcfsdsi
ptpsnmeetqqksnlellrislllieswlepvrflrsmfannlvydtsdsddyhllkd
leegiqtlmgrledgsrrtgqilkqtyskfdtnshnhdallknygllycfrkdmdkve
tflrmvqcrsvegscgf
The sequence query is now ready for searching
directly in USGENE using the L-number (L1).
32
The 7 basic steps of USGENE BLAST
2) RUN the BLAST search
¾ Protein search: RUN BLAST L1 /SQP
¾ Nucleotide search: RUN BLAST L1 /SQN
¾ Translated search: RUN BLAST L1 /TSQN
33
2) RUN the USGENE BLAST search
=> FILE USGENE
FILE 'USGENE' ENTERED AT 12:09:16 ON 03
COPYRIGHT (C) 2007 SEQUENCEBASE CORP
USGENE is updated within 7 days
OCT
2007
of publication
by the USPTO.
FILE LAST UPDATED:
2 OCT 2007
MOST RECENT PUBLICATION DATE: 27 SEP 2007
<20071002/UP>
<20070927/PD>
FILE COVERS 1982 TO DATE
>>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE
IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<<
=> RUN BLAST L1 /SQP -F F
BLAST Version 2.2
Turn the Low Complexity Filter off
with the syntax… /SQP –F F
The BLAST software is used herein with permission of the
National Center for Biotechnology Information (NCBI) of
the National Library of Medicine (NLM). See also, . . . .
BLAST SEARCHING . . . .
34
RUN BLAST command syntax
Similarity Searching with BLAST (protein/polypeptides)
=> RUN BLAST L1 (sequence or L-number)
/SQP (protein) (default)
-e
(Expect-value)
-f
(Filter) (on by default)
-w
(Word size)
-m
(Matrix)
-g
(Gap penalty)
-x
(Gap extension)
BATCH (offline)
ALERT (Alert/SDI)
35
RUN BLAST command syntax
Similarity Searching with BLAST (Nucleic acids)
=> RUN BLAST L1 (sequence or L-number)
/SQN (nucleotide)
SIN
(single strand)
COM (complementary strand)
BOTH (both strands) (default)
-e
(Expect-value)
-f
(Filter)
-w
(Word size)
-g
(Gap penalty)
-x
(Gap extension)
-q
(penalty for mismatch)
-r
(reward for match)
BATCH (offline)
ALERT (Alert/SDI)
36
RUN BLAST advanced options
Expectation Value (-E)
Expectation value (E-Value) is the statistical significance
threshold for reporting matches against a sequence database.
The E-value can be any positive number, and the default value is
10. This means that 10 matches may be expected to be found
merely by chance. In general E-value is lowered to make the
search more precise and raised to retrieve more answers.
Word Size (-W)
Word Size is the length of the character string fragments of a
sequence query which are used as the basis for a BLAST
search. For SQN the default is 11 and the range 7-23. For all
other BLAST searches the default is 3 and the range 2-3. For
short search queries, reducing the default word size can give
improved search results.
37
RUN BLAST advanced options (cont.)
Low Complexity Filtering (on by default) (-F)
The low complexity filter can eliminate biologically
uninteresting segments that have low compositional
complexity and are statistically significant, as determined
by specific programs for peptide or nucleotide sequences in
nature. Filtering is applied to the query sequence and is
indicated by a series of Xs for peptide sequences and Ns
for nucleotide sequences. Low complexity filtering can be
turned off (i.e. set to F - false).
Peptide similarity matrices (-M)
For peptide based searches SQP and TSQN the advanced
options provide additional scoring matrices to the default
BLOSUM62 (next slide)
38
Guidelines from NCBI on the use of
Advanced Settings for peptide sequence
searching are as follows:
Query Length
Matrix
Gap costs
<35
PAM-30
(9,1)
35 – 50
PAM-70
(10,1)
50 – 85
BLOSUM-80
(10,1)
>85
BLOSUM-62
(11,1) (BLAST default)
39
The 7 basic steps of USGENE BLAST
3) Decide how many answers to keep (L2)
¾ How many answers would you like to keep? (ALL)
or ?:
¾ Recommendation: Keep ALL answers
40
3) Decide how many answers to keep
1350 ANSWERS FOUND BELOW EXPECTATION VALUE OF 10.0
Similarity
Score
390 |
The graphic representation gives a count
|
of hit sequences (x-axis) and similarity
|
|||||||||||||||||||||||||||||||||||||
score (y-axis). The graph gives a visual
|||||||||||||||||||||||||||||||||||||
clue about the proportion of similar and not
||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||
so similar sequences in the answer set.
||||||||||||||||||||||||||||||||||||||||
195 ||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||
Recommendation: keep ALL answers
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
Answer Count
270
540
810
1080
1350
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
41
The 7 basic steps of USGENE BLAST
4) SORT by SCORE descending (L3)
¾ SOR L2 SCORE D
¾ Option: limit using text terms and/or dates (L4)
¾ Remember to SORT L4 SCORE D !! (L5)
42
4) SORT by SCORE descending
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
L2
L2
RUN STATEMENT CREATED
1350 VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQT
SFCFSDSIPTPSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANN
LVYDTSDSDDYHLLKDLEEGIQTLMGRLEDGSRRTGQILKQTYSKFDTNS
HNHDALLKNYGLLYCFRKDMDKVETFLRMVQCRSVEGSCGF/SQP.-F F
Answer set arranged by accession number; to sort by descending
similarity score, enter at an arrow prompt (=>) "sor score d".
=> SOR SCORE D
PROCESSING COMPLETED FOR L2
L3
1350 SOR L2 SCORE D
Use SORT SCORE D to sort
by descending BLAST score.
43
The 7 basic steps of USGENE BLAST
5) Review answers using a free-of-charge format
including alignment (ALIGN), while “parked” in
the STNGUIDESM file
¾ D L5 TRI ORGN ALIGN 1¾ FILE STNGUIDE
44
5) Review answers with a free-of-charge
format including alignment
=> D L3 TRI ORGN ALIGN 1-30; FILE STNGUIDE
L3
ANSWER 1 OF 1350 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
TI
Recombinant DNA transfer vectors (Patent)
MTY
Protein
SQL
191
This top hit comes from
ORGN Unknown
a U.S. issued patent.
BLASTALIGN
Query = 191 letters
Length = 191
Score = 390 bits (1001), Expect = e-113
Identities = 191/191 (100%), Positives = 191/191 (100%)
Query: 1
VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT
VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT
Sbjct: 1
VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT
Query: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG
PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG
Sbjct: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG
. . . .
45
5) Review answers with a free-of-charge
format including alignment
L3
TI
ANSWER 5 OF 1350 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
Genetic polymorphisms associated with myocardial infarction, methods
of detection and uses thereof (PublishedApplication)
MTY
Protein
SQL
217
The 5th from top hit comes from
ORGN Homo Sapiens
a U.S. published application.
BLASTALIGN
Query = 191 letters
Length = 217
Score = 387 bits (995), Expect = e-113
Identities = 189/191 (98%), Positives = 191/191 (100%)
Query: 1
VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT
VQTVPLSRLFDHAML+AHRAH+LAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT
Sbjct: 1
VQTVPLSRLFDHAMLQAHRAHQLAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT
Query: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG
BLAST alignment details are
PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG
explained on the next slide. . . .
Sbjct: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG
Query: 121 IQTLMGRLEDGSRRTGQILKQTYSKFDTNSHNHDALLKNYGLLYCFRKDMDKVETFLRMV
IQTLMGRLEDGSRRTGQILKQTYSKFDTNSHNHDALLKNYGLLYCFRKDMDKVETFLRMV
Sbjct: 147 IQTLMGRLEDGSRRTGQILKQTYSKFDTNSHNHDALLKNYGLLYCFRKDMDKVETFLRMV
. . . .
46
Understanding BLAST alignments
Query
the length of the query sequence
Length
the length of the answer sequence
Score
a relative score assigned by BLAST
Expect
Expectation Value – a value representing the
chance that an answer is a random hit. The closer
to zero, the less likely the hit is random
Identities
the number of exact letter matches between query
and answer within the displayed local alignment.
The amino acid letter is repeated* in the display
Positives
a combination of identities and amino acid family
matches shown with + (plus) in the alignment
Gaps
shown as dashes - where BLAST must break the
query or answer to maintain an alignment
(* For nucleic acid searches a vertical bar is used to indicate nucleotide identities in the alignment display.)
47
USGENE provides text search options for
refining sequence searches
• The USGENE default text search index – known
on STN as the Basic Index (/BI) – comprises
– Original publication Title (/TI) and abstract (/AB)
– Organism name (/ORGN) and Molecule Type (/MTY)
• The Exemplary Claim (/ECLM) and Feature
Table (/FEAT) can also be added to a search
– Either specify the fields: => S VIRUS/BI,FEAT
– Or use SET SFIELDS: => SET SFIELDS BI ECLM
• The Basic Index and Feature Table both offer
simultaneous left and right truncation (SLART)
48
USGENE provides bibliographic search
options for refining sequence searches
• Patent Assignee (/PA) and Inventor (/IN)
– Examples: GLAXO/PA, SMITH JOHN/IN
• Granted or application Sequence Source (/SSO)
– Examples: APPLICATION/SSO, GRANTED/SSO
• Publication date (/PD) or publication year (/PY)
– Examples: PY < 2001, PD < 1 Mar 1995
• Application date (/AD) or application year (/AY)
– Examples: AY < 2002, AD < 1 Mar 1998
• WO application date (/RLD) or year (/RLY)
– Examples: RLY < 1993, RLD < 1 Aug 1986
49
Option: refine USGENE BLAST results with
text and/or date search terms
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
L2
L2
RUN STATEMENT CREATED
1350 VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQT
SFCFSDSIPTPSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANN
LVYDTSDSDDYHLLKDLEEGIQTLMGRLEDGSRRTGQILKQTYSKFDTNS
HNHDALLKNYGLLYCFRKDMDKVETFLRMVQCRSVEGSCGF/SQP.-F F
Answer set arranged by accession number; to sort by descending
similarity score, enter at an arrow prompt (=>) "sor score d".
=> SOR SCORE D
PROCESSING COMPLETED FOR L2
L3
1350 SOR L2 SCORE D
The BLAST search (L2) is further refined
to sequences from granted patents, with
application year prior to 1996, and to a
specific text search term (L4).
=> S L2 AND SOMATOMAMMOTROPIN/BI,ECLM AND AY<1996 AND GRANTED/SSO
L4
7 L2 AND SOMATOMAMMOTROPIN AND AY<1996 AND GRANTED/SSO
=> SOR SCORE D
PROCESSING COMPLETED FOR L4
L5
7 SOR L4 SCORE D
If you limit using text and/or date terms
remember to SORT SCORE D again!
50
The 7 basic steps of USGENE BLAST
6) Display selected relevant answers in a
bibliographic format including alignment
¾ D L5 BIB AB CLM ALIGN 1 5 6
7) Ensure your STN Express session transcript
was captured and then logoff
51
6) Display selected USGENE answers in a
preferred bibliographic format
=> D BIB AB CLM ORGN SSO ALIGN 1 3 5
L5
AN
TI
IN
PA
PI
AI
AB
CLM
ANSWER 1 OF 7 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
4363877.1 Protein
USGENE
Recombinant DNA transfer vectors (Patent)
This sequence hit comes
Goodman Howard M. (San Francisco, CA); Shine John (San Francisco, CA);
from a U.S. granted patent,
Seeburg Peter H. (San Francisco, CA)
with an application
date prior
The Regents of the University of California(Berkeley
CA)
US 4363877
A
19821214
to 1996, and a key concept
US 1978-897710
19780419
in the abstract and claims.
Recombinant DNA transfer vectors containing codons for human
somatomammotropin and for human growth hormone.
US4363877 A: What is claimed is:
1. A recombinant DNA transfer vector comprising codons for human
chorionic somatomammotropin comprising the nucleotide . . . .
ORGN Unknown
SSO PROTEIN; EMBL; GRANTED
BLASTALIGN . . . .
Note: this USGENE sequence
record, sourced from EMBL, is an
example of one which is not
indexed in DGENE or REGISTRY.
52
Useful USGENE display fields/formats
TRIAL*
SCAN*
ALIGN*
SCORE*
BIB
AB
ECLM
CLM
BRIEF
ALL
Title, Molecule Type, Sequence Length
Random Title
BLAST/GETSIM Sequence Alignment
Similarity Score (for post-processing)
Inventors, Assignees, numbers, dates
Original abstract
Exemplary (1st) claim text
All claims text
BIB + AB + ECLM, sequence, sequence
source (SSO), feature table (FEAT)
BRIEF with CLM instead of ECLM
(* Free of charge display formats in USGENE.)
53
The importance of using the correct BLAST
advanced options
=> RUN BLAST GSSFLSPEHQR/SQP
BLAST Version 2.2 . . . .
Changing BLAST options is especially
important for short sequence queries!
NO ANSWERS FOUND BELOW THRESHOLD OF 10
=> RUN BLAST GSSFLSPEHQR/SQP -M PAM30 –W 2 –E 1000 –F F
BLAST Version 2.2 . . . .
712 ANSWERS FOUND BELOW EXPECTATION VALUE OF 1000.0
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
L1
RUN STATEMENT CREATED
L1
712 GSSFLSPEHQR/SQP.-M PAM30 –W 2 –E 1000 –F F
Answer set arranged by accession number; to sort by descending
similarity score, enter at an arrow prompt (=>) "sor score d".
54
The importance of using the correct BLAST
advanced options (cont.)
=> SOR L1 SCORE D
PROCESSING COMPLETED FOR L1
L2
712 SOR L1 SCORE D
=> D TRI ALIGN
Correct use of BLAST options
finds relevant sequence hits.
L2
ANSWER 1 OF 712 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
TI
Fluorescently labeled growth hormone secretagogue (Patent)
MTY
Protein
SQL
18
BLASTALIGN
Query = 11 letters
Length = 18
Score = 37.5 bits (81), Expect = 1e-09
Identities = 11/11 (100%), Positives = 11/11 (100%)
Query: 1 GSSFLSPEHQR 11
GSSFLSPEHQR
Sbjct: 1 GSSFLSPEHQR 11
55
Review: 7 steps of USGENE BLAST
1)
2)
3)
4)
5)
SAVE, UPLOAD, and VERIFY the query (L1)
RUN the BLAST search (/SQP or /SQN)
Decide how many answers to keep (L2)
SORT SCORE in Descending order (L3)
Review answers in a free-of-charge format
e.g. D L3 TRI ORGN ALIGN 16) Display selected answers in bibliographic
format, e.g. D L3 BIB AB CLM ALIGN 1,3,10
7) Ensure transcript was captured and Logoff
56
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
Similarity searching GETSIM (FASTA)
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
57
USGENE answer sets may be grouped by source
publications using Family SORT (FSORT)
• FSORT gathers multiple sequence hits from the same
applications together via publication, application and/or
WO/PCT related application numbers
• FSORT organizes answers into two subgroups: multiple
sequence hit (multi-record) families and single sequence
hit (individual-record) families
• When FSORT is used on an answer set previously
sorted by similarity SCORE, the two FSORT subgroups
each separately retain their similarity sort order
• FSORT makes it possible to review, e.g. just the most
similar sequence answer for each application retrieved,
or all the sequences from a single application
58
USGENE answer sets may be grouped by source
publications using Family SORT (FSORT)
Search Question:
Find all relevant U.S. published application and
patent references with sequences similar to the
Banana Bunchy Top Virus (BBTV) Replication
Initiation Protein (NCBI: AAG44003).
59
Banana Bunchy Top Virus (BBTV) Replication
Initiation Protein (NCBI: AAG44003)
60
SAVE, UPLOAD and VERIFY
=> FILE PCTGEN
=> UPL R BLAST
There are 17
records
These commands are automatically
runsequence
by the STN
DGENE
for CA2325774.
Express Sequence QueryinUpload
wizard
(slides 27-31).
UPLOAD SUCCESSFULLY COMPLETED
L1
GENERATED
=> D L1 LQUE
L1
LQUE
=>
ANSWER 1 PCTGEN COPYRIGHT 2007 WIPO on STN
MSSFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIK
LGGLKKKYSSRAHWERARGSDEDNAKYCSKETLILELGFPASQGSNRRKLSEMVSRSPE
RMRIEQPEIYHRYTSVKKLKKFKEEFVHPCLDRPWQIQLTEAIDEEPDDRSIIWVYGPN
GNEGKSTYAKSLMKKDWFYTRGGKKENILFSYVDEGSEKHIVFDIPRCNQDYLNYDVIE
ALKDRVIESTKYKPIKLVELINIHVIVMANFMPEFCKISEDRIKIIYC
The sequence query is now ready for searching
directly in USGENE using the L-number (L1).
61
RUN the USGENE BLAST search
=> FILE USGENE
FILE 'USGENE' ENTERED AT 22:44:51 ON 06
COPYRIGHT (C) 2007 SEQUENCEBASE CORP
USGENE is updated within 7 days
OCT
2007
of publication
by the USPTO.
FILE LAST UPDATED:
2 OCT 2007
MOST RECENT PUBLICATION DATE: 27 SEP 2007
<20071002/UP>
<20070927/PD>
FILE COVERS 1982 TO DATE
>>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE
IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<<
=> RUN BLAST L1 /SQP -F F
BLAST Version 2.2
Turn the Low Complexity Filter off
with the syntax… /SQP –F F
The BLAST software is used herein with permission of the
National Center for Biotechnology Information (NCBI) of
the National Library of Medicine (NLM). See also, . . . .
BLAST SEARCHING . . . .
62
Decide how many answers to keep
57 ANSWERS FOUND BELOW EXPECTATION VALUE OF 10.0
Similarity
Score
520 |
The graphic representation gives a count
|
|
of hit sequences (x-axis) and similarity
|
score (y-axis). The graph gives a visual
|
|
clue about the proportion of similar and not
|
||
so similar sequences in the answer set.
||
||
260 ||
|||
|||
|||
Recommendation: keep ALL answers
|||||
|||||
|||||
||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Answer Count
10
20
30
40
50
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
63
SORT by SCORE descending
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?:ALL
L2
RUN STATEMENT CREATED
L2
57 MSSFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQG
YLSLKKSIKLGGLKKKYSSRAHWERARGSDEDNAKYCSKETLILELGFPA
SQGSNRRKLSEMVSRSPERMRIEQPEIYHRYTSVKKLKKFKEEFVHPCLD
RPWQIQLTEAIDEEPDDRSIIWVYGPNGNEGKSTYAKSLMKKDWFYTRGG
KKENILFSYVDEGSEKHIVFDIPRCNQDYLNYDVIEALKDRVIESTKYKP
IKLVELINIHVIVMANFMPEFCKISEDRIKIIYC/SQP.-F F
Answer set arranged by accession number; to sort by descending
similarity score, enter at an arrow prompt (=>) "sor score d".
=> SOR SCORE D
PROCESSING COMPLETED FOR L2
L3
57 SOR L2 SCORE D
=> SET FORMAT .MYUSGENE BIB AB ECLM ORGN SQL SCORE ALIGN
SET COMMAND COMPLETED
=> SET DFORMAT .MYUSGENE
SET COMMAND COMPLETED
Option: set a customized display format
with SET FORMAT. The new format may be
set as the file default with SET DFORMAT.
64
Display selected USGENE answers using
the new customized default display format
=> D 1-2
L3
AN
TI
ANSWER 1 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
5846705.16 Protein
USGENE
Nucleotide sequence of two circular SSDNA associated with banana
bunchy top virus and method for detection of banana bunchy top virus
(Patent)
IN
Wu Rey-Yuh (Taipei, TW); You Li-Ru (Taipei, TW); Soong Tai-Seng
(Taipei,TW)
PA
Development Center for Biotechnology (Taipei TW)
PI
US 5846705
A
19981208
AI
US 1995-418071
19950406
AB
Nucleotide sequences of two circular single-stranded DNAs . . . .
ECLM US5846705 A: 1. An isolated DNA molecule comprising a . . . .
ORGN Unknown
SQL
286
SCORE 520
The top hit is SEQ ID 16 from US5846705.
BLASTALIGN
Query = 284 letters
Length = 286
Score = 520 bits (1338), Expect = e-152
Identities = 247/282 (87%), Positives = 268/282 (94%)
Query: 3
SFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKLGG
S KWCFTLNYSSAAERE+FL+LLKEE+++YAVVGDEVAP++GQKHLQGYLSLKK I+LGG
Sbjct: 5
SLKWCFTLNYSSAAERENFLSLLKEEDVHYAVVGDEVAPATGQKHLQGYLSLKKRIRLGG
. . . . .
65
The second hit sequence comes from the
same U.S. patent as the top hit
L3
AN
TI
ANSWER 2 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
5846705.17 Protein
USGENE
Nucleotide sequence of two circular SSDNA associated with banana
bunchy top virus and method for detection of banana bunchy top virus
(Patent)
IN
Wu Rey-Yuh (Taipei, TW); You Li-Ru (Taipei, TW); Soong Tai-Seng
(Taipei, TW)
PA
Development Center for Biotechnology(Taipei TW)
PI
US 5846705
A
19981208
AI
US 1995-418071
19950406
AB
Nucleotide sequences of two circular single-stranded DNAs . . . .
ECLM US5846705 A: 1. An isolated DNA molecule comprising a nucleotide
sequence encoding a polypeptide comprising amino acid . . . .
ORGN Unknown
SQL
285
SCORE 340
The 2nd hit is SEQ ID 17 from US5846705.
BLASTALIGN
Query = 284 letters
Length = 285
Score = 340 bits (872), Expect = 2e-98
Identities = 171/288 (59%), Positives = 217/288 (74%), Gaps = 7/288
Query: 1
MSSFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKL
MSSFKWCFTLNYSSAAEREDFLALLKEE+++Y+VVGDEVAP++GQKHL GYLSLKKSI+L
Sbjct: 1
MSSFKWCFTLNYSSAAEREDFLALLKEEDVHYSVVGDEVAPATGQKHLGGYLSLKKSIRL
. . . . .
66
USGENE answer sets may be grouped by source
publications using Family SORT (FSORT)
=> FSORT L3
SEL L3 1- PN,APPS
L4
SEL L3 1- PN APPS :
45 TERMS
'L4' DELETED
L4
57 FSO L3
12 Multi-record Families
Family 1
Family 2
Family 3
Family 4
Family 5
Family 6
Family 7
Family 8
Family 9
Family 10
Family 11
Family 12
7 Individual Records
0 Non-patent Records
The 57 sequence hits
belong to 12 multi-hit
and 7 individual-hit
source publications.
Answers 1-50
Answers 1-3
Answers 4-5
Answers 6-7
Answers 8-13
Answers 14-19
Answers 20-25
Answers 26-31
Answers 32-37
Answers 38-39
Answers 40-44
Answers 45-47
Answers 48-50
Answers 51-57
67
Use the patent family display (PFAM) feature to
display selective records from a FSORT L-number
General format of PFAM:
=> D L# PFAM=# RECORD# FORMAT
Examples using PFAM:
=> D PFAM=1-10
1st member of patent family number 1-10 in
default display format
=> D PFAM=2 TRI ORGN ALIGN 1-TOTAL
All members of family number 2 in a free
sequence review format
68
The top answer is the same as before….
=> D PFAM=1-2
L4
AN
TI
The first record from families 1 & 2 in default format.
ANSWER 1 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY1
5846705.16 Protein
USGENE
Nucleotide sequence of two circular SSDNA associated with banana
bunchy top virus and method for detection of banana bunchy top virus
(Patent)
IN
Wu Rey-Yuh (Taipei, TW); You Li-Ru (Taipei, TW); Soong Tai-Seng
(Taipei,TW)
PA
Development Center for Biotechnology (Taipei TW)
PI
US 5846705
A
19981208
AI
US 1995-418071
19950406
AB
Nucleotide sequences of two circular single-stranded DNAs . . . .
ECLM US5846705 A: 1. An isolated DNA molecule comprising a . . . .
ORGN Unknown
SQL
286
SCORE 520
The top hit is SEQ ID 16 from US5846705.
BLASTALIGN
Query = 284 letters
Length = 286
Score = 520 bits (1338), Expect = e-152
Identities = 247/282 (87%), Positives = 268/282 (94%)
Query: 3
SFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKLGG
S KWCFTLNYSSAAERE+FL+LLKEE+++YAVVGDEVAP++GQKHLQGYLSLKK I+LGG
Sbjct: 5
SLKWCFTLNYSSAAERENFLSLLKEEDVHYAVVGDEVAPATGQKHLQGYLSLKKRIRLGG
. . . . .
69
…but the second answer displayed is now
the best answer from the 2nd family
L4
AN
TI
IN
ANSWER 4 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY2
5756708.26 Protein
USGENE
DNA sequences of banana bunchy top virus (Patent)
Karan Mirko (Holland Park, AU); Burns Thomas Michael (Herston, AU);
Dale James Langham (Moggill, AU); Harding Robert Maxwell(Lawnton, AU)
PA
Queensland University of Technology(Brisbane AU)
PI
US 5756708
A
19980526
AI
US 1994-202186
19940224
DT
Patent
AB
The invention provides DNA molecules consisting essentially of a
nucleotide sequence or part thereof which are associated . . . .
ECLM US5756708 A: 1. An isolated DNA molecule derived from banana bunchy
top virus, consisting of a nucleotide sequence selected . . . .
ORGN Unknown
SQL
290
SCORE 243
The 2nd hit is now SEQ ID 26 from US5756708.
BLASTALIGN
Query = 284 letters
Length = 290
Score = 243 bits (621), Expect = 3e-69
Identities = 117/282 (41%), Positives = 183/282 (64%), Gaps = 6/282
Query: 5
KWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKLGGLK
+WCFTLNY + E + + ++
L YA+VGDEVAPS+GQ+HLQG++ LK
+L GLK
Sbjct: 7
RWCFTLNYETEEEAANVVRRIESLNLVYAIVGDEVAPSTGQRHLQGFIHLKTGRRLQGLK
. . . . .
70
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
Similarity searching GETSIM (FASTA)
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
71
STN Express 8.2+ post-processing tools
• Table Tool to create tabulated results
– Good for scanning/reviewing search results
• Predefined Report Tool for a report using a
Standard Patent Record layout
– Easy way to tidy-up your patent results for a client
• Customized Report Tool to control all options
– E.g. fonts, cover page, which data fields to include
72
USGENE results may be tabulated using
STN Express 8.2+ Table Tool
Search Question:
Find all relevant U.S. published application and
patent references with sequences similar to the
Human osteoprotegerin (OPG) mRNA, complete
CDS (NCBI: U94332).
73
Human osteoprotegerin (OPG) mRNA,
complete CDS (NCBI: U94332)
74
Ensure you capture your STN session
Record your session as
a Transcript (.TRN) file
or as an RTF file.
75
SAVE, UPLOAD and VERIFY
=> FILE PCTGEN
=> UPL R BLAST
There are 17
records
These commands are automatically
runsequence
by the STN
DGENE
for CA2325774.
Express Sequence QueryinUpload
wizard
(slides 27-31).
UPLOAD SUCCESSFULLY COMPLETED
L1
GENERATED
=> D L1 LQUE
L1
LQUE
=>
ANSWER 1 PCTGEN COPYRIGHT 2007 WIPO on STN
gtatatataacgtgatgagcgtacgggtgcggagacgcaccggagcgctcgcccagccg
ccgctccaagcccctgaggtttccggggaccacaatgaacaagttgctgtgctgcgcgc
tcgtgtttctggacatctccattaagtggaccacccaggaaacgtttcctccaaagtac
. . . . .
tggccattgagctgtttcctcacaattggcgagatcccatggatgataa
The sequence query is now ready for searching
directly in USGENE using the L-number (L1).
76
RUN the USGENE BLAST search
=> FILE USGENE
FILE 'USGENE' ENTERED AT 04:38:02 ON 10
COPYRIGHT (C) 2007 SEQUENCEBASE CORP
FILE LAST UPDATED:
8 OCT 2007
MOST RECENT PUBLICATION DATE: 4 OCT 2007
USGENE is updated within 7 days
OCT
2007
of publication
by the USPTO.
<20071008/UP>
<20071004/PD>
FILE COVERS 1982 TO DATE
>>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE
IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<<
=> RUN BLAST L1 /SQN -F F
BLAST Version 2.2
Turn the Low Complexity Filter off
with the syntax… /SQP –F F
The BLAST software is used herein with permission of the
National Center for Biotechnology Information (NCBI) of
the National Library of Medicine (NLM). See also, . . . .
BLAST SEARCHING . . . .
77
Decide how many answers to keep
554 ANSWERS FOUND BELOW EXPECTATION VALUE OF 10.0
Similarity
Score
2686 ||
The graphic representation gives a count
||
|||||||||
of hit sequences (x-axis) and similarity
||||||||||
score (y-axis). The graph gives a visual
||||||||||
|||||||||||
clue about the proportion of similar and not
||||||||||||
so similar sequences in the answer set.
||||||||||||
|||||||||||||
|||||||||||||
1343 ||||||||||||||
|||||||||||||||
|||||||||||||||
||||||||||||||||
Recommendation: keep ALL answers
|||||||||||||||||||
||||||||||||||||||||
||||||||||||||||||||
|||||||||||||||||||||
|||||||||||||||||||||
||||||||||||||||||||||||
Answer Count
110
220
330
440
550
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
78
SORT by SCORE descending
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
L2
RUN STATEMENT CREATED
L2
554 GTATATATAACGTGATGAGCGTACGGGTGCGGAGACGCACCGGAGCGCTC
GCCCAGCCGCCGCTCCAAGCCCCTGAGGTTTCCGGGGACCACAATGAACA
. . . . .
TGGAAATGGCCATTGAGCTGTTTCCTCACAATTGGCGAGATCCCATGGAT
GATAA/SQN.-F F
Answer set arranged by accession number; to sort by descending
similarity score, enter at an arrow prompt (=>) "sor score d".
=> SET SFIELDS BI ECLM PERM
SET COMMAND COMPLETED
Use SET SFIELDS to change the
USGENE default search index.
=> S L2 AND (OSTEO? OR BONE) AND GRANTED/SSO AND AY<2001
L3
245 L2 AND (OSTEO?/BI,ECLM OR BONE/BI,ECLM) AND GRANTED/SSO AND
AY<2001
=> SOR SCORE D
PROCESSING COMPLETED FOR L3
L4
245 SOR L3 SCORE D
After refining using date and text
terms remember to SOR SCORE D.
79
Grouped by source publications using
Family SORT (FSORT)
=> FSORT L4
SEL L4 1- PN,APPS
L5
SEL L4 1- PN APPS :
L5
25 TERMS
245 FSO L4
11 Multi-record Families
Family 1
Family 2
Family 3
Family 4
Family 5
Family 6
Family 7
Family 8
Family 9
Family 10
Family 11
1 Individual Record
0 Non-patent Records
The 245 sequence hits
belong to 11 multi-hit
and 1 individual-hit
source publications.
Answers 1-244
Answers 1-11
Answers 12-22
Answers 23-33
Answers 34-44
Answers 45-71
Answers 72-83
Answers 84-118
Answers 119-179
Answers 180-240
Answers 241-242
Answers 243-244
Answer 245
80
Reviewing the SCORE display can be one
way to identify answers of interest
=> D PFAM=1- SCORE
The SCORE for the best answer from each family.
L5 ANSWER 1 OF 245
SCORE 2686
USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY1
L5 ANSWER 12 OF 245
SCORE 2686
. . . . .
USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY2
L5 ANSWER 119 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY8
SCORE 2375
L5 ANSWER 180 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY9
SCORE 2375
L5 ANSWER 241 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY10
SCORE 40
L5 ANSWER 243 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY11
SCORE 40
L5 ANSWER 245 OF
SCORE 2686
Note: the FSORT individual-hit
245
USGENE
record
alsoCOPYRIGHT
has a top 2007
score.SEQUENCEBASE
CORP on STN
81
Use the PFAM feature to display selective
records from an FSORT L-number
=> D PFAM=1-9,12
L5
AN
TI
IN
ANSWER 1 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY1
6284740.5 cDNA
USGENE
Osteoprotegerin (Patent)
Boyle William J. (Moorpark, CA); Lacey David L. (Thousand Oaks, CA);
Calzone Frank J. (Westlake Village, CA); . . . .
PA
Amgen Inc (Thousand Oaks CA)
PI
US 6284740
B1
20010904
AI
US 1997-974186
19971118
AB
The present invention discloses a novel secreted polypeptide, termed
Osteoprotegerin, which is a member of the tumor necrosis . . . .
ECLM US6284740 B1: What is claimed is:1. A method of increasing levels of
osteoprotegerin in a mammal comprising administering to . . . .
ORGN not provided
SQL
1355
SCORE 2686
The top hit is SEQ ID 5 from US6284740.
BLASTALIGN
Query = 1355 letters
Length = 1355
Score = 2686 bits (1355), Expect = 0.0
Identities = 1355/1355 (100%)
Strand = Plus / Plus
Query: 1
Sbjct: 1
gtatatataacgtgatgagcgtacgggtgcggagacgcaccggagcgctcgcccagccgc
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
gtatatataacgtgatgagcgtacgggtgcggagacgcaccggagcgctcgcccagccgc
82
Use the PFAM feature to display selective
records from an FSORT L-number (cont.)
L5
AN
TI
IN
PA
PI
AI
AB
ANSWER 180 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY9
6919434.6 DNA
USGENE
Monoclonal antibodies that bind OCIF (Patent)
Goto Masaaki (Tochigi, JP); Tsuda Eisuke (Tochigi, JP); . . . .
Sankyo Co Ltd (Tokyo JP)
US 6919434
B1
20050719
US 1999-338063
19990623
A protein which inhibits osteoclast diffraction and/or maturation and
a method for producing the protein. The protein is produced by human
embryonic lung fibroblasts and has a molecular weight of . . . .
ECLM US6919434 B1: 1. An isolated monoclonal antibody produced by a
hybridoma selected from the group consisting of A1G5 having Accession
No. FERM BP-7441,D2F4having Accession No. FERM BP-7442, . . . .
ORGN Unknown
SQL
1206
This hit is SEQ ID 6 from US6919434.
SCORE 2375
BLASTALIGN
Query = 1355 letters
Length = 1206
Score = 2375 bits (1198), Expect = 0.0
Identities = 1204/1206 (99%)
Strand = Plus / Plus
Query: 94
Sbjct: 1
atgaacaagttgctgtgctgcgcgctcgtgtttctggacatctccattaagtggaccacc
|||||||| |||||||||||||||||||||||||||||||||||||||||||||||||||
atgaacaacttgctgtgctgcgcgctcgtgtttctggacatctccattaagtggaccacc
83
After logging off from STN select the table
tool from the main STN Express tool bar
The most recent
Transcript is
automatically
selected.
84
If available choose any template you have
defined previously
The first time you
use the table tool,
no templates have
been defined yet.
85
Choose a previously defined template
Pick the chosen
answer set L-number
and record numbers.
86
Set highlighting preferences
Extra terms that were
not originally searched
may be highlighted.
87
Set up report cover page
Here, we have decided
not to add a cover page.
88
Select fields, fonts, colors, change field order,
customize field names and save templates
Choose fields, order formats
and personalized names.
89
STN Express Table Tool output can be
edited and adjusted as needed
If needed, go back and edit
choices fields, formats, etc.
90
STN Express Table Tool output can be
edited, adjusted and saved in Excel format
Note: see separate appendix
for the full printout of this table.
91
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
Similarity searching GETSIM (FASTA)
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
92
Sequence code match (SCM) searching in
USGENE using RUN GETSEQ
• GETSEQ is designed to retrieve either exact
matches to a sequence query, or answers with
conservative variation using special symbols
• It can also be used to retrieve exact length
matches, or subsequence hits, i.e. where the
query is a small part of a larger hit sequence
• GETSEQ can be prove to be a fast, precise and
effective alternative to BLAST for very short
sequence queries, e.g. DNA probes and primers
The DGENE Workshop Manual is the complete guide (page 38):
http://www.stn-international.com/training_center/bioseq/dgene_wm.pdf
93
Sequence code match (SCM) searching in
USGENE using RUN GETSEQ
Search Question:
Find all relevant U.S. published application and
patent references which were applied for prior to
2001, disclosing sequences with this fragment:
DSDGLAPPQHLIRV
94
RUN GETSEQ command syntax
Sequence Code Match searching with GETSEQ
=> RUN GETSEQ L1 (sequence or L-number)
/SQEP (exact protein) (default)
/SQEFP (exact family protein)
/SQSP (subsequence protein)
/SQSFP (subsequence family protein)
/SQEN (exact nucleotide)
/SQSN (subsequence nucleotide)
95
Amino acid families for RUN GETSEQ
SQEFP and SQSFP search options
Group
Amino acids
Neutral – weakly hydrophobic
P, A, G, S, T
Acid Amine – hydrophilic
Q, N, E, D, B, Z
Basic – hydrophilic
H, K, R
Hydrophobic
I, M, L, V
Aromatic
F, W, Y
Cross-linking
C
96
GETSEQ searches can be combined with
other search terms, e.g. application year
=> FILE USGENE
FILE 'USGENE' ENTERED AT 20:51:09 ON 06 OCT 2007
There
COPYRIGHT (C) 2007 SEQUENCEBASE CORP
FILE LAST UPDATED:
2 OCT 2007
MOST RECENT PUBLICATION DATE: 27 SEP 2007
are 17 sequence records
in DGENE
for CA2325774.
<20071002/UP>
<20070927/PD>
FILE COVERS 1982 TO DATE
>>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE
IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<<
=> RUN GETSEQ DSDGLAPPQHLIRV/SQSP
RUN GETSEQ AT 20:51:38 ON 06 OCT 2007
COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH
L1
L1
RUN STATEMENT CREATED
114 DSDGLAPPQHLIRV/SQSP
=> S L1 AND AY<2001
L2
72 L1 AND AY<2001
72 sequence hits (L2) have been
found in USGENE with the containing
the sequence fragment of interest.
97
The BRIEF format provides full bibliography
and abstract ….
=> D BRIEF
L2
AN
TI
IN
PA
PI
AI
RLI
ED
DT
AB
ANSWER 1 OF 72 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
There are 17 sequence records
6326464.44 Protein
USGENE
in DGENE for CA2325774.
P53 protein variants and therapeutic uses thereof (Patent)
Conseiller Emmanuel (Paris, FR); Bracco Laurent (Paris, FR)
Aventis Pharma S A (Antony FR)
US 6326464
B1
20011204
This sequence hit comes from
WO 1997004092
A
19970206
a U.S. granted patent, with an
US 1998-983035
19980220
application date prior to 2001.
WO 1996-FR1111
19960717
20070328
Patent
Proteins derived from the product of tumor suppressor gene p53 and
having enhanced functions for therapeutical use are disclosed. The
proteins advantageously have enhanced tumour suppressor and
programmed cell death inducer functions, particularly in
proliferative disease contexts where wild-type p53 protein is
inactivated. Nucleic acids coding for such molecules, vectors
containing same, and therapeutical use thereof, particularly in gene
Continued on next slide….
therapy, are also disclosed.
98
…. plus the exemplary claim and sequence
ECLM
SSO
ORGN
SQL
SEQ
US6326464 B1: What is claimed is:1. A variant of p53 protein
D BRIEFwherein
(cont.)
a C-terminal portion of the protein comprising a regulation domain
and a part of an oligomerization domain
is deleted
from residue
326
There
are 17 sequence
records
or from residue 337 and replaced by aninartificial
zipper
DGENE forleucine
CA2325774.
comprising residues 334-363 of SEQ ID No: 26; anda transactivation
domain is deleted and replaced by a VP16 transactivation domain.
PROTEIN; EMBL; GRANTED
Unknown
335
1
51
101
151
mgeyftlqir
ggsrpapaap
svtctyspal
evvrrcphhe
grerfemfre
tpaapapaps
nkmfcqlakt
rcsdsdglap
=======
201 eppevgsdct tihynymcns
251 evrvcacpgr drrteeenlr
301 kpldgdlkal keklkaleek
HITS AT: 164-177
lnealelkda
wplsssvpsq
cpvqlwvdst
pqhlirvegn
=======
scmggmnrrp
kkgephhelp
lkaleeklka
qagkepgrgg
ktyqgsygfr
pppgtrvram
lrveylddrn
ggsggggsgg
lgflhsgtak
aiykqsqhmt
tfrhsvvvpy
iltiitleds sgnllgrnsf
pgstkralpn ntssspqpkk
lvger
The hit portion of the answer sequence
is highlighted with double underlining.
99
Sequence code match (SCM) searching in
USGENE using RUN GETSEQ
Search Question:
Find all relevant U.S. published application and
patent references disclosing one or more of the
sequences represented by this Markush:
LGPX1QLCX2VX3CAP
X1 = V or L
X2 = any amino acid except, G or H
X3 = any amino acid
100
Variability symbols for RUN GETSEQ
sequence code match searches
Symbol
Function
[]
Specify alternate residues
[-]
Exclude a specific residue or alternate residues
{}
Repeat the preceding symbol(s) (number or range)
?
Repeat the preceding symbol(s) zero or one time
*
Repeat the preceding symbol(s) zero or more times
+
Repeat the preceding symbol(s) one or more times
^
Query appears at the beginning or the end of a sequence
|
Alternate sequence expressions
.
A gap of one residue
:
A gap of zero or one residues
&
Concatenate (join together) sequence queries
101
GETSEQ can be a flexible alternative to
BLAST for short sequence queries
=> FILE USGENE
=> RUN GETSEQ LGP[VL]QLC[-GH]LV.CAP/SQSP
RUN GETSEQ AT 22:40:20 ON 06 OCT 2007
COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH
L1
L1
RUN STATEMENT CREATED
13 LGP[VL]QLC[-GH]LV.CAP/SQSP
There are 17 sequence records
in DGENE
for CA2325774.
13 sequence
hits (L1) have
been found in USGENE
containing the sequence
fragment(s) of interest.
=> D TRI SEQ
L1
TI
MTY
SQL
SEQ
ANSWER 1 OF 13 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
Method and composition for treating and preventing tumor metastasis
in vivo (PublishedApplication)
Protein
20
1 tvlllgplql calvhcappa
====== ========
HITS AT: 5-18
The hit portion of the answer sequence
is highlighted with double underlining.
102
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
GETSIM (FASTA) similarity searching
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
103
Similarity searching in USGENE using
FASTA-based RUN GETSIM
• GETSIM was originally developed by FIZ
Karlsruhe for DGENE, and it has since been
implemented in both PCTGEN and USGENE
• It is based on the industry standard FASTA
methodology, and offers the same basic search
modes as BLAST (/SQP, /SQN and /TSQN)
• Since GETSIM requires more computational
time than BLAST, it is a usually a good idea to
make use of the offline BATCH search mode
The DGENE Workshop Manual is the complete guide (page 60):
http://www.stn-international.com/training_center/bioseq/dgene_wm.pdf
104
General differences between FASTA
(GETSIM) and BLAST algorithms
BLAST
FASTA (GETSIM)
Faster than FASTA
Slower than BLAST
Equivalent for highly similar sequences
Misses some less similar sequences
Better for less similar sequences
Comparison of shorter sequence parts
Comparison of entire sequence length
Less sensitive when using default settings
More sensitive, misses less homologs
Less separation between true homologs
and random hits
More separation between true homologs
and random hits
Calculates probabilities
Calculates significance “on the fly” from the
given dataset
105
Similarity searching in USGENE using
FASTA-based RUN GETSIM
Search Question:
Find sequences in U.S. published application
and patents which are similar to the following
nucleic acid query sequence:
GGGUUUAGGAGUGGUAGGUCUUACGA
UGCCAGCUGUAAUGCCUACCGGATAA
106
RUN GETSIM command syntax
Similarity Searching with GETSIM (protein/polypeptides)
=> RUN GETSIM L1 (sequence or L-number)
/SQP (protein) (default)
BATCH (offline)
ALERT (current awareness)
107
RUN GETSIM command syntax
Similarity Searching with GETSIM (nucleotides)
=> RUN GETSIM L1 (sequence or L-number)
/SQN (nucleotide)
SIN
(single strand) (default)
COM
(complementary strand)
BOTH (both strands)
BATCH (offline)
ALERT (current awareness)
108
Similarity searching in USGENE using
FASTA-based RUN GETSIM
=> FILE USGENE
Sequences of less than 256
FILE 'USGENE' ENTERED AT 20:09:16 ON 06 OCT 2007
There
are 17 may
sequence
records
COPYRIGHT (C) 2007 SEQUENCEBASE CORP
characters
be searched
FILE LAST UPDATED:
2 OCT 2007
MOST RECENT PUBLICATION DATE: 27 SEP 2007
in directly
DGENEon
forthe
CA2325774.
command line.
<20071002/UP>
Longer
sequences must be
<20070927/PD>
uploaded (see slides 27-31).
FILE COVERS 1982 TO DATE
=> RUN GETSIM GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCU
ACCGGATAA/SQN
6914 sequence hits have
been found above the
similarity threshold
automatically set by STN.
RUN GETSIM AT 20:10:11 ON 06 OCT 2007
COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH
100000 SEQUENCES PROCESSED
. . . .
5260000 SEQUENCES PROCESSED
6914 ANSWERS FOUND ABOVE A THRESHOLD OF
QUERY SELF SCORE VALUE IS 260
56
GETSIM calculates a
query self score, to help
assess answer similarity.
109
Decide how many answers to keep
Similarity
Score
251 |
|
|
The graphic representation gives a count
|
of hit sequences (x-axis) and similarity
|
score (y-axis). The graph gives a visual
|
clue about the proportion of similar and not
|
|
so similar sequences in the answer set.
126 |
|
|
||
Recommendation: keep ALL answers
||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
Answer Count
1380
2760
4140
5520
6900
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
110
SORT by SCORE descending
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
L1
RUN STATEMENT CREATED
There are 17 sequence records
L1
6914 GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCUACC
in DGENE for CA2325774.
GGATAA/SQN
Answer set arranged by accession number; to sort by descending
similarity score, enter at an arrow prompt (=>) "sor score d".
=> SOR SCORE D
PROCESSING COMPLETED FOR L1
L2
6914 SOR L1 SCORE D
As with a BLAST search, the
initial GETSIM search answer set
should be sorted by similarity
score descending, to bring the
best answers to the top.
111
Review answers with a free-of-charge
format including alignment
=> D TRI ORGN SCORE ALIGN 1-100
L2
TI
MTY
SQL
ORGN
SCORE
ALIGN
L2
TI
ANSWER 1 OF 6914 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
Capsid polypeptides and use to inhibit
viralare
packaging
(Patent)
There
17
sequence
records
DNA
GETSIM
SCORE display
4580
inThe
DGENE
for CA2325774.
Unknown
includes a similarity percentage
251
96% of query self score 260
(Score/Query Self Score x 100).
Smith-Waterman score: 251
52 na overlap starting at 1958
ggguuuaggagugguaggucuuacgaugccagcuguaaugccuaccggataa
:::...:::::.::.:::.:..::::.::::::.:.::.:::.:::::: ::
gggtttaggagtggtaggtcttacgatgccagctgtaatgcctaccggagaa
ANSWER 2 OF 6914 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
The GETSIM
ALIGN display:
Detection kits, such as nucleic acid arrays,
for detecting
the
expression or 10,000 or more Drosophila• genes
and uses
thereof
First line:
portion
of query
(PublishedApplication)
with similarity
MTY
DNA
SQL
2327
• Second line: similarity
ORGN DROSOPHILA
(identical- 2 dots, no matchSCORE 101
38% of query self score 260
ALIGN Smith-Waterman score: 101
blank, one dot- family match)
46 na overlap starting at 144
• Third line: portion of retrieved
ggagugguaggu_cuuacgaugccagcuguaaugccuaccggataa
::::.::. : . : .: :
:::::. ::.::: ::::sequence
:: :
with similarity
ggagtggtggctccatatgcctccagcttcaatgcccaccgcatca
112
Display selected USGENE answers in a
preferred bibliographic format
=> D BIB AB ECLM ORGN SQL SCORE ALIGN
L2
AN
TI
IN
PA
PI
AI
DT
AB
ANSWER 1 OF 6914 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
5831013.1 DNA
USGENE
There are 17 sequence records
Capsid polypeptides and use to inhibit viral packaging (Patent)
Bruenn Jeremy A. (Buffalo, NY); Yao Wensheng
(Kenmore,
NY)
in DGENE
for CA2325774.
The Research Foundation of State University of New York (Amherst NY)
US 5831013
A
19981103
US 1996-674351
19960702
USGENE records can be
Patent
displayed
in a wide variety
The present invention is directed to a viral
capsid polypeptide
capable of inhibiting viral packaging, the of
viral
capsid polypeptide
customized
formats.
consisting of a portion of a viral capsid protein of an RNA virus and
including a multimerization domain of the viral capsid protein. The
invention further provides an isolated nucleic acid . . . .
ECLM US5831013 A: 1. A viral capsid polypeptide capable of inhibiting
viral packaging, said viral capsid polypeptide having an amino acid
sequence selected from the group consisting of amino acids 1 to 473
of SEQ ID NO:2 and amino acids 1 to 443 of SEQ ID NO:4.
ORGN Unknown
SQL
4580
SCORE 251
96% of query self score 260
ALIGN Smith-Waterman score: 251
52 na overlap starting at 1958
ggguuuaggagugguaggucuuacgaugccagcuguaaugccuaccggataa
:::...:::::.::.:::.:..::::.::::::.:.::.:::.:::::: ::
gggtttaggagtggtaggtcttacgatgccagctgtaatgcctaccggagaa
113
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
GETSIM (FASTA) similarity searching
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
114
BLAST and GETSIM similarity searches can
both be run offline in BATCH search mode
• Multiple BATCH requests may be queued, to run
sequentially one after another
– A maximum of 16 requests can be queued per STN Login ID
• BATCH request results may be collected in an online
session up to 3 months from initiation
– Already retrieved results may be re-retrieved multiple times at no
additional cost, up to 8 days from the initial retrieval
• BATCH is most useful for GETSIM queries, as these can
take considerable computational time when run online
– Also a higher query length limit of 2,000 characters is permitted
115
Similarity searching in USGENE using
GETSIM in offline BATCH mode
=> FILE USGENE
FILE 'USGENE' ENTERED AT 20:40:17 ON 06 OCT 2007
COPYRIGHT (C) 2007 SEQUENCEBASE CORP
FILE LAST UPDATED:
2 OCT
MOST RECENT PUBLICATION DATE: 27 SEP
FILE COVERS 1982 TO DATE
There are 17 sequence records
To automatically
search
the
in DGENE
for CA2325774.
2007
<20071002/UP>
nucleotide sequence
and its
2007
<20070927/PD>
complement specify BOTH.
=> RUN GETSIM GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCUACCG
GATAA/SQN BOTH BATCH
Add BATCH for BATCH mode.
PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS): EXAMPLE3
RUN GETSIM AT 20:40:44 ON 06 OCT 2007
COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH
Name the BATCH search.
BATCH PROCESSING STARTED FOR EXAMPLE3
=> LOG H
Most GETSIM searches take
between 5 and 20 minutes to run.
SESSION WILL BE HELD FOR 120 MINUTES
STN INTERNATIONAL SESSION SUSPENDED AT 20:41:23 ON 06 OCT 2007
116
Use RUN GETBATCH to retrieve and
manage the results of BATCH searches
* * * * * * RECONNECTED TO STN INTERNATIONAL * * *Login
* * *with 2 hours if you
SESSION RESUMED IN FILE 'USGENE' AT 20:57:23 ON 06 OCT 2007
want to reconnect to your
FILE 'USGENE' ENTERED AT 20:57:23 ON 06 OCT 2007
=> RUN GETBATCH
Please enter your batch identifier
or enter # for batch id list
or enter * for batch id at top of list
or enter - before batch id to delete
or enter . for (end)
BATCH REQUEST: #
Batch result files remaining:
EXAMPLE1 Retrieved (getsim)
EXAMPLE2 Retrieved (getsim)
EXAMPLE3 Completed (getsim)
----------------------Please enter your batch identifier
or enter # for batch id list
or enter * for batch id at top of list
or enter - before batch id to delete
or enter . for (end)
BATCH REQUEST: EXAMPLE3
There are
17 sequence
records
previous
STN session.
in DGENE for CA2325774.
Enter # for a BATCH ID list.
BATCH result files status
can be: Queued, Running,
Completed or Retrieved.
Enter the name of the BATCH
search results to retrieve.
117
Decide how many answers to keep
5230 ANSWERS FOUND ABOVE A THRESHOLD OF
QUERY SELF SCORE VALUE IS 260
66
Similarity
Score
251 |
The graphic representation gives a count
|
of hit sequences (x-axis) and similarity
|
|
score (y-axis). The graph gives a visual
|
clue about the proportion of similar and not
|
so similar sequences in the answer set.
|
|
|
126 |
|
Recommendation: keep ALL answers
|
|||||
|||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||
Answer Count
1050
2100
3150
4200
5250
HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL
118
After BATCH retrieval, all search, sort and display
options are the same as in online search mode
L1
L1
RUN STATEMENT CREATED
5230 GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCUACCGGAT
AA/SQN.BOTH
There are 17 sequence records
Answer set arranged by accession number; to sort by descending
in DGENE
for CA2325774.
similarity score, enter at an arrow prompt (=>)
"sor score
d".
Batch result files remaining:
EXAMPLE1 Retrieved (getsim)
EXAMPLE2 Retrieved (getsim)
BATCH result files status
EXAMPLE3 Retrieved (getsim)
----------------------can be: Queued, Running,
=> SOR SCORE D
PROCESSING COMPLETED FOR L1
L2
5230 SOR L1 SCORE D
Completed or Retrieved.
=> D TRI ORGN ALIGN 1-10
L2
TI
MTY
SQL
ORGN
ALIGN
ANSWER 1 OF 5230 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN
Capsid polypeptides and use to inhibit viral packaging (Patent)
DNA
4580
Unknown
Smith-Waterman score: 251
52 na overlap starting at 1958
ggguuuaggagugguaggucuuacgaugccagcuguaaugccuaccggataa
:::...:::::.::.:::.:..::::.::::::.:.::.:::.:::::: ::
gggtttaggagtggtaggtcttacgatgccagctgtaatgcctaccggagaa
119
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
GETSIM (FASTA) similarity searching
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
120
In general, multifile sequence searching workflow
uses PN, AP, PRN and RLN numbers
USGENE
REGISTRY
RN
PN APPS
PN APPS
HCAPLUS
DGENE
PN APPS
PN APPS
PCTGEN
See also Effective patent sequence searching on STN (Part V):
http://www.stn-international.com/training_center/bioseq/epss.pdf
121
Multifile searching with DGENE
• The simple document based approach
– See STN transcript appendix number 1.
• The simple patent family based approach
– See STN transcript appendix number 2.
• The advanced patent family approach
– See STN Transcript appendix number 3.
Appendices are provided in the USGENE Workshop Manual:
http://www.stn-international.com/archive/presentations/USGENE_ws_1107.pdf
122
Agenda
•
•
•
•
•
•
•
•
•
•
STN sequence searchable databases
USGENE database content
The 7 basic steps of USGENE BLAST®
BLAST and Patent Family SORT (FSORT)
Post-processing BLAST search results
Sequence Code Match (SCM) with GETSEQ
GETSIM (FASTA) similarity searching
Offline BATCH search mode
Multifile searching with DGENE
Comparisons and conclusions
123
How does USGENE compare to other
USPTO sequence data sources?
Update
Typical
Frequency Timeliness
Backfile Value
coverage added
USGENE
Weekly
7 days
1982 -
DGENE
Biweekly
65 days
1981 -
Daily
27 days
1957 -
(DWPI basics)
REGISTRY
(CAplus basics)
NCBI/EMBL Daily
1-3 months 1982 -
124
How does USGENE compare to other
USPTO sequence data sources? (cont.)
USPTO
PGPs
USGENE
DGENE
(DWPI basics)
REGISTRY
(CAplus basics)
NCBI/EMBL
USPTO
Patents
USPTO
Value
claims text added
125
Comparing STN databases…
• DGENE
– The most comprehensive patent sequence database
– Implemented in-house at major patent offices
• REGISTRY
– More timely than DGENE; complementary indexing
– Unique non-patent literature coverage
• USGENE
– More timely than DGENE and REGISTRY (7 days)
– Sequences from equivalent USPTO applications and patents
• PCTGEN
– The most timely database (24 hours)
– Sequences from equivalent WIPO/PCT publications
126
Conclusions
• USGENE is a vital new tool for business critical patent
searches, providing a complete collection of U.S. Issued
Patent sequences with searchable claims text
• USGENE also provides a collection of published
application sequence data, not covered by NCBI/EMBL
• USGENE provides the most timely source of USPTO
patent sequence data – within 7 days of publication
• DGENE and REGISTRY provide additional value-added
indexing for U.S. patents and published applications
• DGENE, REGISTRY and USGENE are all required for a
comprehensive search of USPTO sequence data
127
Visit www.fiz-k.com/usgene for the latest
USGENE reference materials
128
The USPTO Genetic Sequence
Database, USGENE®, on STN
www.fiz-k.com/usgene
Download