The USPTO Genetic Sequence Database, USGENE®, on STN Robert Austin – FIZ Karlsruhe Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ Similarity searching GETSIM (FASTA) Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions BLAST is a registered trademark of the U.S. National Library of Medicine (NLM) 2 STN sequence searchable databases • CAS REGISTRYSM – Chemical Abstracts Service (CAS) Registry File • DGENE – Thomson Scientific GENESEQTM • PCTGEN – WIPO/PCT Patent Application Biosequences • USGENE – The USPTO Genetic Sequence Database See Effective patent sequence searching on STN: http://www.stn-international.com/training_center/bioseq/epss.pdf 3 A new subject for many…. Bluff Your Way in Genetics!! http://www.stn-international.com/training_center/bioseq/bluff.pdf 4 USGENE is the USPTO Genetic Sequence Database • Sequences from all relevant USPTO published patent applications and granted (issued) patents • Assignee and full inventor names; publication, application and parent case PCT numbers and dates; original publication title, abstract, and claims • Organism name, sequence length, Molecule Type, SEQ ID, and feature tables for features/annotations • Produced by the SequenceBase Corporation • Updated weekly – within 7 days of publication • 1982 – present 5 USGENE consolidates unique USPTO sequence data from different sources • USPTO Publication Site for Issued and Published Sequences (PSIPS) – The official mega-publication download site, 2001-date • International Nucleotide Sequence Database Collaboration (INSDC) (NCBI/EMBL/DDBJ, Genbank) – U.S. granted patent nucleotide sequences, 1982-date • USPTO Protein Database (NCBI/EMBL) – U.S. granted patent protein/peptide sequences, 1982-date • USPTO Patents and Published Applications Full-Text – Filling in omissions, coverage gaps and to enhance timeliness The USGENE Sequence Source (/SSO) field indicates which source any given USGENE sequence record was derived from. 6 USGENE combines these sequences with bibliographic data and claims text USPTO biblio, title, abstract and claims text INSDC USPTO nucleotide Sequences NCBI/EMBL-EBI USPTO peptide Sequences USPTO PSIPS Sequences USPTO full-text sequences 7 An individual publication is represented by one or more USGENE sequence records AN .... Protein PI USGENE US …. B2 SEQ 1 …. AN .... DNA PI USGENE US …. B2 SEQ 2 …. AN .... cDNA PI USGENE US …. B2 SEQ n …. 8 Each USGENE sequence record includes full patent bibliography, title and abstract L1 AN TI IN PA PI AI RLI ED DT AB (7) ANSWER 1 OF 1 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN ALL display format. 7255990.1 DNA USGENE Method for screening genes expressing at desired sites (Patent) Higo Kenichi (Tsukuba, JP); Iwamoto Masao (Tsukuba, JP) National Institute of Agrobiological Sciences (Ibaraki JP) US 7255990 B2 20070814 US 20040086855 A1 20040506 WO 2003044227 A 20030530 See (1) - (7) US 2001-221596 20011121 on slide 12. WO 2001-JP10195 20011121 20070817 Patent The present invention relates to a method for inferring a plant organ, in which a certain gene is to be expressed, using a part of a base sequence, a method for searching for a gene which is to be USGENE records are typically expressed at a desired site, and a composition, kit, system and availableout within 7 days of The present invention also program for carrying these methods. the USPTO. relates to a publication method for by inferring a plant organ, in which a plant gene is to be expressed, based on information about the presence or absence of a base sequence which is highly similar to . . . . (1) (2) (3) (4) (5) (6) 9 Each USGENE sequence record includes patent or published application claims text CLM (8) ALL display format (cont.) US7255990 B2: 1. A method for detecting a gene which is expressed in a flower and other organs in a rice plant, comprising the steps of:(1)searching a gene population using a Tourist C transposon sequence consisting of SEQ ID NO: 1 as a key sequence,(2) selecting a gene having the transposon sequence in the vicinity of a putative protein coding region, and(3) detecting expression of said gene in the flower and other organs. 2. The method according to claim 1, wherein the expression of said gene includes expression of at least one site selected from a stamen and a pistil. 3. The method according to claim 1, wherein the gene population is a library and the key sequence is a probe sequence. 4. The method according to claim 3, wherein the database is a DNA library. 5. The method according to claim 3, wherein the search . . . . 10 All USGENE sequences are provided in STN standardized format SSO NUCLEIC; USPTO; GRANTED ORGN Zea mays SQL 352 (11) (10) (9) ALL display format (cont.) SEQ 1 gggtctgttt agttcccaaa caaaattttt cacgctgtta cataggatgt 51 ttggacacat gcatagagta ctaaatgtag aaaaaaaaca attaaacatt (12) 101 tcgccttgaa attacgagac aaatctttta agcctaattg cgccatgatt 151 tgacaatttg gtgctacaat aaatatttgc taataataga ttaattaggc 201 ttaataaatt cgtcttgcag tttccagacg gaatctgtaa tttattttat 251 gagatacagc tgcttcgatc ttccatcaca tattcagacc gtacctaatc 301 tgaaaggtta gtaatttgaa ctgcgtagta atgctacaag gtaaatcaat 351 ca (13) FEATURE TABLE: Key |Location | ============+==========+======================= misc_feature|(1)..(352)| See (8) - (13) on slide 13. 11 USGENE sample record annotations 1) USGENE Accession Number (AN), including the sequence identity number (SEQ ID NO) 2) Molecule Type (MTY) 3) Original publication title – a “Published Application” or “Patent” indication is given in parentheses 4) Full inventor names, city and state/country 5) Patent assignee name, city and state/country 6) Publication, application and related PCT parent case application details and dates 7) Original patent or published application abstract 12 USGENE sample record annotations 8) Published application or granted patent claims 9) The Sequence Source (SSO) – nucleic or protein; PSIPS/USPTO, NCBI, etc; granted or application 10) Organism (where given) – providing the name of the organism from which the sequence is derived 11) Searchable and sortable Sequence Length (SQL) 12) Standardized patent sequence (SEQ) – each USGENE record is based upon a sequence 13) Feature table including sequence modifications, features and/or annotations, as provided by the patent applicant or assignee 13 The original format of a USGENE sequence is available for display using the SEQO display => S 20070224666.21/AN L1 1 20070224666.21/AN => D TRI SEQO L1 TI MTY SQL SEQO 14 USGENE Accession Numbers (/AN) comprise the publication number + the sequence identity number (SEQ ID NO). ANSWER 1 OF 1 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN Alleles of the zwf gene from coryneform bacteria (PublishedApplication) DNA Often the SEQO original format includes 1263 gtg Met 1 gaa Glu 20 gga Gly 35 . . gcc ctg gtc gta cag Ala Leu Val Val Gln 5 cgc att aga aac gtc Arg Ile Arg Asn Val 25 aat gat gtc gtg gtt Asn Asp Val Val Val 40 . . . . . aaa Lys the patent applicant’s alignment of the tat ggc ggt tcc tcg cttcoding gag agt gcg with48 nucleotide sequence region Tyr Gly Gly Ser Ser Leu Glu Ser Ala its corresponding protein sequence. 10 15 gct gaa cgg atc gtt Ala Glu Arg Ile Val 30 gtc tgc tcc gca atg Val Cys Ser Ala Met 45 gcc acc aag aag gct Ala Thr Lys Lys Ala 96 gga gac acc acg gat Gly Asp Thr Thr Asp 144 In contrast, NCBI/EMBL/DDBJ patent records have minimal bibliographic and text data Note: USPTO peptide sequence records available at EMBL (shown here), are also available a NCBI, but not at DDBJ. 15 USGENE represents a new tool for tackling business critical searches • DGENE and REGISTRY sequences are indexed by Thomson from the DWPISM basic and by CAS from the CAplusSM basic respectively – 65% of basics are PCT published applications • USGENE provides sequences from both USPTO granted patents and published applications – Updated weekly, within 7 days of USPTO publication • Sequence listing variation often occurs between published application and granted patent stage – Especially important, e.g. for freedom-to-operate 16 USGENE provides sequences from both USPTO published applications and granted patents AN .... Protein PI USGENE US …. A1 PI SEQ 1 …. AN .... DNA PI USGENE AN .... Protein FR ….. A1 USGENE US …. B2 AN .... DNA US …. B2 SEQ 2 …. US ….. A1 EP ….. B1 USGENE AN .... Protein PI US ….. B2 DGENE WO …. A1 SEQ 1 …. AN .... DNA PI EP ….. A1 SEQ 1 …. PI WO ….. A1 US …. A1 SEQ 2 …. PI AN .... WPINDEX DGENE WO …. A1 SEQ 2 …. In contrast, DGENE sequences are indexed from DWPI basic publications. WPINDEX = Derwent World Patents Index® on STN DGENE = GENESEQTM on STN USGENE® = USPTO Genetic Sequence Database 17 Sequence listing variation often occurs between published application and granted patent stage L1 AN TI ANSWER 1 OF 1 WPINDEX COPYRIGHT 2007 THE THOMSON CORP on STN 1994-358278 [44] WPINDEX New polynucleotide(s) specific for hepatitis C virus types 4, 5 and 6 and related antigenic peptide(s) and antibodies, useful in vaccines, diagnosis, HCV typing and treatment DC B04; D16; S03 IN PIKE I H; SIMMONDS P; YAP P L PA (COMM-N) COMMON SERVICES AGENCY; (MURE-N) MUREX DIAGNOSTICS INT INC; . . . PI WO 9425602 A1 19941110 (199444)* EN 70[5] AU 9465797 A 19941121 (199508) EN FI 9505224 A 19951220 (199611) FI this example the patent family has: EP 698101 A1 19960228 In (199613) EN [0] JP 09500009 W 19970107 (199711) JA 52[0] 9 sequences from WO 9425602 in DGENE AU 695259 B 19980813 •(199844) EN 58 sequences from US 6881821 in USGENE EP 698101 B1 20041103 •(200475) EN DE 69434116 E 20041209 (200481) DE US 20050032047 A1 20050210 (200512) EN US 6881821 B2 20050419 (200527) EN . . . . . ADT WO 9425602 A1 WO 1994-GB957 19940505 . . . . PRAI GB 1994-263 19940107 GB 1993-9237 19930505 18 USGENE covers a comprehensive variety of USPTO patent publication types PK Patent Kind covered in USGENE (field /PK) USA1 USA2 USA9 USA USB1 USB2 USE USP1 USP2 USP3 WOA Published patent application Republished patent application Corrected published patent application Granted patent (until 2000) Granted patent without pre-grant publication (2001 onwards) Granted patent with pre-grant publication (2001 onwards) Reissued patent Published plant patent application Granted plant patent without pre-grant publication Granted plant patent with pre-grant publication WIPO/PCT published patent application (parent case data) 19 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ Similarity searching GETSIM (FASTA) Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 20 USGENE offers the same sequence search options as DGENE • NCBI BLAST similarity – RUN BLAST • FASTA similarity – RUN GETSIM • Sequence Code Match (SCM) – RUN GETSEQ • Offline BATCH and ALERT options The DGENE Workshop Manual is the complete guide: http://www.stn-international.com/training_center/bioseq/dgene_wm.pdf 21 The 7 basic steps of USGENE BLAST 1) 2) 3) 4) 5) SAVE, UPLOAD, and VERIFY the query (L1) RUN the BLAST search (/SQP or /SQN) Decide how many answers to keep (L2) SORT SCORE in Descending order (L3) Review answers in a free-of-charge format e.g. D L3 TRI ORGN ALIGN 16) Display selected answers in bibliographic format, e.g. D L3 BIB AB CLM ALIGN 1,3,10 7) Ensure transcript was captured and Logoff 22 The 7 basic steps of USGENE BLAST Search Question: Find relevant U.S. published application and patent references for this protein sequence: 1 51 101 151 vqtvplsrlf sfcfsdsipt lvydtsdsdd hnhdallkny dhamleahra psnmeetqqk yhllkdleeg gllycfrkdm helaidtyqe snlellrisl iqtlmgrled dkvetflrmv feetyipkdq llieswlepv gsrrtgqilk qcrsvegscg kysflhdsqt rflrsmfann qtyskfdtns f 23 The 7 basic steps of USGENE BLAST 1) SAVE, UPLOAD, and VERIFY the sequence query text file (L1) ¾ Upload options • • STN Express®: Use UPLOAD command or Upload Query Wizard (STN Express 8.2+) STN® on the WebSM: Use Upload feature or Sequence Assistant (link below) ¾ Verify the sequence with D LQUE STN on the Web Sequence Search Assistant: http://www.stn-international.com/training_center/bioseq/seq_se_ass.pdf 24 Requirements for sequences for the STN Express Upload Query Wizard • Sequence queries must be saved individually in text (.txt) format • Files may – Be 3 letter codes (amino acids) or single letter – Have header information as seen in, e.g. WIPO ST.25, USPTO PSIPS or EMBL formats – Include sequence count numbers • Query (.txt) files must – Be 10,000 characters or less – Not have any lines longer than 300 characters • After upload to STN verify with D LQUE 25 Examples of formats that work DETD SEQUENCE CHARACTERISTICS: SEQ ID NO: 4 USPATFULL/USPAT2 format LENGTH: 724 TYPE: PRT ORGANISM: Artificial Sequence FEATURE: OTHER INFORMATION: Description of Artificial Sequence; Note = synthetic construct SEQUENCE: 4 Met Ser Phe Val Asp His Pro Pro Asp Trp Leu Glu Glu Val Gly Glu 1 5 10 137 <210> SEQ ID NO 15 Gly Leu Arg Glu Phe Leu Gly Leu Glu Ala <211> LENGTH: 951 Gly Pro Pro Lys Pro Lys USPTO PSIPS ST.25 format 20 25 DNA 30 <212> TYPE: <213> ORGANISM: Zea mays <400> SEQUENCE: 137 accgaggccg acttcccgtt cactggccac gacgggacgt gcgatctcaa actgaaaaat 60 acaagggttg tatccataga ttcgttcgag cgtgtgccca tcaactacga gagagcgctg 120 cagaaggccg tggcgcacca gcctgttagt gccagcattg aagcatctcg gcgcgcgttc 180 cagctctaca gttctggcat cttcgacggg agatgcggga cgtacctgga ccacggtgtg 240 26 a) Choose the Upload Query Wizard From the Discover! button menu. OR From the Select Discover! Wizard window. 27 b) Browse to locate sequence file Click Next button to go to the next step. 28 c) Change File type to .txt 29 d) Verify it’s the right query! 30 e) Select STN file to upload to Use PCTGEN to upload queries and verify them (lower connect hour). The resulting L-numbers may be searched in DGENE, PCTGEN, or USGENE. Click Finish for the file to be “scrubbed” and uploaded to STN. 31 1) SAVE, UPLOAD and VERIFY (cont.) => FILE PCTGEN => UPL R BLAST These commands are automatically run by the STN Express Sequence Query Upload wizard. UPLOAD SUCCESSFULLY COMPLETED L1 GENERATED => D L1 LQUE L1 LQUE => ANSWER 1 PCTGEN COPYRIGHT 2007 WIPO on STN vqtvplsrlfdhamleahrahelaidtyqefeetyipkdqkysflhdsqtsfcfsdsi ptpsnmeetqqksnlellrislllieswlepvrflrsmfannlvydtsdsddyhllkd leegiqtlmgrledgsrrtgqilkqtyskfdtnshnhdallknygllycfrkdmdkve tflrmvqcrsvegscgf The sequence query is now ready for searching directly in USGENE using the L-number (L1). 32 The 7 basic steps of USGENE BLAST 2) RUN the BLAST search ¾ Protein search: RUN BLAST L1 /SQP ¾ Nucleotide search: RUN BLAST L1 /SQN ¾ Translated search: RUN BLAST L1 /TSQN 33 2) RUN the USGENE BLAST search => FILE USGENE FILE 'USGENE' ENTERED AT 12:09:16 ON 03 COPYRIGHT (C) 2007 SEQUENCEBASE CORP USGENE is updated within 7 days OCT 2007 of publication by the USPTO. FILE LAST UPDATED: 2 OCT 2007 MOST RECENT PUBLICATION DATE: 27 SEP 2007 <20071002/UP> <20070927/PD> FILE COVERS 1982 TO DATE >>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<< => RUN BLAST L1 /SQP -F F BLAST Version 2.2 Turn the Low Complexity Filter off with the syntax… /SQP –F F The BLAST software is used herein with permission of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM). See also, . . . . BLAST SEARCHING . . . . 34 RUN BLAST command syntax Similarity Searching with BLAST (protein/polypeptides) => RUN BLAST L1 (sequence or L-number) /SQP (protein) (default) -e (Expect-value) -f (Filter) (on by default) -w (Word size) -m (Matrix) -g (Gap penalty) -x (Gap extension) BATCH (offline) ALERT (Alert/SDI) 35 RUN BLAST command syntax Similarity Searching with BLAST (Nucleic acids) => RUN BLAST L1 (sequence or L-number) /SQN (nucleotide) SIN (single strand) COM (complementary strand) BOTH (both strands) (default) -e (Expect-value) -f (Filter) -w (Word size) -g (Gap penalty) -x (Gap extension) -q (penalty for mismatch) -r (reward for match) BATCH (offline) ALERT (Alert/SDI) 36 RUN BLAST advanced options Expectation Value (-E) Expectation value (E-Value) is the statistical significance threshold for reporting matches against a sequence database. The E-value can be any positive number, and the default value is 10. This means that 10 matches may be expected to be found merely by chance. In general E-value is lowered to make the search more precise and raised to retrieve more answers. Word Size (-W) Word Size is the length of the character string fragments of a sequence query which are used as the basis for a BLAST search. For SQN the default is 11 and the range 7-23. For all other BLAST searches the default is 3 and the range 2-3. For short search queries, reducing the default word size can give improved search results. 37 RUN BLAST advanced options (cont.) Low Complexity Filtering (on by default) (-F) The low complexity filter can eliminate biologically uninteresting segments that have low compositional complexity and are statistically significant, as determined by specific programs for peptide or nucleotide sequences in nature. Filtering is applied to the query sequence and is indicated by a series of Xs for peptide sequences and Ns for nucleotide sequences. Low complexity filtering can be turned off (i.e. set to F - false). Peptide similarity matrices (-M) For peptide based searches SQP and TSQN the advanced options provide additional scoring matrices to the default BLOSUM62 (next slide) 38 Guidelines from NCBI on the use of Advanced Settings for peptide sequence searching are as follows: Query Length Matrix Gap costs <35 PAM-30 (9,1) 35 – 50 PAM-70 (10,1) 50 – 85 BLOSUM-80 (10,1) >85 BLOSUM-62 (11,1) (BLAST default) 39 The 7 basic steps of USGENE BLAST 3) Decide how many answers to keep (L2) ¾ How many answers would you like to keep? (ALL) or ?: ¾ Recommendation: Keep ALL answers 40 3) Decide how many answers to keep 1350 ANSWERS FOUND BELOW EXPECTATION VALUE OF 10.0 Similarity Score 390 | The graphic representation gives a count | of hit sequences (x-axis) and similarity | ||||||||||||||||||||||||||||||||||||| score (y-axis). The graph gives a visual ||||||||||||||||||||||||||||||||||||| clue about the proportion of similar and not |||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||| so similar sequences in the answer set. |||||||||||||||||||||||||||||||||||||||| 195 |||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||| Recommendation: keep ALL answers |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| Answer Count 270 540 810 1080 1350 HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL 41 The 7 basic steps of USGENE BLAST 4) SORT by SCORE descending (L3) ¾ SOR L2 SCORE D ¾ Option: limit using text terms and/or dates (L4) ¾ Remember to SORT L4 SCORE D !! (L5) 42 4) SORT by SCORE descending HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL L2 L2 RUN STATEMENT CREATED 1350 VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQT SFCFSDSIPTPSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANN LVYDTSDSDDYHLLKDLEEGIQTLMGRLEDGSRRTGQILKQTYSKFDTNS HNHDALLKNYGLLYCFRKDMDKVETFLRMVQCRSVEGSCGF/SQP.-F F Answer set arranged by accession number; to sort by descending similarity score, enter at an arrow prompt (=>) "sor score d". => SOR SCORE D PROCESSING COMPLETED FOR L2 L3 1350 SOR L2 SCORE D Use SORT SCORE D to sort by descending BLAST score. 43 The 7 basic steps of USGENE BLAST 5) Review answers using a free-of-charge format including alignment (ALIGN), while “parked” in the STNGUIDESM file ¾ D L5 TRI ORGN ALIGN 1¾ FILE STNGUIDE 44 5) Review answers with a free-of-charge format including alignment => D L3 TRI ORGN ALIGN 1-30; FILE STNGUIDE L3 ANSWER 1 OF 1350 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN TI Recombinant DNA transfer vectors (Patent) MTY Protein SQL 191 This top hit comes from ORGN Unknown a U.S. issued patent. BLASTALIGN Query = 191 letters Length = 191 Score = 390 bits (1001), Expect = e-113 Identities = 191/191 (100%), Positives = 191/191 (100%) Query: 1 VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT Sbjct: 1 VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT Query: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG Sbjct: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG . . . . 45 5) Review answers with a free-of-charge format including alignment L3 TI ANSWER 5 OF 1350 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN Genetic polymorphisms associated with myocardial infarction, methods of detection and uses thereof (PublishedApplication) MTY Protein SQL 217 The 5th from top hit comes from ORGN Homo Sapiens a U.S. published application. BLASTALIGN Query = 191 letters Length = 217 Score = 387 bits (995), Expect = e-113 Identities = 189/191 (98%), Positives = 191/191 (100%) Query: 1 VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT VQTVPLSRLFDHAML+AHRAH+LAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT Sbjct: 1 VQTVPLSRLFDHAMLQAHRAHQLAIDTYQEFEETYIPKDQKYSFLHDSQTSFCFSDSIPT Query: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG BLAST alignment details are PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG explained on the next slide. . . . Sbjct: 61 PSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANNLVYDTSDSDDYHLLKDLEEG Query: 121 IQTLMGRLEDGSRRTGQILKQTYSKFDTNSHNHDALLKNYGLLYCFRKDMDKVETFLRMV IQTLMGRLEDGSRRTGQILKQTYSKFDTNSHNHDALLKNYGLLYCFRKDMDKVETFLRMV Sbjct: 147 IQTLMGRLEDGSRRTGQILKQTYSKFDTNSHNHDALLKNYGLLYCFRKDMDKVETFLRMV . . . . 46 Understanding BLAST alignments Query the length of the query sequence Length the length of the answer sequence Score a relative score assigned by BLAST Expect Expectation Value – a value representing the chance that an answer is a random hit. The closer to zero, the less likely the hit is random Identities the number of exact letter matches between query and answer within the displayed local alignment. The amino acid letter is repeated* in the display Positives a combination of identities and amino acid family matches shown with + (plus) in the alignment Gaps shown as dashes - where BLAST must break the query or answer to maintain an alignment (* For nucleic acid searches a vertical bar is used to indicate nucleotide identities in the alignment display.) 47 USGENE provides text search options for refining sequence searches • The USGENE default text search index – known on STN as the Basic Index (/BI) – comprises – Original publication Title (/TI) and abstract (/AB) – Organism name (/ORGN) and Molecule Type (/MTY) • The Exemplary Claim (/ECLM) and Feature Table (/FEAT) can also be added to a search – Either specify the fields: => S VIRUS/BI,FEAT – Or use SET SFIELDS: => SET SFIELDS BI ECLM • The Basic Index and Feature Table both offer simultaneous left and right truncation (SLART) 48 USGENE provides bibliographic search options for refining sequence searches • Patent Assignee (/PA) and Inventor (/IN) – Examples: GLAXO/PA, SMITH JOHN/IN • Granted or application Sequence Source (/SSO) – Examples: APPLICATION/SSO, GRANTED/SSO • Publication date (/PD) or publication year (/PY) – Examples: PY < 2001, PD < 1 Mar 1995 • Application date (/AD) or application year (/AY) – Examples: AY < 2002, AD < 1 Mar 1998 • WO application date (/RLD) or year (/RLY) – Examples: RLY < 1993, RLD < 1 Aug 1986 49 Option: refine USGENE BLAST results with text and/or date search terms HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL L2 L2 RUN STATEMENT CREATED 1350 VQTVPLSRLFDHAMLEAHRAHELAIDTYQEFEETYIPKDQKYSFLHDSQT SFCFSDSIPTPSNMEETQQKSNLELLRISLLLIESWLEPVRFLRSMFANN LVYDTSDSDDYHLLKDLEEGIQTLMGRLEDGSRRTGQILKQTYSKFDTNS HNHDALLKNYGLLYCFRKDMDKVETFLRMVQCRSVEGSCGF/SQP.-F F Answer set arranged by accession number; to sort by descending similarity score, enter at an arrow prompt (=>) "sor score d". => SOR SCORE D PROCESSING COMPLETED FOR L2 L3 1350 SOR L2 SCORE D The BLAST search (L2) is further refined to sequences from granted patents, with application year prior to 1996, and to a specific text search term (L4). => S L2 AND SOMATOMAMMOTROPIN/BI,ECLM AND AY<1996 AND GRANTED/SSO L4 7 L2 AND SOMATOMAMMOTROPIN AND AY<1996 AND GRANTED/SSO => SOR SCORE D PROCESSING COMPLETED FOR L4 L5 7 SOR L4 SCORE D If you limit using text and/or date terms remember to SORT SCORE D again! 50 The 7 basic steps of USGENE BLAST 6) Display selected relevant answers in a bibliographic format including alignment ¾ D L5 BIB AB CLM ALIGN 1 5 6 7) Ensure your STN Express session transcript was captured and then logoff 51 6) Display selected USGENE answers in a preferred bibliographic format => D BIB AB CLM ORGN SSO ALIGN 1 3 5 L5 AN TI IN PA PI AI AB CLM ANSWER 1 OF 7 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN 4363877.1 Protein USGENE Recombinant DNA transfer vectors (Patent) This sequence hit comes Goodman Howard M. (San Francisco, CA); Shine John (San Francisco, CA); from a U.S. granted patent, Seeburg Peter H. (San Francisco, CA) with an application date prior The Regents of the University of California(Berkeley CA) US 4363877 A 19821214 to 1996, and a key concept US 1978-897710 19780419 in the abstract and claims. Recombinant DNA transfer vectors containing codons for human somatomammotropin and for human growth hormone. US4363877 A: What is claimed is: 1. A recombinant DNA transfer vector comprising codons for human chorionic somatomammotropin comprising the nucleotide . . . . ORGN Unknown SSO PROTEIN; EMBL; GRANTED BLASTALIGN . . . . Note: this USGENE sequence record, sourced from EMBL, is an example of one which is not indexed in DGENE or REGISTRY. 52 Useful USGENE display fields/formats TRIAL* SCAN* ALIGN* SCORE* BIB AB ECLM CLM BRIEF ALL Title, Molecule Type, Sequence Length Random Title BLAST/GETSIM Sequence Alignment Similarity Score (for post-processing) Inventors, Assignees, numbers, dates Original abstract Exemplary (1st) claim text All claims text BIB + AB + ECLM, sequence, sequence source (SSO), feature table (FEAT) BRIEF with CLM instead of ECLM (* Free of charge display formats in USGENE.) 53 The importance of using the correct BLAST advanced options => RUN BLAST GSSFLSPEHQR/SQP BLAST Version 2.2 . . . . Changing BLAST options is especially important for short sequence queries! NO ANSWERS FOUND BELOW THRESHOLD OF 10 => RUN BLAST GSSFLSPEHQR/SQP -M PAM30 –W 2 –E 1000 –F F BLAST Version 2.2 . . . . 712 ANSWERS FOUND BELOW EXPECTATION VALUE OF 1000.0 HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL L1 RUN STATEMENT CREATED L1 712 GSSFLSPEHQR/SQP.-M PAM30 –W 2 –E 1000 –F F Answer set arranged by accession number; to sort by descending similarity score, enter at an arrow prompt (=>) "sor score d". 54 The importance of using the correct BLAST advanced options (cont.) => SOR L1 SCORE D PROCESSING COMPLETED FOR L1 L2 712 SOR L1 SCORE D => D TRI ALIGN Correct use of BLAST options finds relevant sequence hits. L2 ANSWER 1 OF 712 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN TI Fluorescently labeled growth hormone secretagogue (Patent) MTY Protein SQL 18 BLASTALIGN Query = 11 letters Length = 18 Score = 37.5 bits (81), Expect = 1e-09 Identities = 11/11 (100%), Positives = 11/11 (100%) Query: 1 GSSFLSPEHQR 11 GSSFLSPEHQR Sbjct: 1 GSSFLSPEHQR 11 55 Review: 7 steps of USGENE BLAST 1) 2) 3) 4) 5) SAVE, UPLOAD, and VERIFY the query (L1) RUN the BLAST search (/SQP or /SQN) Decide how many answers to keep (L2) SORT SCORE in Descending order (L3) Review answers in a free-of-charge format e.g. D L3 TRI ORGN ALIGN 16) Display selected answers in bibliographic format, e.g. D L3 BIB AB CLM ALIGN 1,3,10 7) Ensure transcript was captured and Logoff 56 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ Similarity searching GETSIM (FASTA) Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 57 USGENE answer sets may be grouped by source publications using Family SORT (FSORT) • FSORT gathers multiple sequence hits from the same applications together via publication, application and/or WO/PCT related application numbers • FSORT organizes answers into two subgroups: multiple sequence hit (multi-record) families and single sequence hit (individual-record) families • When FSORT is used on an answer set previously sorted by similarity SCORE, the two FSORT subgroups each separately retain their similarity sort order • FSORT makes it possible to review, e.g. just the most similar sequence answer for each application retrieved, or all the sequences from a single application 58 USGENE answer sets may be grouped by source publications using Family SORT (FSORT) Search Question: Find all relevant U.S. published application and patent references with sequences similar to the Banana Bunchy Top Virus (BBTV) Replication Initiation Protein (NCBI: AAG44003). 59 Banana Bunchy Top Virus (BBTV) Replication Initiation Protein (NCBI: AAG44003) 60 SAVE, UPLOAD and VERIFY => FILE PCTGEN => UPL R BLAST There are 17 records These commands are automatically runsequence by the STN DGENE for CA2325774. Express Sequence QueryinUpload wizard (slides 27-31). UPLOAD SUCCESSFULLY COMPLETED L1 GENERATED => D L1 LQUE L1 LQUE => ANSWER 1 PCTGEN COPYRIGHT 2007 WIPO on STN MSSFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIK LGGLKKKYSSRAHWERARGSDEDNAKYCSKETLILELGFPASQGSNRRKLSEMVSRSPE RMRIEQPEIYHRYTSVKKLKKFKEEFVHPCLDRPWQIQLTEAIDEEPDDRSIIWVYGPN GNEGKSTYAKSLMKKDWFYTRGGKKENILFSYVDEGSEKHIVFDIPRCNQDYLNYDVIE ALKDRVIESTKYKPIKLVELINIHVIVMANFMPEFCKISEDRIKIIYC The sequence query is now ready for searching directly in USGENE using the L-number (L1). 61 RUN the USGENE BLAST search => FILE USGENE FILE 'USGENE' ENTERED AT 22:44:51 ON 06 COPYRIGHT (C) 2007 SEQUENCEBASE CORP USGENE is updated within 7 days OCT 2007 of publication by the USPTO. FILE LAST UPDATED: 2 OCT 2007 MOST RECENT PUBLICATION DATE: 27 SEP 2007 <20071002/UP> <20070927/PD> FILE COVERS 1982 TO DATE >>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<< => RUN BLAST L1 /SQP -F F BLAST Version 2.2 Turn the Low Complexity Filter off with the syntax… /SQP –F F The BLAST software is used herein with permission of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM). See also, . . . . BLAST SEARCHING . . . . 62 Decide how many answers to keep 57 ANSWERS FOUND BELOW EXPECTATION VALUE OF 10.0 Similarity Score 520 | The graphic representation gives a count | | of hit sequences (x-axis) and similarity | score (y-axis). The graph gives a visual | | clue about the proportion of similar and not | || so similar sequences in the answer set. || || 260 || ||| ||| ||| Recommendation: keep ALL answers ||||| ||||| ||||| |||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Answer Count 10 20 30 40 50 HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL 63 SORT by SCORE descending HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?:ALL L2 RUN STATEMENT CREATED L2 57 MSSFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQG YLSLKKSIKLGGLKKKYSSRAHWERARGSDEDNAKYCSKETLILELGFPA SQGSNRRKLSEMVSRSPERMRIEQPEIYHRYTSVKKLKKFKEEFVHPCLD RPWQIQLTEAIDEEPDDRSIIWVYGPNGNEGKSTYAKSLMKKDWFYTRGG KKENILFSYVDEGSEKHIVFDIPRCNQDYLNYDVIEALKDRVIESTKYKP IKLVELINIHVIVMANFMPEFCKISEDRIKIIYC/SQP.-F F Answer set arranged by accession number; to sort by descending similarity score, enter at an arrow prompt (=>) "sor score d". => SOR SCORE D PROCESSING COMPLETED FOR L2 L3 57 SOR L2 SCORE D => SET FORMAT .MYUSGENE BIB AB ECLM ORGN SQL SCORE ALIGN SET COMMAND COMPLETED => SET DFORMAT .MYUSGENE SET COMMAND COMPLETED Option: set a customized display format with SET FORMAT. The new format may be set as the file default with SET DFORMAT. 64 Display selected USGENE answers using the new customized default display format => D 1-2 L3 AN TI ANSWER 1 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN 5846705.16 Protein USGENE Nucleotide sequence of two circular SSDNA associated with banana bunchy top virus and method for detection of banana bunchy top virus (Patent) IN Wu Rey-Yuh (Taipei, TW); You Li-Ru (Taipei, TW); Soong Tai-Seng (Taipei,TW) PA Development Center for Biotechnology (Taipei TW) PI US 5846705 A 19981208 AI US 1995-418071 19950406 AB Nucleotide sequences of two circular single-stranded DNAs . . . . ECLM US5846705 A: 1. An isolated DNA molecule comprising a . . . . ORGN Unknown SQL 286 SCORE 520 The top hit is SEQ ID 16 from US5846705. BLASTALIGN Query = 284 letters Length = 286 Score = 520 bits (1338), Expect = e-152 Identities = 247/282 (87%), Positives = 268/282 (94%) Query: 3 SFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKLGG S KWCFTLNYSSAAERE+FL+LLKEE+++YAVVGDEVAP++GQKHLQGYLSLKK I+LGG Sbjct: 5 SLKWCFTLNYSSAAERENFLSLLKEEDVHYAVVGDEVAPATGQKHLQGYLSLKKRIRLGG . . . . . 65 The second hit sequence comes from the same U.S. patent as the top hit L3 AN TI ANSWER 2 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN 5846705.17 Protein USGENE Nucleotide sequence of two circular SSDNA associated with banana bunchy top virus and method for detection of banana bunchy top virus (Patent) IN Wu Rey-Yuh (Taipei, TW); You Li-Ru (Taipei, TW); Soong Tai-Seng (Taipei, TW) PA Development Center for Biotechnology(Taipei TW) PI US 5846705 A 19981208 AI US 1995-418071 19950406 AB Nucleotide sequences of two circular single-stranded DNAs . . . . ECLM US5846705 A: 1. An isolated DNA molecule comprising a nucleotide sequence encoding a polypeptide comprising amino acid . . . . ORGN Unknown SQL 285 SCORE 340 The 2nd hit is SEQ ID 17 from US5846705. BLASTALIGN Query = 284 letters Length = 285 Score = 340 bits (872), Expect = 2e-98 Identities = 171/288 (59%), Positives = 217/288 (74%), Gaps = 7/288 Query: 1 MSSFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKL MSSFKWCFTLNYSSAAEREDFLALLKEE+++Y+VVGDEVAP++GQKHL GYLSLKKSI+L Sbjct: 1 MSSFKWCFTLNYSSAAEREDFLALLKEEDVHYSVVGDEVAPATGQKHLGGYLSLKKSIRL . . . . . 66 USGENE answer sets may be grouped by source publications using Family SORT (FSORT) => FSORT L3 SEL L3 1- PN,APPS L4 SEL L3 1- PN APPS : 45 TERMS 'L4' DELETED L4 57 FSO L3 12 Multi-record Families Family 1 Family 2 Family 3 Family 4 Family 5 Family 6 Family 7 Family 8 Family 9 Family 10 Family 11 Family 12 7 Individual Records 0 Non-patent Records The 57 sequence hits belong to 12 multi-hit and 7 individual-hit source publications. Answers 1-50 Answers 1-3 Answers 4-5 Answers 6-7 Answers 8-13 Answers 14-19 Answers 20-25 Answers 26-31 Answers 32-37 Answers 38-39 Answers 40-44 Answers 45-47 Answers 48-50 Answers 51-57 67 Use the patent family display (PFAM) feature to display selective records from a FSORT L-number General format of PFAM: => D L# PFAM=# RECORD# FORMAT Examples using PFAM: => D PFAM=1-10 1st member of patent family number 1-10 in default display format => D PFAM=2 TRI ORGN ALIGN 1-TOTAL All members of family number 2 in a free sequence review format 68 The top answer is the same as before…. => D PFAM=1-2 L4 AN TI The first record from families 1 & 2 in default format. ANSWER 1 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY1 5846705.16 Protein USGENE Nucleotide sequence of two circular SSDNA associated with banana bunchy top virus and method for detection of banana bunchy top virus (Patent) IN Wu Rey-Yuh (Taipei, TW); You Li-Ru (Taipei, TW); Soong Tai-Seng (Taipei,TW) PA Development Center for Biotechnology (Taipei TW) PI US 5846705 A 19981208 AI US 1995-418071 19950406 AB Nucleotide sequences of two circular single-stranded DNAs . . . . ECLM US5846705 A: 1. An isolated DNA molecule comprising a . . . . ORGN Unknown SQL 286 SCORE 520 The top hit is SEQ ID 16 from US5846705. BLASTALIGN Query = 284 letters Length = 286 Score = 520 bits (1338), Expect = e-152 Identities = 247/282 (87%), Positives = 268/282 (94%) Query: 3 SFKWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKLGG S KWCFTLNYSSAAERE+FL+LLKEE+++YAVVGDEVAP++GQKHLQGYLSLKK I+LGG Sbjct: 5 SLKWCFTLNYSSAAERENFLSLLKEEDVHYAVVGDEVAPATGQKHLQGYLSLKKRIRLGG . . . . . 69 …but the second answer displayed is now the best answer from the 2nd family L4 AN TI IN ANSWER 4 OF 57 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY2 5756708.26 Protein USGENE DNA sequences of banana bunchy top virus (Patent) Karan Mirko (Holland Park, AU); Burns Thomas Michael (Herston, AU); Dale James Langham (Moggill, AU); Harding Robert Maxwell(Lawnton, AU) PA Queensland University of Technology(Brisbane AU) PI US 5756708 A 19980526 AI US 1994-202186 19940224 DT Patent AB The invention provides DNA molecules consisting essentially of a nucleotide sequence or part thereof which are associated . . . . ECLM US5756708 A: 1. An isolated DNA molecule derived from banana bunchy top virus, consisting of a nucleotide sequence selected . . . . ORGN Unknown SQL 290 SCORE 243 The 2nd hit is now SEQ ID 26 from US5756708. BLASTALIGN Query = 284 letters Length = 290 Score = 243 bits (621), Expect = 3e-69 Identities = 117/282 (41%), Positives = 183/282 (64%), Gaps = 6/282 Query: 5 KWCFTLNYSSAAEREDFLALLKEEELNYAVVGDEVAPSSGQKHLQGYLSLKKSIKLGGLK +WCFTLNY + E + + ++ L YA+VGDEVAPS+GQ+HLQG++ LK +L GLK Sbjct: 7 RWCFTLNYETEEEAANVVRRIESLNLVYAIVGDEVAPSTGQRHLQGFIHLKTGRRLQGLK . . . . . 70 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ Similarity searching GETSIM (FASTA) Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 71 STN Express 8.2+ post-processing tools • Table Tool to create tabulated results – Good for scanning/reviewing search results • Predefined Report Tool for a report using a Standard Patent Record layout – Easy way to tidy-up your patent results for a client • Customized Report Tool to control all options – E.g. fonts, cover page, which data fields to include 72 USGENE results may be tabulated using STN Express 8.2+ Table Tool Search Question: Find all relevant U.S. published application and patent references with sequences similar to the Human osteoprotegerin (OPG) mRNA, complete CDS (NCBI: U94332). 73 Human osteoprotegerin (OPG) mRNA, complete CDS (NCBI: U94332) 74 Ensure you capture your STN session Record your session as a Transcript (.TRN) file or as an RTF file. 75 SAVE, UPLOAD and VERIFY => FILE PCTGEN => UPL R BLAST There are 17 records These commands are automatically runsequence by the STN DGENE for CA2325774. Express Sequence QueryinUpload wizard (slides 27-31). UPLOAD SUCCESSFULLY COMPLETED L1 GENERATED => D L1 LQUE L1 LQUE => ANSWER 1 PCTGEN COPYRIGHT 2007 WIPO on STN gtatatataacgtgatgagcgtacgggtgcggagacgcaccggagcgctcgcccagccg ccgctccaagcccctgaggtttccggggaccacaatgaacaagttgctgtgctgcgcgc tcgtgtttctggacatctccattaagtggaccacccaggaaacgtttcctccaaagtac . . . . . tggccattgagctgtttcctcacaattggcgagatcccatggatgataa The sequence query is now ready for searching directly in USGENE using the L-number (L1). 76 RUN the USGENE BLAST search => FILE USGENE FILE 'USGENE' ENTERED AT 04:38:02 ON 10 COPYRIGHT (C) 2007 SEQUENCEBASE CORP FILE LAST UPDATED: 8 OCT 2007 MOST RECENT PUBLICATION DATE: 4 OCT 2007 USGENE is updated within 7 days OCT 2007 of publication by the USPTO. <20071008/UP> <20071004/PD> FILE COVERS 1982 TO DATE >>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<< => RUN BLAST L1 /SQN -F F BLAST Version 2.2 Turn the Low Complexity Filter off with the syntax… /SQP –F F The BLAST software is used herein with permission of the National Center for Biotechnology Information (NCBI) of the National Library of Medicine (NLM). See also, . . . . BLAST SEARCHING . . . . 77 Decide how many answers to keep 554 ANSWERS FOUND BELOW EXPECTATION VALUE OF 10.0 Similarity Score 2686 || The graphic representation gives a count || ||||||||| of hit sequences (x-axis) and similarity |||||||||| score (y-axis). The graph gives a visual |||||||||| ||||||||||| clue about the proportion of similar and not |||||||||||| so similar sequences in the answer set. |||||||||||| ||||||||||||| ||||||||||||| 1343 |||||||||||||| ||||||||||||||| ||||||||||||||| |||||||||||||||| Recommendation: keep ALL answers ||||||||||||||||||| |||||||||||||||||||| |||||||||||||||||||| ||||||||||||||||||||| ||||||||||||||||||||| |||||||||||||||||||||||| Answer Count 110 220 330 440 550 HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL 78 SORT by SCORE descending HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL L2 RUN STATEMENT CREATED L2 554 GTATATATAACGTGATGAGCGTACGGGTGCGGAGACGCACCGGAGCGCTC GCCCAGCCGCCGCTCCAAGCCCCTGAGGTTTCCGGGGACCACAATGAACA . . . . . TGGAAATGGCCATTGAGCTGTTTCCTCACAATTGGCGAGATCCCATGGAT GATAA/SQN.-F F Answer set arranged by accession number; to sort by descending similarity score, enter at an arrow prompt (=>) "sor score d". => SET SFIELDS BI ECLM PERM SET COMMAND COMPLETED Use SET SFIELDS to change the USGENE default search index. => S L2 AND (OSTEO? OR BONE) AND GRANTED/SSO AND AY<2001 L3 245 L2 AND (OSTEO?/BI,ECLM OR BONE/BI,ECLM) AND GRANTED/SSO AND AY<2001 => SOR SCORE D PROCESSING COMPLETED FOR L3 L4 245 SOR L3 SCORE D After refining using date and text terms remember to SOR SCORE D. 79 Grouped by source publications using Family SORT (FSORT) => FSORT L4 SEL L4 1- PN,APPS L5 SEL L4 1- PN APPS : L5 25 TERMS 245 FSO L4 11 Multi-record Families Family 1 Family 2 Family 3 Family 4 Family 5 Family 6 Family 7 Family 8 Family 9 Family 10 Family 11 1 Individual Record 0 Non-patent Records The 245 sequence hits belong to 11 multi-hit and 1 individual-hit source publications. Answers 1-244 Answers 1-11 Answers 12-22 Answers 23-33 Answers 34-44 Answers 45-71 Answers 72-83 Answers 84-118 Answers 119-179 Answers 180-240 Answers 241-242 Answers 243-244 Answer 245 80 Reviewing the SCORE display can be one way to identify answers of interest => D PFAM=1- SCORE The SCORE for the best answer from each family. L5 ANSWER 1 OF 245 SCORE 2686 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY1 L5 ANSWER 12 OF 245 SCORE 2686 . . . . . USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY2 L5 ANSWER 119 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY8 SCORE 2375 L5 ANSWER 180 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN FAMILY9 SCORE 2375 L5 ANSWER 241 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY10 SCORE 40 L5 ANSWER 243 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY11 SCORE 40 L5 ANSWER 245 OF SCORE 2686 Note: the FSORT individual-hit 245 USGENE record alsoCOPYRIGHT has a top 2007 score.SEQUENCEBASE CORP on STN 81 Use the PFAM feature to display selective records from an FSORT L-number => D PFAM=1-9,12 L5 AN TI IN ANSWER 1 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY1 6284740.5 cDNA USGENE Osteoprotegerin (Patent) Boyle William J. (Moorpark, CA); Lacey David L. (Thousand Oaks, CA); Calzone Frank J. (Westlake Village, CA); . . . . PA Amgen Inc (Thousand Oaks CA) PI US 6284740 B1 20010904 AI US 1997-974186 19971118 AB The present invention discloses a novel secreted polypeptide, termed Osteoprotegerin, which is a member of the tumor necrosis . . . . ECLM US6284740 B1: What is claimed is:1. A method of increasing levels of osteoprotegerin in a mammal comprising administering to . . . . ORGN not provided SQL 1355 SCORE 2686 The top hit is SEQ ID 5 from US6284740. BLASTALIGN Query = 1355 letters Length = 1355 Score = 2686 bits (1355), Expect = 0.0 Identities = 1355/1355 (100%) Strand = Plus / Plus Query: 1 Sbjct: 1 gtatatataacgtgatgagcgtacgggtgcggagacgcaccggagcgctcgcccagccgc |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| gtatatataacgtgatgagcgtacgggtgcggagacgcaccggagcgctcgcccagccgc 82 Use the PFAM feature to display selective records from an FSORT L-number (cont.) L5 AN TI IN PA PI AI AB ANSWER 180 OF 245 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STNFAMILY9 6919434.6 DNA USGENE Monoclonal antibodies that bind OCIF (Patent) Goto Masaaki (Tochigi, JP); Tsuda Eisuke (Tochigi, JP); . . . . Sankyo Co Ltd (Tokyo JP) US 6919434 B1 20050719 US 1999-338063 19990623 A protein which inhibits osteoclast diffraction and/or maturation and a method for producing the protein. The protein is produced by human embryonic lung fibroblasts and has a molecular weight of . . . . ECLM US6919434 B1: 1. An isolated monoclonal antibody produced by a hybridoma selected from the group consisting of A1G5 having Accession No. FERM BP-7441,D2F4having Accession No. FERM BP-7442, . . . . ORGN Unknown SQL 1206 This hit is SEQ ID 6 from US6919434. SCORE 2375 BLASTALIGN Query = 1355 letters Length = 1206 Score = 2375 bits (1198), Expect = 0.0 Identities = 1204/1206 (99%) Strand = Plus / Plus Query: 94 Sbjct: 1 atgaacaagttgctgtgctgcgcgctcgtgtttctggacatctccattaagtggaccacc |||||||| ||||||||||||||||||||||||||||||||||||||||||||||||||| atgaacaacttgctgtgctgcgcgctcgtgtttctggacatctccattaagtggaccacc 83 After logging off from STN select the table tool from the main STN Express tool bar The most recent Transcript is automatically selected. 84 If available choose any template you have defined previously The first time you use the table tool, no templates have been defined yet. 85 Choose a previously defined template Pick the chosen answer set L-number and record numbers. 86 Set highlighting preferences Extra terms that were not originally searched may be highlighted. 87 Set up report cover page Here, we have decided not to add a cover page. 88 Select fields, fonts, colors, change field order, customize field names and save templates Choose fields, order formats and personalized names. 89 STN Express Table Tool output can be edited and adjusted as needed If needed, go back and edit choices fields, formats, etc. 90 STN Express Table Tool output can be edited, adjusted and saved in Excel format Note: see separate appendix for the full printout of this table. 91 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ Similarity searching GETSIM (FASTA) Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 92 Sequence code match (SCM) searching in USGENE using RUN GETSEQ • GETSEQ is designed to retrieve either exact matches to a sequence query, or answers with conservative variation using special symbols • It can also be used to retrieve exact length matches, or subsequence hits, i.e. where the query is a small part of a larger hit sequence • GETSEQ can be prove to be a fast, precise and effective alternative to BLAST for very short sequence queries, e.g. DNA probes and primers The DGENE Workshop Manual is the complete guide (page 38): http://www.stn-international.com/training_center/bioseq/dgene_wm.pdf 93 Sequence code match (SCM) searching in USGENE using RUN GETSEQ Search Question: Find all relevant U.S. published application and patent references which were applied for prior to 2001, disclosing sequences with this fragment: DSDGLAPPQHLIRV 94 RUN GETSEQ command syntax Sequence Code Match searching with GETSEQ => RUN GETSEQ L1 (sequence or L-number) /SQEP (exact protein) (default) /SQEFP (exact family protein) /SQSP (subsequence protein) /SQSFP (subsequence family protein) /SQEN (exact nucleotide) /SQSN (subsequence nucleotide) 95 Amino acid families for RUN GETSEQ SQEFP and SQSFP search options Group Amino acids Neutral – weakly hydrophobic P, A, G, S, T Acid Amine – hydrophilic Q, N, E, D, B, Z Basic – hydrophilic H, K, R Hydrophobic I, M, L, V Aromatic F, W, Y Cross-linking C 96 GETSEQ searches can be combined with other search terms, e.g. application year => FILE USGENE FILE 'USGENE' ENTERED AT 20:51:09 ON 06 OCT 2007 There COPYRIGHT (C) 2007 SEQUENCEBASE CORP FILE LAST UPDATED: 2 OCT 2007 MOST RECENT PUBLICATION DATE: 27 SEP 2007 are 17 sequence records in DGENE for CA2325774. <20071002/UP> <20070927/PD> FILE COVERS 1982 TO DATE >>> SIMULTANEOUS LEFT AND RIGHT TRUNCATION (SLART) IS AVAILABLE IN THE BASIC INDEX (/BI) AND FEATURE TABLE (/FEAT) FIELDS <<< => RUN GETSEQ DSDGLAPPQHLIRV/SQSP RUN GETSEQ AT 20:51:38 ON 06 OCT 2007 COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH L1 L1 RUN STATEMENT CREATED 114 DSDGLAPPQHLIRV/SQSP => S L1 AND AY<2001 L2 72 L1 AND AY<2001 72 sequence hits (L2) have been found in USGENE with the containing the sequence fragment of interest. 97 The BRIEF format provides full bibliography and abstract …. => D BRIEF L2 AN TI IN PA PI AI RLI ED DT AB ANSWER 1 OF 72 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN There are 17 sequence records 6326464.44 Protein USGENE in DGENE for CA2325774. P53 protein variants and therapeutic uses thereof (Patent) Conseiller Emmanuel (Paris, FR); Bracco Laurent (Paris, FR) Aventis Pharma S A (Antony FR) US 6326464 B1 20011204 This sequence hit comes from WO 1997004092 A 19970206 a U.S. granted patent, with an US 1998-983035 19980220 application date prior to 2001. WO 1996-FR1111 19960717 20070328 Patent Proteins derived from the product of tumor suppressor gene p53 and having enhanced functions for therapeutical use are disclosed. The proteins advantageously have enhanced tumour suppressor and programmed cell death inducer functions, particularly in proliferative disease contexts where wild-type p53 protein is inactivated. Nucleic acids coding for such molecules, vectors containing same, and therapeutical use thereof, particularly in gene Continued on next slide…. therapy, are also disclosed. 98 …. plus the exemplary claim and sequence ECLM SSO ORGN SQL SEQ US6326464 B1: What is claimed is:1. A variant of p53 protein D BRIEFwherein (cont.) a C-terminal portion of the protein comprising a regulation domain and a part of an oligomerization domain is deleted from residue 326 There are 17 sequence records or from residue 337 and replaced by aninartificial zipper DGENE forleucine CA2325774. comprising residues 334-363 of SEQ ID No: 26; anda transactivation domain is deleted and replaced by a VP16 transactivation domain. PROTEIN; EMBL; GRANTED Unknown 335 1 51 101 151 mgeyftlqir ggsrpapaap svtctyspal evvrrcphhe grerfemfre tpaapapaps nkmfcqlakt rcsdsdglap ======= 201 eppevgsdct tihynymcns 251 evrvcacpgr drrteeenlr 301 kpldgdlkal keklkaleek HITS AT: 164-177 lnealelkda wplsssvpsq cpvqlwvdst pqhlirvegn ======= scmggmnrrp kkgephhelp lkaleeklka qagkepgrgg ktyqgsygfr pppgtrvram lrveylddrn ggsggggsgg lgflhsgtak aiykqsqhmt tfrhsvvvpy iltiitleds sgnllgrnsf pgstkralpn ntssspqpkk lvger The hit portion of the answer sequence is highlighted with double underlining. 99 Sequence code match (SCM) searching in USGENE using RUN GETSEQ Search Question: Find all relevant U.S. published application and patent references disclosing one or more of the sequences represented by this Markush: LGPX1QLCX2VX3CAP X1 = V or L X2 = any amino acid except, G or H X3 = any amino acid 100 Variability symbols for RUN GETSEQ sequence code match searches Symbol Function [] Specify alternate residues [-] Exclude a specific residue or alternate residues {} Repeat the preceding symbol(s) (number or range) ? Repeat the preceding symbol(s) zero or one time * Repeat the preceding symbol(s) zero or more times + Repeat the preceding symbol(s) one or more times ^ Query appears at the beginning or the end of a sequence | Alternate sequence expressions . A gap of one residue : A gap of zero or one residues & Concatenate (join together) sequence queries 101 GETSEQ can be a flexible alternative to BLAST for short sequence queries => FILE USGENE => RUN GETSEQ LGP[VL]QLC[-GH]LV.CAP/SQSP RUN GETSEQ AT 22:40:20 ON 06 OCT 2007 COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH L1 L1 RUN STATEMENT CREATED 13 LGP[VL]QLC[-GH]LV.CAP/SQSP There are 17 sequence records in DGENE for CA2325774. 13 sequence hits (L1) have been found in USGENE containing the sequence fragment(s) of interest. => D TRI SEQ L1 TI MTY SQL SEQ ANSWER 1 OF 13 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN Method and composition for treating and preventing tumor metastasis in vivo (PublishedApplication) Protein 20 1 tvlllgplql calvhcappa ====== ======== HITS AT: 5-18 The hit portion of the answer sequence is highlighted with double underlining. 102 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ GETSIM (FASTA) similarity searching Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 103 Similarity searching in USGENE using FASTA-based RUN GETSIM • GETSIM was originally developed by FIZ Karlsruhe for DGENE, and it has since been implemented in both PCTGEN and USGENE • It is based on the industry standard FASTA methodology, and offers the same basic search modes as BLAST (/SQP, /SQN and /TSQN) • Since GETSIM requires more computational time than BLAST, it is a usually a good idea to make use of the offline BATCH search mode The DGENE Workshop Manual is the complete guide (page 60): http://www.stn-international.com/training_center/bioseq/dgene_wm.pdf 104 General differences between FASTA (GETSIM) and BLAST algorithms BLAST FASTA (GETSIM) Faster than FASTA Slower than BLAST Equivalent for highly similar sequences Misses some less similar sequences Better for less similar sequences Comparison of shorter sequence parts Comparison of entire sequence length Less sensitive when using default settings More sensitive, misses less homologs Less separation between true homologs and random hits More separation between true homologs and random hits Calculates probabilities Calculates significance “on the fly” from the given dataset 105 Similarity searching in USGENE using FASTA-based RUN GETSIM Search Question: Find sequences in U.S. published application and patents which are similar to the following nucleic acid query sequence: GGGUUUAGGAGUGGUAGGUCUUACGA UGCCAGCUGUAAUGCCUACCGGATAA 106 RUN GETSIM command syntax Similarity Searching with GETSIM (protein/polypeptides) => RUN GETSIM L1 (sequence or L-number) /SQP (protein) (default) BATCH (offline) ALERT (current awareness) 107 RUN GETSIM command syntax Similarity Searching with GETSIM (nucleotides) => RUN GETSIM L1 (sequence or L-number) /SQN (nucleotide) SIN (single strand) (default) COM (complementary strand) BOTH (both strands) BATCH (offline) ALERT (current awareness) 108 Similarity searching in USGENE using FASTA-based RUN GETSIM => FILE USGENE Sequences of less than 256 FILE 'USGENE' ENTERED AT 20:09:16 ON 06 OCT 2007 There are 17 may sequence records COPYRIGHT (C) 2007 SEQUENCEBASE CORP characters be searched FILE LAST UPDATED: 2 OCT 2007 MOST RECENT PUBLICATION DATE: 27 SEP 2007 in directly DGENEon forthe CA2325774. command line. <20071002/UP> Longer sequences must be <20070927/PD> uploaded (see slides 27-31). FILE COVERS 1982 TO DATE => RUN GETSIM GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCU ACCGGATAA/SQN 6914 sequence hits have been found above the similarity threshold automatically set by STN. RUN GETSIM AT 20:10:11 ON 06 OCT 2007 COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH 100000 SEQUENCES PROCESSED . . . . 5260000 SEQUENCES PROCESSED 6914 ANSWERS FOUND ABOVE A THRESHOLD OF QUERY SELF SCORE VALUE IS 260 56 GETSIM calculates a query self score, to help assess answer similarity. 109 Decide how many answers to keep Similarity Score 251 | | | The graphic representation gives a count | of hit sequences (x-axis) and similarity | score (y-axis). The graph gives a visual | clue about the proportion of similar and not | | so similar sequences in the answer set. 126 | | | || Recommendation: keep ALL answers |||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| Answer Count 1380 2760 4140 5520 6900 HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL 110 SORT by SCORE descending HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL L1 RUN STATEMENT CREATED There are 17 sequence records L1 6914 GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCUACC in DGENE for CA2325774. GGATAA/SQN Answer set arranged by accession number; to sort by descending similarity score, enter at an arrow prompt (=>) "sor score d". => SOR SCORE D PROCESSING COMPLETED FOR L1 L2 6914 SOR L1 SCORE D As with a BLAST search, the initial GETSIM search answer set should be sorted by similarity score descending, to bring the best answers to the top. 111 Review answers with a free-of-charge format including alignment => D TRI ORGN SCORE ALIGN 1-100 L2 TI MTY SQL ORGN SCORE ALIGN L2 TI ANSWER 1 OF 6914 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN Capsid polypeptides and use to inhibit viralare packaging (Patent) There 17 sequence records DNA GETSIM SCORE display 4580 inThe DGENE for CA2325774. Unknown includes a similarity percentage 251 96% of query self score 260 (Score/Query Self Score x 100). Smith-Waterman score: 251 52 na overlap starting at 1958 ggguuuaggagugguaggucuuacgaugccagcuguaaugccuaccggataa :::...:::::.::.:::.:..::::.::::::.:.::.:::.:::::: :: gggtttaggagtggtaggtcttacgatgccagctgtaatgcctaccggagaa ANSWER 2 OF 6914 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN The GETSIM ALIGN display: Detection kits, such as nucleic acid arrays, for detecting the expression or 10,000 or more Drosophila• genes and uses thereof First line: portion of query (PublishedApplication) with similarity MTY DNA SQL 2327 • Second line: similarity ORGN DROSOPHILA (identical- 2 dots, no matchSCORE 101 38% of query self score 260 ALIGN Smith-Waterman score: 101 blank, one dot- family match) 46 na overlap starting at 144 • Third line: portion of retrieved ggagugguaggu_cuuacgaugccagcuguaaugccuaccggataa ::::.::. : . : .: : :::::. ::.::: ::::sequence :: : with similarity ggagtggtggctccatatgcctccagcttcaatgcccaccgcatca 112 Display selected USGENE answers in a preferred bibliographic format => D BIB AB ECLM ORGN SQL SCORE ALIGN L2 AN TI IN PA PI AI DT AB ANSWER 1 OF 6914 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN 5831013.1 DNA USGENE There are 17 sequence records Capsid polypeptides and use to inhibit viral packaging (Patent) Bruenn Jeremy A. (Buffalo, NY); Yao Wensheng (Kenmore, NY) in DGENE for CA2325774. The Research Foundation of State University of New York (Amherst NY) US 5831013 A 19981103 US 1996-674351 19960702 USGENE records can be Patent displayed in a wide variety The present invention is directed to a viral capsid polypeptide capable of inhibiting viral packaging, the of viral capsid polypeptide customized formats. consisting of a portion of a viral capsid protein of an RNA virus and including a multimerization domain of the viral capsid protein. The invention further provides an isolated nucleic acid . . . . ECLM US5831013 A: 1. A viral capsid polypeptide capable of inhibiting viral packaging, said viral capsid polypeptide having an amino acid sequence selected from the group consisting of amino acids 1 to 473 of SEQ ID NO:2 and amino acids 1 to 443 of SEQ ID NO:4. ORGN Unknown SQL 4580 SCORE 251 96% of query self score 260 ALIGN Smith-Waterman score: 251 52 na overlap starting at 1958 ggguuuaggagugguaggucuuacgaugccagcuguaaugccuaccggataa :::...:::::.::.:::.:..::::.::::::.:.::.:::.:::::: :: gggtttaggagtggtaggtcttacgatgccagctgtaatgcctaccggagaa 113 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ GETSIM (FASTA) similarity searching Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 114 BLAST and GETSIM similarity searches can both be run offline in BATCH search mode • Multiple BATCH requests may be queued, to run sequentially one after another – A maximum of 16 requests can be queued per STN Login ID • BATCH request results may be collected in an online session up to 3 months from initiation – Already retrieved results may be re-retrieved multiple times at no additional cost, up to 8 days from the initial retrieval • BATCH is most useful for GETSIM queries, as these can take considerable computational time when run online – Also a higher query length limit of 2,000 characters is permitted 115 Similarity searching in USGENE using GETSIM in offline BATCH mode => FILE USGENE FILE 'USGENE' ENTERED AT 20:40:17 ON 06 OCT 2007 COPYRIGHT (C) 2007 SEQUENCEBASE CORP FILE LAST UPDATED: 2 OCT MOST RECENT PUBLICATION DATE: 27 SEP FILE COVERS 1982 TO DATE There are 17 sequence records To automatically search the in DGENE for CA2325774. 2007 <20071002/UP> nucleotide sequence and its 2007 <20070927/PD> complement specify BOTH. => RUN GETSIM GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCUACCG GATAA/SQN BOTH BATCH Add BATCH for BATCH mode. PLEASE ENTER BATCH IDENTIFIER (MAX. 8 CHARS): EXAMPLE3 RUN GETSIM AT 20:40:44 ON 06 OCT 2007 COPYRIGHT (C) 2007 FIZ KARLSRUHE GMBH Name the BATCH search. BATCH PROCESSING STARTED FOR EXAMPLE3 => LOG H Most GETSIM searches take between 5 and 20 minutes to run. SESSION WILL BE HELD FOR 120 MINUTES STN INTERNATIONAL SESSION SUSPENDED AT 20:41:23 ON 06 OCT 2007 116 Use RUN GETBATCH to retrieve and manage the results of BATCH searches * * * * * * RECONNECTED TO STN INTERNATIONAL * * *Login * * *with 2 hours if you SESSION RESUMED IN FILE 'USGENE' AT 20:57:23 ON 06 OCT 2007 want to reconnect to your FILE 'USGENE' ENTERED AT 20:57:23 ON 06 OCT 2007 => RUN GETBATCH Please enter your batch identifier or enter # for batch id list or enter * for batch id at top of list or enter - before batch id to delete or enter . for (end) BATCH REQUEST: # Batch result files remaining: EXAMPLE1 Retrieved (getsim) EXAMPLE2 Retrieved (getsim) EXAMPLE3 Completed (getsim) ----------------------Please enter your batch identifier or enter # for batch id list or enter * for batch id at top of list or enter - before batch id to delete or enter . for (end) BATCH REQUEST: EXAMPLE3 There are 17 sequence records previous STN session. in DGENE for CA2325774. Enter # for a BATCH ID list. BATCH result files status can be: Queued, Running, Completed or Retrieved. Enter the name of the BATCH search results to retrieve. 117 Decide how many answers to keep 5230 ANSWERS FOUND ABOVE A THRESHOLD OF QUERY SELF SCORE VALUE IS 260 66 Similarity Score 251 | The graphic representation gives a count | of hit sequences (x-axis) and similarity | | score (y-axis). The graph gives a visual | clue about the proportion of similar and not | so similar sequences in the answer set. | | | 126 | | Recommendation: keep ALL answers | ||||| ||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| |||||||||||||||||||||||||||||||||||||||||||||||||| Answer Count 1050 2100 3150 4200 5250 HOW MANY ANSWERS WOULD YOU LIKE TO KEEP ? (ALL) OR ?: ALL 118 After BATCH retrieval, all search, sort and display options are the same as in online search mode L1 L1 RUN STATEMENT CREATED 5230 GGGUUUAGGAGUGGUAGGUCUUACGAUGCCAGCUGUAAUGCCUACCGGAT AA/SQN.BOTH There are 17 sequence records Answer set arranged by accession number; to sort by descending in DGENE for CA2325774. similarity score, enter at an arrow prompt (=>) "sor score d". Batch result files remaining: EXAMPLE1 Retrieved (getsim) EXAMPLE2 Retrieved (getsim) BATCH result files status EXAMPLE3 Retrieved (getsim) ----------------------can be: Queued, Running, => SOR SCORE D PROCESSING COMPLETED FOR L1 L2 5230 SOR L1 SCORE D Completed or Retrieved. => D TRI ORGN ALIGN 1-10 L2 TI MTY SQL ORGN ALIGN ANSWER 1 OF 5230 USGENE COPYRIGHT 2007 SEQUENCEBASE CORP on STN Capsid polypeptides and use to inhibit viral packaging (Patent) DNA 4580 Unknown Smith-Waterman score: 251 52 na overlap starting at 1958 ggguuuaggagugguaggucuuacgaugccagcuguaaugccuaccggataa :::...:::::.::.:::.:..::::.::::::.:.::.:::.:::::: :: gggtttaggagtggtaggtcttacgatgccagctgtaatgcctaccggagaa 119 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ GETSIM (FASTA) similarity searching Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 120 In general, multifile sequence searching workflow uses PN, AP, PRN and RLN numbers USGENE REGISTRY RN PN APPS PN APPS HCAPLUS DGENE PN APPS PN APPS PCTGEN See also Effective patent sequence searching on STN (Part V): http://www.stn-international.com/training_center/bioseq/epss.pdf 121 Multifile searching with DGENE • The simple document based approach – See STN transcript appendix number 1. • The simple patent family based approach – See STN transcript appendix number 2. • The advanced patent family approach – See STN Transcript appendix number 3. Appendices are provided in the USGENE Workshop Manual: http://www.stn-international.com/archive/presentations/USGENE_ws_1107.pdf 122 Agenda • • • • • • • • • • STN sequence searchable databases USGENE database content The 7 basic steps of USGENE BLAST® BLAST and Patent Family SORT (FSORT) Post-processing BLAST search results Sequence Code Match (SCM) with GETSEQ GETSIM (FASTA) similarity searching Offline BATCH search mode Multifile searching with DGENE Comparisons and conclusions 123 How does USGENE compare to other USPTO sequence data sources? Update Typical Frequency Timeliness Backfile Value coverage added USGENE Weekly 7 days 1982 - DGENE Biweekly 65 days 1981 - Daily 27 days 1957 - (DWPI basics) REGISTRY (CAplus basics) NCBI/EMBL Daily 1-3 months 1982 - 124 How does USGENE compare to other USPTO sequence data sources? (cont.) USPTO PGPs USGENE DGENE (DWPI basics) REGISTRY (CAplus basics) NCBI/EMBL USPTO Patents USPTO Value claims text added 125 Comparing STN databases… • DGENE – The most comprehensive patent sequence database – Implemented in-house at major patent offices • REGISTRY – More timely than DGENE; complementary indexing – Unique non-patent literature coverage • USGENE – More timely than DGENE and REGISTRY (7 days) – Sequences from equivalent USPTO applications and patents • PCTGEN – The most timely database (24 hours) – Sequences from equivalent WIPO/PCT publications 126 Conclusions • USGENE is a vital new tool for business critical patent searches, providing a complete collection of U.S. Issued Patent sequences with searchable claims text • USGENE also provides a collection of published application sequence data, not covered by NCBI/EMBL • USGENE provides the most timely source of USPTO patent sequence data – within 7 days of publication • DGENE and REGISTRY provide additional value-added indexing for U.S. patents and published applications • DGENE, REGISTRY and USGENE are all required for a comprehensive search of USPTO sequence data 127 Visit www.fiz-k.com/usgene for the latest USGENE reference materials 128 The USPTO Genetic Sequence Database, USGENE®, on STN www.fiz-k.com/usgene