Patent Classifications as ‘Knowledge’ …towards a more conscious (auto)categorization of patents Arcanum Development 2013 A usual hierarchic categorization task… Given a hierarchic taxonomy (classification system) Provide a list of taxonomy nodes (classification symbols) for a document that best match the subject matter of the taxonomy node “Best” is based on… – For experts: understanding the subject matter and considering classification rules – For computers: providing the hierarchy, and training with sample categorized documents …and the rules?! However, patent classification task is somewhat more complicated… Typical methods of categorization Roots (sections) Non-classifying (‘preclassification’) levels … during training… flat the best is the winner on each level greedy hierarchic traversing down on best only Common features of patent classification schemes Hierarchic – covered subject matter of a higher level contains subject matter of a lower level – but: may be assigned to a higher level when none of the lower levels fit Some nodes (symbols) cannot be used (in general or alone) for classification – hierarchy levels – indexing schemes Schemes contain specific rules – relations – between symbols – Place / priority / precedence / limiting rules – Indexing rules – References to symbols to be taken in consideration Rules given in the scheme are extended by definitions / manuals of classification Schemes can be multilingual Used by various offices, cultures (maybe slightly differently) Using relations in patent categorization Last place rule Takes precedence Hierarchic, rules, references Why a (more) formal analysis and presentation is advantageous? Recently: the rules are presented as text, in various master files, machine-readable but not machine-interpretable way (it is ‘content’ but not yet ‘knowledge’) Lots of complex rules spread over multiple sources (e.g. definitions) and places (e.g. reverse references) Both for humans and computer programs, it causes trouble to collect and apply all the rules systematically It is worth then to convert IPC content to more explicit IPC ‘knowledge’… Hypotheses Tests were made – to verify confusions of patent examiners of various autocategorizers – …if they are in correlation to relations given in IPC Assumption – more references in IPC: higher overlap between subject matter area Hypotheses – the more references in IPC between two areas, the higher the confusion of humans and computers – the knowledge coded in IPC is, indeed, used by patent categorizers Testing the hypotheses If patent examiners take seriously references in IPC – the more references between two symbols, the higher number of co-classification Practice between two offices can be different – the more references in IPC, the higher likelihood of different decision Confusion of autocategorizers – more failures if subject matter area is overlapping Cocategorization vs. IPC references A47,A61 When there are references in IPC, patent examiners take them seriously – References mark overlapping subject matter areas and/or – References propose the use of secondary (indexing) symbols On class level, frequency of references in IPC is similar to the frequency of common use of symbols of both classes in patent documents B65 B60-B65 A61 vs C07 and C12 C07-C12 F16 G-H Differences in examiner’s practice A47,A61 When there are references in IPC, patent examiners may assign them differently – References mark overlapping subject matter areas On class level, frequency of references in IPC is similar to differences between selected first symbol (prereform practice, simulate preclassification) B65 B60-B65 A61 vs C07 and C12 C07-C12 F16 G-H Confusion of autocategorizers A47,A61 When there are references in IPC, autocategorizers fail more frequently B65 – “first symbol” may be selected differently On class level, frequency of references in IPC is similar to differences between selected first symbol of an autocategorizer (2002 data, to simulate preclassification) B60-B65 A61 vs C07 and C12 C07-C12 F16 G-H Conclusion 1. 2. 3. 4. Reference statistics in IPC Co-classification Human classification differences Preclassification autocategorization errors show similar characteristics on higher levels of IPC It may be even more important on lower levels, having there more complex rules Therefore, an easier access to the rules maybe welcome both by human and machine categorizers Presentation of IPCInfo An analysis and data preparation was performed as in-house research – defining relevant relation types (about 15 main relations and further ~20) (excerpts below) – parsing IPC scheme, definitions, catchwords and RCL – building relation graph in RDBMS (>1.5 m relations) The result is presented on a user interface Convertible to RDF or OWL for further use Patent taxonomy relations, samples reference: (transitive!) A01B 1/00 Hand tools (edge trimmers for lawns A01G 3/06) A01G 3/06 Hand-held edge trimmers or shears for lawns (mowers combined with lawn edgers A01D 43/16) precedence: (over 600 transitive cases, e.g A61M 3/00 A61M 5/00 A61M 36/00 [in definitions!]) A01B 3/24 Tractor-drawn ploughs (A01B 3/04 takes precedence) A01B 3/04 Animal-drawn ploughs limiting: A01N PRESERVATION OF BODIES…; BIOCIDES, e.g. AS DISINFECTANTS, AS PESTICIDES OR AS HERBICIDES; … in Definitions for A01N subclass: Fungicidal, bactericidal, insecticidal, disinfecting or antiseptic paper D21H Patent taxonomy relations, samples indexing : guidance heading before A61K 101/00 Indexing scheme associated with group A61K 51/00, relating to the nature of the radioactive substance placerule : note before A01N 25/00, even specifying an exception… In groups A01N 27/00-A01N 65/00, in the absence of an indication to the contrary, an active ingredient is classified in the last appropriate place. priorities (standardseq): for main groups in IPC where no place rule is applied cooccurrence: e.g. in catchwords: also the text of IPC mentions the reference CONDITIONING harvested crops A01D 43/10, A01D 82/00 A01D 43/10 with means for crushing or bruising the mown crop A01D 82/00 Crop conditioners, i.e. machines for crushing or bruising stalks (mowers combined with means for crushing or bruising the mown crop A01D 43/10) Presentation of IPCInfo / 2 Thank you… And keep reading if interested… Formalization With mathematical notations Targeted for audience not familiar with IPC The ‘patent’ (auto)categorization task Regular multiclass hierarchic categorization task – Given a hierarchic taxonomy (a patent classification) with categories – Given a set of training documents, each associated to multiple categories …or… an expert knowing both state of the art of the field and the taxonomy – For a document, provide a list of potential categories (preferably with relevance) – Categorization level may be fixed (preclassification) or full But… The ‘patent’ (auto)categorization task, but… Really a regular multiclass hierarchic categorization task? – Taxonomy: text and definitions (manuals or handbooks) and revisions, and therefore: known relations between categories (rules of classification, e.g. last place rule, takes precedence) secondary categories, non-primary categories (indexing codes, ‘not used as first symbol’) some categories excluded for ‘final’ categorization (top levels of the hierarchy) but required in preclassification (where secondary categories cannot be used) – Documents contain metadata (priorities, inventor, applicant) various “fields” (title, abstract, description, claims) some fields are subject of independent categorization (claims), some fields may be use just globally (abstract, description) – Changes: subject matter of a symbol, classification rules and procedures provided categories may require revisions, since taxonomy can be revised in regular intervals or immediately e.g. there is no more ‘main classification symbol’ preclassification may help to reduce the scope but requires handling failures Notations: Hierarchic taxonomy Taxonomy: T Category: C, supercategory: ⊗ ∉ C Parent function: p: C→C ⋃ ⊗ function, describing a non-directed tree graph Ancestors: p+: C→C+ ⋃ ⊗, transitive closure of p Child function (subcategories): c: C→C* = p-1 Descendants: c+: C→C* transitive closure of c Roots of taxonomy (‘sections’): C⊗ ⊂ C , C⊗ = {r ∊ C | p(r) = ⊗} Notations: Patent taxonomy Level of category: L, l: C→L (e.g. ‘subclass’) Classifying category level: Lc⊂ L Classifying category: Cc ⊂ C Cc = { c ∊ C: l(c) ∊ Lc } Non-classifying category: Cc‾ ⊂ C Cc‾ = C ∖ Cc Category symbol: s: C↔$ ($ stands for string) Category sort relation: c1 < c2 ⇔ s(c1) < s(c2) also min, max applicable for C+ Category interval: [f,t] = {c ∈ C | f ≤ c ∧ c ≤ t } Usually: descendants form a contiguous interval, i.e. ∀ a ∈ C : d ∈ [ min(c+(a)), max(c+(a))] ⇔ d ∈ c+(a) Notations: category relations Relation types: R ⊂ (C → (℘(C) ∪ ⊗)) All relations in a taxonomy: TR ⊂ C ☓ C ☓ R All relations for a category: r∀ : C → (R ☓ C)* Obvious relation types in hierarchies: { parent, child, ancestor, descendant } ⊂ R defined as parent ≈ p, child ≈ c etc. Further obvious relation: sibling (s), as child of parent (c) c C {s C | s c} sibling (c ) : {s c( p(c )) | s c} c C Interval and set relations: union of the single-category form, e.g. descendant({c1,[c2,c3]}) result abbreviated as an interval or set: descendant(a) = [min(c+(a)),max(c+(a))] Patent taxonomy relations on a single version Invertable relations – – – – Simple reference: category ‘refers’ to another ‘Takes precedence’ reference limiting references, very similar to precedence Allowed indexing symbols on an interval Precedence relations on siblings – placerule: first place rule or last place rule – priority: siblings prioritized by ‘standardized sequence’ cooccurrence of references (commutative) Patent taxonomy relations, samples reference: (may refer further!) A01B 1/00 Hand tools (edge trimmers for lawns A01G 3/06) A01G 3/06 Hand-held edge trimmers or shears for lawns (mowers combined with lawn edgers A01D 43/16) precedence: A01B 3/24 Tractor-drawn ploughs (A01B 3/04 takes precedence) A01B 3/04 Animal-drawn ploughs limiting: A01N PRESERVATION OF BODIES…; BIOCIDES, e.g. AS DISINFECTANTS, AS PESTICIDES OR AS HERBICIDES; … in Definitions for A01N subclass: Fungicidal, bactericidal, insecticidal, disinfecting or antiseptic paper D21H Patent taxonomy relations, samples indexing : guidance heading before A61K 101/00 Indexing scheme associated with group A61K 51/00, relating to the nature of the radioactive substance placerule : note before A01N 25/00, even specifying an exception… In groups A01N 27/00-A01N 65/00, in the absence of an indication to the contrary, an active ingredient is classified in the last appropriate place. priorities (stand.seq.): main groups in IPC where no place rule is applied cooccurrence: in catchwords: also the text of IPC mentions the reference CONDITIONING harvested crops A01D 43/10, A01D 82/00 A01D 43/10 with means for crushing or bruising the mown crop A01D 82/00 Crop conditioners, i.e. machines for crushing or bruising stalks (mowers combined with means for crushing or bruising the mown crop A01D 43/10) Patent taxonomy relations, multiple versions Patent taxonomies change in time A former category (or a set) may be – transferred to a single or a set of new categories or, it is recognized that the subject matter is – covered by a single or set of existing categories In the newer version, all the categories which are associated to a single or a set of former categories, are in concordance relation concordance relation may be computed by transitive traversing category changes over multiple versions Patent taxonomy relations: concordance relation sample 2011: B24B 49/00 Measuring or gauging equipment for controlling the feed movement of the grinding tool or work; Arrangements of indicating or measuring equipment, e.g. for indicating the start of the grinding operation 2012: B24B 49/00 B24B 37/005 - 37/015, B24B 49/00 B24B 37/005 . Control means for lapping machines or devices B24B 37/013 . . Devices or means for detecting lapping completion B24B 37/015 . . Temperature control B24B 49/00 Measuring or gauging equipment for controlling the feed movement of the grinding tool or work; Arrangements of indicating or measuring equipment, e.g. for indicating the start of the grinding operation ( B24B 33/06, B24B 37/005 takes precedence; if applicable to other machine tools, B23Q 15/00-B23Q 17/00 take precedence) Effect of relations on categorization A weighted directed graph can be built between categories Whenever an ‘oracle’ (e.g. a flat categorizer, a fielded search etc.) proposes a category, related categories must be evaluated and verified, may be, in a given order, considering also weights Training may also benefit from knowing, in advance – order of evaluation, e.g. standardized sequences, priority rules – relations: to enhance a good hit or suppress a false hit or co-classifiy