International Atomic Energy Agency International Nuclear Information System (INIS) CAI, Thesaurus, Subject Categories and Metadata Extraction Tool (MET) 13th Joint INIS/ETDE Technical Committee Meeting 20-22 October 2011, Vienna, Austria Neviana Rashkova INIS Subject Specialist IAEA International Atomic Energy Agency CONTENT COMPUTER ASSISTED INDEXING – CAI INIS/ETDE THESAURUS SUBJECT CATEGORIES INIS INPUT QUALITY CONTROL - UPDATE in co-operation with L. Iliev, Computer Support Group IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 2 COMPUTER ASSISTED INDEXING – CAI • Assists the indexer to choose subject category and descriptors • • • • based on the text analysis of abstract and title Offers an opportunity for off-line work – batch indexing Incorporates the latest version of INIS Thesaurus Uses “hidden terms” pointing to a valid Thesaurus term Currently we have: • 28 accounts created for Member states • 19 countries with access to CAI • 6 accounts created for external users • This year - 53 658 documents indexed - 55% of the input from: Springer, ELSEVIER, ANS, IOPP, IAEA, MemSt, AIP IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 3 INIS/ETDE THESAURUS • Thesaurus is “a controlled and dynamic vocabulary of semantically and generically related terms which covers a specific domain of knowledge“ (part of UNISCO definition) • Types of relations for terms: • BT (level1,2…10); NT (1,2…10); RT – related term; UF(+) – used for, SF seen for • Contains: • 21882 valid terms • 8677 forbidden terms • 30559 total IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 4 INIS/ETDE THESAURUS • Maintaining the INIS/ETDE Thesaurus • • • • Regularly updated simultaneously at INIS and ETDE New terms proposed by Member States Terms revised if needed Discussion Group of experts – for new proposals and updates • Translations • Original - in English • Other languages: German, French, Arabic, Russian, Chinese INIS Liaison Officer of the respective countries provide translations with yearly updates for the new terms IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 5 USES OF INIS/ETDE THESAURUS • For indexing • WinFibre • CAI – hidden terms • Independent use • For retrieval • Incorporated in INIS search • For independent advanced search • For establishing of search strategy • As a dictionary IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 6 USES OF INIS/ETDE THESAURUS • Other potential applications • Retrieval – for navigation search together with subject classification • Automation in text analysis – provides multiple level taxonomy • Learning tool – give immediate structured information about the terms and their relations BRUCE-1 REACTOR Tiverton, Ontario, Canada. *BT1 candu type reactors *BT1 natural uranium reactors *BT1 phwr type reactors RT bruce site IAEA BUBBLE CHAMBERS *BT1 gas track detectors NT1 cryogenic bubble chambers NT1 heavy liquid bubble chambers NT1 ultrasonic bubble chambers RT digitizers 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 7 INIS/ETDE SUBJECT CATEGORIES • INIS/ETDE subject categories update • Review the existing subject categories to include newer concepts and/or areas of research and development • Make the "ETDE only" categories available for INIS • Consider the introduction of new categories • Four new Subject categories • • • • S77 NANOSCIENCE AND NANOTECHNOLOGY S79 ASTROPHYSICS, COSMOLOGY AND ASTRONOMY S96 KNOWLEDGE MANAGEMENT AND PRESERVATION S97 MATHEMATICAL METHODS AND COMPUTING IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 8 INIS/ETDE SUBJECT CATEGORIES • ETDE/INIS Joint Reference Series No. 2 (Rev. 1) INIS Scope Descriptions • The current categorization scheme contains 49 subject categories, both for • • • • INIS and ETDE. The categories have three-character alphanumeric codes The document defines the subject categories and provides the scope descriptions Subject Index is included as an aid to subject classifiers Cross references to other categories are provided where appropriate The tool is provided to Member States to assist in subject indexing IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 9 INIS INPUT QUALITY CONTROL UPDATE INTRODUCTION The general goal of the procedure is to improve the quality of input • Identifies documents with errors in input and extracts them for manual check by a specialist • Knowledge Base created using a large number of expert decisions made by human indexers - intellectual choices for usage of a specific SC/D combination • Implemented in a computer program, currently in use • Uses documents from immediately preceding time period • At the time of implementation – 75% of identified records were proved to be real errors IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 10 CURRENT PROCEDURE • Based on old statistics • period 1980-1984 • 26 000 documents used • Subject categories changed several times • new categories added • artificially adjusted values to replace the real statistics • Thesaurus updated many times • new descriptors • new concepts IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 11 CURRENT PROCEDURE THE RESULTS FROM THE QA PROCEDURE DO NOT REFLECT THE REAL SITUATION • Too many false warnings (~ 50% of all documents) • More bad records allowed in production • Not relevant any more- no consistent approach for all pairs categories/descriptors THE OLD QA PROCEDURE NEEDS REVISION IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 12 UPDATED PROCEDURE • Based on real statistics using the whole INIS database • Takes in account all subject categories • Takes in account the accumulated experience about specific error usage of category/descriptor combinations • Flexible towards changes of descriptors weights UPDATED PROCEDURE IS EXPECTED TO IMPROVE QUALITY AND SAVE TIME IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 13 PRELIMINARY ANALYSES • Analysis of the documentation on procedure for category match value (CMV) calculation An Expert System for Quality Control in Bibliographic Databases* Claudio Todeschini International Nuclear information System, international Atomic Energy Agency, Wagramerstrasse 5, A-7400 Vienna, Austria Michael P. Farrell Carbon Dioxide Information Center, Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 3783 1 U.S.A. *Based on work performed at Oak Ridge National Laboratory, operated for the U.S. Department of Energy under Contract No. DE-ACOS- 840R21400 with Martin Marietta Energy Systems, Inc. Work was partially supported by the Carbon Dioxide Research Division, U.S. Department of Energy. • Analysis of the program for quality control and testing the formula • Criteria for category/descriptor combination IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 14 WORK DONE • Conversion of all existing categories to the currently used set of categories • Calculation of frequencies – table category/descriptor • Comparison between two statistics new/all SC • Decision about which period to use for the statistics • Adjustment to avoid expected errors • Identification of known combinations giving nearly100% errors • Creating a table for “bad” combinations - assigned different weight (to reach very low CMV) • Possibility to manually change weights IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 15 FINE TUNNING EXPECTED ERRORS – examples: Material Science GROWTH - CRYSTAL GROWTH Plasma physics IGNITION – THERMONUCLEAR IGNITION Physics of Elementary Particles and Fields PRODUCTION – PARTICLE PRODUCTION COLOR, FLAVOR, HOLOGRAPHY, TRANSPORT, CAVITIES,…etc. 17 descriptors in 18 subject categories have been adjusted IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 16 TOOLS DEVELOPED Tools were developed to perform the steps: • Scanning the records from the Reference DB to make full statistics for the subject category-descriptor pairs • Report to show difference between table and the one to replace it • A table for manual “tuning” some pairs. • Unfinished report to show the effect of changing the table on raw (unprocessed) and processed records IAEA 13th INIS/ETDE Joint Technical Committee Meeting 20-21 October 2011 17 COMPARISON WITH IRPS (processed records) IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 18 COMPARISON WITH IRPS (unprocessed records) IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 19 TRESHOLD DETERMINATION S12 Management of radioactive wastes... S21 Specific nuclear reactors and associated plants S36 Materials science 180 160 Number of documents 140 120 100 S12 80 S21 S36 60 40 20 0 -1 0 1 2 3 4 5 CMV IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 20 TRESHOLD DETERMINATION Category S21 140 120 Number of documents 100 80 S21 S12 60 S36 40 20 0 -1 0 1 2 3 4 5 CMV IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 21 TRESHOLD DETERMINATION Category S36 250 Number of documents 200 150 S36 S43 100 S12 50 0 -1 0 1 2 3 4 5 CMV IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 22 TRESHOLD DETERMINATION Category S43 250 Number of documents 200 150 S43 S36 100 S12 50 0 -1 0 1 2 3 4 5 CMV IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 23 DISCUSSION • First analyses suggest a natural threshold value • • • • CMV ∈ (1,2) Analysis of the number of documents to be scanned for different threshold CMV is necessary Tests to assess errors if choose the threshold value in the different intervals are necessary Further testing over different sets of records is required before implementation Possibility for integration in WinFibre IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 24 Thank you! IAEA 13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011 25