INIS DB on Interent Open Access Pilot

advertisement
International Atomic Energy Agency
International Nuclear Information System (INIS)
CAI, Thesaurus, Subject Categories and
Metadata Extraction Tool (MET)
13th Joint INIS/ETDE Technical Committee Meeting
20-22 October 2011, Vienna, Austria
Neviana Rashkova
INIS Subject Specialist
IAEA
International Atomic Energy Agency
CONTENT
 COMPUTER ASSISTED INDEXING – CAI
 INIS/ETDE THESAURUS
 SUBJECT CATEGORIES
 INIS INPUT QUALITY CONTROL - UPDATE
in co-operation with L. Iliev, Computer Support Group
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
2
COMPUTER ASSISTED INDEXING – CAI
• Assists the indexer to choose subject category and descriptors
•
•
•
•
based on the text analysis of abstract and title
Offers an opportunity for off-line work – batch indexing
Incorporates the latest version of INIS Thesaurus
Uses “hidden terms” pointing to a valid Thesaurus term
Currently we have:
• 28 accounts created for Member states
• 19 countries with access to CAI
• 6 accounts created for external users
• This year - 53 658 documents indexed - 55% of the input from:
Springer, ELSEVIER, ANS, IOPP, IAEA, MemSt, AIP
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
3
INIS/ETDE THESAURUS
• Thesaurus is “a controlled and dynamic vocabulary of
semantically and generically related terms which covers a specific
domain of knowledge“ (part of UNISCO definition)
• Types of relations for terms:
• BT (level1,2…10); NT (1,2…10); RT – related term; UF(+) – used for, SF
seen for
• Contains:
• 21882 valid terms
• 8677 forbidden terms
• 30559 total
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
4
INIS/ETDE THESAURUS
• Maintaining the INIS/ETDE Thesaurus
•
•
•
•
Regularly updated simultaneously at INIS and ETDE
New terms proposed by Member States
Terms revised if needed
Discussion Group of experts – for new proposals and updates
• Translations
• Original - in English
• Other languages: German, French, Arabic, Russian, Chinese
INIS Liaison Officer of the respective countries provide translations with
yearly updates for the new terms
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
5
USES OF INIS/ETDE THESAURUS
• For indexing
• WinFibre
• CAI – hidden terms
• Independent use
• For retrieval
• Incorporated in INIS search
• For independent advanced search
• For establishing of search strategy
• As a dictionary
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
6
USES OF INIS/ETDE THESAURUS
• Other potential applications
• Retrieval – for navigation search together with subject classification
• Automation in text analysis – provides multiple level taxonomy
• Learning tool – give immediate structured information about the terms and
their relations
BRUCE-1 REACTOR
Tiverton, Ontario, Canada.
*BT1 candu type reactors
*BT1 natural uranium reactors
*BT1 phwr type reactors
RT bruce site
IAEA
BUBBLE CHAMBERS
*BT1 gas track detectors
NT1 cryogenic bubble chambers
NT1 heavy liquid bubble chambers
NT1 ultrasonic bubble chambers
RT digitizers
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
7
INIS/ETDE SUBJECT CATEGORIES
• INIS/ETDE subject categories update
• Review the existing subject categories to include newer concepts and/or
areas of research and development
• Make the "ETDE only" categories available for INIS
• Consider the introduction of new categories
• Four new Subject categories
•
•
•
•
S77 NANOSCIENCE AND NANOTECHNOLOGY
S79 ASTROPHYSICS, COSMOLOGY AND ASTRONOMY
S96 KNOWLEDGE MANAGEMENT AND PRESERVATION
S97 MATHEMATICAL METHODS AND COMPUTING
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
8
INIS/ETDE SUBJECT CATEGORIES
• ETDE/INIS Joint Reference Series No. 2 (Rev. 1)
INIS Scope Descriptions
• The current categorization scheme contains 49 subject categories, both for
•
•
•
•
INIS and ETDE. The categories have three-character alphanumeric codes
The document defines the subject categories and provides the scope
descriptions
Subject Index is included as an aid to subject classifiers
Cross references to other categories are provided where appropriate
The tool is provided to Member States to assist in subject indexing
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
9
INIS INPUT QUALITY CONTROL UPDATE
INTRODUCTION
The general goal of the procedure is to improve the quality of input
• Identifies documents with errors in input and extracts them for
manual check by a specialist
• Knowledge Base created using a large number of expert decisions
made by human indexers - intellectual choices for usage of a
specific SC/D combination
• Implemented in a computer program, currently in use
• Uses documents from immediately preceding time period
• At the time of implementation – 75% of identified records were
proved to be real errors
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
10
CURRENT PROCEDURE
• Based on old statistics
• period 1980-1984
• 26 000 documents used
• Subject categories changed several times
• new categories added
• artificially adjusted values to replace the real statistics
• Thesaurus updated many times
• new descriptors
• new concepts
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
11
CURRENT PROCEDURE
THE RESULTS FROM THE QA PROCEDURE DO NOT
REFLECT THE REAL SITUATION
• Too many false warnings (~ 50% of all documents)
• More bad records allowed in production
• Not relevant any more- no consistent approach for all pairs
categories/descriptors
THE OLD QA PROCEDURE NEEDS REVISION
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
12
UPDATED PROCEDURE
• Based on real statistics using the whole INIS database
• Takes in account all subject categories
• Takes in account the accumulated experience about specific
error usage of category/descriptor combinations
• Flexible towards changes of descriptors weights
UPDATED PROCEDURE IS EXPECTED TO IMPROVE
QUALITY AND SAVE TIME
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
13
PRELIMINARY ANALYSES
• Analysis of the documentation on procedure for category
match value (CMV) calculation
An Expert System for Quality Control in Bibliographic Databases*
Claudio Todeschini
International Nuclear information System, international Atomic Energy Agency, Wagramerstrasse 5, A-7400 Vienna, Austria
Michael P. Farrell
Carbon Dioxide Information Center, Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee 3783 1 U.S.A.
*Based on work performed at Oak Ridge National Laboratory, operated for the U.S. Department of Energy under Contract No. DE-ACOS- 840R21400 with Martin Marietta Energy Systems, Inc.
Work was partially supported by the Carbon Dioxide Research Division, U.S. Department of Energy.
• Analysis of the program for quality control and testing the
formula
• Criteria for category/descriptor combination
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
14
WORK DONE
• Conversion of all existing categories to the currently used set of
categories
• Calculation of frequencies – table category/descriptor
• Comparison between two statistics new/all SC
• Decision about which period to use for the statistics
• Adjustment to avoid expected errors
• Identification of known combinations giving nearly100% errors
• Creating a table for “bad” combinations - assigned different weight (to
reach very low CMV)
• Possibility to manually change weights
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
15
FINE TUNNING
EXPECTED ERRORS – examples:
Material Science
GROWTH - CRYSTAL GROWTH
Plasma physics
IGNITION – THERMONUCLEAR IGNITION
Physics of Elementary Particles and Fields
PRODUCTION – PARTICLE PRODUCTION
COLOR, FLAVOR, HOLOGRAPHY, TRANSPORT,
CAVITIES,…etc.
17 descriptors in 18 subject categories have been adjusted
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
16
TOOLS DEVELOPED
Tools were developed to perform the steps:
• Scanning the records from the Reference DB to make full
statistics for the subject category-descriptor pairs
• Report to show difference between table and the one to replace it
• A table for manual “tuning” some pairs.
• Unfinished report to show the effect of changing the table on raw
(unprocessed) and processed records
IAEA
13th INIS/ETDE Joint Technical Committee Meeting 20-21 October 2011
17
COMPARISON WITH IRPS (processed records)
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
18
COMPARISON WITH IRPS (unprocessed records)
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
19
TRESHOLD DETERMINATION
S12 Management of radioactive wastes...
S21 Specific nuclear reactors and associated plants
S36 Materials science
180
160
Number of documents
140
120
100
S12
80
S21
S36
60
40
20
0
-1
0
1
2
3
4
5
CMV
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
20
TRESHOLD DETERMINATION
Category S21
140
120
Number of documents
100
80
S21
S12
60
S36
40
20
0
-1
0
1
2
3
4
5
CMV
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
21
TRESHOLD DETERMINATION
Category S36
250
Number of documents
200
150
S36
S43
100
S12
50
0
-1
0
1
2
3
4
5
CMV
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
22
TRESHOLD DETERMINATION
Category S43
250
Number of documents
200
150
S43
S36
100
S12
50
0
-1
0
1
2
3
4
5
CMV
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
23
DISCUSSION
• First analyses suggest a natural threshold value
•
•
•
•
CMV ∈ (1,2)
Analysis of the number of documents to be scanned for different
threshold CMV is necessary
Tests to assess errors if choose the threshold value in the different
intervals are necessary
Further testing over different sets of records is required before
implementation
Possibility for integration in WinFibre
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
24
Thank you!
IAEA
13th INIS/ETDE Joint Technical Committee Meeting, 20-21 October 2011
25
Download