Automating & Evaluating Metadata Generation Automatic Metadata Generation & Evaluation Elizabeth D. Liddy Center for Natural Language Processing School of Information Studies Syracuse University Outline Automatic Metadata Generation & Evaluation • Semantic Web • Metadata • 3 Metadata R & D Projects Semantic Web Automatic Metadata Generation & Evaluation • Links digital information in such a way as to make the information easily processable by computers globally • Enables publishing data in a re-purposable form • Built on syntax which uses URIs and RDF to represent and exchange data on the web – Maps directly & unambiguously to a model – Generic parsers are available • However, requisite processing is still largely manual Metadata Automatic Metadata Generation & Evaluation • Structured data about resources • Supports a wide range of operations: – Management of information resources – Resource discovery • Enables communication and co-operation amongst: – Software developers – Publishers – Recording & television industry – Digital libraries – Providers of geographical & satellite-based information – Peer-to-peer community Metadata (cont’d) Automatic Metadata Generation & Evaluation • Value-added information which enables information objects to be: – – – – Identified Represented Managed Accessed • Standards within industries enable interoperability between repositories & users • However, produced manually Educational Metadata Schema Elements Dublin Core Generation Metadata& Elements Automatic Metadata Evaluation • • • • • • • • • • • • • • • Contributor Coverage Creator Date Description Format Identifier Language Publisher Relation Rights Source Subject Title Type GEM Metadata Elements • Audience • • • • • Cataloging Duration Essential Resources Pedagogy Grade • • Standards Quality Educational Metadata Schema Elements Dublin Core Generation Metadata& Elements Automatic Metadata Evaluation • • • • • • • • • • • • • • • Contributor Coverage Creator Date Description Format Identifier Language Publisher Relation Rights Source Subject Title Type GEM Metadata Elements • Audience • • • • • Cataloging Duration Essential Resources Pedagogy Grade • • Standards Quality Semantic Web MetaData ? Automatic Metadata Generation & Evaluation • But both…. – Seek same goals – Use standards & crosswalks between schema – Look for comprehensive, well-understood, well-used sets of terms for describing content of information resources – Enable mutual sharing, accessing, and reuse of information resources NSDL MetaData Projects • Breaking the MetaData Generation Bottleneck Automatic Metadata Generation & Evaluation – CNLP – University of Washington • StandardConnection – University of Washington – CNLP • MetaTest – CNLP – Center for Human Computer Interaction – Cornell University Breaking the MetaData Generation Bottleneck Automatic Metadata Generation & Evaluation • Goal: Demonstrate feasibility of automatically generating high-quality metadata for digital libraries through Natural Language Processing • Data: Full-text resources from clearinghouses which provide teaching resources to teachers, students, administrators and parents • Metadata Schema: Dublin Core + Gateway for Educational Materials (GEM) Schema Method: Information Extraction Automatic Metadata Generation & Evaluation • Natural Language Processing – Technology which enables a system to accomplish human-like understanding of document contents – Extracts both explicit and implicit meaning • Sublanguage Analysis – Utilizes domain and genre-specific regularities vs. full-fledged linguistic analysis • Discourse Model Development – Extractions specialized for communication goals of document type and activities under discussion Information Extraction Types of Features recognized & utilized: Automatic Metadata Generation & Evaluation • • Non-linguistic • Length of document • HTML and XML tags Linguistic • Root forms of words • Part-of-speech tags • Phrases (Noun, Verb, Proper Noun, Numeric Concept) • Categories (Proper Name & Numeric Concept) • Concepts (sense disambiguated words / phrases) • Semantic Relations • Discourse Level Components Sample Lesson Plan Stream Channel Erosion Activity Automatic Metadata Generation & Evaluation Student/Teacher Background: Rivers and streams form the channels in which they flow. A river channel is formed by the quantity of water and debris that is carried by the water in it. The water carves and maintains the conduit containing it. Thus, the channel is self-adjusting. If the volume of water, or amount of debris is changed, the channel adjusts to the new set of conditions. ….. ….. Student Objectives: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. … NLP Processing of Lesson Plan Automatic Metadata Generation & Evaluation Input: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. Morphological Analysis: The student will discuss stream sedimentation that occurred in the Grand Canyon as a result of the controlled release from Glen Canyon Dam. Lexical Analysis: The|DT student|NN will|MD discuss|VB stream|NN sedimentation|NN that|WDT occurred|VBD in|IN the|DT Grand|NP Canyon|NP as|IN a|DT result|NN of|IN the|DT controlled|JJ release|NN from|IN Glen|NP Canyon|NP Dam|NP .|. NLP Processing of Lesson Plan (cont’d) Syntactic Analysis Phrase Identification: Automatic Metadata Generation &-Evaluation The|DT student|NN will|MD discuss|VB <CN> stream|NN sedimentation|NN </CN> that|WDT occurred|VBD in|IN the|DT <PN> Grand|NP Canyon|NP </PN> as|IN a|DT result|NN of|IN the|DT <CN> controlled|JJ release|NN </CN> from|IN <PN> Glen|NP Canyon|NP Dam|NP </PN> .|. Semantic Analysis Phase 1- Proper Name Interpretation: The|DT student|NN will|MD discuss|VB <CN> stream|NN sedimentation|NN </CN> that|WDT occurred|VBD in|IN the|DT <PN cat=geography/location> Grand|NP Canyon|NP </PN> as|IN a|DT result|NN of|IN the|DT <CN> controlled|JJ release|NN </CN> from|IN <PN cat=geography/structure> Glen|NP Canyon|NP Dam|NP </PN> .|. NLP Processing of Lesson Plan (cont’d) Automatic Metadata Generation & Evaluation Semantic Analysis Phase 2 - Event & Role Extraction Teaching event: discuss event: stream sedimentation actor: topic: student stream sedimentation location: Grand Canyon cause: controlled release HTML Html Document MetaExtract Automatic Metadata Generation & Evaluation Metadata Retrieval Module HTML Converter Configuration eQuery Extraction Module Cataloger Catalog Date Rights Publisher Format Language Resource Type Title Creator Description Grade/Level Essential Resources Duration Date PreProcessor Tf/idf Pedagogy Audience Standard Keywords Output Gathering Program HTML Document with Metadata Relation Automatically Generated Metadata Automatic Metadata Generation & Evaluation Title: Grade Levels: GEM Subjects: Keywords: Named Entities: Subject Keywords: Material Keywords: Grand Canyon: Flood! - Stream Channel Erosion Activity 6, 7, 8 Science--Geology Mathematics--Geometry Mathematics--Measurement Colorado River (river), Grand Canyon (geography / location), Glen Canyon Dam (geography / structures) channels, conduit, controlled_release, dam, flow_volume, hold, reservoir, rivers, sediment, streams clayboard, cookie_sheet, cup, paper_towel, pencil, roasting_pan, sand, water Automatically Generated Metadata (cont’d) Automatic Metadata Generation & Evaluation Pedagogy: Tool For: Resource Type: Format: Placed Online: Name: Role: Homepage: Collaborative learning Hands on learning Teachers Lesson Plan text/HTML 1998-09-02 PBS Online onlineProvider http://www.pbs.org Metadata Evaluation Experiment • Blind test of automatic vs. manually generated metadata Automatic Metadata Generation & Evaluation • Subjects: – Teachers – Education Students – Professors of Education • Web-based experiment – Subjects provided with educational resources and metadata records – 2 conditions tested Metadata Evaluation Experiment Blind Test of Automatic vs. Manual Metadata Automatic Metadata Generation & Evaluation Expectation Condition – Subjects reviewed: 1st - metadata record 2nd - lessson plan and then judged whether metadata provided an accurate preview of the lesson plan on 1 to 5 scale Metadata Evaluation Experiment Blind Test of Automatic vs. Manual Metadata Automatic Metadata Generation & Evaluation Expectation Condition – Subjects reviewed: 1st - metadata record 2nd - lessson plan and then judged whether metadata provided an accurate preview of the lesson plan on 1 to 5 scale Satisfaction Condition– Subjects reviewed: 1st – lesson plan 2nd – metadata record and then judged the accuracy and coverage of metadata on 1 to 5 scale, with 5 being high Qualitative Experimental Results Automatic Metadata Generation & Evaluation Expec Satis Comb # Manual Metadata Records # Automatic Metadata Records 153 139 571 532 724 671 Qualitative Experimental Results Automatic Metadata Generation & Evaluation Expec Satis Comb # Manual Metadata Records # Automatic Metadata Records 153 139 571 532 724 671 Manual Metadata Average Score Automatic Metadata Average Score 4.03 3.76 3.81 3.55 3.85 3.59 Qualitative Experimental Results Automatic Metadata Generation & Evaluation Expec Satis Comb # Manual Metadata Records # Automatic Metadata Records 153 139 571 532 724 671 Manual Metadata Average Score Automatic Metadata Average Score 4.03 3.76 3.81 3.55 3.85 3.59 Difference 0.27 0.26 0.26 MetaData Research Projects Automatic Metadata Generation & Evaluation 1. Breaking the MetaData Generation Bottleneck 2. StandardConnection 3. MetaTest StandardConnection Automatic Metadata Generation & Evaluation • Goal: Determine feasibility & quality of automatically mapping teaching standards to learning resources • • “Solve linear equations and inequalities algebraically and non-linear equations using graphing, symbolmanipulating or spreadsheet technology.” Data: Educational Resources: Lesson Plans, Activities, Assessment Units, etc. • Teaching Standards: Achieve/McREL Compendix “Simultaneous Equations Using Elimination” Washington Mapping Automatic Metadata Generation & Evaluation URI: M8.4.11ABCJ California Mapping New York Mapping Florida Mapping Arkansas Mapping Compendix Alaska Mapping Michigan Mapping Texas Mapping Cross-mapping through the Compendix Meta-language StandardConnection Components Automatic Metadata Generation & Evaluation Educational Resources: Lesson Plans, Activities, Assessment Units, etc. State Standards Compendix Mathematics: 6.2.1 C Adds, subtracts, multiplies, & divides whole numbers and decimals Lesson Plan: “Simultaneous Equations Using Elimination” Submitted by: Leslie Howe Automatic Metadata Generation & Evaluation Email: teachhowe2@hotmail.com School/University/Affiliation: Farragut High School, Knoxville, Tn Grade Level: 9, 10, 11, 12, Higher education, Vocational education, Adult/Continuing education Subject(s): Mathematics / Algebra Duration: 30 minutes Description: The Elimination method is an effective method for solving a system of two unknowns. This lesson provides students with immediate feedback using a computer program or online applet. Goals: The student will be able to solve a system of two equations when there are two unknowns. Materials: Online computer applet / program http://www.usit.com/howe2/eqations/index.htm Similar downloadable C++ application available at the same site. Procedure: A system of two unknowns can be solved by multiplying each equation by the constant that will make the coefficient of one of the variables become the LCM (least common multiple) of the initial coefficients. Students may use the scroll bars on the indicated applet to multiply the equations by constants until the GCF is located. When the "add" button is activated after the correct constants are chosen one of the variables will be eliminated. The process can be repeated for the second variable. The student may enter the solution of the system by using scroll bars. When the "check" button is pressed the answer is evaluated and the student is given immediate feedback. (The same procedure can be done using the downloadable C++ application.) After 5-10 correct responses the student should make the transition to paper and solve the equations without using the applet. The student can still use the applet to check the answer. The applet will generate problems in a random fashion. All solutions are integers. Assessment: The lesson itself provides alternative assessment. The correct responses are recorded. Lesson Plan: “Simultaneous Equations Using Elimination” Submitted by: Leslie Howe Automatic Generation & Evaluation Email: Metadata teachhowe2@hotmail.com School/University/Affiliation: Farragut High School, Knoxville, Tn Grade Level: 9, 10, 11, 12, Higher education, Vocational education, Adult/Continuing education Subject(s): Mathematics / Algebra Duration: 30 minutes Standard: McREL 8.4.11 Uses a variety of methods (e.g., with graphs, algebraic methods, and matrices) to solve systems of equations and inequalities Description: The Elimination method is an effective method for solving a system of two unknowns. This lesson provides students with immediate feedback using a computer program or online applet. Goals: The student will be able to solve a system of two equations when there are two unknowns. Materials: Online computer applet / program http://www.usit.com/howe2/eqations/index.ht m Similar downloadable C++ application available at the same site. Procedure: A system of two unknowns can be solved by multiplying each equation by the constant that will make the coefficient of one of the variables become the LCM (least common multiple) of the initial coefficients. Students may use the scroll bars on the indicated applet to multiply the equations by constants until the GCF is located. When the "add" button is activated after the correct constants are chosen one of the variables will be eliminated. The process can be repeated for the second variable. The student may enter the solution of the system by using scroll bars. When the "check" button is pressed the answer is evaluated and the student is given immediate feedback. (The same procedure can be done using the downloadable C++ application.) After 5-10 correct responses the student should make the transition to paper and solve the equations without using the applet. The student can still use the applet to check the answer. The applet will generate problems in a random fashion. All solutions are integers. Assessment: The lesson itself provides alternative assessment. The correct responses are recorded. Automatic Assigning of Standards as a Retrieval Process Automatic Metadata Generation & Evaluation Index of terms from Standards DOCUMENT COLLECTION = Compendix Standards Automatic Metadata Generation & Evaluation Indexed Processed Standards Assembled Standard Index of Standards is assembled from the subject heading, secondary subject, actual standard, and vocabulary. Automatic Assigning of Standards as a Retrieval Process Automatic Metadata Generation & Evaluation Index of terms from Standards Automatic Assigning of Standards as a Retrieval Process Automatic Metadata Generation & Evaluation Lesson Plan as Query Index of terms from Standards QUERY = NLP Processed Lesson Plan New Automatic Metadata Generation & Evaluation Lesson Plan Relevant parts of lesson plan Filtering: Sections are eliminated or given greater weight (e.g. citations are removed). Simultaneous|JJ Equations|NNS Using|VBG Elimination|NN Natural Language Processing: Includes part-of-speech tagging, bracketing of phrases & proper names Query=Top 30 terms: equation, eliminate solve TF/IDF: Relative frequency weights of words, phrases, proper names, etc Automatic Assigning of Standards as a Retrieval Process Automatic Metadata Generation & Evaluation Lesson Plan as Query Index of terms from Standards Assignment of Standard to Lesson Plan Teaching Standard Assignment as Retrieval Task Experiment Automatic Metadata Generation & Evaluation • Exploratory test run – 3,326 standards (documents) – TF/IDF term weighting scheme – 2,239 lesson plans (queries) – top 30 weighted terms from each as a query vector • Manual evaluation – Focusing on understanding of issues & solutions Information Retrieval Experiments Automatic Metadata Generation & Evaluation • Baseline Results – 68 queries (lesson plans) evaluated – 24 (35%) queries - appropriate standard was ranked first – 28 (41%) queries - predominant standard was in top 5 – Room for improvement, but promising Future Research • Improve current retrieval performance Automatic Metadata Generation & Evaluation – Matching algorithm, document expansion, etc • Apply classification approach to Standard Connection Project • Compare information retrieval approach and classification approach • Improve browsing access for teachers & administrators Browsing Access to Learning Resources Browsable Map of Automatic Automatic Metadata Generation & Evaluation Standards, e.g. Strand Assignment of Standards Maps to Lesson Standard 8.3.6: Solves simple Plans inequalities and non-linear equations with rational number solutions, using concrete and informal methods . Standard 8.4.11:Uses a variety of methods (e.g., with graphs, algebraic methods, and matrices) to solve systems of equations and inequalities Standard 8.4.12 Understands formal notation (e.g., sigma notation, factorial representation) and various applications (e.g., compound interest) of sequences and series Linked Lesson Plan with Standards attached Standard 8.4.11 MetaData Research Projects Automatic Metadata Generation & Evaluation 1. Breaking the MetaData Generation Bottleneck 2. StandardConnection 3. MetaTest Life-Cycle Evaluation of Metadata Automatic Metadata Generation & Evaluation 1. Initial generation 2. Accessing DL resources - Methods - Users’ interactions - Manual - Browsing - Automatic - Searching - Costs - Relative contribution of - Time each metadata element - Human Resources - Technology 3. Search Effectiveness - Precision - Recall GOAL: Measure Quality & Usefulness of Metadata Automatic Metadata Generation & Evaluation Metadata Generation COSTS: Time Human Resources Technology Metadata Browsing Searching Precision Recall Evaluation User System METHODS: Manual Semi-Automatic Automatic Understanding Evaluation Methodology • Automatically metatag a Digital Library collection that has already been manually meta-tagged. • Solicit range of appropriate Digital Library users. • For each metadata element: 1. Users qualitatively evaluate it in light of the digital resource. 2. Conduct a standard IR experiment. 3. Observe subjects while searching & browsing. • Monitor with eye-tracking & think-aloud protocols Automatic Metadata Generation & Evaluation Information Retrieval Experiment • Users ask queries of system • System retrieves documents using either: – Manually assigned metadata – Automatically generated metadata • System ranks documents in order by system estimation of relevance • Users review retrieved documents & judge relevance • Compute precision & recall • Compare results according to: – Method of assignment – The Metadata element which enabled retrieval Automatic Metadata Generation & Evaluation User Studies: Methods & Questions Automatic Metadata Generation & Evaluation 1. Observations of Users Seeking DL Resources – How do users search & browse the digital library? – Do search attempts utilize the available metadata? – Which metadata elements are most important to users? – Which are used consistently for the best results? User Studies: Methods & Questions (cont’d) 2. Eye-tracking Think-aloud Protocols Automatic Metadata Generationwith & Evaluation – Which metadata elements do users spend most time viewing? – What are users thinking about when seeking digital library resources? – Show correlation between what users are looking at and thinking. – Use eye-tracking to measure the number & duration of fixations, scan paths, dilation, etc. 3. Individual Subject Data – How does expertise / role influence seeking resources from digital libraries? Sample Lesson Plans Automatic Metadata Generation & Evaluation Eye Scan Path For Bug Club Document Automatic Metadata Generation & Evaluation Eye Scan Path For Sigmund Freud Document Automatic Metadata Generation & Evaluation What, When, Where, and How Long Automatic Metadata Generation & Evaluation Word Fixated Fixation Number Fixation Duration In Summary: Metadata Research Goals 1. Improve access via automatic metadata generation: Automatic Generation & Evaluation • Metadata Provide richer, more complete and consistent metadata. • Increase the number of resources available electronically. • Increase the speed with which they are added. 2. Add appropriate teaching standards to each resource. 3. Provide empirical results on quality, utility, and cost of automatic vs. manual metadata generation. 4. Show evidence as to which metadata elements are needed. 5. Inform HCI design with a better understanding of users’ behaviors when browsing and searching Digital Libraries. 6. Employ automatic metadata generation to build the Semantic Web.