Metadata: Generation and evaluation

advertisement
Automating & Evaluating
Metadata Generation
Automatic Metadata Generation & Evaluation
Elizabeth D. Liddy
Center for Natural Language Processing
School of Information Studies
Syracuse University
Outline
Automatic Metadata Generation & Evaluation
• Semantic Web
• Metadata
• 3 Metadata R & D Projects
Semantic Web
Automatic Metadata Generation & Evaluation
• Links digital information in such a way as to make the
information easily processable by computers globally
• Enables publishing data in a re-purposable form
• Built on syntax which uses URIs and RDF to represent and
exchange data on the web
– Maps directly & unambiguously to a model
– Generic parsers are available
• However, requisite processing is still largely manual
Metadata
Automatic Metadata Generation & Evaluation
• Structured data about resources
• Supports a wide range of operations:
– Management of information resources
– Resource discovery
• Enables communication and co-operation amongst:
– Software developers
– Publishers
– Recording & television industry
– Digital libraries
– Providers of geographical & satellite-based information
– Peer-to-peer community
Metadata (cont’d)
Automatic Metadata Generation & Evaluation
• Value-added information which enables information
objects to be:
–
–
–
–
Identified
Represented
Managed
Accessed
• Standards within industries enable interoperability
between repositories & users
• However, produced manually
Educational Metadata Schema Elements
Dublin
Core Generation
Metadata& Elements
Automatic
Metadata
Evaluation
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Contributor
Coverage
Creator
Date
Description
Format
Identifier
Language
Publisher
Relation
Rights
Source
Subject
Title
Type
GEM Metadata Elements
•
Audience
•
•
•
•
•
Cataloging
Duration
Essential Resources
Pedagogy
Grade
•
•
Standards
Quality
Educational Metadata Schema Elements
Dublin
Core Generation
Metadata& Elements
Automatic
Metadata
Evaluation
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Contributor
Coverage
Creator
Date
Description
Format
Identifier
Language
Publisher
Relation
Rights
Source
Subject
Title
Type
GEM Metadata Elements
•
Audience
•
•
•
•
•
Cataloging
Duration
Essential Resources
Pedagogy
Grade
•
•
Standards
Quality
Semantic Web  MetaData ?
Automatic Metadata Generation & Evaluation
• But both….
– Seek same goals
– Use standards & crosswalks between schema
– Look for comprehensive, well-understood, well-used
sets of terms for describing content of information
resources
– Enable mutual sharing, accessing, and reuse of
information resources
NSDL MetaData Projects
• Breaking the MetaData Generation Bottleneck
Automatic Metadata Generation & Evaluation
– CNLP
– University of Washington
• StandardConnection
– University of Washington
– CNLP
• MetaTest
– CNLP
– Center for Human Computer Interaction –
Cornell University
Breaking the MetaData Generation Bottleneck
Automatic Metadata Generation & Evaluation
•
Goal: Demonstrate feasibility of automatically
generating high-quality metadata for digital libraries
through Natural Language Processing
•
Data: Full-text resources from clearinghouses
which provide teaching resources to teachers,
students, administrators and parents
•
Metadata Schema: Dublin Core + Gateway for
Educational Materials (GEM) Schema
Method: Information Extraction
Automatic Metadata Generation & Evaluation
•
Natural Language Processing
– Technology which enables a system to accomplish
human-like understanding of document contents
– Extracts both explicit and implicit meaning
• Sublanguage Analysis
– Utilizes domain and genre-specific regularities vs.
full-fledged linguistic analysis
• Discourse Model Development
– Extractions specialized for communication goals
of document type and activities under discussion
Information Extraction
Types of Features recognized & utilized:
Automatic Metadata Generation & Evaluation
•
•
Non-linguistic
• Length of document
• HTML and XML tags
Linguistic
• Root forms of words
• Part-of-speech tags
• Phrases (Noun, Verb, Proper Noun, Numeric Concept)
• Categories (Proper Name & Numeric Concept)
• Concepts (sense disambiguated words / phrases)
• Semantic Relations
• Discourse Level Components
Sample Lesson Plan
Stream Channel Erosion Activity
Automatic Metadata Generation & Evaluation
Student/Teacher Background:
Rivers and streams form the channels in which they flow. A river channel
is formed by the quantity of water and debris that is carried by the water
in it. The water carves and maintains the conduit containing it. Thus, the
channel is self-adjusting. If the volume of water, or amount of debris is
changed, the channel adjusts to the new set of conditions.
…..
…..
Student Objectives:
The student will discuss stream sedimentation that occurred in the Grand
Canyon as a result of the controlled release from Glen Canyon Dam.
…
NLP Processing of Lesson Plan
Automatic Metadata Generation & Evaluation
Input:
The student will discuss stream sedimentation that occurred in the
Grand Canyon as a result of the controlled release from Glen Canyon
Dam.
Morphological Analysis:
The student will discuss stream sedimentation that occurred in the
Grand Canyon as a result of the controlled release from Glen Canyon
Dam.
Lexical Analysis:
The|DT student|NN will|MD discuss|VB stream|NN sedimentation|NN
that|WDT occurred|VBD in|IN the|DT Grand|NP Canyon|NP as|IN a|DT
result|NN of|IN the|DT controlled|JJ release|NN from|IN Glen|NP
Canyon|NP Dam|NP .|.
NLP Processing of Lesson Plan (cont’d)
Syntactic
Analysis
Phrase Identification:
Automatic Metadata
Generation &-Evaluation
The|DT student|NN will|MD discuss|VB <CN> stream|NN sedimentation|NN
</CN> that|WDT occurred|VBD in|IN the|DT <PN> Grand|NP Canyon|NP
</PN> as|IN a|DT result|NN of|IN the|DT <CN> controlled|JJ release|NN
</CN> from|IN <PN> Glen|NP Canyon|NP Dam|NP </PN> .|.
Semantic Analysis Phase 1- Proper Name Interpretation:
The|DT student|NN will|MD discuss|VB <CN> stream|NN sedimentation|NN
</CN> that|WDT occurred|VBD in|IN the|DT <PN cat=geography/location>
Grand|NP Canyon|NP </PN> as|IN a|DT result|NN of|IN the|DT <CN>
controlled|JJ release|NN </CN> from|IN <PN cat=geography/structure>
Glen|NP Canyon|NP Dam|NP </PN> .|.
NLP Processing of Lesson Plan (cont’d)
Automatic Metadata Generation & Evaluation
Semantic Analysis Phase 2 - Event & Role Extraction
Teaching event: discuss
event: stream sedimentation
actor:
topic:
student
stream sedimentation
location: Grand Canyon
cause:
controlled release
HTML
Html Document
MetaExtract
Automatic Metadata Generation & Evaluation
Metadata
Retrieval
Module
HTML
Converter
Configuration
eQuery
Extraction
Module
Cataloger
Catalog Date
Rights
Publisher
Format
Language
Resource
Type
Title
Creator
Description
Grade/Level
Essential
Resources
Duration
Date
PreProcessor
Tf/idf
Pedagogy
Audience
Standard
Keywords
Output Gathering Program
HTML Document with Metadata
Relation
Automatically Generated Metadata
Automatic Metadata Generation & Evaluation
Title:
Grade Levels:
GEM Subjects:
Keywords:
Named Entities:
Subject Keywords:
Material Keywords:
Grand Canyon: Flood! - Stream Channel
Erosion Activity
6, 7, 8
Science--Geology
Mathematics--Geometry
Mathematics--Measurement
Colorado River (river), Grand Canyon
(geography / location), Glen Canyon Dam
(geography / structures)
channels, conduit, controlled_release, dam,
flow_volume, hold, reservoir, rivers, sediment,
streams
clayboard, cookie_sheet, cup, paper_towel,
pencil, roasting_pan, sand, water
Automatically Generated Metadata (cont’d)
Automatic Metadata Generation & Evaluation
Pedagogy:
Tool For:
Resource Type:
Format:
Placed Online:
Name:
Role:
Homepage:
Collaborative learning
Hands on learning
Teachers
Lesson Plan
text/HTML
1998-09-02
PBS Online
onlineProvider
http://www.pbs.org
Metadata Evaluation Experiment
•
Blind test of automatic vs. manually generated
metadata
Automatic Metadata Generation & Evaluation
•
Subjects:
– Teachers
– Education Students
– Professors of Education
•
Web-based experiment
– Subjects provided with educational resources
and metadata records
– 2 conditions tested
Metadata Evaluation Experiment
Blind Test of Automatic vs. Manual Metadata
Automatic Metadata Generation & Evaluation
Expectation Condition – Subjects reviewed:
1st - metadata record
2nd - lessson plan
and then judged whether metadata provided an
accurate preview of the lesson plan on 1 to 5 scale
Metadata Evaluation Experiment
Blind Test of Automatic vs. Manual Metadata
Automatic Metadata Generation & Evaluation
Expectation Condition – Subjects reviewed:
1st - metadata record
2nd - lessson plan
and then judged whether metadata provided an
accurate preview of the lesson plan on 1 to 5 scale
Satisfaction Condition– Subjects reviewed:
1st – lesson plan
2nd – metadata record
and then judged the accuracy and coverage of
metadata on 1 to 5 scale, with 5 being high
Qualitative Experimental Results
Automatic Metadata Generation & Evaluation
Expec Satis Comb
# Manual Metadata Records
# Automatic Metadata Records
153
139
571
532
724
671
Qualitative Experimental Results
Automatic Metadata Generation & Evaluation
Expec Satis Comb
# Manual Metadata Records
# Automatic Metadata Records
153
139
571
532
724
671
Manual Metadata Average Score
Automatic Metadata Average Score
4.03
3.76
3.81
3.55
3.85
3.59
Qualitative Experimental Results
Automatic Metadata Generation & Evaluation
Expec Satis Comb
# Manual Metadata Records
# Automatic Metadata Records
153
139
571
532
724
671
Manual Metadata Average Score
Automatic Metadata Average Score
4.03
3.76
3.81
3.55
3.85
3.59
Difference
0.27 0.26 0.26
MetaData Research Projects
Automatic Metadata Generation & Evaluation
1. Breaking the MetaData Generation Bottleneck
2. StandardConnection
3. MetaTest
StandardConnection
Automatic Metadata Generation & Evaluation
•
Goal: Determine feasibility & quality of automatically
mapping teaching standards to learning resources
•
•
“Solve linear equations and inequalities algebraically
and non-linear equations using graphing, symbolmanipulating or spreadsheet technology.”
Data: Educational Resources: Lesson Plans, Activities,
Assessment Units, etc.
•
Teaching Standards: Achieve/McREL Compendix
“Simultaneous Equations
Using Elimination”
Washington Mapping
Automatic Metadata Generation & Evaluation
URI: M8.4.11ABCJ
California Mapping
New York Mapping
Florida Mapping
Arkansas Mapping
Compendix
Alaska Mapping
Michigan Mapping
Texas Mapping
Cross-mapping through the
Compendix Meta-language
StandardConnection Components
Automatic Metadata Generation & Evaluation
Educational Resources:
Lesson Plans, Activities,
Assessment Units, etc.
State
Standards
Compendix
Mathematics: 6.2.1 C
Adds, subtracts, multiplies,
& divides whole numbers
and decimals
Lesson Plan: “Simultaneous Equations Using Elimination”
Submitted by: Leslie Howe
Automatic Metadata Generation & Evaluation
Email: teachhowe2@hotmail.com
School/University/Affiliation: Farragut High
School, Knoxville, Tn
Grade Level: 9, 10, 11, 12, Higher education,
Vocational education, Adult/Continuing education
Subject(s): Mathematics / Algebra
Duration: 30 minutes
Description: The Elimination method is an
effective method for solving a system of two
unknowns. This lesson provides students with
immediate feedback using a computer program or
online applet.
Goals: The student will be able to solve a system
of two equations when there are two unknowns.
Materials: Online computer applet / program
http://www.usit.com/howe2/eqations/index.htm
Similar downloadable C++ application available at
the same site.
Procedure: A system of two unknowns can be
solved by multiplying each equation by the
constant that will make the coefficient of one
of the variables become the LCM (least
common multiple) of the initial coefficients.
Students may use the scroll bars on the
indicated applet to multiply the equations by
constants until the GCF is located. When the
"add" button is activated after the correct
constants are chosen one of the variables will
be eliminated. The process can be repeated
for the second variable. The student may
enter the solution of the system by using
scroll bars. When the "check" button is
pressed the answer is evaluated and the
student is given immediate feedback. (The
same procedure can be done using the
downloadable C++ application.) After 5-10
correct responses the student should make the
transition to paper and solve the equations
without using the applet. The student can still
use the applet to check the answer. The applet
will generate problems in a random fashion.
All solutions are integers.
Assessment: The lesson itself provides alternative
assessment. The correct responses are
recorded.
Lesson Plan: “Simultaneous Equations Using Elimination”
Submitted by: Leslie Howe
Automatic
Generation & Evaluation
Email: Metadata
teachhowe2@hotmail.com
School/University/Affiliation: Farragut High
School, Knoxville, Tn
Grade Level: 9, 10, 11, 12, Higher education,
Vocational education, Adult/Continuing
education
Subject(s): Mathematics / Algebra
Duration: 30 minutes
Standard: McREL 8.4.11 Uses a variety of
methods (e.g., with graphs, algebraic
methods, and matrices) to solve systems of
equations and inequalities
Description: The Elimination method is an
effective method for solving a system of two
unknowns. This lesson provides students with
immediate feedback using a computer
program or online applet.
Goals: The student will be able to solve a system
of two equations when there are two
unknowns.
Materials: Online computer applet / program
http://www.usit.com/howe2/eqations/index.ht
m Similar downloadable C++ application
available at the same site.
Procedure: A system of two unknowns can be
solved by multiplying each equation by the
constant that will make the coefficient of one
of the variables become the LCM (least
common multiple) of the initial coefficients.
Students may use the scroll bars on the
indicated applet to multiply the equations by
constants until the GCF is located. When the
"add" button is activated after the correct
constants are chosen one of the variables will
be eliminated. The process can be repeated
for the second variable. The student may
enter the solution of the system by using
scroll bars. When the "check" button is
pressed the answer is evaluated and the
student is given immediate feedback. (The
same procedure can be done using the
downloadable C++ application.) After 5-10
correct responses the student should make the
transition to paper and solve the equations
without using the applet. The student can still
use the applet to check the answer. The applet
will generate problems in a random fashion.
All solutions are integers.
Assessment: The lesson itself provides alternative
assessment. The correct responses are
recorded.
Automatic Assigning of
Standards as a Retrieval Process
Automatic Metadata Generation & Evaluation
Index of
terms from
Standards
DOCUMENT COLLECTION = Compendix Standards
Automatic Metadata Generation & Evaluation
Indexed
Processed
Standards
Assembled
Standard
Index of Standards is
assembled from the
subject heading,
secondary subject,
actual standard, and
vocabulary.
Automatic Assigning of
Standards as a Retrieval Process
Automatic Metadata Generation & Evaluation
Index of
terms from
Standards
Automatic Assigning of
Standards as a Retrieval Process
Automatic Metadata Generation & Evaluation
Lesson
Plan as
Query
Index of
terms from
Standards
QUERY = NLP Processed Lesson Plan
New
Automatic Metadata Generation & Evaluation
Lesson
Plan
Relevant
parts of
lesson
plan
Filtering: Sections are
eliminated or given
greater weight (e.g.
citations are removed).
Simultaneous|JJ
Equations|NNS
Using|VBG
Elimination|NN
Natural Language
Processing: Includes
part-of-speech tagging,
bracketing of phrases &
proper names
Query=Top
30 terms:
equation,
eliminate
solve
TF/IDF: Relative
frequency weights of
words, phrases,
proper names, etc
Automatic Assigning of
Standards as a Retrieval Process
Automatic Metadata Generation & Evaluation
Lesson
Plan as
Query
Index of
terms from
Standards
Assignment of Standard to Lesson Plan
Teaching Standard Assignment as
Retrieval Task Experiment
Automatic Metadata Generation & Evaluation
• Exploratory test run
– 3,326 standards (documents)
– TF/IDF term weighting scheme
– 2,239 lesson plans (queries)
– top 30 weighted terms from each as a query vector
• Manual evaluation
– Focusing on understanding of issues & solutions
Information Retrieval Experiments
Automatic Metadata Generation & Evaluation
• Baseline Results
– 68 queries (lesson plans) evaluated
– 24 (35%) queries - appropriate standard was
ranked first
– 28 (41%) queries - predominant standard was in
top 5
– Room for improvement, but promising
Future Research
• Improve current retrieval performance
Automatic Metadata Generation & Evaluation
– Matching algorithm, document expansion, etc
• Apply classification approach to Standard
Connection Project
• Compare information retrieval approach and
classification approach
• Improve browsing access for teachers &
administrators
Browsing Access to Learning Resources
Browsable Map of
Automatic
Automatic Metadata Generation & Evaluation
Standards,
e.g. Strand
Assignment
of Standards
Maps
to Lesson
Standard 8.3.6: Solves simple
Plans
inequalities and non-linear
equations with rational number
solutions, using concrete and
informal methods .
Standard 8.4.11:Uses a variety of
methods (e.g., with graphs, algebraic
methods, and matrices) to solve
systems of equations and inequalities
Standard 8.4.12 Understands
formal notation (e.g., sigma notation,
factorial representation) and various
applications (e.g., compound interest)
of sequences and series
Linked
Lesson
Plan with
Standards
attached
Standard
8.4.11
MetaData Research Projects
Automatic Metadata Generation & Evaluation
1. Breaking the MetaData Generation Bottleneck
2. StandardConnection
3. MetaTest
Life-Cycle Evaluation of Metadata
Automatic Metadata Generation & Evaluation
1. Initial generation
2. Accessing DL resources
- Methods
- Users’ interactions
- Manual
- Browsing
- Automatic
- Searching
- Costs
- Relative contribution of
- Time
each metadata element
- Human Resources
- Technology
3. Search Effectiveness
- Precision
- Recall
GOAL: Measure Quality & Usefulness of Metadata
Automatic Metadata Generation & Evaluation
Metadata
Generation
COSTS:
Time
Human Resources
Technology
Metadata
Browsing
Searching
Precision
Recall
Evaluation
User
System
METHODS:
Manual
Semi-Automatic
Automatic
Understanding
Evaluation Methodology
•
Automatically metatag a Digital Library collection
that has already been manually meta-tagged.
•
Solicit range of appropriate Digital Library users.
•
For each metadata element:
1. Users qualitatively evaluate it in light of the
digital resource.
2. Conduct a standard IR experiment.
3. Observe subjects while searching & browsing.
• Monitor with eye-tracking & think-aloud
protocols
Automatic Metadata Generation & Evaluation
Information Retrieval Experiment
• Users ask queries of system
• System retrieves documents using either:
– Manually assigned metadata
– Automatically generated metadata
• System ranks documents in order by system estimation
of relevance
• Users review retrieved documents & judge relevance
• Compute precision & recall
• Compare results according to:
– Method of assignment
– The Metadata element which enabled retrieval
Automatic Metadata Generation & Evaluation
User Studies: Methods & Questions
Automatic Metadata Generation & Evaluation
1.
Observations of Users Seeking DL Resources
–
How do users search & browse the digital library?
–
Do search attempts utilize the available metadata?
–
Which metadata elements are most important to users?
–
Which are used consistently for the best results?
User Studies: Methods & Questions (cont’d)
2.
Eye-tracking
Think-aloud Protocols
Automatic
Metadata Generationwith
& Evaluation
– Which metadata elements do users spend most time
viewing?
– What are users thinking about when seeking digital
library resources?
– Show correlation between what users are looking at
and thinking.
– Use eye-tracking to measure the number & duration of
fixations, scan paths, dilation, etc.
3. Individual Subject Data
– How does expertise / role influence seeking resources
from digital libraries?
Sample Lesson Plans
Automatic Metadata Generation & Evaluation
Eye Scan Path For Bug Club Document
Automatic Metadata Generation & Evaluation
Eye Scan Path For Sigmund Freud Document
Automatic Metadata Generation & Evaluation
What, When, Where, and How Long
Automatic Metadata Generation & Evaluation
Word Fixated
Fixation Number
Fixation Duration
In Summary: Metadata Research Goals
1.
Improve access via automatic metadata generation:
Automatic
Generation
& Evaluation
• Metadata
Provide
richer,
more complete and consistent metadata.
• Increase the number of resources available electronically.
• Increase the speed with which they are added.
2.
Add appropriate teaching standards to each resource.
3.
Provide empirical results on quality, utility, and cost of
automatic vs. manual metadata generation.
4.
Show evidence as to which metadata elements are needed.
5.
Inform HCI design with a better understanding of users’
behaviors when browsing and searching Digital Libraries.
6.
Employ automatic metadata generation to build the
Semantic Web.
Download