Enhancing search : an update on taxonomies, metadata and thesauri.

advertisement
Enhancing search
An update on taxonomies, metadata and thesauri
Leonard Will
Willpower Information
1
Summary
1 Metadata creation is cataloguing
2 Taxonomies are classifications
3 Thesauri and classifications are
complementary ways of grouping concepts
4 Facet analysis is a useful technique for
constructing schemes systematically
5 Most computer search interfaces are
inadequate
2
Metadata = catalogue records
• Resources: any things that can be identified
– documents, web pages, images, sound files, teaching
packages, books, museum objects, people,
organisations
• Metadata: structured information about resources
– May be included with resources (e.g. “CIP”) or collected
in separate “union catalogues” (e.g. OAI-PMH)
– Some from the resource itself (size, format), some from
external sources (provenance, location, accessibility)
3
Metadata standards
•
•
•
•
•
•
•
Anglo-American Cataloguing Rules (AACR)
Encoded Archival Description (EAD)
Learning Object Metadata (LOM)
Spectrum standard for museum information
Friend of a Friend (FOAF) and vCard
e-Government Metadata Standard (eGMS)
Dublin Core - lowest common denominator
4
Kinds of standards
• Content standards: which pieces of information
are to be recorded (DC, AACR)
• Value standards: how is the information to be
recorded (= DC encoding schemes)
– formats (ISO date format, NCA name formats, AACR)
– lists of valid values (thesauri, authority files)
• Structure standards: how the information is to
be grouped and labelled for use by computers
and humans (XML schemas, MARC)
• Application profiles: Choices from the above
5
Dublin Core metadata
•
•
•
•
•
•
•
•
Title
Creator
Subject
Description
Publisher
Contributor
Date
Type
•
•
•
•
•
•
•
•
Format
Identifier
Source
Language
Relation
Coverage
Rights
+ element refinements
6
Subject
“Typically, Subject will be expressed as
keywords, key phrases or classification
codes that describe a topic of the
resource.
Recommended best practice is to select a
value from a controlled vocabulary or
formal classification scheme.”
7
Taxonomies = controlled vocabularies
• “Taxonomy”: woolly meaning -> confusion
– keep it for biological classification systems
• Knowledge organization systems (KOS)
– a better expression for the general concept
• Main types are
– thesauri
– classification schemes
– ontologies
8
Thesauri and classification schemes
• Thesauri and classification schemes are
alternative ways of showing concepts
and their relationships
• They are complementary and both
approaches are needed
• They can both be built on the principles
of facet analysis
9
Building blocks of all knowledge
organisation schemes
• concepts
• relationships
35 m cameras
CC:H012
BT: film cameras
aqualungs
CC: D002
BT: diving equipment
camera accessories
CC: H002
BT: photographic
equipment
NT: flash guns
light meters
tripods
RT: cameras
10
Relationships are between
concepts, not words
vehicles
road vehicles
conveyances
voitures
388.34
629.2
BT
NT
cars
automobiles
autos
private cars
388.342
629.222
Choose one term
as a descriptor to
label the concept:
cars
USE automobiles
11
Preferred term substitution
Anything
on
farming?
I use the term
agriculture for
farming, so I’ll
search for that
12
Relationships between concepts
• Paradigmatic, or a priori: apply generally,
independently of any specific document
– shoes BT footwear
– shoes RT shoemakers
A thesaurus can
show these
• Syntagmatic, or a posteriori: concepts
that are related only in the context of a
specific document
– shoes : history
– shoes : prices
A classification scheme
can also show these
13
Searching hierarchies
I need information
on road vehicles
I know that
buses,cars and
lorries are all
kinds of road
vehicles, so I’ll
search for these
terms as well as
for road vehicles
14
Searching related terms
Please give me information
about agriculture
OK,I’ll look for that.
Would you also be
interested in items
dealing with
forestry, livestock or
pet breeding?
15
Paradigmatic relationships
in a thesaurus
• Many relationships are indicated as
RT/RT, but their nature is not specified,
so cannot be used for systematic
grouping (ontologies overcome this)
• Hierarchical generic-specific
relationship (BT/NT) allows (requires)
grouping of concepts into facets - the
terms have to be in the same facet
16
What is a facet?
(Sometimes called a fundamental facet)
A high-level grouping of concepts of the same
inherent category, e.g. activities, disciplines,
people, materials, places, times. For example:
 animals, mice, daffodils and bacteria could all
be members of a living organisms facet;
 digging, writing and cooking could all be
members of an activities facet;
 birthdays, wars and football matches could all
be members of an events facet.
A concept cannot belong to more than one facet
17
What is an array?
(Sometimes called a subfacet)
A grouping of concepts within a facet by
some stated characteristic of division.
vehicles
Array
Array
<vehicles by number of wheels>
 bicycles
 tricycles
 four-wheeled vehicles
automobiles
Node labels
showing
characteristics
of division
<vehicles by load carried>
 goods vehicles
lorries
A concept may occur in
 passenger vehicles
more than one array
automobiles
buses
19
Parametric search
• Searching for resources that have one
or more specified characteristics
• e.g. vehicles which
– have three wheels
AND
– are used for carrying passengers
• This is an important and useful aspect
of post-coordinate searching, but it is
not faceted classification
20
Ways of displaying concepts and
their paradigmatic relationships
1. Alphabetically, with their relationships
35 mm cameras
BT: film cameras
aqualungs
BT: diving equipment
camera accessories
BT: photographic equipment
NT: flash guns
light meters
tripods
RT: cameras
21
Ways of displaying concepts and
their paradigmatic relationships
2. Hierarchically - one tree for each facet
(fields of work)
. diving
. photography
. physics
. . optics
(people)
<people by age>
. infants
. children
. adults
<people by occupation>
. divers
. models (people)
. photographers
. physicists
(equipment)
. diving equipment
. . aqualungs
. . diving suits
. . . dry suits
. . . wet suits
. . face masks
. photo equipment
. . cameras
22
Ways of displaying concepts and
their paradigmatic relationships
3. In subject groups or categories (microthesauri)
– one tree for each facet in each category
770: PHOTOGRAPHY
(fields of work)
(people)
. photography
. models (people)
. . colour photography . photographers
(equipment)
. photo equipment
. . cameras
797.23: DIVING
(fields of work)
. diving
. . scuba diving
. . snorkel diving
(people)
. divers
(equipment)
. diving equipment
. . aqualungs
. . diving suits
. . . dry suits
23
Combining concepts :
syntagmatic relationships
(activities)
(places)
A1 Italy
A2 The Netherlands
A3 Russia
C1
C2
C3
(people)
B1 potters
B2 repairers
B3 ceramicists
(objects)
D1
earthenware
D2
porcelain
D3
stoneware
moulding
throwing
decoration
Node labels
showing
facet names
Combine to express compound subjects either post-coordinate, for searching:
porcelain AND decoration AND Russia
or pre-coordinate, for browsing:
porcelain decoration in Russia: D2C3A3
24
Order of combining facets
thing - kind - part - property - material - process operation - system operated on - product - byproduct - agent - space - time - form
e.g.
porcelain (thing) decoration (process) in Russia (space)
A facet may occur more than once in a string
25
Faceted classification
with processes subordinated to objects
A
AA
AAA
AAB
AAC
AB
ABA
ABB
B
BB
BB.AA
BB.AAB
BB.AB
BB.ABA
BB.ABB
BC
BC.AA
BC.AAB
(processes)
ceramic production processes in general
forming in general
coiling
moulding
throwing
decoration in general
glazing
transfer printing
(objects)
ceramics in general
earthenware in general
(processes)
forming of earthenware
moulding of earthenware
decoration of earthenware
glazing of earthenware
transfer printing of earthenware
porcelain in general
(processes)
forming of porcelain
moulding of porcelain
Words shown in blue
may be omitted as
they are implied by the
hierarchical structure
26
Faceted classification
generation of subject strings
B
BB
BB.AA
BB.AAB
BB.AB
BB.ABA
BB.ABB
BC
BC.AA
BC.AAB
(objects)
ceramics
earthenware
(processes)
forming
moulding
decoration
glazing
transfer printing
porcelain
(processes)
forming
moulding
ceramics > earthenware > forming
ceramics > earthenware > forming > moulding
ceramics > earthenware > decoration
ceramics > earthenware > decoration > glazing
ceramics > earthenware > decoration > transfer printing
ceramics > porcelain
ceramics > porcelain > forming
ceramics > porcelain > forming > moulding
27
Alphabetical index
ceramic production processes
ceramics
coiling : forming : ceramic production
decoration : ceramic production
decoration : earthenware : ceramics
earthenware : ceramics
forming : ceramic production
forming : earthenware : ceramics
forming : porcelain : ceramics
glazing : decoration : ceramic production
glazing : decoration : earthenware : ceramics
moulding : earthenware : ceramics
moulding : forming : ceramic production
moulding : porcelain : ceramics
porcelain : ceramics
throwing : forming : ceramic production
transfer printing : decoration : ceramic production
transfer printing : decoration : earthenware : ceramics
A
B
AAA
AB
BB.AB
BB
AA
BB.AA
BC.AA
ABA
BB.ABA
BB.AAB
AAB
BC.AAB
BC
AAC
ABB
BB.ABB
28
The same concepts viewed in
different ways
Classification view
 Good for browsing or
surveying a topic
 Like a map
 Like a book’s contents page
 Shows related concepts
together
 Usually arranged by discipline
 Shows syntagmatic and
paradigmatic relationships
 Shows compound topics as
pre-combined subject strings
Thesaurus view
 Good for searching if you
know what you want
 Like a gazetteer
 Like a book’s index
 Gets quickly to individual
concepts
 Usually arranged by facet
 Shows paradigmatic
relationships
 Lets you combine concepts
when searching
29
Some clarifications
• A classification can be both hierarchical and faceted
• A classification built on faceted principles can be
enumerative
• A symbolic notation is not essential, and should not
determine the structure
• A classification can arrange compound topics in a
useful linear sequence - a thesaurus cannot
• One-to-one mapping between a thesaurus and a
classification is not possible
• A “guide to popular topics” may be used to
supplement a systematic classification
30
Use of a thesaurus
• A thesaurus as a search aid with
unindexed material
– Allows searching on terms linked to the term
asked for
• Software support for formulating
questions
– Browsing the thesaurus to choose terms
– Combining terms with AND, OR, NOT and ( )
31
An ambiguous search interface
Does this mean:
or does it mean:
(lorries OR cars) AND diesel ?
lorries OR (cars AND diesel) ?
32
Thesaurus creation and
management
• Standards
– BS/ISO standards give helpful guidance
– Draft revised BS standard now out for comments
• Software
– Many packages available
– Best if integrated with database used for cataloguing
• Cooperative thesaurus development and
use
– DIY is a major and continuing task
33
Thesaurus development never
ends
• It is an ongoing task
• It needs a knowledgeable thesaurus
editor
• It needs cooperation and input from
indexers and users
• User feedback
34
What we need
• Software for the combined development of thesaurus and
classification
– Thesaurofacet; Classaurus; ROOT; Bliss; Taxomita
• Software support for combining facets when searching,
using a thesaurus. Often referred to as faceted classification,
but not the same thing
– Flamenco; View-based searching; No zero match (NZM)
• Software support for browsing in a classified catalogue with
notation, captions and an alphabetical index
35
Links and further information
<http://www.willpowerinfo.co.uk/>
36
Download