Controlling values The equivalence relationship

advertisement
Controlling values
The equivalence relationship
The vocabulary problem
What is this?
Synonymy
Restroom, bathroom, toilet, loo, WC, ladies’
room, mens’ room, little girls’ room, little boys’
room. . .
Synonymy: Using different words to identify the
same concept.
Another vocabulary problem
What is mercury?
What is bank?
What is python?
What is java?
Polysemy
Polysemy: Using the same word
(morphologically speaking) to identify different
concepts.
Java: Island in Indonesia, variety of coffee bean,
object-oriented programming language.
Yet more vocabulary problems
The White House has been lobbying Congress to
support the proposed budget. . .
Freedom of the press is an important value in the
United States. . .
I’m tired of taking the bus; I need some new
wheels. . .
Metonymy and synecdoche
Metonymy: Using a related concept to stand for
another concept.
Synecdoche: Using the word for part of
something to stand for the entire thing.
Furnas, et al’s experiment
Furnas, et al asked people (including subject experts) to
label a variety of items (recipes, text editing operations,
“common content objects”). Surprise, there was little
agreement among the names submitted by participants.
Conclusion: “The idea of an ‘obvious,’ ‘self-evident,’ or
‘natural’ term is a myth! Since even the best possible
name is not very useful, it follows that there can exist
no rules, guidelines or procedures for choosing a good
name, in the sense of ‘accessible to the unfamiliar user.’
Furnas, et al’s recommendations
Furnas, et al suggest that interface designers:
• Implement unlimited aliasing.
• Disambiguate terms that can be used in
multiple senses by presenting possibilities to
users and asking them to select the appropriate
one.
Limitations of Furnas’s study
• Participants were asked to label objects, not
how they would search for objects.
• The study assumes a search interface, not a
browsing (or menu-driven) interface.
In a search interface, users must recall or guess an
object’s name. In a browsing interface, users merely
need to recognize the appropriate term.
Vocabulary problems and
information systems
Designers of organizational systems have been
grappling with the ambiguities of language for
many years.
Synonymy, polysemy, and so on complicate the
goal to collocate, or bring together, like items in
an information system (those by the same author,
with the same title, or on the same subject).
Vocabulary control
In LIS, vocabulary control is similar to Furnas’s
idea of aliasing: multiple terms that might stand
for the same concept are grouped together.
One term is typically designated as preferred:
this is the term used in a display (or, in a card
catalog, the card with the preferred term would
actually have the entry; the other terms would
just be cross-references).
Example of a controlled term
Preferred term: bathroom
Equivalent terms: restroom, loo, toilet, WC,
ladies’ room, mens’ room, little girls’ room,
little boys’ room, ladies room, ladys room,
lady’s room, ladie’s room, ladys’ room...
Digression into the library catalog
Library catalogs have three traditional access
points: author, title, and subject. In the old card
catalog, these were the three ways that users
could search.
Each of these access points has associated
vocabulary control.
Catalog entries
Entry is an old term for a catalog record. For example,
Herman Melville’s Moby Dick might have an entry in
the card catalog under the subject Fiction—Whaling.
The main entry designates the primary access point and,
in the card catalog, the card with all the bibliographic
information. (Other entries might have a cross-reference
to the main entry only.)
The entry for Moby Dick under Fiction—Whaling might
say merely “See Melville, Herman. Moby Dick.”
Main entry confusion
For many people, the designation of a primary
access point or main entry is anachronistic in the
world of online systems. We can search any
attribute now: why select a “primary” one?
Taylor notes three arguments for retaining the
main entry: standardization of citation,
subarrangement, and collocation of works.
Control of names
Names, such as author or title names, are
controlled via authority files.
Authority files both disambiguate names that
identify multiple people or items and group
variations for the same person or item (that is,
they deal with polysemy and synonymy).
Authority file examples
In the UT author authority file: headings for
Patricia Williams:
• Names are disambiguated by using middle
initials and dates of birth.
• Cross references are used for some authors.
• There may still be two headings for one person!
Digression 2: Power catalog searching
To increase the precision in library catalog
searches, avoid keyword searching.
Instead, search the appropriate authority file first,
then search using the preferred heading. Magic!
Searching the authority file typically necessitates
proper query formation (e.g., last name, first
name for author searches).
Digression 3: Pseudonyms in the catalog
Pseudonymous identities are maintained in
AACR2 (in older catalogs, everything went
under the author’s real name).
For example, “Carolyn Keene,” the name used
by multiple people as the author for the Nancy
Drew novels, is maintained as an author entity in
the authority file.
Controlled subject vocabularies
Subject vocabularies have varying amounts of
structure (e.g., relationships between terms).
Thesauri may include equivalence, hierarchical,
and associative relationships. Thesauri can also
be faceted (that is, represent multiple aspects of a
subject...we will discuss facets in depth later in
the course).
Example thesaurus entry
Dark chocolate
BT Chocolate
RT
Single-origin chocolate
UF
Semisweet chocolate
Baker’s chocolate
Sweet chocolate
SN
Chocolate without milk
solids and with less
than 70 percent
chocolate mass.
BT: broader term, one level up in
a hierarchy
RT: related term, in another facet
or hierarchical branch
UF: Use for; synonyms, or nonpreferred terms
SN: Scope note; definitions or
usage guidelines
Equivalence in thesauri
Similar concepts may be treated as equivalents as
judged appropriate by the thesaurus designer.
Examples:
Beer
UF ale, porter, stout, pilsner, bock, IPA. . .
Cartography
UF maps
Disambiguation in thesauri
Polysemous terms are often identified by adding
qualifying terms in parentheses.
Mercury (element)
Mercury (god in Greek mythology)
Search engines may use ask users to select the
sense they want.
Using controlled vocabularies:
MeSH and PubMed
The Medical Subject Headings (MeSH) index journal
articles for the PubMed database.
Keyword searches in PubMed are automatically
expanded with MeSH. Searches can also be explicitly
limited to MeSH terms, which can increase precision.
The comparison to a system like Google Scholar is
illuminating.
Standards for controlled vocabularies
There are a number of standards for thesaurus
construction: ISO, NISO, British.
These can be quite detailed, but they provide
mostly syntactic guidance: e.g., terms should
take noun form.
Summary
• Controlled vocabularies increase precision and recall
in searching by identifying equivalent terms.
• Authority files are types of controlled vocabularies
that describe preferred forms of author names and
names of works.
• Thesauri are subject-based controlled vocabularies
that include hierarchical and associative relationships
in addition to equivalence relationships. Thesauri can
also be used as browsing interfaces.
Download