Building Taxonomies - Access Innovations, Inc.

advertisement
Building
Taxonomies
Alice Redmond-Neal
Access Innovations, Inc.
Enterprise Search Summit
New York City, May 21, 2006
1
Copyright © 2006
Access Innovations, Inc.
So what’s a taxonomy?
• Words – controlled vocabulary
• Used as labels for indexing – descriptive
metadata
• Attached to documents, digital objects,
or physical objects
• Organized to aid retrieval – hierarchical
structure
– Hierarchical presentation of a thesaurus
2
Copyright © 2006
Access Innovations, Inc.
Perspectives on taxonomies
• Taxonomist (aka Lexicographer, Thesaurus
builder)
• Information architect
• Indexer
• Searcher
Each has a different view and need for
words in retrieving information.
Each need relates to using a taxonomy
for indexing.
3
Copyright © 2006
Access Innovations, Inc.
Taxonomies for
information retrieval online
• Conceptual framework for web content –
reflects organization of knowledge in a
domain
• Foundation for information architecture
• Often 3 levels deep – depends on domain
• May be hidden or displayed
4
Copyright © 2006
Access Innovations, Inc.
Info retrieval starts with a
knowledge organization system
•
•
•
•
•
•
•
•
Uncontrolled list
Name authority file
Synonym set/ring
Controlled vocabulary
Taxonomy
Thesaurus
Ontology
Semantic network
LOTS OF OVERLAP!
5
Not complex
Copyright © 2006
Access Innovations, Inc.
Highly complex
Structure of
controlled vocabularies
List of words
Synonyms
Taxonomy
Thesaurus
INCREASING COMPLEXITY
Ambiguity control
Synonym control
6
Ambiguity control
Synonym control
Hierarchical rel’s
Copyright © 2006
Access Innovations, Inc.
Ambiguity cont’l
Synonym cont’l
Hierarchical rel’s
Associative rel’s
Controlled vocabulary
construction standards
• ANSI (American National Standards
Institute)
• NISO (National Information Standards
Organization)
• ISO (International Standards Organization)
• BS (British Standards Institute)
Differences are minor and diminishing.
ANSI/NISO Z39-19.2005 revision approved.
7
Copyright © 2006
Access Innovations, Inc.
Taxonomy defined –
ANSI/NISO Z39.19-2005*
controlled
“A controlled vocabulary
hierarchy
consisting of preferred terms
all of which are connected in a hierarchy
or polyhierarchy.”
Missing:
equivalence, homographic, and associative relationships
and notes – features of a THESAURUS.
* http://www.niso.org/standards/resources/Z39-19-2005.pdf
8
Copyright © 2006
Access Innovations, Inc.
Taxonomy as
an organization system
• Controlled vocabulary
• Hierarchical format
– Parent-child relationships
• Specific items appear as final leaves
on hierarchy branches
• Common on websites
– Pick list
– Browsable directory
– Other variations
9
Copyright © 2006
Access Innovations, Inc.
Thesaurus as
an organization system
• Controlled vocabulary
• Focus on conceptual classes, not specifics
• Hierarchy – implicit if not displayed
– Parent-child relationships
• Various display formats may be available
Long
• Network of relationships between
terms helps
user to find information
– Cousins, friends, aliases
• Scope notes, term history
established
standards
• More elaborate and informative
10
Copyright © 2006
Access Innovations, Inc.
Thesaurus defined –
ANSI/NISO Z39.19-1993, -2005
“A controlled vocabulary of terms in natural
language that are designed for
postcoordination...”
“Terms are arranged…so that various
relationships are displayed clearly…”
“The controlled vocabulary is established by
information specialists or lexicographers
and is generally employed in indexing.”
11
Copyright © 2006
Access Innovations, Inc.
Thesaurus defined –
ANSI/NISO Z39.19-2005
“A controlled vocabulary arranged in a known
order in which equivalence, homographic,
hierarchical, and associative relationships among
terms are clearly displayed and identified by
standardized relationship indicators, which must be
employed reciprocally.
Its purposes are to promote consistency in the
indexing of content objects, especially for
postcoordinated information storage and retrieval
systems, and to facilitate browsing and
searching by linking entry terms with terms.
Thesauri may also facilitate the retrieval of content
objects in free text searching.”
12
Copyright © 2006
Access Innovations, Inc.
Standards and pragmatism
• Standards are your friends
– Lead to richer, more informative product
– Promote interoperability -- Allow you to adopt
or adapt other controlled vocabularies
– Promote predictability
– Allow repurposing within your organization and
by other organizations
• Follow standards for taxonomy building
– Incorporate authority files / final nodes as
needed
• Your taxonomy or thesaurus must
meet your needs
13
Copyright © 2006
Access Innovations, Inc.
Your taxonomy / thesaurus
end product
• Reflects
– scope of your concern
– degree of precision you need
• Facilitates
– data storage and retrieval by vocabulary control
– discovery of ideas
• Promotes learning
– preferred terminology
– relationships among concepts
– organized guide to your field
14
Copyright © 2006
Access Innovations, Inc.
Talk about terms and
taxonomies
• How to choose terms
• How to ensure term clarity, avoid
ambiguity
– Vocabulary control—why and how
• How to format terms
• Terms within a taxonomy—the big picture
15
Copyright © 2006
Access Innovations, Inc.
How do you choose terms?
• Importance in the subject area
• Use in the literature, by the
organization or community
• Necessary degree of specificity or detail
• Relationship with other controlled
vocabularies
16
Copyright © 2006
Access Innovations, Inc.
Vocabulary control – why?
“The need for vocabulary control arises
from two basic features of natural
language, namely:
two or more words or terms can be used
to represent a single concept, and
two or more words that have the same
spelling can represent different
concepts.”
ANSI/NISO Z39.19-2005
17
Copyright © 2006
Access Innovations, Inc.
Vocabulary control
through disambiguation
Synonyms – de-duplicate meanings
• Multiple words for the same concept
– President of the United States, POTUS
– Biological technology, Biotech
Homographs (polysemes) – eliminate
ambiguity
• Same written word used for multiple
meanings
– Balloon—which kind?, Box—which kind?
– Cells, Mercury, Records, Bridge/Bridges, Bush
18
Copyright © 2006
Access Innovations, Inc.
Vocabulary control – how?
Organize terms
• to show which of two or more
synonymous terms is preferred or
authorized for use
• to distinguish between homographs
• to indicate hierarchical and associative
relationships among terms
19
Copyright © 2006
Access Innovations, Inc.
Vocabulary control – in practice
• Use unambiguous terms, clear to the user
•
•
•
•
20
group
Distinguish between terms that appear
similar
Use Scope Notes when necessary
Use terms as elements that can be
coordinated in a flexible manner
Create compound terms (noun+modifier)
when necessary
Copyright © 2006
Access Innovations, Inc.
One term / one concept
• “Terms in a thesaurus should represent
simple or unitary concepts…” (ISO standard)
• “Each descriptor included in a thesaurus
should represent a single concept (or unit
of thought).
…frequently expressed by a single-word
term but in many cases a multiword term
is required.”
(ANSI/NISO Z39.19-2005)
21
Copyright © 2006
Access Innovations, Inc.
A “term” synonym ring
Term
Descriptor
Node
Category
22
Subject heading
Copyright © 2006
Access Innovations, Inc.
So what’s a concept?
• “A unit of thought, formed by mentally
combining some or all of the characteristics
of a concrete or abstract, real or imaginary
object. Concepts exist in the mind as
abstract entities independent of terms used
to express them.”
• Three main categories
– Abstract concepts
– Concrete entities
– Proper nouns
23
Copyright © 2006
Access Innovations, Inc.
Concrete entities as terms
• Things and their physical parts
– primates
• head
– buildings
• floors
• Materials
– cement
– wood
– lead
24
Copyright © 2006
Access Innovations, Inc.
Abstract concepts as terms
• Actions and events
– evolution, skating, management, ceremonies
• Abstract entitites
– law, theory
• Properties of things, materials, and actions
– strength, efficiency
• Disciplines and sciences
– physics, meteorology, mathematics
• Units of measurement
– pounds, kilograms, miles, meters, nanoseconds
25
Copyright © 2006
Access Innovations, Inc.
Proper nouns as terms
• Individual entities – “classes of one” –
expressed as proper nouns
– San Francisco, Lake Michigan
Thesaurus standards prefer to exclude proper
names, persons, and trade names.
Extensive lists  authority files.
Taxonomies include them as final nodes.
26
Copyright © 2006
Access Innovations, Inc.
Pop quiz – which qualify as
terms?
• rooms
• living rooms
• living room furniture
• schools
• public schools
• public school curricula
“single unit
of thought”
• marketing and advertising
• societal issues
information ethics, plagiarism, credibility
information literacy, lifelong learning
27
Copyright © 2006
Access Innovations, Inc.
The term record
• Main Term (MT)
• Top Term (TT)
• Broader Terms (BT)
= subject term, heading, node,
category, descriptor, class
TAXONOMY
• Narrower Terms (NT)
• Related Terms (RT)
– See also (SA)
• Scope Note (SN)
• History (H)
• NonPreferred Term (NP)
– Used for (UF), See (S)
28
Copyright © 2006
Access Innovations, Inc.
THESAURUS
see
Lexicographer’s
lexicon
Build a taxonomy – simple steps
• Get paper and pencil
– Sharpen pencil
• Define subject field
• Collect terms
• Organize terms
• Fill in gaps
• Flesh out and interrelate terms
You’re done!
29
Copyright © 2006
Access Innovations, Inc.
Define subject field
• Review representative collection of content
• Determine:
– Core areas
– Peripheral topics
Sociology
Psychology
Education
• Scope can be modified later
30
Copyright © 2006
Access Innovations, Inc.
Law
Before you go on: Build or buy?
• Survey existing thesaurus/taxonomy
resources for your domain
• Test for
– Scope
– Depth
• Make-or-break terms
– Cost
Don’t reinvent the wheel!
31
Copyright © 2006
Access Innovations, Inc.
Collect terms
•
•
•
•
•
•
•
•
•
•
32
Your documents and databases
Departmental terminology
Text books and their indexes (indices)
Book tables of contents and indexes
Journal quarterly indexes
Encyclopediae
Lexicons, glossaries on the topic
Web resources
Users and experts
Search logs
Copyright © 2006
Access Innovations, Inc.
Gather terms from search logs
Beyond the Spider: The Accidental Thesaurus
(Richard Wiggins, Information Today, Oct 2002)
Top ~100 search terms from search logs
Match to web site with appropriate answer
Basis for favorites or best bets, presented at the top
of results list.
(AKA behavior-based taxonomy)
Not a thesaurus or taxonomy,
but still a useful source of terms.
33
Copyright © 2006
Access Innovations, Inc.
Organize terms – roughly
• Sort terms into several major categories –
logical groups of similar concepts as
Top Terms
– Identify core areas and peripheral topics
– 10 – 20 to start
– Consider moving proper names to authority files
• Result: loose collection of terms under
several main headings
– Rough and tentative – see how it fits as you go
– Initial gap analysis
– Add / modify / delete as needed
34
Copyright © 2006
Access Innovations, Inc.
Labelling a concept –
cognitive linguistics
• Most-used labels are middle in range from
abstract to specific --- relates to search
• Linguistic universal – true across cultures
• Unique beginner
• Life form
• Generic
Insurance
• Specific
• Varietal
35
Practical
Health insurance
application?
Group health insurance
Copyright © 2006
Access Innovations, Inc.
Craft the Top Terms
• Toughest job and most important step!
• Dictates further organization
• Determines how browsers/searchers
perceive the taxonomy
– Coverage
– Formality
• Establish the concept first, tweak the
wording later
36
Copyright © 2006
Access Innovations, Inc.
Usefulness of a term –
the “duh” factor
• Some terms are so basic for a domain that
they have little or no value
– “Sports” in Sports Illustrated
– “Technology” in Technology Review
– “Golf” in Golf Magazine
• How useful will the term be for indexing?
– Apply to everything in the domain?
– Distinguish important concepts?
– If term is needed, specify limited use conditions
in Scope Note
37
Copyright © 2006
Access Innovations, Inc.
Hierarchy structures –
variations on a theme
• Not pre-determined
– Winestypevarietyregioncost
– Or Winescosttype….
• Varies by user group and needs
– May have multiple views of same content
– Standard alpha view or customized notation
• Affects information architecture, i.e. how
web site functions
38
Copyright © 2006
Access Innovations, Inc.
How do terms relate?
• Hierarchical relationships
-- Parents and their
TAXONOMY
children
• Equivalence relationships
-- Aliases
• Associative relationships
-- Cousins
39
Copyright © 2006
Access Innovations, Inc.
THESAURUS
Hierarchical relationships
• Broader Term represents the category
• Narrower Term represents the specific
• Three types:
– Generic relationship (BTG/NTG)
– Whole-part relationship (BTP/NTP)
– Instance relationship (BTI/NTI)
• BTs/NTs have a reciprocal relationship
40
Copyright © 2006
Access Innovations, Inc.
Broader to Narrower Terms
Politics
Elections
Generic
41
Specific
Presidential elections
Gubernatorial elections
Mayoral elections
Varietal
Copyright © 2006
Access Innovations, Inc.
Hierarchy – Generic
(genus-species) relationship
• Inheritance or inclusion – what’s true of the
parent (BT) is true for all children (NTs)
• Applies to entities, actions, properties,
agents – not just biological taxonomies
Value
Cultural value
Economic value
Moral value
Social value
42
Teachers
Adult educators
School teachers
Special ed teachers
Student teachers
Copyright © 2006
Access Innovations, Inc.
Thinking
Contemplation
Divergent thinking
Lateral thinking
Reasoning
Generic relationship test –
1
• Both terms in same fundamental category
• “All-and-some” test
Rodents
SOME
ALL
Squirrels
Pests
SOME
NOT ALL
Squirrels
43
Copyright © 2006
Access Innovations, Inc.
Generic relationship test –
Rodents
Pests
Squirrels
 ALL squirrels are rodents
x NOT ALL squirrels are pests
x NOT ALL pests are rodents
44
Copyright © 2006
Access Innovations, Inc.
2
Hierarchy – Whole-part relationship
• Also known as meronymy or partonomy
• Four types allowed in thesaurus standards
– Body systems and organs
• Ear  Middle ear
– Geographical locations
• Bernalillo County  Albuquerque
– Fields of study
• Geology  Physical geology
– Hierarchical
organizational/corporate/social/political structures
• Diocese  Parish
45
Copyright © 2006
Access Innovations, Inc.
Hierarchy – Instance relationship
• General category (common noun) = BT
• Individual example (proper noun) = NT
Seas
Baltic Sea
Caspian Sea
Mediterranean Sea
New York museums
Guggenheim Museum
Museum of Modern Art
Museum of Natural History
Essentially identical to “final node” in taxonomies.
Best practice: long list  move to authority file
46
Copyright © 2006
Access Innovations, Inc.
Polyhierarchical relationship
• Term can logically fit under more than one
Broader Term – can have Multiple Broader
Terms (MBT)
• New to ANSI/NISO standards
47
Spoons
Sporks
Forks
Sporks
Nurses
Nurse administrators
Health administrators
Nurse administrators
Finance
Accounting
Careers
Accounting
Copyright © 2006
Access Innovations, Inc.
Equivalence relationship
• Preferred Term
– Thesaurus term and valid for indexing
– Thesaurus notation: USE
• NonPreferred Term
– Not valid for indexing
– An alias or imposter
– Entry point, directs user to Preferred Term
– Thesaurus notation: UF or NPT
Spiders
UF Arachnids
48
Plant pathology
USE Phytopathology
Copyright © 2006
Access Innovations, Inc.
Equivalence – when to use
• Synonyms, slang, quasi-synonyms
• Scientific and trade names
– Ibubrofen
UF Motrin™
• Lexical variants
– Fiber optics UF Fibre optics
– Mouse
UF Mice
• Upward posting of narrow concepts not
specified in taxonomy or thesaurus
– Social class
UF Elite, Middle class, Working class
Get equivalent terms from search logs, brainstorming…
49
Copyright © 2006
Access Innovations, Inc.
Associative relationship
• Related Terms (RTs) ~ cousins
• “…terms related conceptually but not
hierarchically, and are not part of an
equivalence set” (i.e. not synonyms)
– Should siblings be Related Terms??
• Both terms are valid thesaurus terms for
indexing, and have reciprocal relationship
• Expands user’s awareness, reflects
thesaurus coverage of unanticipated areas
• Standards describe specific types (see Lexicon)
50
Copyright © 2006
Access Innovations, Inc.
Sibling rivalry and facets
• Format and sense of sibling terms should
•
•
•
•
be consistent
If siblings don’t coexist well, separate them
Subdivide large groups of terms into facets,
mutually exclusive subcategories
Growing demand with faceted navigation
Facet examples
– Properties, Materials, Agents, Actions, Influence
– Objects, Styles and periods, Color, Shape
(Art & Architecture Thesaurus)
51
Copyright © 2006
Access Innovations, Inc.
Faceted classification
• Pharmaceuticals
– (by action)
• Anti-inflammatory
agents…
– (by chemical structure)
• Alkaloids…
– (by indication)
• Pain…
– (by use)
• Immunosuppression…
52
Copyright © 2006
Access Innovations, Inc.
Facet indicators
(aka Node labels),
not to be used
for indexing
Faceting challenge
Propose facet
indicators
and
subgroup these
paint varieties
into facets.
53
• Paint
– Oil paint
– High-gloss paint
– Interior paint
– Matte paint
– Latex paint
– Semi-gloss paint
– Exterior paint
Copyright © 2006
Access Innovations, Inc.
Scope Notes (SN)
• Indicate meaning of the term in the context
•
•
•
•
•
•
54
of this thesaurus, for this audience
– Stress – Metal, Psychological, Physiological
Indicate any restriction in meaning
Indicate range of topics covered
Provide direction for indexers; for terms often
confused, may suggest an alternative term
Use only as needed – not for every term
Establish and stick with consistent format
Be concise
Copyright © 2006
Access Innovations, Inc.
Evaluating terms
• Do terms represent all necessary
concepts?
– Gap analysis
• Do terms capture necessary details?
– Level of granularity
• Are terms understood by users?
– Domain expert vs. common user
55
Copyright © 2006
Access Innovations, Inc.
Talk about terms
• Term format
• Grammatical issues
• Singular and plural forms
• Spelling
• Abbreviations and acronyms
• Capitalization
• Other punctuation
• Consistency
56
Copyright © 2006
Access Innovations, Inc.
Term format
• KISS – Keep it short and simple
– 1-2-3 words
• Effect on search
• Factoring, Postcoordination (coming)
• Grammatical issues
– Nouns and noun phrases
– Verbish things
– Adjectives
– Adverbs
– Initial articles
57
Copyright © 2006
Access Innovations, Inc.
Most terms are nouns
• Nouns or simple noun phrases
(phrase = compound or bound term)
– Adj + Noun – Art history (ANSI/NISO standard)
• Noun + Prep + Noun – History of art (ISO standard)
– Exceptions – Burden of proof, Coats of arms,
Prisoners of war, Birds of prey, etc.
58
Copyright © 2006
Access Innovations, Inc.
Other parts of speech
• Verbs
– Gerund form: Fishing
• Adjectives
– Not used in isolation
– Very rare (lots in Art & Architecture Thesaurus)
– OK when combined with another term –
Dental bridges
• Adverbs
– No, except as part of proper name –
Very Large Array
• Articles
– No, except as part of proper name –
El Salvador, Le Mans
59
Copyright © 2006
Access Innovations, Inc.
Singular and plural forms
• Plural form for count nouns
– “how many” clouds, animals, highways
• Singular form for mass nouns
stocks?
fishes?
– “how much” security, oxygen, rain
monies?
• Exceptions
– Body parts in medicine  singular (heart,
foot)
– Unique entities  singular (Brooklyn
Bridge)
– User warrant  plural/singular (fishes)
60
Copyright © 2006
Access Innovations, Inc.
Term spelling
• Preferred spelling depends on audience
– Multinational company may need alternative
spellings in same taxonomy
• Use most widely accepted spelling
• Use secondary spelling as NonPreferred
Term (synonym)
• Exception:
– Proper names – Labour Party
61
Copyright © 2006
Access Innovations, Inc.
Abbreviations and acronyms
• Use only when full form is rarely seen –
SCUBA, LASER, DNA, LASIK
• Use full form if abbreviation is not widely
used and understood
– Automated teller machines – for ATM
– Driving while intoxicated – for DWI
• Alternative becomes NonPreferred Term
• Use and acceptance always shifting
• Be consistent
62
Copyright © 2006
Access Innovations, Inc.
Capitalization
• Standards: use all lower case
– Exceptions:
•
•
•
•
Initialisms – DNA
Proper names – Queen Mary
Trade names – Thesaurus Master™
Taxonomic names – Homo sapiens
• Much variation in practice
63
Copyright © 2006
Access Innovations, Inc.
Parentheses
• Use only for
– Parenthetical qualifiers to disambiguate
homographs
• Bridges (Dentistry), Bridges (Roadways), Bridges
(Music)
– Different meanings for singular / plural word forms
• Bridges [all the above] vs. Bridge (Card game)
• Wood (Material) vs. Woods (Forest)
• Damage (Injury) vs. Damages (Law)
– Facet indicators – Paint (by finish)
– Part of the term – benzo(a)pyrene
– Trademark indicator (tm) becomes ™
64
Copyright © 2006
Access Innovations, Inc.
Hyphens
• Generally avoid -- nonfiction
• Use only if
– Omitting the hyphen would be ambiguous
• cocitation vs. co-occurrence
– The hyphen is part of the term
• n-body problem
• p-benzoquinone
• CD-ROM
65
Copyright © 2006
Access Innovations, Inc.
Other punctuation bits
• Apostrophes
– Keep for possessive case
• Diacritical marks
– Keep if possible –
Québec
• Other random marks
– Keep if part of a proper name –
A&W Root Beer
Standard & Poors
66
Copyright © 2006
Access Innovations, Inc.
Compound terms (aka bound terms)
and factored terms
• Term consisting of more than one word
that represents a single concept
• Keep compound term or factor out (split)?
67
Copyright © 2006
Access Innovations, Inc.
Compound terms
are precoordinated
• Elements are bound together to specify a
concept at the indexing stage
• Can’t change the parts
Water pollution
Library science
Television influence on preschoolers
Chicken dinner with turnips and rutabagasno substitutions of menu items!
68
Copyright © 2006
Access Innovations, Inc.
Factored terms
can be Postcoordinated
• Elements can be strung together to
specify a concept at the search stage
• Elements can be mixed and combined as
needed
– Few clothing pieces  several outfits
• The sum of the elements reflects the
concept (usually)
69
Copyright © 2006
Access Innovations, Inc.
To factor or not to factor
Is each factor a single concept?
Is each factor in your thesaurus?
If YES, break term down to factors:
California highway construction 
California + Highways + Construction
If NO, or if factoring would be confusing,
retain the compound term
Children’s television
Science library
70
 Television + Children ??
 Library + Science ??
Copyright © 2006
Access Innovations, Inc.
Precoordination positives
• User expectations – Rapid transit
– Occurs commonly in data
– Splitting would be odd
– Reflects a single concept for the audience
• Better accuracy – captures specific concepts
precisely
• Fewer false drops
• Term information is retained
(Related Terms, NonPreferred Terms, Scope Notes, …)
71
Copyright © 2006
Access Innovations, Inc.
Precoordination negatives
• Poorer total recall
• Term proliferation
– Combinations and permutations increase
thesaurus size
• Higher cost
• Limited flexibility in expressing new
concepts
72
Copyright © 2006
Access Innovations, Inc.
Postcoordination pros and cons
 Higher recall
 Lower cost
 Greater flexibility – enables expression of
new concepts through novel combinations
x Lower accuracy, some false drops
– Library science
– Art museums
NOT = Library + Science
NOT = Art + Museums
• Postcoordination is implicit in most online
searches (implied AND between search
words)
73
Copyright © 2006
Access Innovations, Inc.
About “and”
• Avoid “and” in terms – not a single
concept
Instead of: Children and television
Factor and postcoordinate
USE Media influence + Television + Children
• “and” OK when both elements are
members of a broader class
Vessels
Ships and boats
74
Copyright © 2006
Access Innovations, Inc.
Your need for
granularity may
dictate your choice
So far you’ve got
• Hierarchy
• Complete term records
– Broader and Narrower Terms
• Polyhierarchies when needed
– Preferred/NonPreferred Terms
(equivalence relationships)
– Related Terms (associative relationships)
– Scope Notes
– Correct term format
– Compound terms when needed
75
Copyright © 2006
Access Innovations, Inc.
Notation
• Symbols (numbers, letters, hyphens, colons…)
– 1: Apples
• 1.1: Granny Smith
• 1.2: Winesap
• Another kind of ordering (non-alphabetic)
– Chronological, positional, numeric sequence, or
other logical sequence for user group
– Same terms presented differently
– Different user groups, different purposes
• Adjunct to verbal expression of term
• Secondary to verbal concept organization
76
Copyright © 2006
Access Innovations, Inc.
Review, edit, test, edit,
use, edit, and maintain, i.e. edit
• Review
– Users
– Expert reviewers
• Test
– Index 500+
documents (more for
variable writing style;
fewer for strict style)
– Monitor search log
• Edit and maintain
– Add term
– Change existing term
– Change term status
– Delete term
– Add term relationship
– Delete term
relationship
– Add/modify Scope
Note
Consider machine automated / – Change overall
assisted indexing software
structure
77
Copyright © 2006
Access Innovations, Inc.
Automatic taxonomy construction
• Words and phrases from documents
• Based on frequency and co-occurrence of
words
• No semantic analysis
• Produces list of possible terms
• Requires editorial analysis
– hierarchical and conceptual organization
– association of related concepts
– identifying and deduplicating equivalent
concepts
78
Copyright © 2006
Access Innovations, Inc.
Show ‘em what you’ve got –
displays for every user
• Thesaurus/taxonomy views and functions
depend on audience and purpose
–
–
–
–
79
taxonomists
indexers
corporate workers
public searchers
Copyright © 2006
Access Innovations, Inc.
For the taxonomist
•
•
•
•
•
•
•
•
•
•
80
Hierarchy view
Alphabetic view
Permuted (KWIC) view
Single term record view
Graphical view
Notational view
Deleted terms
Candidate terms
Retrieve term record
Find term in hierarchy view
Copyright © 2006
Access Innovations, Inc.
Taxonomists
NEED MOST
and
WANT even
MORE!
Hierarchy
Alphabetical
Permuted
(KWIC)
Term record
Notation view
For the indexer
• Search to retrieve term record
• Access to Scope Notes, Related Terms,
NonPreferred Terms
• Hierarchy view for the big picture
• Automated proposal of indexing terms
83
Copyright © 2006
Access Innovations, Inc.
For the searcher
•
•
•
•
•
Browsable directory (Yahoo.com, MediaSleuth.com)
Faceted navigation (MOMA.org, LandsEnd.com)
Alpha term list or terms grouped by letter
Drop down list with selected terms
Portal view – complete or partial taxonomy
– Display terms may be identical to taxonomy
terms
– Display terms may be variants, mapped to
taxonomy terms
• Taxonomy may not be accessible – requires
random guessing
85
Copyright © 2006
Access Innovations, Inc.
Display taxonomy categories
Results from
sample
of 1,100
documents
(not all categories
are populated)
Reveal Narrower Terms
87
Copyright © 2006
Access Innovations, Inc.
Select taxonomy category to display titles
88
Copyright © 2006
Access Innovations, Inc.
Access full bibliographic record
89
Copyright © 2006
Access Innovations, Inc.
Faceted navigation
90
Copyright © 2006
Access Innovations, Inc.
SLA website and thesaurus
91
Copyright © 2006
Access Innovations, Inc.
SLA search
92
Copyright © 2006
Access Innovations, Inc.
Concept indexing – effect on retrieval
Search query: THESAURUS
Precision search based
on M.A.I. indexing: 3 hits
Free text, no indexing  0 hits
93
Copyright © 2006
Access Innovations, Inc.
94
Copyright © 2006
Access Innovations, Inc.
Search: kangaroo
Broader Terms
Narrower Terms
Related Terms
Use (synonyms)
95
Copyright © 2006
Access Innovations, Inc.
Leverage
taxonomy term
information
to aid search
Indexing rule
Term record
96
Copyright © 2006
Access Innovations, Inc.
What we’ve covered
•
•
•
•
•
•
•
•
•
Taxonomy – from different perspectives
Collecting and organizing concepts
Term choice and vocabulary control
Taxonomy structure
Term relationships
Term format
Factored and compound terms
Constructing a simple taxonomy
Display variations for different users
97
Copyright © 2006
Access Innovations, Inc.
“The Computer and the Poet”
“The biggest single need in computer technology
is not for improved circuitry,
or enlarged capacity,
or prolonged memory,
or miniaturized containers,
but for better questions and better use of answers.”
Norman Cousins, editorial in The Saturday Review,
July 23, 1966 special issue on “The New Computer Age”
Through taxonomies, effectively applied through
indexing, we aim to efficiently connect
the questions and the answers.
98
Copyright © 2006
Access Innovations, Inc.
Questions?
Comments?
Thanks for your attention!
Alice Redmond-Neal
ared@accessinn.com
Access Innovations, Inc.
www.AccessInn.com
Data Harmony software
www.DataHarmony.com
99
Copyright © 2006
Access Innovations, Inc.
Download