ASAIB Conference 2014 Presentation Du Preez_~

advertisement
Taxonomies, folksonomies, ontologies? What are they and how do
they support information retrieval?
Madely du Preez
INTRODUCTION
Until fairly recently thesauri, lists of subject headings and taxonomies were the most important
controlled vocabularies that were used in formal information retrieval systems. The development of
more sophisticated information retrieval tools triggered debates on the usefulness of controlled
vocabularies and classification systems. Despite these debates, people involved in information
organisation and retrieval have become acutely aware of the value these tools have for information
retrieval.
When the World Wide Web (WWW) first started, only a few experts were able to use it to distribute
information while the majority of users dealt with the WWW as consumers (Stock & Stock, 2013, p.
611). However, technological advances and the development of Web 2.0 or the Semantic Web – that
is a web of linked data - has made it possible for digital libraries or digital repositories to provide
interfaces that allow endless information access (Hwang, Yang, & Ting, 2010, p. 297). These digital
repositories or libraries still use thesauri, ontologies and taxonomies to organise their resources and
to assist users in locating the information they require (Garcia, Martin-Moncunill, Sanchez-Alonso, &
Garcia, 2014, p. 285).
Apart from the availability of numerous digital libraries and online databases, Web 2.0 has also made
social media possible and terms like blogging, wikis, social tagging, and cloud computing have
become part of our communication technology related vocabularies. Furthermore, it seems as if
social media sites such as Facebook and YouTube as well social catalogues such as LibraryThing.com
and Amazon also contributed to the development of folksonomies to support information retrieval.
The purpose of this presentation is to learn a bit more about taxonomies, folksonomies, ontologies
and thesauri are and their roles in information retrieval.
CONTROLLED VOCABULARIES
Two indexing languages are generally used when indexing or searching for information in retrieval
systems such as databases and the Internet. These are natural language and controlled vocabularies.
Fourie and Burger (2005, p. 54) explain that when the indexer or the user uses the words of the
author or words that are generally used within a subject field to index a document or search a
database, they are using natural language terms. However, when they select words from a list of
possible indexing terms, they are using a controlled vocabulary.
When viewed from an information organisation point of view, the objective of controlled
vocabularies is to ensure consistency in indexing, tagging or categorising (Hedden H. , 2010, p. 135).
However, when viewed from an information retrieval point of view, controlled vocabularies also
support users when they search for information. This means that the user does not need to think of
all the possible terms (including spelling variations and synonyms) that would retrieve the required
information. By using terms from a specific information system’s controlled vocabulary, the users
are ensured of retrieving some information that is relevant to their search.
Ontologies, thesauri, taxonomies and folksonomies are four examples of controlled vocabularies
that are relevant to this presentation. The following discussion will therefore focus on learning more
about these controlled vocabularies and to discover how they facilitate information retrieval in an
online environment.
THESAURI
Thesauri are described by Mamassion (2010, p. 103) as “controlled vocabularies that contains index
terms that are used to describe the contents of a document”. Currás (2010, p. 72) extends this
definition when he defines a thesaurus as a “controlled and dynamic vocabulary of terms that share
semantic and generic relationships, and that are applied in a particular field of knowledge.” From a
functional point of view, he views a thesaurus as an instrument for the control of terminology, used
to transmit, in a more strict language, the language used in documents.”
Based on these definitions, Currás (2010, p. 72) identified certain conditions that a thesaurus must
fulfil:








It must be a specialised language
Must be normalised in a post-controlled process
It must be possible to convert the indexing terms in a thesaurus into keywords which could
assist in determining the theme of the document
The keywords have a hierarchical relationship
The indexing languages are terminological
They must allow for the introduction or suppression of terms so that the thesaurus can be
updated
They must convert natural language into a controlled vocabulary
They must serve as a nexus of union between the document and the user.
The conditions identified by Currás (2010) is supported by Burger (2005, p. 160) when she identifies
the following characteristics of thesauri:




formally structured (e.g. alphabetically or hierarchically)
indicate a variety of indexing terms. These terms are either preferred terms (UF – used for)
or non-preferred terms (U – Use) in the thesaurus
refer indexers/users from the terms not suitable for indexing to the preferred terms in the
thesaurus
indicate relationships with other terms by means of special codes such as BT (broader
terms), NT (narrower terms) and RT (related terms).
The hierarchical relationships among terms in a thesaurus are based on a logical progression from
broader terms to narrower terms. For example:
BT
Transport
NT
Road transport
Air transport
Shipping
In turn, the narrower terms can also be broader terms for other terms. For example:
BT
NT
Road transport
Buses
Motor cars
Trucks
In addition to having a hierarchical relationship with each other, thesaurus terms can also have an
“equal” relationship with other terms. In the above example, buses, motor cars and trucks and three
different types of motorised vehicles that are used in road transportation. Although all three terms
have a hierarchical relationship with road transport, they are also related terms as they all three
appear on the same hierarchical level.
Furthermore, not all terms that are listed in a thesaurus have some form of relationship with other
terms.
Another feature of thesauri which has not been previously mentioned is the inclusion of scope
notes. Not all terms in a thesaurus includes a scope note, but they are used to clarify certain terms in
the thesaurus which could be misinterpreted or used wrongly.
Thesauri, according to Currás (2010, p. 74)started with the increase in themes that emerged from
the literature and where neither hierarchical nor faceted systems could provide adequate responses
to the demand for information. What everybody thought was a new idea, was in fact something
already in use. The earliest thesauri emerged from the use of concepts taken from existing
documents that were not necessarily related to each other. The only problem was, no-one knew for
certain that they could be applied to information processes. But, with the introduction of
computers, the first indexes were developed and the first formally constructed thesauri began to
appear in the 1960s and two different classes of thesauri developed: general thesauri and
specialised thesauri. The Eric thesaurus and the ISAP thesaurus are examples of general thesauri.
These thesauri are multidisciplinary. Thesauri that were developed to describe specialised
collections such as an art collection, or historical photograph collection can also cover more than
one discipline at different levels of importance.
Currás (2010, p. 26) uses the following figure to illustrate how thesauri developed, first from
information and then through the field of computing.
However, thesauri are not only used to describe the contents of a document. They are also used
information retrieval systems’ users to identify terms they could use to retrieve information that is
relevant to their individual information needs. For example, some multidisciplinary databases such
as ISAP (Index for Southern African Periodicals) allow users to search the database while using
keywords (natural language terms) and thesaurus terms. In such a database, the keywords are the
most specific terms and the thesaurus terms are the broader terms. The purpose of the thesauri in
such databases is to categorise the information sources according to subject and they therefore
narrow an information search. The following record from the ISAP database illustrates the use of
keywords and thesaurus terms.
RE 1528
LN Afrikaans
TI Mikrorekenaarmatige persoonlike inligtingstelsels.
AU Burger, M.
AB Refers to the general characteristics and application of personal information systems on microcomputer.
Aims to stimulate professionals. Briefly discusses indexing, computer hardware and the advantages of a
computerised system.
SO Mousaion series 3
VO 8 IS1 MO Sep YE 1990
PA 32-47
SN 0027-2639
TH
Information services
KE
Computerised information retrieval
Indexing
Information management
Information systems
Microcomputers
LO 10/12/2013
DD 10/12/2013
PN P13194
Since thesauri are generally developed for specific information systems, thesauri not only narrow an
information search, but also ensure that some information will be retrieved when terms listed in the
thesaurus are used to search for information in the specific information system for which the
thesaurus was developed. Since this is an indexing and abstracting database, users need to request
the relevant information from the National Library, or repeat the information search in a different
full text such as the SA E-Publications database which is available through Sabinet.
The Centre for African Studies’s (University of Leiden) thesaurus (http://thesaurus.ascleiden.nl/) is a
good example of a thesaurus which was developed to facilitate the organisation and retrieval of
documents that form part of Centre’s collection. The thesaurus has a search box as well as an
alphabetical search option. I clicked on “R” and the following thesaurus entry was revealed.
African Studies Thesaurus
rights of the accused
Search catalogue
Scope note
A class of rights that apply to a person in the time period between when they are
formally accused of a crime and when they are either convicted or acquitted, generally
based on the maxim of 'innocent until proven guilty' and including the right to a fair trial,
the right to counsel and the right to communicate.
Used for
defendants' rights
habeas corpus
Broader terms
civil and political rights
Related terms
legal procedure
offenders
presumption of innocence
Subject category
10.05 CRIMINAL LAW, CRIMINAL PROCEDURE
This thesaurus entry shows an example of a scope note, the terms the entry term is used for, and a
broader as well as some related terms. It also places the term within a hierarchical subject category
in the library collection. By clicking on the “search catalogue” link, I could retrieve all the documents
that were linked to this thesaurus entry in the library catalogue.
Thesauri are not necessarily used to describe the contents of written documents. In her new book,
Titangos (2013) described the Santa Cruz Public Library, California’s History Photograph Project
(LHPP) which aimed at making more than 979 historical photographs in the Santa Cruz Library
collection available online. To organise the photographs in this new digital collection, a Thesaurus for
Graphic Materials (TGM) team was appointed. The team came up with a list of terms that could be
used for this purpose, but they soon found that more information was needed to make the
photographs accessible. One photograph, no. 0125, had three words written on its face: “Laurel bull
donkey.” No thesaurus term or subject heading term could adequately describe this photograph. It
was then decided to compose a footnote to explain the photograph in more detail. The following
illustration shows the database entry for the “Laurel bull donkey”. Note the assigned keyword
resembles a subject heading term. The descriptions that are added to the entries for photographs in
this database, now supports the retrieval of these entities in that they provide the user with more
information on the photograph. Based on the additional information, the user can then decide on
whether the retrieved photograph is relevant to the information search or not.

Title: 0125

Summary: Laurel bull donkey.

Keywords: Industries--Lumber

Description: Laurel bull donkey. A donkey was an engine used to hoist logs onto flatcars bound for
the mill. Cables could be extended on spools run through the woods for a distance of up to several
miles.
Date: 1890's
Place: Laurel
Sources of Information: Notes on back of photo; Article on this website, see link below
Related Articles:
o
o
Felling the Giants, [Lumbering in the Mountains]
Industrial Development: Lumber; Lime; Fishing
E
aa-001
aa-002
Ansley Kullman Salz holding a
The tanoak used to dye leather
sample of the leather made by
was stored in drying sheds across
his company, the A. K. Salz
Highway 9 from the main Salz
Tannery in Santa Cruz, California.
Tannery complex.
Subject Headings and
Subject Headings and
Keywords:Industries--Tanneries--Salz
Keywords:Industries--Tanneries--Salz
Tannery,Portraits--Men, Leather
Tannery,Leather Industry, Drying
Industry,Leather Garments, Leather
Sheds, Barns,Storage
Goods,Hides and
Facilities, Smokestacks,Equipment
Skins, Hats, Clothing and
Dress, Eyeglasses
The above two photographs are further examples of how additional information was provided to
describe the photograph and to illustrate the use of subject headings and keywords to describe the
photographs.
ONTOLOGIES
Currás (2010, pp. 20-22) cites a number of wide ranging definitions for ontologies which describe
ontologies as catalogues, a means to capture human knowledge based on common sense; as groups
of concepts; a general framework which can display coherent organisation; the marriage of symbols
used in natural language and the entities that they represent in the real world. However, he found
the definition by Marco the most comprehensible: an ontology “is the systematic description of a
specific domain in accordance with the entities and processes that allow the description of ‘all’
things and processes”. This definition is supported by Hedden (2010, p. 12) when she describes
ontologies as “a level of abstraction of data models, analogous to hierarchical and relational
models.”
The following figure is used by Currás (2010, p. 23) to illustrate the differences and similarities
between ontologies and thesauri. An analysis of the figure shows that the main difference lies in the
structure and the existing relationships among terms. Thesaurus terms are hierarchically arranged
whereas the arrangement of terms in ontologies take certain characteristics and properties of the
terms in consideration.
Hedden (2010, p. 12) explains that there can be any number of domain-specific types of relationship
pairs. The example she gives includes owns/belongs to; produces/is produced by, and has
members/is a member of. Currás (2010, p. 22) uses the MICROKOSMOS system as an example to
show how principal and subordinate classes are established:



Objects
o Physical order
o Mental order
o Social order
Events
o Physical order
o Mental order
o Social order
Properties
o Attributes (objects or events)
o Relationships (with each other)
The structure of ontologies is therefore aimed at providing an order and a relation of terms which
are based on certain characteristics and properties (Curras, 2010, p. 23).
Currás observed that ontologies are useful when they are applied to translation machines since they
serve as a nexus between the words of intervening languages in order to find similarities or
equivalencies. This view is supported by Hedden (2010, p. 14) when she observes that ontologies are
becoming very important in semantic search engine deployment in specialised industries.
TAXONOMIES
The concept “taxonomy” means the science of classifying things. Hedden (2010, pp. 137-138)
explains that the concept was traditionally used for the classification of plants and animals, such as
the Linnaean classification system. However, it has lately become the preferred term for any
hierarchical classification or categorisation system. According to Hedden (2010, p. 138), the main
difference between a thesaurus and a taxonomy lies in the hierarchical relationships among terms.
For example, a given term in a thesaurus may or may not have a broader/narrower term relationship
with another term whereas all terms in taxonomies belong to a single large hierarchy that
encompasses all concepts of a certain class, category, or aspect. Furthermore, terms in a thesaurus
can have an equal relationship with other terms, e.g. dog breeds and cat breeds. Considering
taxonomies’ strict hierarchical structure, there can be no equal relationships in taxonomies.
Hedden (2010, p. 138) explains that taxonomies’ structures are sometimes referred to as “trees” and
the terms that are included in the taxonomy as “nodes”.
A Google search for taxonomies, revealed the existence of taxonomies in different subject fields.
Examples of subject related taxonomies that were described in Wikipedia include science, education,
business and economics, information science and safety.
Science
Two scientific taxonomies that are discussed in Wikipedia include the Linnaus classification system
and a more modern biological classification system which is based on the Linnaus system: .
(http://en.wikipedia.org/wiki/File:Linnaeus_-_Regnum_Animale_(1735).png)
Biological_classification_L_Pengo_vflip.svg
Wikipedia also grouped folk taxonomies under science taxomies. These taxonomies are vernacular
naming systems. They represent the way in which people describe and organise their natural
surroundings and are generated from social knowledge and are used in everyday speech.
Education
In the field of education, Bloom’s Taxonomy of learning in action seems to be an important
taxonomy. His taxonomy of “learning in action” standardises learning objectives in an educational
environment. These are then subdivided into three “domains”. That is the cognitive, affective and
psychomotor domains. Through this division, Bloom had hoped to motivate educators to focus on all
three domains in their teaching. There seems to be a number of depictions based on his taxonomy. I
found his “wheel” in which he depicted “learning in action” quite different and interesting. In his
wheel, he listed a number of verbs and grouped them according to different types of assessment.
http://en.wikipedia.org/wiki/Bloom%27s_Taxonomy
Taxonomies in business and economics
In the business and economics field, corporate taxonomies are increasingly being used in business
information management systems, especially in their content management and knowledge
management systems. According to Wikipedia (2014), corporate taxonomies reflect the hierarchical
classification of entities of interest of an enterprise and they are used to classify documents,
products, processes, knowledge fields and human groups. Hedden (2010, p. 138) noted that these
taxonomies may or may not have the hierarchical structure that is generally associated with
traditional taxonomies such as the science taxonomies that were discussed above. She also found
that the taxonomies that are found on public websites should not have more than three or four
levels. The reasons she gives is that users are unfamiliar with a site typically only have the patience
to search through that many levels.
Safety
Safety taxonomies are standardised sets of terminologies which are used by safety and health care
workers. These taxonomies aim at standardising the terminology in these fields to avoid confusion
among safety and health care workers. Wikipedia (2014) indicates that there exists numerous safety
taxonomies which analyse and classify human error and accident causes. One example which is
discussed in depth in Wikipedia is the Human Factors Analysis and Classification System (HFACS).
This system identifies the human causes of an accident and provides a tool to assist in the
investigation process and is used in accident prevention training. Four different levels of analysis is
reflected in the HFACS taxonomy: unsafe acts, preconditions for unsafe acts; unsafe supervision; and
organisational influences. Each of these levels are then further subdivided into more categories.
Information and computer science
This is the last category of subject related taxonomy that was identified in Wikipedia that will be
discussed. Of the four different taxonomies in this category, I found “Taxonomies for search
engines” to be extremely relevant to this discussion. Currás (2010, pp. 46-48) refers to taxonomies
as virtual taxonomies and cybernetic taxonomies. He describes a virtual taxonomy as an “intelligent
agent or an intelligent meta-search engine to be used for web pages”. As explained by Vicient,
Sànchez and Moreno (Vicient, Sanchez, & Moreno, 2013), taxonomies, thesauri and concept
hierarchies are crucial components of any information retrieval system. Currás (2010, p. 47) uses the
following figure to illustrate taxonomies in computing.
Hedden (2010, p. 203) identified two different ways in which controlled vocabularies or taxonomies
support Internet searches. These are


through nonpreferred terms or synonym rings, or
as browsable taxonomies.
As explained by Hedden (2010, pp. 201-202), ordinary search engines generally don’t make use of
taxonomies as it is basically impossible to create and maintain taxonomies that would organise all
the information that is available on the web. However, the search engine software for single sites,
such web directories may incorporate taxonomies. The directory for Microbial Life Educational
Resources (http://serc.carleton.edu/microbelife/resources/index.html) is an example of a site which
uses a browsable taxonomy.
Refine the Results
Subject: Biology
287 matches General/Other







Astrobiology 92 matches
Biogeochemistry 125 matches
Diversity 141 matches
Ecology 613 matches
Evolution 211 matches
Microbiology 814 matches
Molecular Biology 174 matches
Resource Type










Activities 123 matches
Assessments 12 matches
Course Information 20 matches
Datasets and Tools 29 matches
Audio/Visual 154 matches
Computer Applications 18 matches
Pedagogic Resources 61 matches
Scientific Resources 700 matches
Biographical Resources 4 matches
Policy Resources 14 matches
Extreme Environments










Alkaline 54 matches
Acidic 56 matches
Extremely Cold 53 matches
Extremely Hot 116 matches
Hypersaline 63 matches
High Pressure 57 matches
High Radiation 24 matches
Anhydrous 32 matches
Anoxic 66 matches
Altered by Humans 66 matches
Ocean Environments


Coastal and Estuarine 170 matches
Shallow Sea Floor/Continental Shelf 30 matches
This taxonomy uses broad categories to organise website information and thereafter lists taxonomy
terms that are used within each category. A hypertext link indicating the number of matches that are
available for the specific term is included. The number of matches is hypertexts links and clicking on
them transports the user to the actual sources of information.
The synonym rings for each concept in a controlled vocabulary include terms that are likely to be
searched and terms that are likely to appear in the content. Hedden (2010, p. 203) explains that
synonym rings do not display the taxonomy terms in the user interface, whereas taxonomies that do
distinguish between preferred and nonpreferred terms usually display the preferred terms. Hedden
(2010, p. 204) provides the following example of how a synonym ring supports an information
search for a concept:
Users might enter:
Oil industry
Oil & gas industry
Oil & gas industries
Petroleum industry
Synonym ring contains all:
Oil industry
Oil & gas industry
Oil and gas industry
Oil & gas industries
Oil and gas industries
Petroleum industry
Oil companies
Big oil
Oil producers
Petroleum companies
Text may contain:
Oil and gas industry
Oil companies
Big oil
Oil producers
However, due to modern automated indexing technologies, search engines have become more
sophisticated. According to Hedden (2010, p. 205), automated indexing technologies generally
follow two basic approaches: information extraction and auto-categorisation.
Information extraction, also known as web mining, is a technique that is used by search engine
crawlers to collect data from sources and add them to the search engines’ indexes (Web Data
Mining.net 2014). The data is then typically collected from the metadata in the websites’ headers
(these are not visible to the users) and other hyperlinks within a website. The data mining software
focuses on identifying which key names, concepts, and data in the metadata and text of the
documents are significant in comparison with those with a mere passing mention (Hedden H. , 2010,
p. 205). Hedden (2010, p. 205) compares this process to book indexing as it, according to her, seeks
to identify significant names and concepts within chunks of texts. She also notes that data extraction
or data mining does not necessarily use a taxonomy but when it does, it usually uses a simply
synonym ring.
Auto-categorisation on the other hand, seeks to categorise each document based on what it is
fundamentally about. In order to do so, Rouse (2005) noted that web mining software uses data
patterns from the information that was retrieved for a specific query to identify similar data patterns
in other sources which could also be relevant to the search. The data patterns identified in this
manner, then form the parameters which are then applied to new information searches. By
continuously “learning” from new data patterns that can be linked to data patterns already
identified in its databases, search engines build their own web taxonomies. Hedden (2010, p. 205)
compares this process of auto-categorisation to database indexing where one or more taxonomy or
thesaurus terms are assigned to describe the subject content of a document.
Similarities and differences between taxonomies and thesauri
Currás (2010, pp. 51-52) identified some similarities and differences between taxonomies and
thesauri. According to him, the most obvious similarities lies in the fact that they are both controlled
vocabularies which are used in an information retrieval system. They are also used for the
systematisation of knowledge using scientific, logical and coherent methods that are established
through a set of predetermined rules. Both thesauri and taxonomies are pre- and post-coordinate
systems made up of terms that are derived from documents.
The first and primary differences between thesauri and taxonomies that were identified by Currás
(2010, pp. 51,53) is that information technologies are almost exclusively used to develop taxonomies
whereas thesauri can be constructed manually. In instances where computer programs are used to
construct thesauri, an in-depth knowledge of the mechanisms, techniques, theory and practice of
thesaurus construction is still needed. These differences can also be seen in the use that are made of
thesauri and taxonomies. Thesauri as used by information specialists whereas taxonomies are
almost exclusively used by computer specialists working within the business field.
The existing differences and similarities between taxonomies and thesauri are illustrated in the
following figure (Curras, 2010, p. 52).
FOLKSONOMIES
The term “folksonomies” actually stands for “folk taxonomies” which suggests that they are created
by ordinary information users as opposed to experts in a subject field (Goh, 2012, p. 75). This is why
Hedden (2010, p. 193) could explain folksonomies as being created and used by authors and/or
users of information content. Stock and Stock (2013, p. 611) contributed to this description when
they described “folksonomies” as the free allocation of keywords by anyone and everyone in an
information system. Goh (2012, p. 75) and Hassan-Montero and Hererro-Solana (2006) uses the
concepts “social tagging” and “collaborative tagging” to describe the process of assigning keword
tags to documents. This phenomenon is also known as social bookmarking, collaborative tagging,
social classification, social indexing, or ethnoclassification (Hedden H. , 2010, pp. 194-195).
According to Hassan-Montero and Herrero-Solana (2006), folksonomies are a form of crowdsourced
(meta) data which provides a different mode of access to the content in digital libraries. The
inclusion of tags in an information system therefore facilitates the taggers’ future access to
resources (Macgregor & McCulloch, 2006).
The main difference between thesauri, ontologies, taxonomies and folksonomies lies in the
development of these vocabularies, in who creates them and their structure. Whereas thesauri,
ontologies and taxonomies are created by experts in the field of information organisation,
folksonomies are created by users of information and the language of the user becomes important
in their development. Thesauri are mainly developed by humans whereas ontologies, modern
taxonomies and folksonomies are computer generated. Lastly, folksonomies reflect no hierarchical
structure and there are no directly specified parent-child like relationships. These are merely a set of
terms that are used by a group of people to describe information sources.
There seem to be three different aspects as important in the development of folksonomies:



the tags (i.e. the words that are used to describe a document);
the documents that need to be described; and
the users who perform the indexing task.
Social tagging and the creation of social networks
Stock and Stock (2013, p. 613) explain that the users and the documents are sort of connected in a
social network where the documents are thematically linked if they have been indexed via the same
tags and the documents are also coupled via shared users. The users are similarly thematically
connected if they use the same tax and are coupled via shared documents. They explain the
development of social networks through social tagging as follows:


Documents are generally indexed via several tags and with differing degrees of frequencies.
When two tags co-occur in a single document, they are regarded to be interlinked. By using
interlinked tags, information systems compile tag clusters which represent networks of
folksonomies.
Personomies develop from indexed documents that were tagged by the same person and
these personomies then support folksonomy-based recommender systems where the
information system (search engine) makes recommendations to their users for documents,
for users and for tags. In these instances the personomy is chosen as a point of reference
and the recommended tags are tags that were previously used by the current user.
Websites or services that make use of social tagging include social bookmarking management sites
such as Delicious (delicious.com) Connotea (www.connotea.org), and Diigo (www.diigo.com), Flickr
(www.flickr.com) and Facebook (facebook.com). The option in social library catalogues such as
LibraryThing.com and online vendors such as Kalahari.net to review a book or an information
resource online that you have read or bought is nothing other than a form of social tagging. These
reviews are then used to “promote” the book to other possible readers or consumers. However, it is
not only commercial and social catalogues that allow for the use of social tagging. The following is an
edited version of the entry in the Unisa Library’s catalogue for the book by Hedden
The accidental taxonomist / Heather Hedden
Printed Material
Hedden, Heather.
Medford, N.J. : Information Today, c2010
Community Tags
Add a Tag
When I click on the “Add a Tag” button, the system required of me to identify myself before I could
tag the record and submit my tag to the system.
Advantages and disadvantages of folksonomies
Stock and Stock (2013, p. 617) listed some of the advantages of folksonomies that were identified by
Peterson (2006) and Shirky (2005). These include:


tags represent user-specific interpretations of documents which in turn allow users to
interpret documents from various points of views. These could be scientific, ideological or
cultural.
tagging can represent a kind of quality control: the more people tag a document, the more
important the content appears to be. When used as a quality control measure, folksonomies
support users in two ways. First by supporting the retrieval of documents that are relevant
to an information search and secondly when they browse an information system to see what
is available or exploit serendipitous information discovery.
In addition to the advantages that were identified by Peterson and Shirky, Hedden (2010, p. 195)
also identified some advantages folksonomies have over taxonomies. These include:




folksonomies reflect trends, are up to date, and can monitor change and popularity
folksonomies are cheaper and quicker to develop than building and maintaining a taxonomy
they are responsive to user needs
they facilitate democracy (as in votes for popular content and popular tags), the distribution
of tasks, and the building of virtual communities of shared interest and knowledge.
The disadvantages Peterson, Shirky and Hedden listed of folksonomies include:

lack of precision: different word forms and abbreviations are used for the same concept.
There is no control of synonyms and homonyms, typos are frequent.









users have different tasks and approach documents with different motives and the
documents are located in different cognitive contexts, but they do not share a common
indexing level.
tags can be biased as users may disagree with prior tagging
users may index documents in their own language (e.g. Cape Town versus Kaapstad) without
bothering to translate.
homonyms that span different languages are not separated, e.g. Gift in German (poison) and
English (present).
users don’t always distinguish between content indexing and formal descriptions.
tags could contain value judgments (e.g. stupid, or nice)
tags could describe planned activities (e.g. to read) rather being evaluative of the content
syncaegorematical tags, e.g. tagging a photo in Facebook with “me”
spam tags which have nothing to do with the contents of the document but which are
intended to mislead users.
Peters (2006 in Stock & Stock 2013, p. 618) is of the view that the exclusive use of folksonomies in
professional environments such as corporate knowledge management systems cannot be
recommended . However, if folksonomies are combined with other methods of knowledge
representation, their advantages outweigh the disadvantages. Furthermore, to be effective social
tagging requires a mass of user involvement.
CONCLUSION
In this presentation, I addressed a number of controlled vocabularies and some not so controlled
vocabularies and discussed the role each of these have in information organisation and retrieval. The
discussion on modern taxonomies and folksonomies hopefully also helped to create and
understanding of how intelligent search engines come up with all the suggested sources that could
possibly be relevant to information search in.
Bibliography
Benzon, W. (1996). Culture as an evolutionary arena. Journal of Social and Evolutionary Systems,
19(4), 321-362.
Burger, M. (2005). Thesaurus construction. In J. A. Kalley, E. Schoeman, & M. Burger (Eds.), Indexing
for southern Africa: a manual compiled in celebration of ASAIB's first decade 1994-2004 (pp.
159-188). Pretoria: University of South Africa.
Curras, E. (2010). Ontologies, taxonomies and thesauri in systems science and systematics. Oxford:
Chandos.
Fourie, I., & Burger, M. (2005). Verbal subject description. In J. A. Kalley, E. Schoeman, & M. Burger
(Eds.), Indexing for southern Africa: a manual compiled in celebration of ASAIB's first decade
1994-2004 (pp. 53-67). Pretoria: University of South Africa.
Garcia, P. A., Martin-Moncunill, D., Sanchez-Alonso, S., & Garcia, A. F. (2014). A usability study of
taxonomy visualisation user interfaces in digital repositories. Online Information Review,
38(2), 284-304.
Goh, D. H. (2012). Collaborative search and retrieval in digital libraries. In G. G. Chowdhury, & S. Foo
(Eds.), Digital libraries and information access : research perspectives (pp. 69-82). Chicago:
Neal-Schumann.
Hassan-Montero, Y., & Herrero-Solana, V. (2006). Improving tag-clouds as visual information
retrieval intervaces. Proceedings o International Conference on Multidisciplinary Information
Sciences and Technologies.
Hedden, H. (2010). Controlled vocabularies, thesauri, and taxonomies. In J. Perlman, & E. L. Zafran
(Eds.), Index it right: advice from the experts (pp. 135-154). Medford, N.J.: Information
Today.
Hedden, H. (2010). The accidental taxonomist. Medford, NJ: Information Today.
Hwang, S. Y., Yang, W. S., & Ting, K. D. (2010). Automatic index construction for multimedia digital
libraries. Information Processing and Management, 46, 295-307.
Macgregor, G., & McCulloch, E. (2006). Collaborative tagging as a knowledge organisation and
resource discovery tool. Library Review, 55(5), 291-300.
Mamassion, L. (2010). Through the looking glass: a freelance perspective on database indexing. In J.
Perlman, & E. L. Zafran (Eds.), Index it right!: advice from experts. Vol. 2 (pp. 99-110).
Medford, NJ: Information Today in association with the American Society for Indexing.
Rouse, M. (2005). Web mining. Retrieved 3 22, 2014, from
http://searchcrm.techtarget.com/definition/Web-mining
Stock, W. G., & Stock, M. (2013). Handbook of information science. Berlin: De Gruyter.
Titangos, H. L. (2013). Local community in the era of social media technologies: a global approach.
Oxford: Chandos.
Vicient, C., Sanchez, D., & Moreno, A. (2013). An automatic approach for ontology-based extraction
from heterogeneous textual resources. Engineering Applications of Artificial Intelligence,
26(3), 1092-1106.
Web data mining. (n.d.). Retrieved 3 22, 2014, from Web Data Mining.net: http://www.webdatamining.net/
Download