Indexing in Database Systems

advertisement
Databases as a tool for the content specialist
Applying databases to indexing in the information specialists’ world
Johan van Wyk (M.Bibl.; BA Hons History, THED)
When you look into the literature on indexing in databases, you are confronted with terms
such as BTree, TFSG parsing, clustered indexes, filtered" indices etc. Those are all techniques
and tools used to achieve retrieval of information in database systems. But this is not at all
our concern here today. Lets just say that we accept this as a given (like we do with most
technology we don’t understand!) As information professionals in the information industry,
what is our concern with database systems?
Database systems can be arranged on a continuum between DBMS’s and full text systems.
On the one extreme are the DBMS’s: very structured, good sorting and report facility. But
awful with textual information. It was never built to do that. Examples:
DBMS’s : Oracle, SQL
On the other extreme are the full text systems. The were specifically built for textual
documents. Examples of those systems are BRS Search, Brainware and you could include
Internet search engines such as Alta Vista.
The distinguishing factors are:
DBMS’s:
 Structure
 Report facility to manipulate output by manipulating the output by sorting the
structure elements
 Field lengths often limited
 Searchable fields are limited
 Complicated search
Full text systems:
 The full document is the unit of information
 Every word is searchable
 No formal structure
Then we find a third category in the industry: text retrieval systems. Text retrieval systems
were developed to get the best of both worlds. They were developed to have the structured
elements of DBMS’s and the ability to handle full text. Thus we find that text retrieval
systems have the following characteristics or features:
 Structure facilitated by means of fields
 Report facility
 All fields of variable length
 Indexing all fields, full text
Interestingly, though, Text retrieval systems were developed before Full text systems. Some
of the reasons were computing power, retrieval techniques and (to us “obviously”) the needs
of information specialists. One is always amazed at the searching functionality of systems like
IBM’s “STAIRS” (1972) and online systems used by BRS, Medline and Dialog in the same
period.
To be able to retrieve information in these systems programming techniques such as BTree
etc. were developed. These techniques must be very successful today – volume of data is not
a problem anymore. Today, as you all know, we search the internet and are amazed at the
amount of information retrieved. And then we are confused and irritated by the overdose.
And that is exactly what the core function of this profession is: to make sense of the masses
of information. The user is not interested in being the world champion in the number of items
1
retrieved. The question is whether his need is being addressed: the retrieval of relevant
information is what matters.
The IT sector has now reached a point where the indexing masses of data is “solved”. What
more can you want when you have full text indexing? Why is the user now confused or
irritated? The reason is that the information need was not met: finding relevant information.
Relevance is a very difficult concept to tie down. When testing information retrieval
performance, this is the crucial variable to define, because it could skew your research totally.
The measuring of retrieval performance is done using two parallel measures, both relying on
the judgement of relevance:
 Recall (The ability to retrieve relevant information)
 Precision (the ability to withhold non-relevant information from the information
retrieved)
The technology used for full text indexing greatly enhances the recall performance or ability,
to the detriment of precision performance. Then information retrieval research started
focussing on techniques focussed on manipulating the output. Here we find techniques such
as “relevance ranking” widely used such as in internet search engines. These techniques are
quite successful since the computing power became available.
In these text retrieval systems we work with words, word stems or phrases. Almost all of the
relevance ranking techniques are based upon the work of Karen Spark-Jones in Cambridge
(UK). Spark-Jones brought us what became known as the term frequency–inverse document
frequency theory. The tf–idf weight is a statistical measure used to evaluate how important
a word is within a document and then within a collection of information items. The
importance increases proportionally to the number of times a word appears in the document
but is offset by the frequency of the word in a collection. An unpublished study at Syracuse
university (USA) in 1984 tested 29 variations of the tf–idf weighting scheme to determine the
difference between these variations and came to the conclusion that there was no significant
difference.
Another technique used for retrieval is the “fuzzy set theory” or “fuzzy logic”. This technique
attempts to retrieve information without being bound to specific words or spelling. The
results here are also greatly enhancing recall performance as opposed to precision.
But all of these techniques are on the output side. In the indexing profession we are on the
input side. We would like to see that the entries created for retrieval improves the ability to
retrieve relevant information.
Words and textual “strings” (multi-word sequences or phrases) are used index and search
these systems. Words are obviously just representatives of concepts. The user actually needs
information about a concept. This is exactly what information professional tries to achieve,
namely to identify what the information item is “about”. Hutchins (1978) coined the phrase
“aboutness” to describe this phenomenon. This is exactly where the indexer performs his
skill: to create retrieval elements for optimising relevance. It is all about the process of
identifying and creating index entries to represent “aboutness” of a concept.
One of the areas investigated since the 1970’s was the application of linguistic theory. The
main problem here was that linguistic theory only gave answers to the function of words up
to the unit of a sentence. Very little applicable theory was available for working with whole
documents or eve collections of textual items.
Today we want to look at what Database systems can offer us on the input side.
The database systems gives us the following features: (specifically text retrieval systems)

The ability to house and work with a collection of information items
2







The structure for organising or categorising meta-data of a collection
The ability to retrieve full text – every word
The ability to retrieve items specific to individual meta data categories
Searching functionality : Boolean, word stem- and proximity searching
Word and string indexes related to the collection as a whole or within a field
Batch modification of indexes
Some even gives us a thesaurus capability
Because text retrieval database systems build indexes, we can use these to identify more
successful index terms. Here I would suggest using the elements of the term frequency–
inverse document frequency theory: the word frequencies. This theory acknowledges the
distinctive roles of words in a document, in a collection and the documents related to a word.
Automatic indexing theory identified it as:
 Within document frequency (WDF)
 Collection frequency (CF)
 Document frequency (DF)
WDF gives us the frequency of a word in a document. Too low and too high frequencies gives
us non-significant words. We need to look at the medium to higher frequencies. The focus
here is the significance and role of a term within a document.
CF gives you the frequency in a collection. Again, the medium to frequencies are significant.
The focus here is the significance and role of a term within a document collection.
DF shows the number of documents linked to a word. Again, the medium to frequencies are
significant. The focus here is the significance of a document in a collection with regards to a
specific term or concept.
Text retrieval systems, because they build indexes, can give us these word frequencies.
Through the knowledge of word frequency theory one can assess the significance of terms.
Text retrieval systems also allows us to incorporate full text into one of the fields alongside
the meta data.
Indexers tend to focus on index terms related to a specific item of information, relating to the
“within document frequency” . One can therefore say that this is the element best catered for
namely the role and significance of a term within an information item.
The other two elements should be a concern.
The significance in a collection is often taken care of by:
 Using an indexer that “knows the subject area”
 The indexer’s knowledge of the collection
 The indexer’s knowledge of the users, company or environment.
All three of these are done in a sub-conscious manner and often not really taken care of.
The third element, namely the role or significance of a document in a collection with regards
to a specific term, is often not even considered. An analysis of the subject coverage of the
documentation used in a country wide HSRC study in 1988 gave alarming results. This is not
apparent looking at the study’s reports.
We can use text retrieval database systems to determine better index terms in the following
way:
With single documents one can do an analysis of the term frequencies within a document.
Most text retrieval systems can load a singe document full text into a record in the database,
generating the word frequencies in the index. Stop word lists can successfully be applied.
3
With collections of documents the text retrieval database systems can be used to:
 Determine collection frequencies of terms in a collection before indexing a new
document within or across fields
 Determining document frequencies of terms (the number of documents relating to
the index term) in a collection when indexing a document.
The structure of database systems can successfully be used to:
 Assigning index terms to a specific meta-data element, increasing the specific role of
a term
 Creating database fields that would enhance the role of index terms. An example
here is where subject field is divided into broad terms for a controlled vocabulary and
a separate field for free text indexing, and even having a separate field for names of
persons, places and companies
Most Text retrieval systems has the ability to do batch changes in indexes. When Northern
Province became Limpopo – how many of the information services made that change?
As mentioned, some text retrieval database systems has a thesaurus feature. Some are linked
to an existing thesaurus, some allow the user or indexer to build a thesaurus. Building your
own thesaurus can be of great value in narrower subject fields. This could be a very valuable
tool fir the indexer to use on the input side.
Lastly we can use Text retrieval systems to test the effectiveness of our indexing, using the
performance measures of recall and precision. Information services should, on a regular
basis, reflect on the effectiveness of indexes of a collection by analysing and rectifying
problems on a regular basis.
Database systems, and specifically Text retrieval databse systems have developed to a
standard where they are very efficient with a whole collection of features relevant to the
input soide of indexing. It is therefore a powerful tool in the hands of the indexer.
4
Bibliography
Bertino, E., B. C. Ooi, R. Sacks-Davis, K-L Tan, J. Zobel, B. Shidlovsky, and B. Catania. 1997.
Indexing Techniques for Advanced Database Systems. Kluwer Academic Publishers
Elmasri R. and S. Navathe. 2000 Fundamentals of database systems. Addison-Wesley.
Graf P., 1996 Term Indexing. Springer.
Hutchins, W. J. The concept of 'aboutness' in subject indexing. Aslib Proceedings 30 (5), May
1978, p.172-181.
Manning, Christopher D. ; Raghavan, Prabhakar and Schütze, Hinrich Introduction to
Information retrieval, 2008 Cambridge University Press. http://informationretrieval.org/
http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html
Ramesh et al. ,2001 In:
http://www.cs.toronto.edu/~mcosmin/publications/thesis/node61.html
Salton, G. M.J. McGill 1983. Introduction to modern information retrieval. McGraw Hill.
5
Download