Databases as a tool for the content specialist Applying databases to indexing in the information specialists’ world Johan van Wyk (M.Bibl.; BA Hons History, THED) When you look into the literature on indexing in databases, you are confronted with terms such as BTree, TFSG parsing, clustered indexes, filtered" indices etc. Those are all techniques and tools used to achieve retrieval of information in database systems. But this is not at all our concern here today. Lets just say that we accept this as a given (like we do with most technology we don’t understand!) As information professionals in the information industry, what is our concern with database systems? Database systems can be arranged on a continuum between DBMS’s and full text systems. On the one extreme are the DBMS’s: very structured, good sorting and report facility. But awful with textual information. It was never built to do that. Examples: DBMS’s : Oracle, SQL On the other extreme are the full text systems. The were specifically built for textual documents. Examples of those systems are BRS Search, Brainware and you could include Internet search engines such as Alta Vista. The distinguishing factors are: DBMS’s: Structure Report facility to manipulate output by manipulating the output by sorting the structure elements Field lengths often limited Searchable fields are limited Complicated search Full text systems: The full document is the unit of information Every word is searchable No formal structure Then we find a third category in the industry: text retrieval systems. Text retrieval systems were developed to get the best of both worlds. They were developed to have the structured elements of DBMS’s and the ability to handle full text. Thus we find that text retrieval systems have the following characteristics or features: Structure facilitated by means of fields Report facility All fields of variable length Indexing all fields, full text Interestingly, though, Text retrieval systems were developed before Full text systems. Some of the reasons were computing power, retrieval techniques and (to us “obviously”) the needs of information specialists. One is always amazed at the searching functionality of systems like IBM’s “STAIRS” (1972) and online systems used by BRS, Medline and Dialog in the same period. To be able to retrieve information in these systems programming techniques such as BTree etc. were developed. These techniques must be very successful today – volume of data is not a problem anymore. Today, as you all know, we search the internet and are amazed at the amount of information retrieved. And then we are confused and irritated by the overdose. And that is exactly what the core function of this profession is: to make sense of the masses of information. The user is not interested in being the world champion in the number of items 1 retrieved. The question is whether his need is being addressed: the retrieval of relevant information is what matters. The IT sector has now reached a point where the indexing masses of data is “solved”. What more can you want when you have full text indexing? Why is the user now confused or irritated? The reason is that the information need was not met: finding relevant information. Relevance is a very difficult concept to tie down. When testing information retrieval performance, this is the crucial variable to define, because it could skew your research totally. The measuring of retrieval performance is done using two parallel measures, both relying on the judgement of relevance: Recall (The ability to retrieve relevant information) Precision (the ability to withhold non-relevant information from the information retrieved) The technology used for full text indexing greatly enhances the recall performance or ability, to the detriment of precision performance. Then information retrieval research started focussing on techniques focussed on manipulating the output. Here we find techniques such as “relevance ranking” widely used such as in internet search engines. These techniques are quite successful since the computing power became available. In these text retrieval systems we work with words, word stems or phrases. Almost all of the relevance ranking techniques are based upon the work of Karen Spark-Jones in Cambridge (UK). Spark-Jones brought us what became known as the term frequency–inverse document frequency theory. The tf–idf weight is a statistical measure used to evaluate how important a word is within a document and then within a collection of information items. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in a collection. An unpublished study at Syracuse university (USA) in 1984 tested 29 variations of the tf–idf weighting scheme to determine the difference between these variations and came to the conclusion that there was no significant difference. Another technique used for retrieval is the “fuzzy set theory” or “fuzzy logic”. This technique attempts to retrieve information without being bound to specific words or spelling. The results here are also greatly enhancing recall performance as opposed to precision. But all of these techniques are on the output side. In the indexing profession we are on the input side. We would like to see that the entries created for retrieval improves the ability to retrieve relevant information. Words and textual “strings” (multi-word sequences or phrases) are used index and search these systems. Words are obviously just representatives of concepts. The user actually needs information about a concept. This is exactly what information professional tries to achieve, namely to identify what the information item is “about”. Hutchins (1978) coined the phrase “aboutness” to describe this phenomenon. This is exactly where the indexer performs his skill: to create retrieval elements for optimising relevance. It is all about the process of identifying and creating index entries to represent “aboutness” of a concept. One of the areas investigated since the 1970’s was the application of linguistic theory. The main problem here was that linguistic theory only gave answers to the function of words up to the unit of a sentence. Very little applicable theory was available for working with whole documents or eve collections of textual items. Today we want to look at what Database systems can offer us on the input side. The database systems gives us the following features: (specifically text retrieval systems) The ability to house and work with a collection of information items 2 The structure for organising or categorising meta-data of a collection The ability to retrieve full text – every word The ability to retrieve items specific to individual meta data categories Searching functionality : Boolean, word stem- and proximity searching Word and string indexes related to the collection as a whole or within a field Batch modification of indexes Some even gives us a thesaurus capability Because text retrieval database systems build indexes, we can use these to identify more successful index terms. Here I would suggest using the elements of the term frequency– inverse document frequency theory: the word frequencies. This theory acknowledges the distinctive roles of words in a document, in a collection and the documents related to a word. Automatic indexing theory identified it as: Within document frequency (WDF) Collection frequency (CF) Document frequency (DF) WDF gives us the frequency of a word in a document. Too low and too high frequencies gives us non-significant words. We need to look at the medium to higher frequencies. The focus here is the significance and role of a term within a document. CF gives you the frequency in a collection. Again, the medium to frequencies are significant. The focus here is the significance and role of a term within a document collection. DF shows the number of documents linked to a word. Again, the medium to frequencies are significant. The focus here is the significance of a document in a collection with regards to a specific term or concept. Text retrieval systems, because they build indexes, can give us these word frequencies. Through the knowledge of word frequency theory one can assess the significance of terms. Text retrieval systems also allows us to incorporate full text into one of the fields alongside the meta data. Indexers tend to focus on index terms related to a specific item of information, relating to the “within document frequency” . One can therefore say that this is the element best catered for namely the role and significance of a term within an information item. The other two elements should be a concern. The significance in a collection is often taken care of by: Using an indexer that “knows the subject area” The indexer’s knowledge of the collection The indexer’s knowledge of the users, company or environment. All three of these are done in a sub-conscious manner and often not really taken care of. The third element, namely the role or significance of a document in a collection with regards to a specific term, is often not even considered. An analysis of the subject coverage of the documentation used in a country wide HSRC study in 1988 gave alarming results. This is not apparent looking at the study’s reports. We can use text retrieval database systems to determine better index terms in the following way: With single documents one can do an analysis of the term frequencies within a document. Most text retrieval systems can load a singe document full text into a record in the database, generating the word frequencies in the index. Stop word lists can successfully be applied. 3 With collections of documents the text retrieval database systems can be used to: Determine collection frequencies of terms in a collection before indexing a new document within or across fields Determining document frequencies of terms (the number of documents relating to the index term) in a collection when indexing a document. The structure of database systems can successfully be used to: Assigning index terms to a specific meta-data element, increasing the specific role of a term Creating database fields that would enhance the role of index terms. An example here is where subject field is divided into broad terms for a controlled vocabulary and a separate field for free text indexing, and even having a separate field for names of persons, places and companies Most Text retrieval systems has the ability to do batch changes in indexes. When Northern Province became Limpopo – how many of the information services made that change? As mentioned, some text retrieval database systems has a thesaurus feature. Some are linked to an existing thesaurus, some allow the user or indexer to build a thesaurus. Building your own thesaurus can be of great value in narrower subject fields. This could be a very valuable tool fir the indexer to use on the input side. Lastly we can use Text retrieval systems to test the effectiveness of our indexing, using the performance measures of recall and precision. Information services should, on a regular basis, reflect on the effectiveness of indexes of a collection by analysing and rectifying problems on a regular basis. Database systems, and specifically Text retrieval databse systems have developed to a standard where they are very efficient with a whole collection of features relevant to the input soide of indexing. It is therefore a powerful tool in the hands of the indexer. 4 Bibliography Bertino, E., B. C. Ooi, R. Sacks-Davis, K-L Tan, J. Zobel, B. Shidlovsky, and B. Catania. 1997. Indexing Techniques for Advanced Database Systems. Kluwer Academic Publishers Elmasri R. and S. Navathe. 2000 Fundamentals of database systems. Addison-Wesley. Graf P., 1996 Term Indexing. Springer. Hutchins, W. J. The concept of 'aboutness' in subject indexing. Aslib Proceedings 30 (5), May 1978, p.172-181. Manning, Christopher D. ; Raghavan, Prabhakar and Schütze, Hinrich Introduction to Information retrieval, 2008 Cambridge University Press. http://informationretrieval.org/ http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html Ramesh et al. ,2001 In: http://www.cs.toronto.edu/~mcosmin/publications/thesis/node61.html Salton, G. M.J. McGill 1983. Introduction to modern information retrieval. McGraw Hill. 5