LIS618 lecture 0 Thomas Krichel 2004-01-25 today's lecture • A look at the course home page http://wotan.liu.edu/home/krichel/lis618p04s • administrative stuff • historical matters about the course • about me • business of database searching • indexes • the Boolean information retrieval model • practice example on Dialog Organization • homepage http://wotan.liu.edu/home/krichel/lis618p04s • Contents to be discussed today. • Send mail to krichel@openlib.org – Your name – Your secret word for grades delivery • Interrupt me with as many questions as possible! • Ask for breaks! Proposed Organization • Normal lecture • Quiz at the beginning of every lecture – Factually oriented, around 15 minutes – Remove worst performance – Average to form 50% • Search exercise 50% • I may make some adjustment to the syllabus this week. Search exercise • Find victim of an information need • Best to take someone you know in a professional capacity • Conduct interview about an information need experienced by the victim, write down expectations • Search in formal database and on web • Discuss results with the victim • Write essay, no longer than 5 pages. about the course • This course is new wine in an old bottle • Officially a merger of – lis566 information resources on the Internet • mailing lists • usenet news • web searching – lis618 database searching • access and use of commercial databases mix of theory and practice • I am not a database search practitioner. • Each database is different, practical skills are not easily transferable. • Thus my emphasis in the course is more on theory. • In the past, I did theory first, then practice. • These day I mix. Some theory and some practice in every session. What online retrieval systems? • Dialog has been the traditional database covered. – They were the market leaders in online databases in the past. – Nowadays the field is much more open. – They remain a very good teaching tool for command based database searching. • Nexis: a news database I have covered every year. • Google: a well-known search engine that I started to cover two years ago. other stuff • Other online IR systems that I have covered in the past – OCLC FirstSearch – Factiva (briefly) – WestLaw (external speaker) • New developments – Peer-to-peer networks – an introduction to reference linking using OpenURL • Old developments with library potential – relational databases About me • Born 1965, in Völklingen (Germany) • Studied economics and social sciences at the Universities of Toulouse, Paris, Exeter and Leiceister. • PhD in theoretical macroeconomics • Lecturer in Economics at the University of Surrey 1993 and 2001 • Since 2001 assistant professor at the Palmer School Why? • During research assistantship period, (1990 to 1993) I was constantly frustrated with difficult access to scientific literature. • At the same time, I discovered easy access to freely downloadable software over the Internet. • I decided to work towards downloadable scientific documents. This lead to my library career (eventually). Steps taken I • 1993 founded the NetEc project at http://netec.mcc.ac.uk, later available at http://netec.ier.hit-u.ac.jp as well as at http://netec.wustl.edu. • These are networking projects targeted to the economics community. The bulk is – Information about working papers – Downloadable working papers – Journal articles were added later Steps taken II • Set up RePEc, a digital library for economics research. Catalogs – Research documents – Collections of research documents – Researchers themselves – Organizations that are important to the research process • Decentralized collection, model for the open archives initiative Steps taken III • Co-founder of Open Archives Initiative • Work on the Academic Metadata Format • Co-founded rclis, a RePEc clone for (Research in Computing, Library and Information Science) • Currently working on the Konz project. It uses a database of titles of journal published papers and tries to find them on the Internet. my interest in databases • an important emphasis of course is still on commercial databases. • From my point of view I have two interests in database searching – As a provider, I must understand how people search in order to provide some data that they can use and will use. – As an economist, I have a strong interest in information as a commodity. The database market is an important market place. online information retrieval • This subject can be though off as a subset of information retrieval (IR). Most IR is online or digital. • IR concentrates on textual data. • We can think of online IR to fall under two categories – database IR – web IR database / web IR • Database IR look at systems that have – controlled set of record – low heterogeneity – use requires authentication – advanced search features • Web IR has opposite characteristics traditional social model • User goes to a library • Describes problem to the librarian • Librarian does the search – without the user present – with the user present • Hands over the result to the user • User fetches full-text or asks a librarian to fetch the full text. economic rational for traditional model • In olden days the cost of telecommunication was high. • Database use costs – cost of communication – cost of access time to the database • The traditional model controls an upper limit to the costs. disintermediation • With access cost time gone, the traditional model is under threat • There is disintermediation where the librarian looses her role of doing the search. • But that may not be good news for information retrieval results – user knows subject matter best – librarian knows searching best Web searching • IR has received a lot of impetus through the web, which poses unprecedented search challenges. • With more and more data appearing on the web DS may be a subject in decline – It is primarily concerned with non-web databases – There is more and more web-based methods of searching Public access vs quality • Now the public at large is able to do online searching. • At the same time need for quality answers has grown. • Quality-filtered services will become more important. • In the current databases, there is as lot that would already be available for free mixed with quality-controlled stuff. • Publishers have direct offerings and intermediated vending is in decline. main theory part • Literature: – "Modern Information Retrieval" by Ricardo Baeza-Yates and Berthier Ribiero-Neto. – "Information Retrieval in the Digital Age" by Heting Chu. • You don't need to buy the books. You better spend practice time on databases rather than reading books components of the IR process • provider – define data that is available • documents that can be used • document operations • document structure – index • user – user need – IR system familiarity the IR process • Query expresses user need in a query language • Processing of query yields retrieved documents • Calculation of relevance ranking • Examination of retrieved documents • Possible return to the start, another query. main problem • User is not an expert at the formulation of a query • Garbage in garbage out, the retrieval yields poor result • Ways around that problem – design very intuitive interface for the query – give expert guidance taxonomy of classic IR models • Boolean, or set-theoretic – fuzzy set models – extended Boolean • Vector, or algebraic – generalized vector model – latent semantic indexing – neural network model • Probabilistic – inference network – belief network summary • There are three basic types of models in classic information retrieval. • Extensions of these types are a matter of research concern and require good mathematical skills. • All classic models treat document as individual pieces. key aid: index • An index is a list of terms, with a list of locations where the term is to be found. • The way to express locations usually depends on the form that the indexed data takes. – for a book, it is usually the page number, e.g. "shmoo 34, 75" – for computer files it is usually the name of the file plus the number of the byte where the indexed term starts, e.g. "krichel index.html 34, cv.html 890 1209" • There is usually more than one location of the term. key aid: index terms • The index term is a part of the document that has a meaning on its own. • It is usually a noun word. • Retrieval based on index term raises questions – semantics in query or document is lost – matching done in imprecise space of index terms • One way out is to specify several terms and require that they have to be close to each other. basic concept: weight of index term • Given all nouns, not all appear to have the same relevance to the text • Sometimes, we can have a simple measure of the importance of a term, example? • More generally, for each indexing term and each document we can associate a weight with the term and the document. • Usually, if the document does not contain the term, its weight is zero Boolean model • In the Boolean model, the index weight of all index term for any document is 1 if the term appears in the document. It is 0 otherwise. • This allows to combine query terms with Boolean operator AND, OR, and NOT • thus powerful queries can be written Classic implementation: dialog • The documents that I have used – http://training.dialog.com/sem_info/courses/pdf _sem/dlg1.pdf – http://training.dialog.com/sem_info/courses/pdf _sem/dlg2.pdf – http://training.dialog.com/sem_info/courses/pdf _sem/dlg3.pdf – http://training.dialog.com/sem_info/courses/pdf _sem/dlg4.pdf • I am also told that there are others at http://gep.dialog.com/instruction/ Dialog is a databank • over 500 databases • these are also known as files and cover – references and abstracts for published literature, – business information and financial data; – complete text of articles and news stories; – statistical tables – directories • DIALOG uses the Boolean model DIALOG interface • It is still rooted in "traditional" database systems. • It has been dismissed as "dial-a-dog". • It uses a command-driven interface. • It is very complicated to learn fully. • It is not suitable for the end-user. • It therefore offers a valuable skill to the information professional. Accessing DIALOG • • • • • • • On the web, go to http://www.dialogweb.com/ Enter username and password Forget about subaccount Then click on logon On the next screen go to command search "continue" at the next screen two steps in DIALOG • Step one: select databases (aka files) to look at • Step two: perform searches on the selected databases • You may wonder why one does not have one single step like in a search engine. Discuss. sample search • We want to know something about "current awareness in digital libraries" • From dialogweb command search: – databases – social sciences and humanities – library and information science • This leads you to http://www.dialogweb.com/cgi/logoff?mode= guided&url=/cgi/dwframe?href=search.html This is database selection… • At that screen you see a number of "files" with their number. • You can select those that you want to search • Then you click "begin database" • and you get back to the command search • "b numbers" it will say. That is the command to begin working with files. Boolean search • Do a number of searches – s current(N)awarness – s digital(N)library – s digital(N)libraries • Each search retrieves a set of documents • The sets can be combined – s s1 and (s2 or s3) What is the deal? • There are two stages. • At stage two we make Boolean queries. • Each query splits the records into matching and non-matching records. • The set of matching records is return. • It can be further searched or combined with other sets using Boolean operators. • Try this at home. http://openlib.org/home/krichel Thank you for your attention!