09_Search_and_retrieval1

advertisement
Class 9 - Search and Retrieval
Exercise overview
In classes 1-7 we explored information systems with a focus on the role of structured and semistructured data (e.g. web-pages, metadata schemas and relational databases) in supporting the
information lifecycle. We identified different types of metadata that support aspects of the lifecycle
including descriptive (e.g. discovery phase), administrative (e.g. acquisition, appraisal phases),
technical (e.g. preservation phase) and structural (e.g. ingest/management phases). We found that
these types of metadata are essential in enabling long-term management and preservation of
resources.
While we learned about the value of metadata we also found that manual metadata creation
techniques are not always ideal and may not scale as needed in certain situations. In addition to this
problem of scale, there is a widespread school of research and practice that asserts that descriptive
metadata is poorly suited for certain types of discovery, also called "Information Retrieval" or IR. In
this class we explore these alternative approaches to IR from both the systems processing
perspective and the user-engagement perspective. In doing so we will build on our understanding of
information seeking and use models by thinking about new types of discovery.
Suggested readings
1. Mitchell, E. (2015). Chapter 7 in Metadata Standards and Web Services in Libraries, Archives, and Museums.
Libraries Unlimited. Santa Barbara, CA.
2. Watch: How Search Works: http://www.youtube.com/watch?v=BNHR6IQJGZs
3. Watch: The evolution of search: http://www.youtube.com/watch?v=mTBShTwCnD4
4. Kernighan, B. (2011). D is for Digital. Chapter 4: Algorithms
5. Read: How Search Works. http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/
6. Pirolli, Peter. (2009). Powers of Ten. http://www.parc.com/content/attachments/powers-of-ten.pdf
Optional Readings
7. Read/Skim: Michael Lesk, The Seven Ages of Information Retrieval, Conference for the 50th Anniversary of As We
May Think, 1995. http://archive.ifla.org/VI/5/op/udtop5/udtop5.htm
Metadata Standards and Web Services
Erik Mitchell
Page 1
8. Baeza-Yates, Ricardo and Berthier Ribeiro-Neto, Modern Information Retrieval, Addison Wesley Longman, 1999,
Chapter 1
9. Watch: Kenning Arlisch talk about SEO in libraries - http://www.oclc.org/en-US/events/2015/CI_SFPL_Feb_2015.html
(Last talk in the first block of speakers)
10. Explore: http://www.wikidata.org/wiki/Wikidata:Main_Page
11. Read/skim: http://news.cnet.com/8301-17938_105-57619807-1/the-web-at-25-i-was-a-teenage-dial-up-addict/
12.
What is Information Retrieval?
This week we are watching a short video on Google Search, browsing a companion Google Search
website, learning about algorithms and finding out more about alternatives to metadata-based search.
Let's start by watching the "How Search Works" video
(http://www.youtube.com/watch?v=BNHR6IQJGZs), The Evolution of Search
(http://www.youtube.com/watch?v=mTBShTwCnD4) and browsing the "How Search Works" website
http://www.google.com/intl/en_us/insidesearch/howsearchworks/thestory/.
Question 1. Using these resources as well as your own Google searches, define the
following terms:
a. Computer Index
b. PageRank
c. Algorithm
d. Free-text Search
e. Universal Search
Metadata Standards and Web Services
Erik Mitchell
Page 2
f. Real-time or Instant Search
While indexes exist in database design, and are very important in database speed, we did not discuss
them in depth in Class 7. It turns out that indexes exist in multiple forms and exist with multiple goals
but in short they all exist to facilitate access to a large dataset. Indexes accomplish this by taking a
slice of data and re-sorting it (e.g. indexing all of the occurrences of a word in a document, sorting
words in alphabetic order). Indexes and approaches to indexing are one of the building blocks of IR.
On top of indexes, IR systems use search algorithms to find, process, and display information in
unique ways to the searcher.
IR is a broad field that includes the entire process of document processing, index creation, algorithm
application and document presentation situated within the context of a search. The following figure
shows a sample model of IR and the relationships between resources, the search process and
document presentation process. This process is broken into three broad areas, Predict, Nominate
Metadata Standards and Web Services
Erik Mitchell
Page 3
and Choose.
Predict
In the Predict cycle, documents are processed and indexes are created that help an IR system make
informed guesses about which documents are related to one another. This can be accomplished via
algorithms like PageRank, Term-Frequency/Inverse Document Frequency or N-Gram indexing (more
on these methods later) or by other means. The prediction process is largely a back-office and preprocessing activity (e.g. systems complete this process in anticipation of a search).
Nominate
The Nominate process is comparable to the search process that we have explored in previous
classes. Using a combination of words, images, vocabularies or other search inputs the system
applies algorithms to determine the best match for a user's query. This process likely involves a
relevance ranking process in which the system predicts the most relevant documents and pushes
them towards the top of the results. Google implements ranking using a number of factors including
Metadata Standards and Web Services
Erik Mitchell
Page 4
personal/social identity (e.g. Google will show results related to you or your friends first), resource
raking with PageRank (e.g. The main website of UMD gets listed first because it is a central hub of
links) and timeliness (e.g. using real-time search Google prioritizes news and other current results).
Relevance Ranking has evolved quickly over the last twenty years and is increasingly the preferred
results display method. At the same time, relevancy is not always the best sorting process.
Question 2. Can you think of some areas in which relevance ranking is not the ideal
approach to sorting search results?
Choose
In the resource selection process users are highly engaged with the system scanning results,
engaging in SenseMaking and other information seeking behaviors to evaluate the fit of a resource
with their information need and ultimately selecting documents for use. In this case as well the
information product delivered may vary even using the same source documents. For example if we
are seeking information about books we want to be presented with a list of texts while if we are
seeking information about an idea and it's presence across multiple texts we may want to see a
concordance that shows our search terms in context (also known as Keyword in Context or KWIC).
Structured data vs. full-text or digital object Information Retrieval
In the structured data world, search and retrieval is based on the idea that our data is highly
structured, predictable and conforms to well-understood boundaries. For example if we are
supporting search of books and other library materials using traditional library metadata (e.g. MARC)
we can expect that our subject headings will conform to LCSH and look for other metadata fields (e.g.
title, author, publication date) to support specific search functions. If in contrast we decided to find
books based just on the full-content of the text with no supporting metadata we would not be able to
make such assumptions.
Let's build our understanding of different approaches to indexing and IR by exploring three Googlebased discovery systems. Our overarching question is "Which of Mark Twain's books proved to be
Metadata Standards and Web Services
Erik Mitchell
Page 5
most popular over time? How can we measure this popularity? Have these rankings changed over
time? Which book is most popular today?" For feasibility we will limit our search to the following
books: Innocents abroad, The Adventures of Huckleberry Finn, The Adventures of Tom Sawyer and
A Connecticut Yankee in King Arthur's Court Roughing It, Letters from the Earth, The Prince and the
Pauper and Life on the Mississippi.
In order to answer these questions we are going to explore several information systems. As we
explore each information system you should look for answers to these questions and think about what
type of index was required to facilitate the search and whether or not the index is based on metadata
or "free text." You should also take note of the best index or search engine for this information.
Types of IR systems
Google Search (http://google.com)
The regular Google search interface indexes the web. There is a lot to say about this resource but I
expect we are largely familiar with it. Try a few searches with Google related to each question? A
"Pro-tip" for Google: Look for the "Search Tools" button at the top of the screen just under the search
box. These search tools give you access to some filtering options.
Question 3. What type of index or IR system is most prevalent in this discovery
environment (e.g. Metadata or full-text based)?
Question 4. What search terms or strategies proved to be most useful in this database?
Google Books (http://books.google.com)
Google Books is an index created by a large-scale scanning and metadata harvesting operation
initiated by Google in the early 2000s (http://www.google.com/googlebooks/library/). Google books
indexes both metadata (e.g. title, author, publication date) and the full-text of books. It uses a pagepreview approach to showing users where books their search terms occurred.
Metadata Standards and Web Services
Erik Mitchell
Page 6
Question 5. What type of index or IR system is most prevalent in this discovery
environment (e.g. Metadata or full-text based)?
Question 6. What search terms or strategies proved to be most useful in this database?
HathiTrust (http://catalog.hathitrust.org)
The HathiTrust is a library-run cooperative organization that shares all of the scanned books and
OCR data from Google. The main objective of HT is to provide an archive of scanned books for
libraries. One product of this archive is a searchable faceted-index discovery system. In some cases
(e.g. when a book is out of copyright) the digital full text is made available.
Question 7. What type of index or IR system is most prevalent in this discovery
environment (e.g. Metadata or full-text based)?
Question 8. What search terms or strategies proved to be most useful in this database?
GoodReads (http://www.goodreads.com/search)
GoodReads is a social book cataloging and reading platform. GoodReads aggregates bibliographic
metadata and social recommendations by readers and serves as both a resource discovery and
community engagement platform.
Question 9. What type of index or IR system is most prevalent in this discovery
environment (e.g. Metadata or full-text based)?
Question 10. What search terms or strategies proved to be most useful in this database?
Metadata Standards and Web Services
Erik Mitchell
Page 7
Google Ngram Viewer (https://books.google.com/ngrams/)
The Google Ngram Viewer is a specialized slice or index of the Google Books project. An N-Gram is
an index structure that refers to a combination of words that are related by their proximity to one
another. The letter "N" refers to a variable that is any whole number (e.g. 1, 2,3). N-grams are often
referred to according to the number of words that are indexed together. For example in the sentence
"The quick brown fox jumped over the fence" an index of two word combinations or "Bi-grams) would
include "The Quick," "Quick Brown," "Brown Fox," "Fox Jumped" and so forth. Tri-grams are indexes
of three words together (e.g. "The Quick Brown"). N-Gram indexes are a new take on Phrase
searching as applied to full-text resources at a large scale.
Step 1:
Searching the Google Ngram viewer can be conceptually somewhat difficult so I
recommend you follow the short tutorial below:
a. Go to the Google Ngram Viewer (https://books.google.com/ngrams/).
b. In the Search box type (without the quotes) "Adventures of Huckleberry Finn,
Adventures of Tom Sawyer."
c. Check the case-insensitive box and click the "Search lots of books" button.
d. You will see a graph displayed (see below) that shows the relative occurance of these
ngrams across the entire corpus of books.
e. You should notice that we searched for Quad-Grams (e.g.4 word phrases) but you can
mix and match n-Grams in a single search. You should also notice that we separate
our n-grams with commas. One technical detail - the maximum number of words you
can search for is 5 words in a phrase so you may need to think about this as you search
google.
Question 11. What type of index or IR system is most prevalent in this discovery
environment (e.g. Metadata or full-text based)?
Question 12. What search terms or strategies proved to be most useful in this database?
Metadata Standards and Web Services
Erik Mitchell
Page 8
Searching and find
Using these four indexing systems try your hand at answering our questions. Don't be shy about
looking for documentation or other sites!
Metadata Standards and Web Services
Erik Mitchell
Page 9
Question
Type of index (e.g. free-text /
Best search and resource to
metadata)
answer the question
Which of MT's books proved to
be most popular over time?
How are rankings of popularity
different (e.g. what do they
measure, what data sources do
they use)?
Which book is most popular
today?
Where can you get an electronic
copy of each book?
Metadata Standards and Web Services
Erik Mitchell
Page 10
Your findings
Evaluating search results: Precision vs. Recall
In deciding which systems worked best for the questions we were asking you likely made qualitative
decisions about what systems worked best. You may have decided that systems were not useful
based on the initial page of results you looked at or you may have ultimately found that specialized or
unique search strategies helped you identify better results. This process of evaluating relevance is
often expressed as "Precision vs. recall" in the IR community.
Broadly stated, precision is related to whether or not the result you wanted was retrieved as part of a
search process. In other words, precision helps you ask the question "How much of what was found
is relevant?" An example of a high precision search is a known item search in an online catalog by
title. In this case you know the title of the book and the index to use (e.g. the title index). The search
results are highly precise - the catalog either has the book or it does not. In a well structured and
perfect search environment, high precision probably best fulfills your information need.
In contrast to precision, recall pertains to the number of search results retrieved during a search. The
Google web search is an example of a high-recall search, search results often contain tens of
thousands of results! In contrast to precision, relevance asks the question "How much of what was
relevant was found?" High recall helps users find the best resource that fits a fuzzy information need.
A good example of this is a search for a website or product on the web where you may remember the
qualities about the product (like it's function, color or price) but not its name. In a fuzzy search world
where we do not always know what we are looking for high recall is most likely preferred over high
precision as we want to expand the number of records we look at.
Precision and recall can be thought of as two intersecting collections of documents including all of the
relevant documents irrelevant documents in an index. The intersection of these two groups of
documents represents the documents retrieved in a search. Precision and recall are presented as a
ratio with the minimum value being 0 and the maximum value being 1. This means that we can think
about precision and recall in terms of a percentage (e.g. 100%).
Metadata Standards and Web Services
Erik Mitchell
Page 11
Precision and recall can be expressed mathematically as:
1. Precision = # of relevant records retrieved / (# relevant retrieved + # irrelevant records retrieved)
2. Recall = # of relevant records retrieved / (# relevant retrieved + # relevant not retrieved)
Ideally, information systems are high recall (e.g. all relevant results retrieved) as well as high
precision ( e.g. high ratio between relevant to irrelevant results). In the real world this is difficult if not
impossible. In fact, as precision increased to 1 (or 100%) recall approaches 0. Conversely, as our
recall approaches 1 (or 100%) our precision approaches 0.
Question 13. You have an index containing 100 documents, 10 of which are relevant to a
given search and 90 which are not. Your search produces 5 good documents and 30
bad documents. Calculate the precision and recall ratios for this search
Metadata Standards and Web Services
Erik Mitchell
Page 12
Question 14. Suppose you tweaked your indexing or your search and managed to retrieve
all 10 relevant documents but at the same time returned 50 irrelevant documents.
Calculate your precision and recall.
Question 15. Assuming that you would rather return all of the relevant documents rather
than missing any what techniques might your IR system need to implement to make the
results more useful?
Recall and precision are just one measure in system evaluation. In addition there are a number of
affective measures such as user satisfaction or happiness and user-generated measures such as
rate or re-use, # of searches to locate a resource or user judgments about the "best" resource.
Summary
In this class we explored types of indexing and information retrieval systems as we considered the
differences between metadata-based, free-text and social/real-time based information retrieval
systems. We learned about a key measure in IR - precision vs. recall and became acquainted with
how to calculate both precision and recall. In doing so we just scratched the surface of IR. If you
were intrigued by some of the search features in Google you may want to try out their
"PowerSearching" course at http://www.google.com/insidesearch/landing/powersearching.html. If the
aspects of IR intrigued you I encourage you to explore the optional readings for this week and explore
more information retrieval systems.
Metadata Standards and Web Services
Erik Mitchell
Page 13
Download