Information Retrieval(1955-1992) • Primary Users – Law Clerks – Reference Librarians – (Some) News organizations, product research, congressional committees, medical/chemical abstract searches • Primary Search Models – Boolean keyword searches on Abstract, Title, keyword • Vendors – – – – Mead Data Central(Lexis – Nexis) Dialog Westlaw Total searchable online data : O(10 terabytes) Information Retrieval(1993+) • Primary users – – • 1st time computer users novices Primary search modes – – • Still Boolean keyword searches with limited probabilistic models But FULL TEXT Retrieval Vendors – – Lycos, Infoseek, Yahoo, Excite, AltaVista, Google Total online data : ??? Growth of the Web # of web sites or Volume of web traffic ? Mosaic Netscape Exponential Growth 1992 1993 1994 1995 1996 1997 1998 Volume doubling every 6 months Observation • Early IR system basically extended library catalog systems, allowing – Keyword searches, – Limited abstract searches in addition to Author/Title/Subject and including Boolean combination functionality • IR was seen as reference retrieval (full documents still had to be ordered/delivered by hand) In Contrast Today, IR has a much wider role in the age of digital libraries • Full document retrieval (hypertext, postscript or optical image(TIFF) representations) • Question answering Old View … AND … OR … Funtion of IR : Map queries to relevant documents 15 New View 1 8 Satisfy user’s information need Infer goals/information need from: - query itself - past user query history - User profiling(aol.com vs. CS dept.) - Collective analysis of other user feedback on similar queries In addition, return information in a format useful/intelligible to the user • weighted orderings • clusterings of documents by different attributes • visualization tools ** Text Understanding techniques to extract answer to questions or at least subregion of text Who is the current mayor of Columbus, Ohio? don’t need full AP/CNN article on city scandals, just the answer(and available source for proof) Boolean Systems Function #1 : Provide a fast, compact index into the database (of documents or references) Chihuahua (granularity) Index options - Doc number - Page number in Doc - Actual word offset Nanny Data structure: Inverted file Boolean Operations Chihuahua AND Nanny g Join ( Chihuahua OR Nanny g Union ( ) Proximity searches Chihuhua W/3 Nanny ) Vector IR model d1 Find optimal f( ) V1 _____________ _____________ _____________ _____________ _____________ _____________ _____________ f( d2 ) _____________ _____________ _____________ _____________ _____________ _____________ _____________ f( ) V2 Sim (Vi , VQ) = Sim’ (Di , Q) Sim (V1, V2) Sim’ (d1 , d2) Cosine distance Query Vector models D1 D2 Query _____________ _____________ _____________ _____________ _____________ _____________ _____________ _____________ _____________ _____________ _____________ _____________ _____________ _____________ V1 V2 Bit vector capturing essence/meaning of D1 Sim (V1 , Q1) Q1 Find max Sim (Vi , Q1) Dimensionality Reduction d1 _____________ _____________ _____________ _____________ _____________ _____________ _____________ f( V1 ) Initial (term) vector representation Dimensionality Reduction(SVD/LSI) ^ V1 More compact/reduced dimensionality model of d1 V1 V1 D1 3 Japanese Japan Nippon Japanese Nihon Japanese Japanese .. Clustering words Offset K - hash(w) - hash(cluster(w)) - hash(cluster(stem(w))) 5 Japan * 1 The 192 1 1 Raw Term Vector Condensed Vector Stem : books computer computation book comput comput d1 The soap opera 0 0 1 Soap Opera Soap opera 1 1 0 Soap Opera Soap opera Collocation (Phrasal Term) d2 The soap residue an opera by Verdi Vector g Abstractly is a compressed document(meaning preserving) m1 f(d1) …… …… …… A meaning or context vector representation m2 document f(d2) …… …… …… document Compression : m1 = m2 iff d1 = d2 f( ) must be invertible Summarization : m1 = m2 iff d1 and d2 are about the same thing (mean the same thing) What is the optimal method for meaning preserving compression? Issues • size of representation(ideally size(Vi) << size(Di)) • cost of computation of vectors – one time cost at model creation • cost of similarity function • must be computed for each query • crucial to speed that this be minimized – header processing retain/model cross references Vi 1 V1 2 ref (V2 ) 3 ref (V3 ) 1. remark (most) function words NOT or 2. downweight by frequency 3. use text analysis + decide which function words carry meaning. Supervised Learning/Training Collective Discrimination A Chihuahua Breeding Club C J Input data stream recognizer Junk mail Personal recognizer Training B Project #1 In Real time (ongoing) recognizer recognizer B A C J Labelled (routed) output Other related problems: Mail/News Routing and Filtering 119 Data Stream 121 125 131 Project #1 at work Project #2 at work Chihuahua breeding Scuba club Personal Junk mail Inboxes (prioritize) Typically model long-term information needs (People put effort into training and user feedback that they aren’t willing to invest for single query-based IR) Features for classification • • • • • • • Subject line Source/Sender X-annotations Date/time Length Other recipients Message content (regions weighted differently) Probabilistic IR models – Intermediate Topic models/detectors S TV1 Topic Detectors TDA (Topic Models) f( d1 TDB 000100 Q 100000 V1 010100 V2 TDE V1 V2 ) f( d2 )