Course Introduction SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2004 Credits to Marti Hearst for some of the slides in this lecture IS 202 - Fall 2004 2004.08.31 - SLIDE 1 Today • Introductions • Course Overview • Administrivia IS 202 - Fall 2004 2004.08.31 - SLIDE 2 Today • Introductions • Course Overview • Administrivia IS 202 - Fall 2004 2004.08.31 - SLIDE 3 IS202 Teaching Team Professor Ray Larson IS 202 - Fall 2004 Professor Marc Davis TA Allison Billings TA Tran Tu 2004.08.31 - SLIDE 4 Who Am I? • Professor and Associate Dean at SIMS • Here from the founding of SIMS, faculty member of the “previous school” IS 202 - Fall 2004 2004.08.31 - SLIDE 5 What Do I Do? • Research – Design, development and evaluation of information retrieval systems and digital libraries – Cheshire II and III – Bibliometrics of the WWW – Geographic information retrieval (GIR) – XML Retrieval – Applications of Grid computing to (large-scale) IR and Digital Libraries – Distributed search and retrieval • Teaching – Information Retrieval – Database Management IS 202 - Fall 2004 2004.08.31 - SLIDE 6 Who Am I? • Assistant Professor at SIMS (School of Information Management and Systems) • Background 1980 – 1984 B.A. from Wesleyan University in the College of Letters 1984 – 1987 M.A. from the University of Konstanz in Literary Theory and Philosophy 1990 – 1995 Ph.D. from MIT Media Laboratory in Media Arts and Sciences 1993 – 1998 Member of the Research Staff and Project Coordinator at Interval Research Corporation 1999 – 2002 Chairman and CTO of Amova IS 202 - Fall 2004 2004.08.31 - SLIDE 7 What Do I Do? • Create technology and applications that will enable daily media consumers to become daily media producers • Research and teaching in the theory, design, and development of digital media systems for creating and using media metadata to automate media production and reuse – Research • Director of Garage Cinema Research • Projects in Media Metadata, Active Capture, Adaptive Media, Mobile Media Metadata, and Social Uses of Personal Media • Executive Committee Member and Co-Founder of the Center for New Media • Affiliated Faculty Member of the Berkeley Institute of Design – Teaching • Multimedia Information • Digital Media Design Studio • Foundations of New Media IS 202 - Fall 2004 2004.08.31 - SLIDE 8 Student Introductions • Who are you? – Name – Undergrad degree – Special areas of expertise and interest • Why are you here? – What you want to learn from the course IS 202 - Fall 2004 2004.08.31 - SLIDE 9 Today • Introductions • Course Overview • Administrivia IS 202 - Fall 2004 2004.08.31 - SLIDE 10 Goals of the Course • Learn about – Design, development, and use of information organization and retrieval systems – Practical and theoretical foundations of information organization and analysis – Evaluation of information access systems – Cognitive and user-centric considerations – Hands-on experience with information systems IS 202 - Fall 2004 2004.08.31 - SLIDE 11 Two Main Themes Information Retrieval and the Search Process IS 202 - Fall 2004 Information Organization and Design 2004.08.31 - SLIDE 12 Information Organization and Retrieval • To organize is to (1) furnish with organs, make organic, make into living tissue, become organic; (2) form into an organic whole; give orderly structure to; frame and put into working order; make arrangements for. • Knowledge is knowing, familiarity gained by experience; person’s range of information; a theoretical or practical understanding of; the sum of what is known. • To retrieve is to (1) recover by investigation or effort of memory, restore to knowledge or recall to mind; regain possession of; (2) rescue from a bad state, revive, repair, set right. • Information is (1) informing, telling; thing told, knowledge, items of knowledge, news. The Oxford English Dictionary, cf. Rowley IS 202 - Fall 2004 2004.08.31 - SLIDE 13 (Approximate) Course Schedule • Retrieval – Overview – Introduction to the Search Process – Boolean Queries and Text Processing – Web Search Issues and Architecture – Statistical Properties of Text and Vector Representation – Probabilistic Ranking & Relevance Feedback – Evaluation – Interfaces for Information Retrieval – Database Design IS 202 - Fall 2004 • Organization – – – – – – – – – – – – Phone Project Introduction Categorization Knowledge Representation Lexical Relations and WordNet Metadata Introduction Controlled Vocabularies Introduction Facetted Classification Thesaurus Design and Construction Semantic Web Multimedia Information Organization and Retrieval Metadata for Media Phone Project Presentations 2004.08.31 - SLIDE 14 Information Properties • Information can be communicated electronically – Broadcasting – Networking • Information can be easily duplicated and shared – Problems of ownership – Problems of control Adapted from ‘Silicon Dreams’ by Robert W. Lucky IS 202 - Fall 2004 2004.08.31 - SLIDE 15 Information Hierarchy Wisdom Knowledge Information Data IS 202 - Fall 2004 2004.08.31 - SLIDE 16 Information Hierarchy • Data – The raw material of information • Information – Data organized and presented by someone • Knowledge – Information read, heard, or seen and understood • Wisdom – Distilled and integrated knowledge and understanding IS 202 - Fall 2004 2004.08.31 - SLIDE 17 Information Where is the Life we have lost in living? Where is the wisdom we have lost in knowledge? Where is the knowledge we have lost in information? -- T.S. Eliot, “The Rock” Where is the information we have lost in data? IS 202 - Fall 2004 2004.08.31 - SLIDE 18 Information Life Cycle Creation Active Authoring Modifying Using Creating Retention/ Mining Organizing Indexing Accessing Filtering Storing Retrieval Semi-Active Discard Utilization Disposition Distribution Networking Searching Inactive IS 202 - Fall 2004 2004.08.31 - SLIDE 19 Authoring/Modifying • Converting data+information+knowledge to new information • Creating information from observation, thought • Editing and publication • Gatekeeping IS 202 - Fall 2004 2004.08.31 - SLIDE 20 Organizing/Indexing • Collecting and integrating information • Affects data, information, and metadata • “Metadata” describes data and information – More on this later • Organizing information – Types of organization? • Indexing IS 202 - Fall 2004 2004.08.31 - SLIDE 21 Storing/Retrieving • Information storage – How and where is information stored? • Retrieving information – How is information recovered from storage? – How do we find needed information? – Linked with accessing/filtering stage IS 202 - Fall 2004 2004.08.31 - SLIDE 22 Distribution/Networking • Transmission of information – How is information transmitted? • Networks vs. broadcast IS 202 - Fall 2004 2004.08.31 - SLIDE 23 Accessing/Filtering • Using the organization created in the O/I stage to: – Select desired (or relevant) information – Locate that information – Retrieve the information from its storage location (often via a network) IS 202 - Fall 2004 2004.08.31 - SLIDE 24 Using/Creating • Using information • Transformation of information to knowledge • Knowledge to new data and new information IS 202 - Fall 2004 2004.08.31 - SLIDE 25 Key Issues in This Course • How to find the appropriate information resources for someone’s (or your own) needs – Retrieving • How to describe information resources in ways so that they may be effectively used by those who need to use them – Organizing IS 202 - Fall 2004 2004.08.31 - SLIDE 26 Key Issues Creation Active Authoring Modifying Using Creating Retention/ Mining Organizing Indexing Accessing Filtering Storing Retrieval Semi-Active Discard Utilization Disposition Distribution Networking Searching Inactive IS 202 - Fall 2004 2004.08.31 - SLIDE 27 (Approximate) Course Schedule • Retrieval – Overview – Introduction to the Search Process – Boolean Queries and Text Processing – Web Search Issues and Architecture – Statistical Properties of Text and Vector Representation – Probabilistic Ranking & Relevance Feedback – Evaluation – Interfaces for Information Retrieval – Database Design IS 202 - Fall 2004 • Organization – – – – – – – – – – – – Phone Project Introduction Categorization Knowledge Representation Lexical Relations and WordNet Metadata Introduction Controlled Vocabularies Introduction Facetted Classification Thesaurus Design and Construction Semantic Web Multimedia Information Organization and Retrieval Metadata for Media Phone Project Presentations 2004.08.31 - SLIDE 28 Web Search Questions • What do people search for? • How do people use search engines? – How often do people find what they are looking for? – How difficult is it for people to find what they are looking for? • How can search engines be improved? IS 202 - Fall 2004 2004.08.31 - SLIDE 29 What Do People Search for on the Web? • Study by Spink et al., Oct 98 – www.shef.ac.uk/~is/publications/infres/paper53.html – Survey on Excite, 13 questions – Data for 316 surveys IS 202 - Fall 2004 2004.08.31 - SLIDE 30 What Do People Search for on the Web? • Topics • • • • • • • • • • • • Genealogy/Public Figure: Computer related: Business: Entertainment: Medical: Politics & Government News Hobbies General info/surfing Science Travel Arts/education/shopping/images 12% 12% 12% 8% 8% 7% 7% 6% 6% 6% 5% 14% • Something is missing… IS 202 - Fall 2004 2004.08.31 - SLIDE 31 What Do People Search for on the Web? 50,000 queries from excite 1997 Most frequent terms: • 4660 sex • 3129 yahoo • 2191 internal site admin check from kho • 1520 chat • 1498 porn • 1315 horoscopes • 1284 pokemon • 1283 SiteScope test IS 202 - Fall 2004 • • • • • • • • • 1223 hotmail 1163 games 1151 mp3 1140 weather 1127 www.yahoo.com 1110 maps 1036 yahoo.com 983 ebay 980 recipes 2004.08.31 - SLIDE 32 Why Do These Differ? • Self-reporting survey • The nature of language – Only a few ways to say certain things – Many different ways to express most concepts • UFO, flying saucer, space ship, satellite • How many ways are there to talk about history? IS 202 - Fall 2004 2004.08.31 - SLIDE 33 What is on the Web? • • • • • • • • • • • • • • • • • • 65002930 the 62789720 a 60857930 to 57248022 of 54078359 and 52928506 in 50686940 s 49986064 for 45999001 on 42205245 this 41203451 is 39779377 by 35439894 with 35284151 or 34446866 at 33528897 all 31583607 are 30998255 from • • • • • • • • • • • • • • • • • • • 30755410 e 30080013 you 29669506 be 29417504 that 28542378 not 28162417 an 28110383 as 28076530 home 27650474 it 27572533 i 24548796 have 24420453 if 24376758 new 24171603 t 23951805 your 23875218 page 22292805 about 22265579 com 22107392 information Source: http://elib.cs.berkeley.edu/docfreq/index.html IS 202 - Fall 2004 2004.08.31 - SLIDE 34 Intranet Queries (Aug 2000) • • • • • • • • • • • • • 3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map IS 202 - Fall 2004 • • • • • • • • • • • • • • 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid 2004.08.31 - SLIDE 35 Intranet Queries • Summary of sample data from 3 weeks of UCB queries – – – – – – – – – 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) 6.7% Schedule of classes or final exams (6222) 5.4% Summer Session (5041) 3.2% Extension (2932) 3.1% Academic Calendar (2846) 2.4% Directories (2202) 1.7% Career Center (1588) 1.7% Housing (1583) 1.5% Map (1393) • Average query length over last 4 months: 1.8 words • This suggests what is difficult to find from the home page IS 202 - Fall 2004 2004.08.31 - SLIDE 36 Queries as Zeitgeist From: http:://www.google.com/press/zeitgeist.html IS 202 - Fall 2004 2004.08.31 - SLIDE 37 IR Issues in the Course • • • • • • What metadata is collected How the indexes are created How queries are formed How documents are ranked How shortest paths are computed How the system is built – … among other things! – This is just an introduction! Much more on these issues in the first half of the course IS 202 - Fall 2004 2004.08.31 - SLIDE 38 IO Issues in the Course • How do people categorize and represent information? • What types of metadata are there and how do we construct and use them? • How do we create ontologies for representing information, especially opaque data like photographs? • What new uses and applications will metadata enable, especially for mobile media? – … among other things! – This is just an introduction! Much more on these issues in the second half of the course IS 202 - Fall 2004 2004.08.31 - SLIDE 39 Course Format • Most classes will be lecture/discussion sessions – Lecture ~55 minutes – Discussion ~25 minutes • For each class students will prepare discussion questions for each reading and help lead discussion • Active participation is essential to your learning • Some classes will be working sessions – Phone Project Presentations – Final Review • Some classes will be exams – Midterm Exam – Final Exam IS 202 - Fall 2004 2004.08.31 - SLIDE 40 IS202 Course Project IS 202 - Fall 2004 2004.08.31 - SLIDE 41 Moore’s Law for Cameras 2000 2002 $400 Kodak DC40 Kodak DX4900 Nintendo GameBoy Camera SiPix StyleCam Blink $ 40 IS 202 - Fall 2004 2004.08.31 - SLIDE 42 Capture+Processing+Interaction+Network IS 202 - Fall 2004 2004.08.31 - SLIDE 43 Camera Phones as Platform • Media capture (images, video, audio) • Programmable processing using open standard operating systems, programming languages, and APIs • Wireless networking • Personal information management functions • Rich user interaction modalities • Time, location, and user contextual metadata IS 202 - Fall 2004 2004.08.31 - SLIDE 44 Camera Phones as Platform • In the first half of 2003, more camera phones were sold worldwide than digital cameras • By 2008, the average camera phone is predicted to have 5 megapixel resolution • Last month Casio and Samsung introduced 3.2 megapixel camera phones with optical zoom and photo flash • There are more cell phone users in China than people in the United States (300 million) • For 90% of the world their “computer” is their cell phone IS 202 - Fall 2004 2004.08.31 - SLIDE 45 Phone Project Goals • Experience the actual process of information organization and retrieval – Especially as regards mobile media metadata creation, sharing, and (re)use • Work in small, focused teams performing a variety of tasks – – – – Mobile image capture and sharing Ontology creation Image annotation Mobile media application design • Explore and design new applications for an emerging information organization and retrieval platform • Develop an ongoing resource for SIMS (an annotated photo database) for – Internal research and teaching – External promotional and informational purposes IS 202 - Fall 2004 2004.08.31 - SLIDE 46 Phone Project Requirements • Create engaging and useful application scenarios and photos • Create a shared, reusable resource of annotated photos – All photos will be stored in one directory – Design your metadata • So that all photos would be accessible from all applications • Not only for the needs of your particular application, but also for the reusability of your photos and metadata IS 202 - Fall 2004 2004.08.31 - SLIDE 47 Assignments and Exams • Approximately 12 assignments – Most due within one week to ten days – In second half, most related to the Phone Project – Sometimes “checked”, sometimes graded • Final exam (during finals week) • Grading – Assignments: 60% • Not evenly weighted – Final: 25% – Class Participation: 15% IS 202 - Fall 2004 2004.08.31 - SLIDE 48 Today • Introductions • Course Overview • Administrivia IS 202 - Fall 2004 2004.08.31 - SLIDE 49 Readings • Course Reader Part I of II – Should be available today at Copy Central on Bancroft • Textbooks • Modern Information Retrieval, Baeza-Yates and Ribiero-Neto (Eds.), Addison Wesley, 1999 • The Organization of Information, 2nd Edition. Arlene G. Taylor, Libraries Unlimited, 1999, IS 202 - Fall 2004 2004.08.31 - SLIDE 50 Recommended Course • INFOSYS 290 / Section 16 XML Foundations • Instructor: Bob Glushko • Units: 1 • W 12:30-2 Th 3:30-5 (5 weeks only: Sept 8 - Oct 7) 110 South Hall IS 202 - Fall 2004 2004.08.31 - SLIDE 51 For Next Time (!) • Readings – Borges, Dennett, and Reddy (in reader, Borges is also online via the class web site) • On-Line Questionnaire – Information about you – Assignment 1 on “What is information, according to your background or area of expertise?” – Due this Thursday, Sept 2 IS 202 - Fall 2004 2004.08.31 - SLIDE 52 Next Time • More on what is information? • And how much of it is out there? • Discussion Questions for: – Borges? – Dennett? – Reddy? IS 202 - Fall 2004 2004.08.31 - SLIDE 53