Course Overview

advertisement
Course Introduction
SIMS 202:
Information Organization
and Retrieval
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 am
Fall 2004
Credits to Marti Hearst for some of the slides in this lecture
IS 202 - Fall 2004
2004.08.31 - SLIDE 1
Today
• Introductions
• Course Overview
• Administrivia
IS 202 - Fall 2004
2004.08.31 - SLIDE 2
Today
• Introductions
• Course Overview
• Administrivia
IS 202 - Fall 2004
2004.08.31 - SLIDE 3
IS202 Teaching Team
Professor
Ray Larson
IS 202 - Fall 2004
Professor
Marc Davis
TA
Allison
Billings
TA
Tran Tu
2004.08.31 - SLIDE 4
Who Am I?
• Professor and Associate Dean at SIMS
• Here from the founding of SIMS, faculty
member of the “previous school”
IS 202 - Fall 2004
2004.08.31 - SLIDE 5
What Do I Do?
• Research
– Design, development and evaluation of information
retrieval systems and digital libraries
– Cheshire II and III
– Bibliometrics of the WWW
– Geographic information retrieval (GIR)
– XML Retrieval
– Applications of Grid computing to (large-scale) IR and
Digital Libraries
– Distributed search and retrieval
• Teaching
– Information Retrieval
– Database Management
IS 202 - Fall 2004
2004.08.31 - SLIDE 6
Who Am I?
• Assistant Professor at SIMS (School of
Information Management and Systems)
• Background
1980 – 1984
B.A. from Wesleyan University in the College of
Letters
1984 – 1987
M.A. from the University of Konstanz in Literary
Theory and Philosophy
1990 – 1995
Ph.D. from MIT Media Laboratory in Media Arts
and Sciences
1993 – 1998
Member of the Research Staff and Project
Coordinator at Interval Research Corporation
1999 – 2002
Chairman and CTO of Amova
IS 202 - Fall 2004
2004.08.31 - SLIDE 7
What Do I Do?
• Create technology and applications that will enable daily
media consumers to become daily media producers
• Research and teaching in the theory, design, and
development of digital media systems for creating and
using media metadata to automate media production
and reuse
– Research
• Director of Garage Cinema Research
• Projects in Media Metadata, Active Capture, Adaptive Media, Mobile
Media Metadata, and Social Uses of Personal Media
• Executive Committee Member and Co-Founder of the Center for
New Media
• Affiliated Faculty Member of the Berkeley Institute of Design
– Teaching
• Multimedia Information
• Digital Media Design Studio
• Foundations of New Media
IS 202 - Fall 2004
2004.08.31 - SLIDE 8
Student Introductions
• Who are you?
– Name
– Undergrad degree
– Special areas of expertise and interest
• Why are you here?
– What you want to learn from the course
IS 202 - Fall 2004
2004.08.31 - SLIDE 9
Today
• Introductions
• Course Overview
• Administrivia
IS 202 - Fall 2004
2004.08.31 - SLIDE 10
Goals of the Course
• Learn about
– Design, development, and use of information
organization and retrieval systems
– Practical and theoretical foundations of
information organization and analysis
– Evaluation of information access systems
– Cognitive and user-centric considerations
– Hands-on experience with information
systems
IS 202 - Fall 2004
2004.08.31 - SLIDE 11
Two Main Themes
Information
Retrieval and the
Search Process
IS 202 - Fall 2004
Information
Organization and
Design
2004.08.31 - SLIDE 12
Information Organization and Retrieval
• To organize is to (1) furnish with organs, make organic,
make into living tissue, become organic; (2) form into an
organic whole; give orderly structure to; frame and put
into working order; make arrangements for.
• Knowledge is knowing, familiarity gained by experience;
person’s range of information; a theoretical or practical
understanding of; the sum of what is known.
• To retrieve is to (1) recover by investigation or effort of
memory, restore to knowledge or recall to mind; regain
possession of; (2) rescue from a bad state, revive, repair,
set right.
• Information is (1) informing, telling; thing told,
knowledge, items of knowledge, news.
The Oxford English Dictionary, cf. Rowley
IS 202 - Fall 2004
2004.08.31 - SLIDE 13
(Approximate) Course Schedule
• Retrieval
– Overview
– Introduction to the Search
Process
– Boolean Queries and Text
Processing
– Web Search Issues and
Architecture
– Statistical Properties of
Text and Vector
Representation
– Probabilistic Ranking &
Relevance Feedback
– Evaluation
– Interfaces for Information
Retrieval
– Database Design
IS 202 - Fall 2004
• Organization
–
–
–
–
–
–
–
–
–
–
–
–
Phone Project Introduction
Categorization
Knowledge Representation
Lexical Relations and
WordNet
Metadata Introduction
Controlled Vocabularies
Introduction
Facetted Classification
Thesaurus Design and
Construction
Semantic Web
Multimedia Information
Organization and Retrieval
Metadata for Media
Phone Project Presentations
2004.08.31 - SLIDE 14
Information Properties
• Information can be communicated
electronically
– Broadcasting
– Networking
• Information can be easily duplicated and
shared
– Problems of ownership
– Problems of control
Adapted from ‘Silicon Dreams’ by Robert W. Lucky
IS 202 - Fall 2004
2004.08.31 - SLIDE 15
Information Hierarchy
Wisdom
Knowledge
Information
Data
IS 202 - Fall 2004
2004.08.31 - SLIDE 16
Information Hierarchy
• Data
– The raw material of information
• Information
– Data organized and presented by someone
• Knowledge
– Information read, heard, or seen and
understood
• Wisdom
– Distilled and integrated knowledge and
understanding
IS 202 - Fall 2004
2004.08.31 - SLIDE 17
Information
Where is the Life we have lost in living?
Where is the wisdom we have lost in knowledge?
Where is the knowledge we have lost in information?
-- T.S. Eliot, “The Rock”
Where is the information we have lost in data?
IS 202 - Fall 2004
2004.08.31 - SLIDE 18
Information Life Cycle
Creation
Active
Authoring
Modifying
Using
Creating
Retention/
Mining
Organizing
Indexing
Accessing
Filtering
Storing
Retrieval
Semi-Active
Discard
Utilization Disposition
Distribution
Networking
Searching
Inactive
IS 202 - Fall 2004
2004.08.31 - SLIDE 19
Authoring/Modifying
• Converting data+information+knowledge
to new information
• Creating information from observation,
thought
• Editing and publication
• Gatekeeping
IS 202 - Fall 2004
2004.08.31 - SLIDE 20
Organizing/Indexing
• Collecting and integrating information
• Affects data, information, and metadata
• “Metadata” describes data and information
– More on this later
• Organizing information
– Types of organization?
• Indexing
IS 202 - Fall 2004
2004.08.31 - SLIDE 21
Storing/Retrieving
• Information storage
– How and where is information stored?
• Retrieving information
– How is information recovered from storage?
– How do we find needed information?
– Linked with accessing/filtering stage
IS 202 - Fall 2004
2004.08.31 - SLIDE 22
Distribution/Networking
• Transmission of information
– How is information transmitted?
• Networks vs. broadcast
IS 202 - Fall 2004
2004.08.31 - SLIDE 23
Accessing/Filtering
• Using the organization created in the O/I
stage to:
– Select desired (or relevant) information
– Locate that information
– Retrieve the information from its storage
location (often via a network)
IS 202 - Fall 2004
2004.08.31 - SLIDE 24
Using/Creating
• Using information
• Transformation of information to
knowledge
• Knowledge to new data and new
information
IS 202 - Fall 2004
2004.08.31 - SLIDE 25
Key Issues in This Course
• How to find the appropriate information
resources for someone’s (or your own)
needs
– Retrieving
• How to describe information resources in
ways so that they may be effectively used
by those who need to use them
– Organizing
IS 202 - Fall 2004
2004.08.31 - SLIDE 26
Key Issues
Creation
Active
Authoring
Modifying
Using
Creating
Retention/
Mining
Organizing
Indexing
Accessing
Filtering
Storing
Retrieval
Semi-Active
Discard
Utilization Disposition
Distribution
Networking
Searching
Inactive
IS 202 - Fall 2004
2004.08.31 - SLIDE 27
(Approximate) Course Schedule
• Retrieval
– Overview
– Introduction to the Search
Process
– Boolean Queries and Text
Processing
– Web Search Issues and
Architecture
– Statistical Properties of
Text and Vector
Representation
– Probabilistic Ranking &
Relevance Feedback
– Evaluation
– Interfaces for Information
Retrieval
– Database Design
IS 202 - Fall 2004
• Organization
–
–
–
–
–
–
–
–
–
–
–
–
Phone Project Introduction
Categorization
Knowledge Representation
Lexical Relations and
WordNet
Metadata Introduction
Controlled Vocabularies
Introduction
Facetted Classification
Thesaurus Design and
Construction
Semantic Web
Multimedia Information
Organization and Retrieval
Metadata for Media
Phone Project Presentations
2004.08.31 - SLIDE 28
Web Search Questions
• What do people search for?
• How do people use search engines?
– How often do people find what they are
looking for?
– How difficult is it for people to find what they
are looking for?
• How can search engines be improved?
IS 202 - Fall 2004
2004.08.31 - SLIDE 29
What Do People Search for on the Web?
• Study by Spink et al., Oct 98
– www.shef.ac.uk/~is/publications/infres/paper53.html
– Survey on Excite, 13 questions
– Data for 316 surveys
IS 202 - Fall 2004
2004.08.31 - SLIDE 30
What Do People Search for on the Web?
• Topics
•
•
•
•
•
•
•
•
•
•
•
•
Genealogy/Public Figure:
Computer related:
Business:
Entertainment:
Medical:
Politics & Government
News
Hobbies
General info/surfing
Science
Travel
Arts/education/shopping/images
12%
12%
12%
8%
8%
7%
7%
6%
6%
6%
5%
14%
• Something is missing…
IS 202 - Fall 2004
2004.08.31 - SLIDE 31
What Do People Search for on the Web?
50,000 queries from excite 1997
Most frequent terms:
• 4660 sex
• 3129 yahoo
• 2191 internal site admin
check from kho
• 1520 chat
• 1498 porn
• 1315 horoscopes
• 1284 pokemon
• 1283 SiteScope test
IS 202 - Fall 2004
•
•
•
•
•
•
•
•
•
1223 hotmail
1163 games
1151 mp3
1140 weather
1127 www.yahoo.com
1110 maps
1036 yahoo.com
983 ebay
980 recipes
2004.08.31 - SLIDE 32
Why Do These Differ?
• Self-reporting survey
• The nature of language
– Only a few ways to say certain things
– Many different ways to express most concepts
• UFO, flying saucer, space ship, satellite
• How many ways are there to talk about history?
IS 202 - Fall 2004
2004.08.31 - SLIDE 33
What is on the Web?
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
65002930 the
62789720 a
60857930 to
57248022 of
54078359 and
52928506 in
50686940 s
49986064 for
45999001 on
42205245 this
41203451 is
39779377 by
35439894 with
35284151 or
34446866 at
33528897 all
31583607 are
30998255 from
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
30755410 e
30080013 you
29669506 be
29417504 that
28542378 not
28162417 an
28110383 as
28076530 home
27650474 it
27572533 i
24548796 have
24420453 if
24376758 new
24171603 t
23951805 your
23875218 page
22292805 about
22265579 com
22107392 information
Source: http://elib.cs.berkeley.edu/docfreq/index.html
IS 202 - Fall 2004
2004.08.31 - SLIDE 34
Intranet Queries (Aug 2000)
•
•
•
•
•
•
•
•
•
•
•
•
•
3351 bearfacts
3349 telebears
1909 extension
1874 schedule+of+classes
1780 bearlink
1737 bear+facts
1468 decal
1443 infobears
1227 calendar
989 career+center
974 campus+map
920 academic+calendar
840 map
IS 202 - Fall 2004
•
•
•
•
•
•
•
•
•
•
•
•
•
•
773 bookstore
741 class+pass
738 housing
721 tele-bears
716 directory
667 schedule
627 recipes
602 transcripts
582 tuition
577 seti
563 registrar
550 info+bears
543 class+schedule
470 financial+aid
2004.08.31 - SLIDE 35
Intranet Queries
• Summary of sample data from 3 weeks of UCB queries
–
–
–
–
–
–
–
–
–
13.2% Telebears/BearFacts/InfoBears/BearLink (12297)
6.7% Schedule of classes or final exams (6222)
5.4% Summer Session (5041)
3.2% Extension (2932)
3.1% Academic Calendar (2846)
2.4% Directories (2202)
1.7% Career Center (1588)
1.7% Housing (1583)
1.5% Map (1393)
• Average query length over last 4 months: 1.8 words
• This suggests what is difficult to find from the home page
IS 202 - Fall 2004
2004.08.31 - SLIDE 36
Queries as Zeitgeist
From: http:://www.google.com/press/zeitgeist.html
IS 202 - Fall 2004
2004.08.31 - SLIDE 37
IR Issues in the Course
•
•
•
•
•
•
What metadata is collected
How the indexes are created
How queries are formed
How documents are ranked
How shortest paths are computed
How the system is built
– … among other things!
– This is just an introduction! Much more on
these issues in the first half of the course
IS 202 - Fall 2004
2004.08.31 - SLIDE 38
IO Issues in the Course
• How do people categorize and represent
information?
• What types of metadata are there and how do
we construct and use them?
• How do we create ontologies for representing
information, especially opaque data like
photographs?
• What new uses and applications will metadata
enable, especially for mobile media?
– … among other things!
– This is just an introduction! Much more on these
issues in the second half of the course
IS 202 - Fall 2004
2004.08.31 - SLIDE 39
Course Format
• Most classes will be lecture/discussion sessions
– Lecture ~55 minutes
– Discussion ~25 minutes
• For each class students will prepare discussion questions for
each reading and help lead discussion
• Active participation is essential to your learning
• Some classes will be working sessions
– Phone Project Presentations
– Final Review
• Some classes will be exams
– Midterm Exam
– Final Exam
IS 202 - Fall 2004
2004.08.31 - SLIDE 40
IS202 Course Project
IS 202 - Fall 2004
2004.08.31 - SLIDE 41
Moore’s Law for Cameras
2000
2002
$400
Kodak DC40
Kodak DX4900
Nintendo GameBoy Camera
SiPix StyleCam Blink
$ 40
IS 202 - Fall 2004
2004.08.31 - SLIDE 42
Capture+Processing+Interaction+Network
IS 202 - Fall 2004
2004.08.31 - SLIDE 43
Camera Phones as Platform
• Media capture (images, video,
audio)
• Programmable processing using
open standard operating
systems, programming
languages, and APIs
• Wireless networking
• Personal information
management functions
• Rich user interaction modalities
• Time, location, and user
contextual metadata
IS 202 - Fall 2004
2004.08.31 - SLIDE 44
Camera Phones as Platform
• In the first half of 2003, more
camera phones were sold
worldwide than digital cameras
• By 2008, the average camera
phone is predicted to have 5
megapixel resolution
• Last month Casio and Samsung
introduced 3.2 megapixel
camera phones with optical
zoom and photo flash
• There are more cell phone users
in China than people in the
United States (300 million)
• For 90% of the world their
“computer” is their cell phone
IS 202 - Fall 2004
2004.08.31 - SLIDE 45
Phone Project Goals
• Experience the actual process of information
organization and retrieval
– Especially as regards mobile media metadata creation, sharing,
and (re)use
• Work in small, focused teams performing a variety of
tasks
–
–
–
–
Mobile image capture and sharing
Ontology creation
Image annotation
Mobile media application design
• Explore and design new applications for an emerging
information organization and retrieval platform
• Develop an ongoing resource for SIMS (an annotated
photo database) for
– Internal research and teaching
– External promotional and informational purposes
IS 202 - Fall 2004
2004.08.31 - SLIDE 46
Phone Project Requirements
• Create engaging and useful application
scenarios and photos
• Create a shared, reusable resource of
annotated photos
– All photos will be stored in one directory
– Design your metadata
• So that all photos would be accessible from all
applications
• Not only for the needs of your particular
application, but also for the reusability of your
photos and metadata
IS 202 - Fall 2004
2004.08.31 - SLIDE 47
Assignments and Exams
• Approximately 12 assignments
– Most due within one week to ten days
– In second half, most related to the Phone Project
– Sometimes “checked”, sometimes graded
• Final exam (during finals week)
• Grading
– Assignments: 60%
• Not evenly weighted
– Final: 25%
– Class Participation: 15%
IS 202 - Fall 2004
2004.08.31 - SLIDE 48
Today
• Introductions
• Course Overview
• Administrivia
IS 202 - Fall 2004
2004.08.31 - SLIDE 49
Readings
• Course Reader Part I of II
– Should be available today at Copy Central on
Bancroft
• Textbooks
• Modern Information Retrieval, Baeza-Yates and
Ribiero-Neto (Eds.), Addison Wesley, 1999
• The Organization of Information, 2nd Edition.
Arlene G. Taylor, Libraries Unlimited, 1999,
IS 202 - Fall 2004
2004.08.31 - SLIDE 50
Recommended Course
• INFOSYS 290 / Section 16 XML
Foundations
• Instructor: Bob Glushko
• Units: 1
• W 12:30-2
Th 3:30-5
(5 weeks only: Sept 8 - Oct 7)
110 South Hall
IS 202 - Fall 2004
2004.08.31 - SLIDE 51
For Next Time (!)
• Readings
– Borges, Dennett, and Reddy (in reader,
Borges is also online via the class web site)
• On-Line Questionnaire
– Information about you
– Assignment 1 on “What is information,
according to your background or area of
expertise?”
– Due this Thursday, Sept 2
IS 202 - Fall 2004
2004.08.31 - SLIDE 52
Next Time
• More on what is information?
• And how much of it is out there?
• Discussion Questions for:
– Borges?
– Dennett?
– Reddy?
IS 202 - Fall 2004
2004.08.31 - SLIDE 53
Download