CS246 - UCLA.edu

advertisement
CS246:
Web Information Systems
Junghoo “John” Cho
Spring 2015
CS246 by John Cho
1
Course Information




Web page: http://oak.cs.ucla.edu/cs246/
Topic: Web information management
Time: MW 10:00 --11:50 am
Instructor: Junghoo “John” Cho



office: 3531H Boelter Hall
email: [email protected]
 please use subject “CS246: …”
office hours: Mon 1-2 pm.
CS246 by John Cho
2
Who is this class for?



Strong interest in research
Interest in Web information systems
Time commitment:

Around 2-3 papers every week


Typically one full day of paper reading
One indepedent project

Similar to paper writing


In fact we read papers from past student projects!
Or interesting application implementation
CS246 by John Cho
3
Today’s Topics


Overview of the course topics
Course logistics


Paper reading assignments
Class project
CS246 by John Cho
4
Prerequisite

Introductory database, e.g., CS143



Basic algorithms and data structures
Basic probability and statistics


P(A|C), Bayes rule, …
Design and implementation experience


e.g.: query? SQL?
Basic C++
Quick test: Grab a sample paper

See if you can read, understand and build it
CS246 by John Cho
5
Tell Us About You






Name
Department & Program
Before coming to UCLA
Brief history at UCLA
Technical/research interests
Expectation from the class
CS246 by John Cho
6
Information Galore
Biblio sever
Legacy database
Plain text files
CS246 by John Cho
7
Central Problem


How to manage/access information on the
Web?
Three major approaches

Central indexing


Dynamic integration


E.g., Web search engine
E.g., comparison shopping services
Data extraction

E.g., spamming companies
CS246 by John Cho
8
Topic: Web Search
(Central Indexing)
Central Index
CS246 by John Cho
9
Topic: Web Search (Central
Indexing)

Web: collection of passive HTML pages


Traditional Information Retrieval:



Find Web pages relevant to a query
Web = collection of HTML pages
HTML page = a bag of words
More than that?



Links, structure of the Web
User access patterns
HTML tags (markups)
CS246 by John Cho
10
Topic: Dynamic Integration
Amazon.com
Cars.com
Apartments.com
401carfinder.com
CS246 by John Cho
11
Topic: Dynamic Integration
Mediator
Wrapper
Source 1
Wrapper
Source 2
CS246 by John Cho
Wrapper
Source n
12
Topic: Data Extraction
Structured data
Beatles $10
Madonna $20
NSync $20

WWW
How can we extract “structured data” from
free text automatically?
CS246 by John Cho
13
Main Course Workload

Paper reading




Paper reading assignments
Class discussion
We mainly focus on “central indexing”
Independent projects
CS246 by John Cho
14
High-Level Goal

Learn core ideas and techniques



Some of the techniques can be useful for other
fields
Learn how to read papers
Hopefully learn what it is like to do research

Sometimes very frustrating but often very
rewarding
CS246 by John Cho
15
Paper Reading

Why:




About 20 papers from


Conferences: SIGMOD, VLDB, WWW, and …
Before the class:


Something that you will do all the time as a researcher
Learn to be critical and communicate well
Acquire knowledge to conduct research/project
Everyone: read and review the paper
During the class:


Instructor: present his own understanding and lead class
discussion
Everyone: participate!!!
CS246 by John Cho
16
How to Get Papers

From the class homepage


Some of the materials password protected



http://oak.cs.ucla.edu/cs246/
User name: cs246
Password: papers
Let me know if any problem
CS246 by John Cho
17
How to Read Papers






Understand the “Big Picture”
What is the problem?
Why is it important?
Why is it difficult?
What has this paper done?
What others have done?
CS246 by John Cho
18
Paper Reviews (1)

Due by the preceding Sunday


Submit through our Web submission interface on the class Web
page
Required components: at most 3 paragraph

Summary (1 paragraph): your own words
This paper discusses how to optimize queries with...

Comments/criticisms (1-2 paragraphs): the good & the bad
It addresses a real problem and the solution is interesting …
But I feel the experiments are not realistic because...

Optional: questions, as many as you want
Why the authors assume that queries are independent?
CS246 by John Cho
19
Paper Reviews (2)


May skip 3 paper summaries without penalty
Most reviews will get full score unless they are
written extremely poorly
CS246 by John Cho
20
Class Project

Why:





Work on a specific problem and learn to find a solution
40% of the class
Team of up to 3
Topic: any problem related to the general problem
Open style


Rigorous study of a research problem or
Any interesting system implementation
CS246 by John Cho
21
Class Project Schedule

Important Milestones







Group formation: 4/08 (2nd week Wed)
Project proposal: 4/10 (3rd week Sun)
Project progress: 5/06 (6th week Wed)
Final report: 5/20 (8th week Sun)
Project presentation: 9th and 10th weeks
You are responsible to stay on track
Make appointments with instructor as needed
CS246 by John Cho
22
Project: Please Remember



Put your aims high and be realistic
Expect to read at least 4-5 papers along the way
Start early




Don’t do it right before the deadline
Always unexpected obstacles
Some students could not finish in previous quarters
 Please, please start early
You are responsible to be on track
CS246 by John Cho
23
Grading



Midterm: 40%
Paper reviews: 20%
Project: 40%
CS246 by John Cho
24
Announcements

First review due Sunday 4/05

Three papers for class 3 and 4



Graph structure in the Web
The Anatomy of a Large-Scale Hypertextual …
Authoritative sources in a hyperlinked environment
CS246 by John Cho
25
Download