PPT

advertisement
Semex: A Platform for Personal
Information Management and
Integration
Xin (Luna) Dong
University of Washington
June 24, 2005
Is Your Personal Information
Intranet
a Mine or a Mess?
Internet
Is Your Personal Information
Intranet
a Mine or a Mess?
Internet
Questions Hard to Answer

Where are my SEMEX papers and
presentation slides (maybe in an
attachment)?
Index Data from Different Sources
E.g. Google, MSN desktop search
Intranet
Internet
Questions Hard to Answer
Where are my SEMEX papers and
presentation slides (maybe in an
attachment)?
 Who are working on SEMEX?
 What are the emails sent by my PKU
alumni?
 What are the phone numbers and emails of
my coauthors?

Organize Data in a Semantically
Meaningful Way
Intranet
Internet
Questions Hard to Answer
Where are my SEMEX papers and
presentation slides (maybe in an
attachment)?
 Who are working on SEMEX?
 What are the emails sent by my PKU alumni?
 What are the phone numbers and emails of
my coauthors?
 Whom of SIGMOD’05 authors do I know?

Integrate Organizational and Public
Data with Personal Data
AttachedTo
Recipient
ConfHomePage
ExperimentOf
CourseGradeIn
PublishedIn
Cites
EarlyVersion
ArticleAbout
PresentationFor
Sender
ComeFrom
FrequentEmailer
CoAuthor
BudgetOf
OriginitatedFrom
HomePage
AddressOf
SEMEX (SEMantic EXplorer)
– I. Provide a Logical View of Data
Organizer,
Participants
Event
Web Page
Cached
Document
Author
Sender,
Recipients
Message
Mail &
calendar
Person
Homepage
Softcopy
Cites
Papers
Paper
Softcopy
Presentation
HTML
Files
Presentations
SEMEX (SEMantic EXplorer)
– II. On-the-fly Data Integration
Organizer,
Participants
Event
Person
Homepage
Cached
Document
Author
Sender,
Recipients
Message
Web Page
Softcopy
Cites
Paper
Softcopy
Presentation
How to Find Alon’s Papers on My
Desktop?
How to Find Alon’s Papers on My
Desktop? – Google Search Results
Search Alon Halevy
Send me the semex demo
slides again?
How to Find Alon’s Papers on My
Desktop? – Google Search Results
Search Alon Halevy
Ignore previous request, I
found them
How to Find Alon’s Papers on My
Desktop? – Google Search Results
Semex Goal

Build a Personal Information Management
(PIM) system prototype that provides a logical
view of personal information
 Build
the logical view automatically
Extract object instances and associations
 Remove instance duplications

 Leverage
the logical view for on-the-fly data
integration
 Exploit the logical view for information search
and browsing to improve people’s productivity
 Be resilient to the evolution of the logical view
An Ideal PIM is a Magic Wand
An Ideal PIM is a Magic Wand
Outline


Problem definition and project goals
Technical issues:
 System architecture and instance extraction
 Reference reconciliation [Sigmod’05]
 On-the-fly data integration
 Association search and browsing
 Domain model personalization and
[WebDB’05]

[CIDR’05]
evolution
Interleaved with Semex demo [Best demo in
Sigmod’05]
Overarching PIM Themes
System Architecture
Domain
Management
Module
Data Analysis Module
Searcher
Browser
Analyzer
Domain
Model
Data Collection Module
Association DB
Indexer
Index
Reference
Reconciliater
Word
PPT
Associations
Objects
Extractors
Integrator
PDF
Latex
Email Webpage Excel
Domain
Manager
DB
Outline


Problem definition and project goals
Technical issues:
 System architecture and instance extraction
 Reference reconciliation [Sigmod’05]
 On-the-fly data integration
 Association search and browsing
 Domain model personalization and
[WebDB’05]

[CIDR’05]
evolution
Interleaved with Semex demo [Best demo in
Sigmod’05]
Overarching PIM Themes
Reference Reconciliation in Semex
Xin (Luna) Dong
Lab-#dong xin
dong xin luna
•¶ðà xinluna dong
luna
Names
x. dong
dongxin
xin dong
Emails
Semex Without Reference Reconciliation
Search results for luna
23 persons
luna dong
SenderOfEmails(3043)
RecipientOfEmails(2445)
MentionedIn(94)
Semex Without Reference Reconciliation
Search results for luna
23 persons
Xin (Luna) Dong
AuthorOfArticles(49)
MentionedIn(20)
Semex Without Reference Reconciliation
A Platform for Personal Information Management and Integration
Semex Without Reference Reconciliation
9 Persons:
dong xin
xin dong
Semex NEEDS Reference Reconciliation
Reference Reconciliation
A very active area of research in Databases,
Data Mining and AI.
(Surveyed in [Cohen, et al. 2003])
 Traditional approaches assume matching
tuples from a single table

 Based

on pair-wise comparisons
Harder in our context
Challenges
Article:
a1=(“Bounds on the Sample Complexity of Bayesian
“703-746”, {p1,p2,p3}, c1)
a2=(“Bounds on the sample complexity of bayesian learning”,
“703-746”, {p4,p5,p6}, c2)

Venue:
c1=(“Computational learning theory”, “1992”, “Austin, Texas”)
c2=(“COLT”, “1992”, null)

Person:
p1=(“David Haussler”, null)
p2=(“Michael Kearns”, null)
p3=(“Robert Schapire”, null)
p4=(“Haussler, D.”, null)
p5=(“Kearns, M. J.”, null)
p6=(“Schapire, R.”, null)

Learning”,
Challenges
Article:
a1=(“Bounds on the Sample Complexity of Bayesian
“703-746”, {p1,p2,p3}, c1)
a2=(“Bounds on the sample complexity of bayesian learning”,
“703-746”, {p4,p5,p6}, c2)

Venue:

Person:
c1=(“Computational learning theory”, “1992”, “Austin, Texas”)
c2=(“COLT”, “1992”, null)
2. Limited

Learning”,
1. Multiple
Classes
Information
p1=(“David Haussler”, null)
p2=(“Michael Kearns”, null)
p3=(“Robert Schapire”, null)
?
p4=(“Haussler, D.”, null)
3. Multi-value
p5=(“Kearns, M. J.”, null)
Attributes
p6=(“Schapire, R.”, null)
?
p7=(“Robert Schapire”, “schapire@research.att.com”)
p8=(null, “mkearns@cis.uppen.edu”)
p9=(“mike”, “mkearns@cis.uppen.edu”)
Intuition
 Complex
information spaces can be
considered as networks of instances and
associations between the instances
 Key: exploit the network, specifically, the
clues hidden in the associations
I. Exploiting Richer Evidences

Cross-attribute similarity – Name&email
 p5=(“Stonebraker, M.”, null)
 p8=(null, “stonebraker@csail.mit.edu”)

Context Information I – Contact list
 p5=(“Stonebraker, M.”, null, {p4, p6})
 p8=(null, “stonebraker@csail.mit.edu”,
 p6=p7

{p7})
Context Information II – Authored articles
 p2=(“Michael Stonebraker”, null)
 p5=(“Stonebraker, M.”, null)
 p2 and p5 authored the same article
Considering Only Attribute-wise Similarities
Cannot Merge Persons Well
3350
3159
3150
#(Person Partitions)
2950
2750
2550
2350
1409
2150
1950
1750
1
2
3
4
Evidence
Person references: 24076
Real-world persons (gold-standard):1750
Considering Richer Evidence
Improves the Recall
3350
3159
3150
#(Person Partitions)
2950
2750
2550
2350
1409
2169
2169
2096
2150
1950
346
1750
Attr-wise
Name&Email
Article
Contact
Evidence
Person references: 24076
Real-world persons:1750
II. Propagate Information between
Reconciliation Decisions

Article:
a1=(“Distributed Query Processing”,“169-180”, {p1,p2,p3}, c1)
a2=(“Distributed query processing”,“169-180”, {p4,p5,p6}, c2)

Venue:
c1=(“ACM Conference on Management of Data”, “1978”,
“Austin, Texas”)
c2=(“ACM SIGMOD”, “1978”, null)

Person:
p1=(“Robert S. Epstein”, null)
p2=(“Michael Stonebraker”, null)
p3=(“Eugene Wong”, null)
p4=(“Epstein, R.S.”, null)
p5=(“Stonebraker, M.”, null)
p6=(“Wong, E.”, null)
Propagating Information between Reconciliation
Decisions Further Improves Recall
Traditional
3350
Propagation
3159
3150
3159
#(Person Partitions)
2950
2750
2550
2350
2169
2169
2096
2150
2146
1950
2135
2022
1750
Attr-w ise
Name&Email
Article
Contact
Evidence
Person references: 24076
Real-world persons:1750
III. Reference Enrichment


p2=(“Michael Stonebraker”, null, {p1,p3})
X
X
p8=(null, “stonebraker@csail.mit.edu”, {p7})
p9=(“mike”, “stonebraker@csail.mit.edu”, null) V V
p8-9 =(“mike”, “stonebraker@csail.mit.edu”, {p7})
References Enrichment Improves Recall More
than Information Propagation
Traditional
Enrichment
Propagation
3350
3159
3150
3169
#(Person Partitions)
2950
2750
2550
2350
2169
2169
2096
2150
1950
2036
2036
1910
1750
Attr-wise
Name&Email
Article
Contact
Evidence
Person references: 24076
Real-world persons:1750
Applying Both Information Propagation and
Reference Enrichment Gets the Highest Recall
Traditional
Enrichment
Propagation
Full
3350
3159
3150
#(Person Partitions)
3169
2950
2750
2550
2350
1409
2169
2169
2096
2150
1950
2002
1990
1750
346
125
1873
Attr-wise
Name&Email
Article
Contact
Evidence
Person references: 24076
Real-world persons:1750
Outline


Problem definition and project goals
Technical issues:
 System architecture and instance extraction
 Reference reconciliation [Sigmod’05]
 On-the-fly data integration
 Association search and browsing
 Domain model personalization and
[WebDB’05]

[CIDR’05]
evolution
Interleaved with Semex demo [Best demo in
Sigmod’05]
Overarching PIM Themes
Importing External Data Sources
Organizer,
Participants
Event
Person
Homepage
Cached
Document
Author
Sender,
Recipients
Message
Web Page
Softcopy
Cites
Paper
Softcopy
Presentation
Intuition—
Explore associations in schema mapping

Traditional approaches: proceed in two steps


Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001])
 Generate term matching candidates
 E.g., “paperTitle” in table Author matches “title” in table Article
Step 2. Query discovery [Miller et al., 2000]
 Take term matching as input, generate mapping expressions
(typically queries)
 E.g., SELECT Article.title as paperTitle, Person.name as author
FROM Article, Person
WHERE Article.author = Person.id
Intuition—
Explore associations in schema mapping

Traditional approaches: proceed in two steps




Step 1. Schema matching (Surveyed in [Rahm&Bernstein, 2001])
 Generate term matching candidates
 E.g., “paperTitle” in table Author matches “title” in table Article
Step 2. Query discovery [Miller et al., 2000]
 Take term matching as input, generate mapping expressions
(typically queries)
 E.g., SELECT Article.title as paperTitle, Person.name as author
FROM Article, Person
WHERE Article.author = Person.id
User’s input is needed to fill in the gap between Step 1 output and Step
2 input
Our approach: check association violations to filter inappropriate
matching candidates
Integration Example
authoredBy
authoredBy
publishedIn
Person(name, email) Book(title, year) Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)
Integration Example
authoredBy
Person(name, email) Book(title, year) Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)

authoredBy
publishedIn
Person(name, email) Book(title, year) Article(title, page) Conference(name, year)
Webpage-item (title, author, conf, year)

Outline


Problem definition and project goals
Technical issues:
 System architecture and instance extraction
 Reference reconciliation [Sigmod’05]
 On-the-fly data integration
 Association search and browsing
 Domain model personalization and
[WebDB’05]

[CIDR’05]
evolution
Interleaved with Semex demo [Best demo in
Sigmod’05]
Overarching PIM Themes
Explore the association network –
1. Find the relationship between two instances
Example: How did I know this person?
 Solution: Lineage

 Find
an association chain between two object
instances
 Shortest chain?
 “Earliest” chain OR “Latest” chain
Explore the association network –
2. Find all instances related to a given keyword


Example: Who are working on “Schema Matching”?
Solution:
 Naive





approach: index object instances on attribute values
A list of papers on schema matching
A list of emails on schema matching
A list of persons working on schema matching
A list of conferences for schema-matching papers
A list of institutes that conduct schema-matching research
 Our
approach: index objects on the attributes of
associated objects
Explore the association network –
3. Rank returned instances in a keyword search
Example: What are important papers on
“schema matching”?
 Solution:

 Naive
approach: rank by TF/IDF metric
 Our approach: ranking by
Significance score: PageRank measure
 Relevance score: TF/IDF metric
 Usage score: last visit time and modification time

Explore the association network –
4. Fuzzy Queries

Queries we pose today—something we can
describe
 Find
me something with (related to) keyword X
 Find me the co-authors of Person Y

Fuzzy queries:
 Q:
What do I want to know?
 A: In this webpage, 5 papers are written by your friends
 Q: What significant things have happened today?
 A: The President wrote an email to you!!
Outline


Problem definition and project goals
Technical issues:
 System architecture and instance extraction
 Reference reconciliation [Sigmod’05]
 On-the-fly data integration
 Association search and browsing
 Domain model personalization and
[WebDB’05]

[CIDR’05]
evolution
Interleaved with Semex demo [Best demo in
Sigmod’05]
Overarching PIM Themes
The Domain Model

The logical view is described with a
domain model
Semex provides very basic classes and
associations as a default domain model
 Users can personalize the domain model

Organizer,
Participants
Event
Person
Web Page
Cached
Document
Author
Sender,
Recipients
Message
Homepage
Softcopy
cite
Paper
Softcopy
Presentation
Problems in Domain Model Personalization

Problem: hard to precisely model a domain
 At
certain point we are not able to give a precise domain
model



Not enough knowledge of the domain
Inherently evolution of a domain
Non-existence of a precise model
 Overly detailed models may be a burden to users
 Modeling every details of the information on one’s desktop is
often overwhelming
 We

may want to leave part of the domain unstructured
Extract descriptions at different levels of granularity
Address v.s. street, city, state, zip
Malleable Schemas

Key idea: capture the important aspects of the domain
model without committing to a strict schema
Unstructured data
sources
Clean Schema
Malleable Schema
Structured data
sources
Malleable Schema

Introduce “text” into schemas
 Phrases
as element names
E.g., “InitialPlanningPhaseParticipant”
 Regular expressions as element names
E.g., “*Phone”, “State|Province”
 Chains as element names
E.g., “name/firstName”

Introduce imprecision into queries
SELECT S.~name, S.~phone
FROM Student as S, ~Project as P
WHERE (S ~initialParticipant P) AND (P.name =
“Semex”)
Outline


Problem definition and project goals
Technical issues:
 System architecture and instance extraction
 Reference reconciliation [Sigmod’05]
 On-the-fly data integration
 Association search and browsing
 Domain model personalization and
[WebDB’05]

[CIDR’05]
evolution
Interleaved with Semex demo [Best demo in
Sigmod’05]
Overarching PIM Themes
Overarching PIM Themes



It is PERSONAL data!
 How to build a system supporting users in their own
habitat?
 How to create an ‘AHA!’ browsing experience and
increase user’s productivity?
There can be any kind of INFORMATION
INFORMATION
 How to combine structured and un-structured data?
We are pursuing life-long data MANAGEMENT
 What is the right granularity for modeling personal data?
 How to manage data and schema that evolve over time?
Related Work

Personal Information Management Systems
 Indexing


Stuff I’ve Seen (MSN Desktop Search)
[Dumais et al., 2003]
Google Desktop Search [2004]
 Richer



relationships
MyLifeBits [Gemmell et al., 2002]
Placeless Documents [Dourish et al., 2000]
LifeStreams [Freeman and Gelernter, 1996]
 Objects

and associations
Haystack [Karger et al., 2005]
Summary

60 years passed since the personal Memex was
envisioned
 It’s time to get serious
 Great challenges for data

management
Deliverables of the project
 An
approach to automatically build a database of
objects and associations from personal data
 An algorithm for on-the-fly integration
 Algorithms for data analysis for association search
and browsing
 The concept of malleable schema as a modeling tool
 A PIM system incorporating the above
Association Network for Semex
publishedIn
Project: Semex ArticleAbout
ArticleAbout
projectLeader
Person: Alon
publishedIn
ArticleAbout
participant
participant
Person: Luna
publishedIn
Advice-giver
Person: Michelle
participant
advisor
CIDR
co-worker
Person: Yuhan
Person: Jayant
co-worker
co-worker
Download