Presented By – Yogesh A. Vaidya

advertisement
Presented By –
Yogesh A. Vaidya
Introduction
 What are Structured Web Community Portals?
 Advantages of SWCP
 powerful capabilities for searching, querying and monitoring
community information.
 Disadvantage of current approaches
 No standardization of process.
 Domain specific development.
 Current approaches in research community
 Top Down, Compositional and Incremental Approach
 CIMPLE Workbench
 Toolset for development
 DBLife case study
What are Community Portals?
 These are portals that collect and integrate relevant
data obtained from various sources on the web.
 These portals enable members to discover, search,
query, and track interesting community activities.
What are Structured Web
Community Portals?
 They extract and integrate information from raw Web
Pages to present a unified view of entities and
relationships in the community.
 These portals can provide users with powerful
capabilities for searching, querying, aggregating,
browsing, and monitoring community information.
 They can be valuable for communities in a wide variety
of domains ranging from scientific data management
to government agencies and are an important aspect of
Community Information Management.
Example
Current Approaches towards SWCP
used in Research Community
 These approaches maximize coverage by looking at the
entire web to discover all possible relevant web sources.
 They then apply sophisticated extraction and
integration techniques to all data sources.
 They expand the portal by periodically running the
source discovery step.
 These solutions achieve good coverage and incur
relatively little human effort.
 Examples: Citeseer, Cora, Rexa, Deadliner
Disadvantages of Current
Approaches
 These techniques are difficult to develop, understand
and debug
 That is, they require builders to be well versed in
complex machine learning techniques.
 Due to monolithic nature of the employed techniques,
it is difficult to optimize the run-time and accuracy of
these solutions.
Top-Down Compositional and
Incremental Approach
 First, select a small set of important community




sources.
Next, create plans that extract and integrate data from
these sources to generate entities and relationships.
These plans can act on different sources and use
different extraction and integration operators and are
hence not monolithic.
Executing this operators yields an initial structured
portal.
Then expand the community by monitoring certain
sources for mentions of new sources.
Step 1: Selecting Initial Data
Sources
 Basis for selecting small set of initial data sources is 80-20
phenomenon that applies to Web Community data sources.
 That is, 20% of sources often cover 80% of interesting
community activities.
 Thus the sources used should be highly relevant to the
community.
 Example: For the Database research community, the portal
builder can select homepages of top conferences (SIGMOD,
PODS, VLDB, ICDE) and the most active researchers (such
as, PC members of top conferences, or those with many
citations.)
Selection of Sources contd…
 In order to assist the portal builder B to select sources,
a tool called RankSource is developed that provides B
with the most relevant data sources.
 Here B first collects as many community sources as
possible (using methods like focused crawling, query
search engine etc).
 B applies RankSource to these sources to get sources in
decreasing order of relevance to the community.
 B examines the ranked list, starting from top and
selects truly relevant data sources.
RankSource principles:
 Three relevance ranking strategies used by RankSource
tool:
 PageRank only
 PageRank + Virtual Links
 PageRank + Virtual Links + TF-IDF
PageRank only
 This version exploits the intuition that
 Community sources often link to highly relevant sources
 Sources linked to/by relevant sources are also highly
relevant.
 Formula used for giving rank:
P(u) = (1-d) +
n
d∑i=1 P(vi)/c(vi).
 PageRank only version achieves limited accuracy
because some highly relevant sources are often not
linked
PageRank + Virtual Links
 This is based on the assumption that if a highly
relevant source discusses an entity, then all other
sources which discuss that entity may also be highly
relevant.
 Here, first create virtual links between sources that
mention overlapping entities, and then do PageRank.
 From results, the accuracy reduces after using
PageRank + Virtual Links.
 The reason for reduction in accuracy is that virtual
links are created even with those sources for which a
particular entity may not be their main focus.
PageRank + Virtual Links + TF-IDF
 From previous case, we want to ensure that we add a
virtual link only if both the sources are relevant for the
concerned entity.
 Hence in this case use of TF-IDF metric for measuring
relevance of the document is done.
 TF is term frequency given by number of times entity
occurs in a source divided by total number of entities
(including duplicates) in the source.
 IDF is inverse document frequency given by logarithm
of total number of sources divided by number of
sources in which a particular entity occurs.
Contd…
 TF-IDF score is given by TF*IDF.
 Next, for each source filter out all entities for which
TF-IDF score is below a certain level Ѳ.
 After this filtering, apply PageRank + Virtual Links on
the results obtained.
 This approach gives better accuracy than both of the
previously discussed techniques.
Step 2: Constructing the E-R graph
 After selecting the sources, B first defines E-R schema G that




captures entities and relationships of interest to the community.
This schema consists of many types of entities (e.g., paper,
person) and relationships (e.g., write-paper, co-author).
B then crawls daily (actual frequency dependent on
domain/developer) to first create a snapshot W of all the
concerned data web pages.
B then applies plan Pday which consists of other plans to extract
entities and relations from this snapshot to create a daily E-R
graph.
B then merges these daily graphs obtained (using plan Pglobal ) to
create a global E-R graph on which it offers various services like
querying, aggregating etc.
Create Daily Plan Pday
 Daily Plan (Pday ) takes daily snapshot W (of web pages
to work on) and E-R schema G as input and produces
as output a daily E-R graph D.
 Here, for each entity type e in G, B creates a Plan Pe
that discovers all entities of type e from W.
 Next, for each relation type r, B creates a plan Pr that
discovers all instances of relation r that connects the
entities discovered in the first step.
Workflow of Pday in database
domain:
Plans to discover entities
 Two plans, Default Plan and Source Aware Plan, are used
for discovering entities.
 Default Plan (Pdefault) uses three operators
 ExtractM to find all mentions of type e in web pages in W,
 MatchM to find matching mentions, that is, those referring
to the same real world entity, thus forming groups g1,g2,…,gk
 and CreateE to create an entity for each group of mentions
 The problem of encapsulating and matching mentions as
encapsulated by ExtractM and MatchM is known to be
difficult and complex implementations are available.
Default Plan contd…
 But in our case, since we are looking at community-
specific, highly relevant sources, a simple dictionarybased solution that matches a collection of entity
names N against the pages in W to find mentions gives
accurate results in most of the cases.
 Assumption behind this is that in a community entity
names are often designed to be as distinct as possible.
Source Aware Plan
 Simple dictionary-based extraction and matching may not
be appropriate for all cases especially where ambiguous
sources are possible.
 In DBLife, DBLP is an ambiguous source since it contains
information pertaining to other communities too.
 To match entities in DBLP sources, a source aware plan is
used, which is a stricter version of an earlier default plan.
Source Aware Plan Contd…
 First apply simple default plan or Pname (only Extract and
Match operators) to all unambiguous sources to get Results
R, which is groups of related mentions.
 Add to this result set R mentions extracted from DBLP to
form result set U.
 For matching mentions in U, not just names is used, but
also their context in the form of related persons.
 Two names are matched only if they have similar names
and are related by at least one related person.
Plans for extracting and matching
mentions:
Plans to find Relations:
 Plans for finding relations are generally domain specific and vary





from relation to relation.
Hence we try to find types of relations that commonly occur in
community portals and then create plan templates for each one.
First amongst these relation types is Co-occurrence relations.
In Co-occurrence relations, we compute CoStrength between the
concerned entities and register the match if the score of
CoStrength is greater than a certain threshold.
CoStrength is calculated using ComputeCoStrength operator, for
which input is entity pair(e,f ) and output is number that
quantifies how often and closely mentions of e and f occur.
Example: Co-occurrence relations are:
write-paper(person,paper), co-author(person,person)
Label Relations
 Second type of relations we focus on are Label
Relations.
 Example: A label relation is served(person,committee)
 We use the ExtractLabel operator to find an instance
of a label immediate to a mention of an entity (e.g.
Person).
 This operator can thus be used by B to make plans that
find label relations.
Plans for finding relations:
Neighborhood Relations:
 Third type of relations we focus on are Neighborhood
relations.
 Example: A neighborhood relation is
talk(person, organization)
 These relations are similar to label relations with the
difference that we here consider a window of words
instead of immediate occurrence.
Decomposition of Daily Plan
 For a real world community portal, the E-R schema is very
large and hence it would be overwhelming to cover all
entities and relations in a single plan.
 Decomposition aids splitting of tasks across people and
also makes evolution of schema possible in a smooth
manner.
 Decomposition is followed by the merge phase in which
E-R fragments from individual plans are merged into a
complete E-R graph.
 We use MatchE and EnrichE operators to find matching
entities from different graph fragments, and merge them
into a single day E-R graph.
Decomposition and Merging of
Plans:
Creating Global plan
 Generating global plan is similar to constructing a
daily plan from individual E-R fragments.
 Hereto operators MatchE and EnrichE are used.
Step 3: Maintaining and Expanding
 The Top-down, compositional and incremental
approach makes maintaining the portal relatively easy.
 The main assumption is that relevance of data sources
changes very slowly over time and new data sources
would be mentioned within the community in certain
sources.
 In expanding, we add new relevant data sources.
 Approach used for finding new sources is to look for
them only at certain locations (e.g. looking for
conference announcements at DBWorld.)
CIMPLE Workbench
 This workbench consists of an initial portal “shell”
(with many built-in administrative controls,) a
methodology on populating the shell, a set of operator
implementations, and a set of plan optimizers.
 The workbench facilitates easy and quick development
since a portal builder can make use of built-in
facilities.
 Workbench provides implementations of operators
and also gives facility for builders to have their own
implementations.
Case Study: DBLife
 DBLife is a structured community portal for the
database community.
 It was developed as a proof of concept for developing
community portals using the top-down approach and
the CIMPLE workbench.
 Portal is a proof of development of SWCP achieving
high accuracy with current set of relatively simple
operations.
DBLife lifecycle:
Conclusion & Future Work
 Experience with developing DBLife suggests that this
approach can effectively exploit common characteristics of
Web communities to build portals quickly and accurately
using simple extraction/integration operators, and to
evolve them over time efficiently with little human effort.
 Still a lot of research problems need to be studied:
 How to develop a better compositional framework?
 How to make such a framework as declarative as possible?
 What technologies (XML, relational, etc) need to be used to
store portal data?
Download