pptx - Computer Science and Engineering

advertisement
Presented by: Zohreh Raghebi
Fall 2015
Foteini Katsarou
Nikos Ntarmos
Peter Triantafillou
University of Glasgow, UK

Graph data management systems have become very popular

One of the main problems for these systems is subgraph query processing

Given a query graph, return all graphs that contain the query.

To perform a subgraph isomorphism, test against each graph in the dataset

Not scale, as subgraph isomorphism is NP-Complete

Many indexing methods : to reduce the number of candidate graphs for
subgraph isomorphism test

A set of key factors-parameters, that influence the performance of related
methods:

the number of nodes per graph

the graph density

the number of distinct labels

the number of graphs in the dataset

and the query graph size

First to derive conclusions about the algorithms’ relative performance

Second highlight how both performance and scalability depend on the above
factors

six well established indexing methods, namely:

Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode

Most related works are tested against the AIDS antiviral dataset and synthetic
datasets,


Grapes alone used several real datasets


Formed of many small graphs
The authors did not evaluate scalability
The iGraph comparison framework compared the performance of older
algorithms (up to 2010).

Since then, several, more efficient algorithms have been proposed

A linear increase in the number of nodes results in a quadratic increase
in the number of edges;

The number of edges increases linearly to the graph density

The increase of the above two factors leads to a detrimental increase
in the indexing time

The number of graphs increases the overall complexity only linearly

The frequent mining techniques are more sensitive because more
features have to be located across more graphs

The increase in the number of distinct labels leads to:

An easier dataset to index

1. It results in fewer occurrences of any given feature

2. A decrease in the false positive ratio of the various algorithms

Our findings give rise to the following adage: “Keep It Simple and
Smart”.

The simpler the feature structure and extraction process, The
faster the indexing and query processing algorithm

Frequent mining algorithms (gIndex, Tree+∆) may be competitive
for small/sparse datasets

Techniques using exhaustive enumeration (Grapes, GGSX, CTIndex) are the clear winners

Especially with those indexing simple features (paths; i.e.,
Grapes, GGSX)

Instead of having more complex features (trees, cycles; i.e., CTIndex)
Industrial paper
Avery Ching
Sergey Edunov
Maja Kabiljo

Presented by: Zohreh Raghebi

Fall 2015

Analyzing the real world graphs

The scale of hundreds of billions edges with available software is
very difficult

Graph processing engines tend to have additional challenges in
scaling to larger graphs

Apache Giraph is an iterative graph processing system designed
to scale to process trillions of edges

Used at Facebook

Giraph was inspired by Pregel, the graph processing developed at
Google

While Giraph did not scale to our needs at Facebook


with over 1.39B users and hundreds of billions of social connections
Improve the platform to support the workloads

Giraph’s graph input model was only vertex centric

Parallelizing Giraph infrastructure relied on Map Reduce’s task level parallelism


Giraph’s flexible types were initially implemented using native Java objects


Not have multithreading support
Consumed excessive memory and garbage collection time.
The aggregator framework was inefficiently implemented in ZooKeeper

Need to support very large aggregators

we modified Giraph to allow loading vertex data and edges from separate sources

Parallelization support:

Use worker local multithreading to take advantage of additional CPU cores

Memory optimization :

By default serialize the edges of every vertex into a byte array

Rather than instantiating them as native Java objects

Aggregator architecture:

Each aggregator is now randomly assigned to one of the workers

Aggregation responsibilities are balanced across all workers

Not bottlenecked by the Master
Xiaofei Zhang†, Hong Cheng‡, Lei Chen†
†Department of Computer Science & Engineering, HKUST
Presented by: Zohreh Raghebi
Fall 2015

Extracts the most prominent vertices

Returns a minimum set of vertices with:

The maximum importance total betweenness and shortest path
reachability in connecting two sets of input vertices

logistic planning, social community bonding

In social network study: To understand information propagation and hidden
correlations

To find the “bonding” of communities , an ideal bonding agent would:

1. Reside on as many cross group pairwise shortest paths as possible

2. Connect as large portion of two groups as possible

Such agents could best serve the message passing between two groups

The VSB query ranks a vertex’s prominence with two factors:

betweenness and shortest path connectivity

Minimum cut finds the minimum set of edges to remove to turn
a graph into two disjoint subgraph

Not offer how other vertices contributes to the connection
between sets

Top-k betweenness computing are employed to find important
vertices in a network

However, due to the local dominance property of the
betweenness metric

such queries cannot serve the vertex sets bonding properly

Two novel building blocks for the efficient VSB query evaluation:

Guided graph exploration: by vertex filtering scheme

to reduce redundant vertex access

The minimum set of vertices of the highest accumulative
betweenness as the bonding vertices

Betweenness ranking on-exploration
 Instead of computing the exact betweenness value

Rank the betweenness of vertices during graph exploration

to save the computation cost.
Daniel Margo Harvard University
Cambridge, Massachusetts
dmargo@eecs.harvard.edu
Margo Seltzer Harvard University
Cambridge, Massachusetts
margo@eecs.harvard.ed
Presented by: Zohreh Raghebi
Fall 2015

Capable of handling graphs that far exceed main memory

High quality edge partitions

Graph partitioning is an important problem that affects many graphstructured system

Partitioning quality greatly impacts the performance of distributed graph
analysis frameworks

METIS: the gold standard for graph partitioning

A multi-level graph partitioning algorithm

These approaches do not scale to today’s large graphs

Streaming partitioners

A graph loader : reads serial graph data from a disk onto a cluster

It must make a decision about the location of each node as it is loaded

The goal is to find an optimal balanced partitioning with as little computation
as possible

They are sensitive to the stream order, which can affect performance

Streaming algorithms are difficult to parallelize

Partitions by a method that does not vary with how the input
graph is distributed

Sheep can arbitrarily divide the input graph for parallelism and fit
tasks in memory

Sheep reduces the input graph to a small elimination tree

Sheep’s tree transformation is a distributed map-reduce
operation

Using simple degree ranking, Sheep creates competitive edge
partitions

faster than other partitioners
Type:
Research Paper
Authors:
Tomohiro Manabe, Keishi Tajima
Presented by: Siddhant Kulkarni
Term:
Fall 2015

How is logical structure different from the mark-up
structure?
 Difference between Human understanding and Browser
interpretation

“Mark-up structure does not necessarily always
correspond to the logical hierarchy”

Basic idea for webpages with improper tag usage

HTML5 solves this problem, but we cannot port all web pages to
HTML 5

Other techniques for document segmentation


Based on margins between blocks

Based on text density

Based on identification of important blocks

Etc.
Most rely on tags (so does this paper, but not entirely)

Define Blocks and Headings for their own structure extraction

Logical Hierarchy extraction using


Preprocessing

Heading based Page Segmentation
Dataset: Web snapshot ClueWeb09 Category B document
collection

Calculate accuracy based on Precision and Recall of extracted
relationship

Types of Relationships – parent, ancestor, siblings, child,
descendants
Type:
Industry Paper
Authors:
Daniel Haas, Jason Ansel, Lydia Gu, Adam Marcus
Presented by: Siddhant Kulkarni
Term:
Fall 2015

What is Macrotask Crowdsourcing?

What is the problem with it?

The related work focuses on Macrotasking and Crowdsourcing
frameworks

Argonaut

Predictive models to identify trustable workers who can perform
reviews

And a model to identify which tasks need review

Evaluate trade off between single and multiple phases of reviews
based on budget
Presented by: Shahab Helmi
VLDB 2015 Paper Review Series
Fall 2015
Authors:

Moria Bergman, Tel-Aviv University

Tova Milo, Tel-Aviv University

Slava Novgorodov, Tel-Aviv University

WangChiew Tan, University of California, Santa Cruz
Publication:

VLDB 2015
Type:

Demonstration Paper
It is important for the database to be as complete (no missing values) and correct (no
wrong values) as possible. For this reason, many data cleaning tools have been developed
to automatically resolve inconsistencies in databases. However these tools:
are not able to remove all erroneous data:
95% accuracy in YOGO database.
It is impossible to correct all errors manually in big datasets.
are not usually able to determine what information is missing from a database.
A novel query oriented data cleaning system with oracle crowds.
It uses materialized views (i.e., query oriented views which are defined through user
queries) are used as a trigger for identifying incorrect or missing information.
If an error (i.e., a wrong tuple or missing tuple) in the materialized view is detected,
the system will interact minimally with a crowd of oracles by asking only pertinent
questions.
Answers to a certain question will help to identify the next pertinent questions to ask
and ultimately, a sequence of edits is derived and applied to the underlying database.
Cleaning the entire database is not the goal of QOCO. It cleans parts of the database as
needed. Hence, it could be used as a complementary tool alongside with other
cleaning tools.

Data cleaning techniques:
QOCO uses crowd to correct query results.
QOCO propagates updates back to the underplaying database.
QOCO discovers and inserts true tuples that are missing from the input database.

Crowdsourcing is a model where humans perform small tasks to help solve challenging
problems such as

Entity/conflict resolution .

Duplicate detection.

Schema matching.
Correct tuples
Wrong tuples
Missing tuples
Consider a user query which searches for European teams that won the World Cup
at least twice.
Result will contain ESP (which is wrong) it ITA will be missing.
Tuples which contains the wrong answer
t1 = Game(11:07:10;ESP;NED; final; 1:0)
t2 = Game(17:07:94;ESP;NED; final; 3:1)
t3 = Team(ESP;EU)
t2 = Game(17:07:94;ESP;NED; final; 3:1)
t4 = Game(12:07:98;ESP;NED; final; 4:2)
t3 = Team(ESP;EU)
t4 = Game(12:07:98;ESP;NED; final; 4:2)
t1 = Game(11:07:10;ESP;NED; final; 1:0)
t3 = Team(ESP;EU)
1.
Finding the most frequent tuple (t3) and asking the oracle if it is
true or not?

2.
t3: is correct ->
{t1, t2}, {t2, t4}, {t4, t1}
the
remaining
candidates
will
be
The rest of tuples have the same frequency so QOCO will choose
one of them randomly, let say t1.
1.
t1: is correct -> {t2}, {t2, t4}, {t4} -> ESP won the world cup only once;
hence both t2 and t4 are wrong and should be deleted!
Could be find in the original research paper:
M. Bergman, T. Milo, S. Novgorodov, and W. Tan. Query-oriented data cleaning with
oracles. In ACM SIGMOD, 2015
Presented by: Shahab Helmi
VLDB 2015 Paper Review Series
Fall 2015
Authors:

Weimo Liuy, The George Washington University

Md Farhadur Rahmanz, University of Texas at Arlington

Saravanan Thirumuruganathanz, University of Texas at Arlington

Nan Zhangy, The George Washington University

Gautam Das, University of Texas at Arlington
Publication:

VLDB 2015
Type:

Research Paper
Location-based services (LBS)

Location-returned services (LR-LBS): this services return the location of the k returned
tuples.


Google Maps.
Location-not-returned services (LNR-LBS): this services does not return the location of
the k tuples and returns some other attributes such as ID, ranking and etc.

WeChat

Sina Weibo
K-nearest-neighbors (kNN) queries: returns the k nearest tuples to the query point
according to a ranking function (Euclidian distance in this paper).
LBS with a kNN interface: third-party applications and/or end users do not have complete and
direct access to this entire database. The database is essentially “hidden”, and access is typically
limited to a restricted public web query interface or API.
These interfaces impose some constraints:

Query limitation: 10,000 per user per day in Google Maps

Maximum coverage limit, for example 5 miles away from the query point
Aggregate Estimations: For many interesting third-party applications, it is important to collect
aggregate statistics over the tuples contained in such hidden databases such as sum, count, or
distributions of the tuples satisfying certain selection conditions.

A hotel recommendation application would like to know the average review scores for
Marriott vs Hilton hotels in Google Maps;

A cafe chain startup would like to know the number of Starbucks restaurants in a certain
geographical region;

A demographics researcher may wish know the gender ratio of users of social networks in
China etc.
Aggregate information can be obtained by:

Entering into data sharing agreements with the location-based service providers, but
this approach can often be extremely expensive, and sometimes impossible if the data
owners are unwilling to share their data.

Getting the whole data using limited interfaces would take so long.
Goals:
Approximate estimates of such aggregates by only querying the database via its restrictive
public interface.

Minimizing the query cost (i.e., ask as few queries as possible) in an effort to adhere to
the rate limits or budgetary constraints imposed by the interface.

Making the aggregate estimations as accurate as possible.


Analytics and Inference over LBS:

Estimating COUNT and SUM aggregates.

Error reduction, such as bias correction
Aggregate Estimations over Hidden Web Repositories:

Unbiased estimators for COUNT and SUM aggregates for static databases.

efficient techniques to obtain random samples from hidden web databases that can then be
utilized to perform aggregate estimation.

Estimating the size of search engines.


For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNT
and SUM aggregates, represents a significant improvement over prior work along
multiple dimensions:

a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations;

top-k returned tuples are leveraged rather than only top-1; several innovative techniques
developed for reducing error and increasing efficiency.
For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel
problem with no prior work.

The algorithm is not bias-free bias-free, but the bias can be controlled to any desired
precision.
In a Voronoi diagram, for each point, there is a corresponding region consisting of all
points closer to that point than to any other.
Top-1 Voronoi
Top-2 Voronoi
1.
2.
Precisely Compute Voronoi Cells

Faster Initialization

Leverage history on Voronoi cell computation
Error Reduction

Bias error removal/reduction

Variance reduction

...
Datasets:

Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness of
the algorithm.

Online LBS Demonstrations: to evaluate efficiency of the algorithm.

Google Maps

WeChat

Sina Weibo
Download