pptx - Computer Science and Engineering

advertisement
Type:
Tutorial Paper
Authors:
Arun Kejriwal (Machine Zone Inc.), Sanjeev Kulkarni, Karthik
Ramasamy(Twitter Inc.)
Presented by: Siddhant Kulkarni
Term:
Fall 2015

In-Depth overview of streaming analytics
 Applications
 Algorithms
 Platforms

Description of various types of data contributing to the field of
Big Data

Social Media

IoT

Healthcare

Machine Data (cloud)

Connected Vehicles
Type:
Demo Paper
Authors:
Xu Chu, John Morcos, Ihab Ilyas, Paolo Papotti, Mourad Ouzzani, Nan
Tang(Qatar Computing Research Institute, Yin Ye(Google)
Presented by: Siddhant Kulkarni
Term:
Fall 2015

Issue with Data Cleaning

What are the external sources?

Problem with External Sources
Presented by: Omar Alqahtani
Fall 2015
Demonstration Papaer
Yunyao Li
IBM Research Almaden
Elmer Kim
Treasure Data, Inc.
Marc A. Touchette
IBM Silicon Valley Lab
Ramiya Venkatachalam
IBM Silicon Valley Lab
Hao Wang
IBM Silicon Valley Lab

Extractor development remains a major bottleneck in satisfying the increasing
demands of real-world applications based on IE.

Lowering the barrier to entry for extractor development becomes a critical
requirement.

Previous work has focused on reducing the manual effort involved in extractor
development.

WizIE is a promising wizard-like environment but needs non-trivial rule language.

Special-purpose systems.
VINERY, a Visual INtegrated Development Environment for Information
extRaction, consists of:

The foundation of VINERY is VAQL, a visual programming language for information extraction.

VINERY embeds VAQL in an web-based visual IDE for constructing extractors, which are
translated into AQL and executed

VINERY includes a rich set of easily customizable pre-built extractors to help jump-start
extractor development.

VINERY provides features to support the entire life cycle of extractor development.
Presented by:
Ranjan_KY
Fall 2015

Web scrapping (or wrapping) is a popular means for
acquiring data from the web.

Today generation made scalable wrapper-generation
possible and enabled data acquisition process involving
thousands of sources.

No scalable tools exists that support these task.
.

Modern wrapper-generation systems leverage a number of features
ranging from HTML and visual structures to knowledge bases and microdata.

Nevertheless, automatically-generated wrappers often suffer from errors
resulting in under/over segmented data, together with missing or
spurious content.

Under and over segmentation of attributes are commonly caused by
irregular HTML markups or by multiple attributes occurring within the
same DOM node.

Incorrect column types are instead associated with the lack of domain
knowledge, supervision, or micro-data during wrapper generation.

The degraded quality of the generated relations argues for means to repair both the data
and the corresponding wrapper so that future wrapper executions can produce cleaner
data

WADaR takes as input a (possibly incorrect) wrapper and a target
relation schema, and iteratively repairs both the generated
relations and the wrapper by observing the output of the
wrapper execution.

A key observation is that errors in the extracted relations are
likely to be systematic as wrappers are often generated from
templated websites.
WADaR’s repair process

(i) Annotating the extracted relations with standard entity
recognizers,

(ii) Computing Markov chains describing the most likely
segmentation of attribute values in the records, and

(iii) Inducing regular expressions which re-segment the input
relation according to the given target schema and that can possibly
be encoded back into the wrapper.
In this paper, related work was not evaluated in detail
[1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of
partially overlapping web sources. PVLDB, 6(10):805–816, 2013.
[2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB,
6(13):1486–1497, 2013.
[3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record
alignment. In SIGMOD, pages 1713–1728. ACM, 2015.
Presented by: Zohreh Raghebi
Fall 2015

We propose graph-pattern association rules (GPARs) for social media marketing

Extending association rules for itemsets, GPARs help us discover regularities between
entities in social graphs

We study the problem of discovering top k diversified GPARs

We also study the problem of identifying potential customers with
GPARs

A graph-pattern association rule (GPAR) R(x, y) is defined as Q(x, y) ⇒ q(x, y),

where Q(x, y) is a graph pattern in which x and y are two designated nodes,

q(x, y) is an edge labeled q from x to y, on which the same search conditions as in Q
are imposed

We refer to Q and q as the antecedent and consequent of R

We model R(x, y) as a graph pattern PR, by extending Q with a (dotted) edge q(x, y).

We treat q(x, y) as pattern Pq , and q(x, G) as the set of matches of x in G by Pq

We are interested in GPARs for a particular event q(x, y)

However, this often generates an excessive number of rules, which often pertain to the
same or similar people

This motivates us to study a diversified mining problem, to discover GPARs that are
both interesting and diverse

Problem. Based on the objective function, the diversified GPAR mining problem (DMP)
is stated as follows.
◦ Input: A graph G, a predicate q(x, y), a support bound σ and positive integers k and d.

Output: A set Lk of k nontrivial GPARs pertaining to q(x, y) such that (a) F(Lk) is
maximized; and (b) for each GPAR R ∈ Lk, supp(R, G) ≥ σ

DMP is a bi-criteria optimization problem to discover GPARs for a particular event q(x,
y) with high support and a balanced confidence and diversity.

In practice, users can freely specify q(x, y) of interests

proper parameters (e.g., support, confidence, diversity) can be estimated from query
logs or recommended by domain experts

Consider a set Σ of GPARs pertaining to the same q(x, y), i.e., their consequents are the
same event q(x, y).

We define the set of entities identified by Σ in a (social) graph G with confidence η

Problem. We study the entity identification problem (EIP):
◦ Input: A set Σ of GPARs pertaining to the same q(x, y), a confidence bound η > 0, and
a graph G.

◦ Output: Σ(x, G, η). It is to find potential customers x of y in G identified by at
least one GPAR in Σ, with confidence of at least η.

Wenfei Fan 1 , 2 Zhe Fan 3 Chao Tian 1 , 2 Xin Luna Dong 4
1 University o f Edinburgh 2 Beihang University 3 Hong Kong Baptist University 4
Google Inc.
{wenfei@inf., chao.tian@ }ed.ac.uk, zfan@comp.hkbu.edu.hk, lunadong@google.com

Keys for graphs aim to uniquely identify entities represented by vertices in a graph.

We propose a class of keys that are recursively defined in terms of graph patterns, and
are interpreted with subgraph isomorphism.

Extending conventional keys for relations and XML, these keys find applications in:
object identification, knowledge fusion and social network reconciliation.

As an application, we study the entity matching problem that, given a graph G and a
set Σ of keys

to find all pairs of entities (vertices) in G that are identified by keys in Σ

we provide two parallel scalable algorithms for entity matching:

MapReduce and a vertex-centric asynchronous model

Entity resolution is to identify records that refer to the same real-world entity.

Keys for graphs yield a deterministic method to provide an invariant connection
between vertices and the real-world entities

The quality of matches identified by keys highly depends on keys discovered and used,
although keys help us reduce false positives.

We defer the topic of key discovery to another paper

focus primarily on the efficiency of applying such constraints

Finally, we remark that entity resolution is just one of the applications for keys for
graphs besides:

e.g., digital citations and knowledge base expansion

entity matching is different to record matching that identify tuples in relations

that does not enforce topological constraints in the matching process

Consider a graph G and an entity e in G.

We say that G matches Q(x) at e if there exist a set S of triples in G and a valuation ν of
Q(x) in S such that ν(x) = e,

ν is a bijection between Q(x) and S.

We refer to S as a match of Q(x) in G at e under ν.

Intuitively, ν is an isomorphism from Q(x) to S when Q(x) and S are depicted as graphs.

That is, we adopt subgraph isomorphism for the semantics of graph pattern matching

Example 4: Consider Q(x) and G,

a set S1 of triples in G 2: {(com 1, name of, “AT&T”),
(com 4, name of, “AT&T”), (com 1, parent of, com 4), (com 3,
parent of, com 4)}.

Then S1 is a match of Q 4(x) in G 2 at com 4, which maps variable x to com 4, name∗ to
“AT&T”, wildcard company to com 1, and company to com 3.

Keys for Graphs:

Keys. A key for entities of type τ is a graph pattern Q(x),
where x is a designated entity variable of type τ

we provide two parallel scalable algorithms for entity matching:

MapReduce and a vertex-centric asynchronous model
Amr El-Helw∗ , Venkatesh Raghavan∗ , Mohamed A. Soliman∗ , George Caragea∗ ,
Zhongxian Gu†, Michalis Petropoulos‡
∗ Pivotal Inc.
Palo Alto, CA, USA
† Datometry Inc.
San Francisco, CA, USA
‡ Amazon Web Services
Palo Alto, CA, USA

Presented by: Zohreh Raghebi

Fall 2015

Big Data analytics are becoming increasingly common in many business domains:

including financial corporations, government agencies, and insurance providers

Big Data analytics often include complex queries with similar or
identical expressions

Massively Parallel Processing (MPP) databases address these challenges by distributing
storage and query processing across multiple nodes and processes

Common Table Expressions (CTEs) are commonly used in complex analytical queries
that often have many repeated computations

A CTE can be seen as a temporary table that exists just for one query.

The purpose of CTEs is to avoid re-execution of expressions referenced more than
once within a query.

CTEs may be defined explicitly, or generated implicitly by the query optimizer

CTEs follow a producer/consumer model

where the data is produced by the CTE definition

consumed in all the locations where that CTE is referenced.

One possible approach to execute CTEs is to expand (inline) all CTE
consumers,

Rewriting the query internally to replace each reference to the CTE

This approach simplifies query execution logic, but may incur performance
overhead due to executing the same expression multiple times.

the CTE expression is separately optimized and executed only once,

the results are kept in memory, or written to disk if the data does not fit in memory

The data is then read whenever the CTE is referenced.

This approach avoids the cost of repeated execution of the same expression,

although it may incur an overhead of disk I/O.

The impact of this approach on query optimization time is rather limited,

since the optimizer chooses one plan to be shared by all CTE consumers.

However, important optimization opportunities could be missed due to fixing one
execution plan for all consumers

MPP systems leverage parallel query execution

where different parts of the query plan execute simultaneously as separate processes,

possibly running on different machines.

In some cases, a process has to wait until another process produces the data
it needs.

For complicated queries involving multiple CTEs, the optimizer needs to guarantee that
no two or more processes could be waiting on each other during query execution.

CTE constructs need to be cleanly abstracted within the query optimization framework
to guarantee deadlock-free plan

The approaches of always inlining CTEs, or never inlining CTEs, can be easily proven to
be sub-optimal

The query optimizer needs to efficiently enumerate and cost plan alternatives that
combine the benefits of these approaches

CTEs should not be optimized in isolation without taking into account the context in
which they occur.

Isolated optimization can easily miss several optimization opportunities
1.
 This approach avoids repeated computation
 However, this approach does not take advantage of the index on i color
2. The opposite approach: all occurrences of the CTE are replaced by the expansion of the CTE
 This allows the optimizer to utilize the index on i color
 However, it suffers from the repeated computation
3. Figure 1(c) depicts a possible plan in which one occurrence of the CTE is expanded,
 allowing the use of the index
 while the other two occurrences are not inlined, to avoid recomputing the common expression.

A novel framework for the optimization of CTEs in MPP database systems.

Our framework extends and builds upon our optimizer infrastructure to allow
optimization of CTEs within the context where they are used in a query

A new technique in which a CTE does not get re-optimized for every reference in the
query, but only when there are optimization opportunities, e.g. pushing down filters or
sort operations.

This ensures that the optimization time does not grow exponentially with the number
of CTE consumers

A cost-based approach for deciding whether or not to expand CTEs in a given query.

The cost model takes into account disk I/O as well the cost of repeated CTE execution

A query execution model that guarantees that the CTE producer is always executed
before the CTE consumer(s).

In MPP settings, this is crucial for deadlock-free execution

Ben Kimmett, Venkatesh Sr inivasan, Alex Thomo
University of Victoria, Canada
{blk,srinivas,thomo }@uvic.ca
Presented by: Zohreh Raghebi
Fall 2015

We report experimental results for the MapReduce algorithms

proposed by Afrati, Das Sarma, Menestrina, Parameswaran and Ullman in
ICDE’12 (Fuzzy join using mapreduce)

to compute fuzzy joins of binary strings using Hamming Distance

Their algorithms come with complete theoretical analysis

however, no experimental evaluation is provided

there are several algorithms proposed for performing “fuzzy join” :

(an operation that finds pairs of similar items) in MapReduce

concentrates on binary strings and Hamming distance

The algorithms proposed are:

Naive, which compares every string in the set with every other

Ball-Hashing, send strings to a ‘ball’ of all ‘nearby strings’ within a certain
similarity

Anchor Points, a randomized algorithm that selects a set of strings and
compares any pair of strings that have a close enough distance to a member
of the set

Splitting, an algorithm that splits the strings into pieces and compares only
strings with matching pieces

It is argued in that there is a tradeoff between communication cost and
processing cost

that there is a skyline of the proposed algorithms; i.e. none dominates
another.

One of our objectives is to see whether we can observe this
skyline in practical terms.

We observe via experiments that some algorithms are almost always
preferable to others.

Splitting is a clear winner

Ball-Hashing suffers for all distance thresholds except the very small ones
Presented by: Shahab Helmi
Fall 2015
Authors:

Yael Amsterdamer , Tel Aviv University

Anna Kukliansky, Tel Aviv University

Tova Milo, Tel Aviv University
Publication:

VLDB 2015
Type:

Research Paper
Many real-life scenarios (queries) require the joint analysis of general knowledge, which
includes facts about the world, with individual knowledge, which relates to the opinions
or habits of individuals.

“What are the most interesting places near Forest Hotel, Buffalo, we should visit in the
fall?“

Locations, opening hours.

Interesting locations: depends on the people’s opinions or habits.
Existing platforms require users to specify their information needs in a formal, declarative
language, which may be too complicated for naive users.

Hence, a question in the natural language should be translated into a well-formed
query.

The NL-to-query translation problem has been previously studied for queries over
general data (knowledge), including SQL/ XQuery/SPARQL queries.

Crowdsourcing: asking user to refine the translated query.

NL tools for parsing and detecting the semantics of NL sentences.
The mix of general and individual knowledge needs lead to unique challenges:

Distinguishing the individual and general part of the question (query).

The crowd info regarding the induvial part of the NL question may not be in the
knowledge base.


Most of the current techniques which are based on aliening questions to the knowledge
based, does not apply.
Integrating the generated queries for individual and general part of the question to a
well-formed query.

The modular design of a translation framework, to solve the challenges mentioned in
the previous slide.

The development of new modules.
Knowledge representation must be expressive enough

To account for both general knowledge, to be queried from an ontology,

For individual knowledge to be collected from the crowd.

RDF: publicly available knowledge bases such as DBPedia and Linked-GeoData.

{Buffalo, NY inside USA}.

{Buffalo, NY has Label "interesting"}.

{I visit Buffalo, NY}.
The query language to which NL questions are translated, should naturally match the
knowledge representation.

QASSIS-QL query language, which extends SPARQL, the RDF query language, with
crowd mining capabilities

Distinguishes the individual and general part of the question (query) according to the
grammatical roles.
Dependency Parser: This tool parses a given text into a standard structure called a
dependency graph. This structure is a directed graph (typically, a tree) with labels on the
edges. It exposes different types of semantic dependencies between the terms of a
sentence (grammatical role of the words).
It is left to perform the translation from the NL representation to the query language
representation.

Limit

Threshold
In this experiment, we have arbitrarily chosen the first 500 questions from the
Yahoo! Answers repositories.
Presented by: Shahab Helmi
Fall 2015
Authors:

Weimo Liuy, The George Washington University

Md Farhadur Rahmanz, University of Texas at Arlington

Saravanan Thirumuruganathanz, University of Texas at Arlington

Nan Zhangy, The George Washington University

Gautam Das, University of Texas at Arlington
Publication:

VLDB 2015
Type:

Research Paper

Location-returned services (LR-LBS): this services return the location of the k returned
tuples.


Google Maps.
Location-not-returned services (LNR-LBS): this services does not return the location of
the k tuples and returns some other attributes such as ID, ranking and etc.

WeChat

Sina Weibo
A K-nearest-neighbors (kNN) query: return the k nearest tuples to the query point
according to a ranking function (Euclidian distance in this paper).
LBS with a KNN interface: hidden databases with limited access, usually through a public web
query interface or API.
These interfaces impose some constraints:

Query limitation: 10,000 per user per day in Google Maps

Maximum coverage limit, for example 5 miles away from the query point
Aggregate Estimations: For many applications, it is important to collect aggregate statistics in
such hidden databases such as sum, count, or distributions of the tuples satisfying certain
selection conditions.

A hotel recommendation application would like to know the average review scores for
Marriott vs Hilton hotels in Google Maps;

A cafe chain startup would like to know the number of Starbucks restaurants in a certain
geographical region;

A demographics researcher may wish know the gender ratio of users of social networks in
China etc.
Aggregate information can be obtained by:

Entering into data sharing agreements with the location-based service providers, but
this approach can often be extremely expensive, and sometimes impossible if the data
owners are unwilling to share their data.

Getting the whole data using limited interfaces would take so long.
Approximate estimates of such aggregates by only querying the database via its restrictive
public interface.

Minimizing the query cost (i.e., ask as few queries as possible)

Making the aggregate estimations as accurate as possible.


Analytics and Inference over LBS:

Estimating COUNT and SUM aggregates.

Error reduction, such as bias correction
Aggregate Estimations over Hidden Web Repositories:

Unbiased estimators for COUNT and SUM aggregates for static databases.

Efficient techniques to obtain random samples from hidden web databases that can then be
utilized to perform aggregate estimation.

Estimating the size of search engines.


For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNT
and SUM aggregates, represents a significant improvement over prior work along
multiple dimensions:

a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations;

top-k returned tuples are leveraged rather than only top-1; several innovative techniques
developed for reducing error and increasing efficiency.
For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel
problem with no prior work.

The algorithm is not bias-free, but the bias can be controlled to any desired precision.
In a Voronoi diagram, for each point, there is a corresponding region consisting of all
points closer to that point than to any other.
Top-1 Voronoi
Top-2 Voronoi

Precisely compute Voronoi cells
𝑝(𝑡) =
𝑉(𝑡)
𝑉
, Count(*) =
1
𝑝(𝑡)
Extensions:


Computing Voroni cells faster
Error reduction
Datasets:

Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness of
the algorithm.

Online LBS Demonstrations: to evaluate efficiency of the algorithm.

Google Maps

WeChat

Sina Weibo
Download