Type: Tutorial Paper Authors: Arun Kejriwal (Machine Zone Inc.), Sanjeev Kulkarni, Karthik Ramasamy(Twitter Inc.) Presented by: Siddhant Kulkarni Term: Fall 2015 In-Depth overview of streaming analytics Applications Algorithms Platforms Description of various types of data contributing to the field of Big Data Social Media IoT Healthcare Machine Data (cloud) Connected Vehicles Type: Demo Paper Authors: Xu Chu, John Morcos, Ihab Ilyas, Paolo Papotti, Mourad Ouzzani, Nan Tang(Qatar Computing Research Institute, Yin Ye(Google) Presented by: Siddhant Kulkarni Term: Fall 2015 Issue with Data Cleaning What are the external sources? Problem with External Sources Presented by: Omar Alqahtani Fall 2015 Demonstration Papaer Yunyao Li IBM Research Almaden Elmer Kim Treasure Data, Inc. Marc A. Touchette IBM Silicon Valley Lab Ramiya Venkatachalam IBM Silicon Valley Lab Hao Wang IBM Silicon Valley Lab Extractor development remains a major bottleneck in satisfying the increasing demands of real-world applications based on IE. Lowering the barrier to entry for extractor development becomes a critical requirement. Previous work has focused on reducing the manual effort involved in extractor development. WizIE is a promising wizard-like environment but needs non-trivial rule language. Special-purpose systems. VINERY, a Visual INtegrated Development Environment for Information extRaction, consists of: The foundation of VINERY is VAQL, a visual programming language for information extraction. VINERY embeds VAQL in an web-based visual IDE for constructing extractors, which are translated into AQL and executed VINERY includes a rich set of easily customizable pre-built extractors to help jump-start extractor development. VINERY provides features to support the entire life cycle of extractor development. Presented by: Ranjan_KY Fall 2015 Web scrapping (or wrapping) is a popular means for acquiring data from the web. Today generation made scalable wrapper-generation possible and enabled data acquisition process involving thousands of sources. No scalable tools exists that support these task. . Modern wrapper-generation systems leverage a number of features ranging from HTML and visual structures to knowledge bases and microdata. Nevertheless, automatically-generated wrappers often suffer from errors resulting in under/over segmented data, together with missing or spurious content. Under and over segmentation of attributes are commonly caused by irregular HTML markups or by multiple attributes occurring within the same DOM node. Incorrect column types are instead associated with the lack of domain knowledge, supervision, or micro-data during wrapper generation. The degraded quality of the generated relations argues for means to repair both the data and the corresponding wrapper so that future wrapper executions can produce cleaner data WADaR takes as input a (possibly incorrect) wrapper and a target relation schema, and iteratively repairs both the generated relations and the wrapper by observing the output of the wrapper execution. A key observation is that errors in the extracted relations are likely to be systematic as wrappers are often generated from templated websites. WADaR’s repair process (i) Annotating the extracted relations with standard entity recognizers, (ii) Computing Markov chains describing the most likely segmentation of attribute values in the records, and (iii) Inducing regular expressions which re-segment the input relation according to the given target schema and that can possibly be encoded back into the wrapper. In this paper, related work was not evaluated in detail [1] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. Extraction and integration of partially overlapping web sources. PVLDB, 6(10):805–816, 2013. [2] L. Chen, S. Ortona, G. Orsi, and M. Benedikt. Aggregating semantic annotators. PVLDB, 6(13):1486–1497, 2013. [3] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. Tegra: Table extraction by global record alignment. In SIGMOD, pages 1713–1728. ACM, 2015. Presented by: Zohreh Raghebi Fall 2015 We propose graph-pattern association rules (GPARs) for social media marketing Extending association rules for itemsets, GPARs help us discover regularities between entities in social graphs We study the problem of discovering top k diversified GPARs We also study the problem of identifying potential customers with GPARs A graph-pattern association rule (GPAR) R(x, y) is defined as Q(x, y) ⇒ q(x, y), where Q(x, y) is a graph pattern in which x and y are two designated nodes, q(x, y) is an edge labeled q from x to y, on which the same search conditions as in Q are imposed We refer to Q and q as the antecedent and consequent of R We model R(x, y) as a graph pattern PR, by extending Q with a (dotted) edge q(x, y). We treat q(x, y) as pattern Pq , and q(x, G) as the set of matches of x in G by Pq We are interested in GPARs for a particular event q(x, y) However, this often generates an excessive number of rules, which often pertain to the same or similar people This motivates us to study a diversified mining problem, to discover GPARs that are both interesting and diverse Problem. Based on the objective function, the diversified GPAR mining problem (DMP) is stated as follows. ◦ Input: A graph G, a predicate q(x, y), a support bound σ and positive integers k and d. Output: A set Lk of k nontrivial GPARs pertaining to q(x, y) such that (a) F(Lk) is maximized; and (b) for each GPAR R ∈ Lk, supp(R, G) ≥ σ DMP is a bi-criteria optimization problem to discover GPARs for a particular event q(x, y) with high support and a balanced confidence and diversity. In practice, users can freely specify q(x, y) of interests proper parameters (e.g., support, confidence, diversity) can be estimated from query logs or recommended by domain experts Consider a set Σ of GPARs pertaining to the same q(x, y), i.e., their consequents are the same event q(x, y). We define the set of entities identified by Σ in a (social) graph G with confidence η Problem. We study the entity identification problem (EIP): ◦ Input: A set Σ of GPARs pertaining to the same q(x, y), a confidence bound η > 0, and a graph G. ◦ Output: Σ(x, G, η). It is to find potential customers x of y in G identified by at least one GPAR in Σ, with confidence of at least η. Wenfei Fan 1 , 2 Zhe Fan 3 Chao Tian 1 , 2 Xin Luna Dong 4 1 University o f Edinburgh 2 Beihang University 3 Hong Kong Baptist University 4 Google Inc. {wenfei@inf., chao.tian@ }ed.ac.uk, zfan@comp.hkbu.edu.hk, lunadong@google.com Keys for graphs aim to uniquely identify entities represented by vertices in a graph. We propose a class of keys that are recursively defined in terms of graph patterns, and are interpreted with subgraph isomorphism. Extending conventional keys for relations and XML, these keys find applications in: object identification, knowledge fusion and social network reconciliation. As an application, we study the entity matching problem that, given a graph G and a set Σ of keys to find all pairs of entities (vertices) in G that are identified by keys in Σ we provide two parallel scalable algorithms for entity matching: MapReduce and a vertex-centric asynchronous model Entity resolution is to identify records that refer to the same real-world entity. Keys for graphs yield a deterministic method to provide an invariant connection between vertices and the real-world entities The quality of matches identified by keys highly depends on keys discovered and used, although keys help us reduce false positives. We defer the topic of key discovery to another paper focus primarily on the efficiency of applying such constraints Finally, we remark that entity resolution is just one of the applications for keys for graphs besides: e.g., digital citations and knowledge base expansion entity matching is different to record matching that identify tuples in relations that does not enforce topological constraints in the matching process Consider a graph G and an entity e in G. We say that G matches Q(x) at e if there exist a set S of triples in G and a valuation ν of Q(x) in S such that ν(x) = e, ν is a bijection between Q(x) and S. We refer to S as a match of Q(x) in G at e under ν. Intuitively, ν is an isomorphism from Q(x) to S when Q(x) and S are depicted as graphs. That is, we adopt subgraph isomorphism for the semantics of graph pattern matching Example 4: Consider Q(x) and G, a set S1 of triples in G 2: {(com 1, name of, “AT&T”), (com 4, name of, “AT&T”), (com 1, parent of, com 4), (com 3, parent of, com 4)}. Then S1 is a match of Q 4(x) in G 2 at com 4, which maps variable x to com 4, name∗ to “AT&T”, wildcard company to com 1, and company to com 3. Keys for Graphs: Keys. A key for entities of type τ is a graph pattern Q(x), where x is a designated entity variable of type τ we provide two parallel scalable algorithms for entity matching: MapReduce and a vertex-centric asynchronous model Amr El-Helw∗ , Venkatesh Raghavan∗ , Mohamed A. Soliman∗ , George Caragea∗ , Zhongxian Gu†, Michalis Petropoulos‡ ∗ Pivotal Inc. Palo Alto, CA, USA † Datometry Inc. San Francisco, CA, USA ‡ Amazon Web Services Palo Alto, CA, USA Presented by: Zohreh Raghebi Fall 2015 Big Data analytics are becoming increasingly common in many business domains: including financial corporations, government agencies, and insurance providers Big Data analytics often include complex queries with similar or identical expressions Massively Parallel Processing (MPP) databases address these challenges by distributing storage and query processing across multiple nodes and processes Common Table Expressions (CTEs) are commonly used in complex analytical queries that often have many repeated computations A CTE can be seen as a temporary table that exists just for one query. The purpose of CTEs is to avoid re-execution of expressions referenced more than once within a query. CTEs may be defined explicitly, or generated implicitly by the query optimizer CTEs follow a producer/consumer model where the data is produced by the CTE definition consumed in all the locations where that CTE is referenced. One possible approach to execute CTEs is to expand (inline) all CTE consumers, Rewriting the query internally to replace each reference to the CTE This approach simplifies query execution logic, but may incur performance overhead due to executing the same expression multiple times. the CTE expression is separately optimized and executed only once, the results are kept in memory, or written to disk if the data does not fit in memory The data is then read whenever the CTE is referenced. This approach avoids the cost of repeated execution of the same expression, although it may incur an overhead of disk I/O. The impact of this approach on query optimization time is rather limited, since the optimizer chooses one plan to be shared by all CTE consumers. However, important optimization opportunities could be missed due to fixing one execution plan for all consumers MPP systems leverage parallel query execution where different parts of the query plan execute simultaneously as separate processes, possibly running on different machines. In some cases, a process has to wait until another process produces the data it needs. For complicated queries involving multiple CTEs, the optimizer needs to guarantee that no two or more processes could be waiting on each other during query execution. CTE constructs need to be cleanly abstracted within the query optimization framework to guarantee deadlock-free plan The approaches of always inlining CTEs, or never inlining CTEs, can be easily proven to be sub-optimal The query optimizer needs to efficiently enumerate and cost plan alternatives that combine the benefits of these approaches CTEs should not be optimized in isolation without taking into account the context in which they occur. Isolated optimization can easily miss several optimization opportunities 1. This approach avoids repeated computation However, this approach does not take advantage of the index on i color 2. The opposite approach: all occurrences of the CTE are replaced by the expansion of the CTE This allows the optimizer to utilize the index on i color However, it suffers from the repeated computation 3. Figure 1(c) depicts a possible plan in which one occurrence of the CTE is expanded, allowing the use of the index while the other two occurrences are not inlined, to avoid recomputing the common expression. A novel framework for the optimization of CTEs in MPP database systems. Our framework extends and builds upon our optimizer infrastructure to allow optimization of CTEs within the context where they are used in a query A new technique in which a CTE does not get re-optimized for every reference in the query, but only when there are optimization opportunities, e.g. pushing down filters or sort operations. This ensures that the optimization time does not grow exponentially with the number of CTE consumers A cost-based approach for deciding whether or not to expand CTEs in a given query. The cost model takes into account disk I/O as well the cost of repeated CTE execution A query execution model that guarantees that the CTE producer is always executed before the CTE consumer(s). In MPP settings, this is crucial for deadlock-free execution Ben Kimmett, Venkatesh Sr inivasan, Alex Thomo University of Victoria, Canada {blk,srinivas,thomo }@uvic.ca Presented by: Zohreh Raghebi Fall 2015 We report experimental results for the MapReduce algorithms proposed by Afrati, Das Sarma, Menestrina, Parameswaran and Ullman in ICDE’12 (Fuzzy join using mapreduce) to compute fuzzy joins of binary strings using Hamming Distance Their algorithms come with complete theoretical analysis however, no experimental evaluation is provided there are several algorithms proposed for performing “fuzzy join” : (an operation that finds pairs of similar items) in MapReduce concentrates on binary strings and Hamming distance The algorithms proposed are: Naive, which compares every string in the set with every other Ball-Hashing, send strings to a ‘ball’ of all ‘nearby strings’ within a certain similarity Anchor Points, a randomized algorithm that selects a set of strings and compares any pair of strings that have a close enough distance to a member of the set Splitting, an algorithm that splits the strings into pieces and compares only strings with matching pieces It is argued in that there is a tradeoff between communication cost and processing cost that there is a skyline of the proposed algorithms; i.e. none dominates another. One of our objectives is to see whether we can observe this skyline in practical terms. We observe via experiments that some algorithms are almost always preferable to others. Splitting is a clear winner Ball-Hashing suffers for all distance thresholds except the very small ones Presented by: Shahab Helmi Fall 2015 Authors: Yael Amsterdamer , Tel Aviv University Anna Kukliansky, Tel Aviv University Tova Milo, Tel Aviv University Publication: VLDB 2015 Type: Research Paper Many real-life scenarios (queries) require the joint analysis of general knowledge, which includes facts about the world, with individual knowledge, which relates to the opinions or habits of individuals. “What are the most interesting places near Forest Hotel, Buffalo, we should visit in the fall?“ Locations, opening hours. Interesting locations: depends on the people’s opinions or habits. Existing platforms require users to specify their information needs in a formal, declarative language, which may be too complicated for naive users. Hence, a question in the natural language should be translated into a well-formed query. The NL-to-query translation problem has been previously studied for queries over general data (knowledge), including SQL/ XQuery/SPARQL queries. Crowdsourcing: asking user to refine the translated query. NL tools for parsing and detecting the semantics of NL sentences. The mix of general and individual knowledge needs lead to unique challenges: Distinguishing the individual and general part of the question (query). The crowd info regarding the induvial part of the NL question may not be in the knowledge base. Most of the current techniques which are based on aliening questions to the knowledge based, does not apply. Integrating the generated queries for individual and general part of the question to a well-formed query. The modular design of a translation framework, to solve the challenges mentioned in the previous slide. The development of new modules. Knowledge representation must be expressive enough To account for both general knowledge, to be queried from an ontology, For individual knowledge to be collected from the crowd. RDF: publicly available knowledge bases such as DBPedia and Linked-GeoData. {Buffalo, NY inside USA}. {Buffalo, NY has Label "interesting"}. {I visit Buffalo, NY}. The query language to which NL questions are translated, should naturally match the knowledge representation. QASSIS-QL query language, which extends SPARQL, the RDF query language, with crowd mining capabilities Distinguishes the individual and general part of the question (query) according to the grammatical roles. Dependency Parser: This tool parses a given text into a standard structure called a dependency graph. This structure is a directed graph (typically, a tree) with labels on the edges. It exposes different types of semantic dependencies between the terms of a sentence (grammatical role of the words). It is left to perform the translation from the NL representation to the query language representation. Limit Threshold In this experiment, we have arbitrarily chosen the first 500 questions from the Yahoo! Answers repositories. Presented by: Shahab Helmi Fall 2015 Authors: Weimo Liuy, The George Washington University Md Farhadur Rahmanz, University of Texas at Arlington Saravanan Thirumuruganathanz, University of Texas at Arlington Nan Zhangy, The George Washington University Gautam Das, University of Texas at Arlington Publication: VLDB 2015 Type: Research Paper Location-returned services (LR-LBS): this services return the location of the k returned tuples. Google Maps. Location-not-returned services (LNR-LBS): this services does not return the location of the k tuples and returns some other attributes such as ID, ranking and etc. WeChat Sina Weibo A K-nearest-neighbors (kNN) query: return the k nearest tuples to the query point according to a ranking function (Euclidian distance in this paper). LBS with a KNN interface: hidden databases with limited access, usually through a public web query interface or API. These interfaces impose some constraints: Query limitation: 10,000 per user per day in Google Maps Maximum coverage limit, for example 5 miles away from the query point Aggregate Estimations: For many applications, it is important to collect aggregate statistics in such hidden databases such as sum, count, or distributions of the tuples satisfying certain selection conditions. A hotel recommendation application would like to know the average review scores for Marriott vs Hilton hotels in Google Maps; A cafe chain startup would like to know the number of Starbucks restaurants in a certain geographical region; A demographics researcher may wish know the gender ratio of users of social networks in China etc. Aggregate information can be obtained by: Entering into data sharing agreements with the location-based service providers, but this approach can often be extremely expensive, and sometimes impossible if the data owners are unwilling to share their data. Getting the whole data using limited interfaces would take so long. Approximate estimates of such aggregates by only querying the database via its restrictive public interface. Minimizing the query cost (i.e., ask as few queries as possible) Making the aggregate estimations as accurate as possible. Analytics and Inference over LBS: Estimating COUNT and SUM aggregates. Error reduction, such as bias correction Aggregate Estimations over Hidden Web Repositories: Unbiased estimators for COUNT and SUM aggregates for static databases. Efficient techniques to obtain random samples from hidden web databases that can then be utilized to perform aggregate estimation. Estimating the size of search engines. For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNT and SUM aggregates, represents a significant improvement over prior work along multiple dimensions: a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations; top-k returned tuples are leveraged rather than only top-1; several innovative techniques developed for reducing error and increasing efficiency. For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel problem with no prior work. The algorithm is not bias-free, but the bias can be controlled to any desired precision. In a Voronoi diagram, for each point, there is a corresponding region consisting of all points closer to that point than to any other. Top-1 Voronoi Top-2 Voronoi Precisely compute Voronoi cells 𝑝(𝑡) = 𝑉(𝑡) 𝑉 , Count(*) = 1 𝑝(𝑡) Extensions: Computing Voroni cells faster Error reduction Datasets: Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness of the algorithm. Online LBS Demonstrations: to evaluate efficiency of the algorithm. Google Maps WeChat Sina Weibo