Presented by: Zohreh Raghebi Fall 2015 Foteini Katsarou Nikos Ntarmos Peter Triantafillou University of Glasgow, UK Graph data management systems have become very popular One of the main problems for these systems is subgraph query processing Given a query graph, return all graphs that contain the query. To perform a subgraph isomorphism, test against each graph in the dataset Not scale, as subgraph isomorphism is NP-Complete Many indexing methods : to reduce the number of candidate graphs for subgraph isomorphism test A set of key factors-parameters, that influence the performance of related methods: the number of nodes per graph the graph density the number of distinct labels the number of graphs in the dataset and the query graph size First to derive conclusions about the algorithms’ relative performance Second highlight how both performance and scalability depend on the above factors six well established indexing methods, namely: Grapes, CT-Index, GraphGrepSX, gIndex, Tree+∆, and gCode Most related works are tested against the AIDS antiviral dataset and synthetic datasets, Grapes alone used several real datasets Formed of many small graphs The authors did not evaluate scalability The iGraph comparison framework compared the performance of older algorithms (up to 2010). Since then, several, more efficient algorithms have been proposed A linear increase in the number of nodes results in a quadratic increase in the number of edges; The number of edges increases linearly to the graph density The increase of the above two factors leads to a detrimental increase in the indexing time The number of graphs increases the overall complexity only linearly The frequent mining techniques are more sensitive because more features have to be located across more graphs The increase in the number of distinct labels leads to: An easier dataset to index 1. It results in fewer occurrences of any given feature 2. A decrease in the false positive ratio of the various algorithms Our findings give rise to the following adage: “Keep It Simple and Smart”. The simpler the feature structure and extraction process, The faster the indexing and query processing algorithm Frequent mining algorithms (gIndex, Tree+∆) may be competitive for small/sparse datasets Techniques using exhaustive enumeration (Grapes, GGSX, CTIndex) are the clear winners Especially with those indexing simple features (paths; i.e., Grapes, GGSX) Instead of having more complex features (trees, cycles; i.e., CTIndex) Industrial paper Avery Ching Sergey Edunov Maja Kabiljo Presented by: Zohreh Raghebi Fall 2015 Analyzing the real world graphs The scale of hundreds of billions edges with available software is very difficult Graph processing engines tend to have additional challenges in scaling to larger graphs Apache Giraph is an iterative graph processing system designed to scale to process trillions of edges Used at Facebook Giraph was inspired by Pregel, the graph processing developed at Google While Giraph did not scale to our needs at Facebook with over 1.39B users and hundreds of billions of social connections Improve the platform to support the workloads Giraph’s graph input model was only vertex centric Parallelizing Giraph infrastructure relied on Map Reduce’s task level parallelism Giraph’s flexible types were initially implemented using native Java objects Not have multithreading support Consumed excessive memory and garbage collection time. The aggregator framework was inefficiently implemented in ZooKeeper Need to support very large aggregators we modified Giraph to allow loading vertex data and edges from separate sources Parallelization support: Use worker local multithreading to take advantage of additional CPU cores Memory optimization : By default serialize the edges of every vertex into a byte array Rather than instantiating them as native Java objects Aggregator architecture: Each aggregator is now randomly assigned to one of the workers Aggregation responsibilities are balanced across all workers Not bottlenecked by the Master Xiaofei Zhang†, Hong Cheng‡, Lei Chen† †Department of Computer Science & Engineering, HKUST Presented by: Zohreh Raghebi Fall 2015 Extracts the most prominent vertices Returns a minimum set of vertices with: The maximum importance total betweenness and shortest path reachability in connecting two sets of input vertices logistic planning, social community bonding In social network study: To understand information propagation and hidden correlations To find the “bonding” of communities , an ideal bonding agent would: 1. Reside on as many cross group pairwise shortest paths as possible 2. Connect as large portion of two groups as possible Such agents could best serve the message passing between two groups The VSB query ranks a vertex’s prominence with two factors: betweenness and shortest path connectivity Minimum cut finds the minimum set of edges to remove to turn a graph into two disjoint subgraph Not offer how other vertices contributes to the connection between sets Top-k betweenness computing are employed to find important vertices in a network However, due to the local dominance property of the betweenness metric such queries cannot serve the vertex sets bonding properly Two novel building blocks for the efficient VSB query evaluation: Guided graph exploration: by vertex filtering scheme to reduce redundant vertex access The minimum set of vertices of the highest accumulative betweenness as the bonding vertices Betweenness ranking on-exploration Instead of computing the exact betweenness value Rank the betweenness of vertices during graph exploration to save the computation cost. Daniel Margo Harvard University Cambridge, Massachusetts dmargo@eecs.harvard.edu Margo Seltzer Harvard University Cambridge, Massachusetts margo@eecs.harvard.ed Presented by: Zohreh Raghebi Fall 2015 Capable of handling graphs that far exceed main memory High quality edge partitions Graph partitioning is an important problem that affects many graphstructured system Partitioning quality greatly impacts the performance of distributed graph analysis frameworks METIS: the gold standard for graph partitioning A multi-level graph partitioning algorithm These approaches do not scale to today’s large graphs Streaming partitioners A graph loader : reads serial graph data from a disk onto a cluster It must make a decision about the location of each node as it is loaded The goal is to find an optimal balanced partitioning with as little computation as possible They are sensitive to the stream order, which can affect performance Streaming algorithms are difficult to parallelize Partitions by a method that does not vary with how the input graph is distributed Sheep can arbitrarily divide the input graph for parallelism and fit tasks in memory Sheep reduces the input graph to a small elimination tree Sheep’s tree transformation is a distributed map-reduce operation Using simple degree ranking, Sheep creates competitive edge partitions faster than other partitioners Type: Research Paper Authors: Tomohiro Manabe, Keishi Tajima Presented by: Siddhant Kulkarni Term: Fall 2015 How is logical structure different from the mark-up structure? Difference between Human understanding and Browser interpretation “Mark-up structure does not necessarily always correspond to the logical hierarchy” Basic idea for webpages with improper tag usage HTML5 solves this problem, but we cannot port all web pages to HTML 5 Other techniques for document segmentation Based on margins between blocks Based on text density Based on identification of important blocks Etc. Most rely on tags (so does this paper, but not entirely) Define Blocks and Headings for their own structure extraction Logical Hierarchy extraction using Preprocessing Heading based Page Segmentation Dataset: Web snapshot ClueWeb09 Category B document collection Calculate accuracy based on Precision and Recall of extracted relationship Types of Relationships – parent, ancestor, siblings, child, descendants Type: Industry Paper Authors: Daniel Haas, Jason Ansel, Lydia Gu, Adam Marcus Presented by: Siddhant Kulkarni Term: Fall 2015 What is Macrotask Crowdsourcing? What is the problem with it? The related work focuses on Macrotasking and Crowdsourcing frameworks Argonaut Predictive models to identify trustable workers who can perform reviews And a model to identify which tasks need review Evaluate trade off between single and multiple phases of reviews based on budget Presented by: Shahab Helmi VLDB 2015 Paper Review Series Fall 2015 Authors: Moria Bergman, Tel-Aviv University Tova Milo, Tel-Aviv University Slava Novgorodov, Tel-Aviv University WangChiew Tan, University of California, Santa Cruz Publication: VLDB 2015 Type: Demonstration Paper It is important for the database to be as complete (no missing values) and correct (no wrong values) as possible. For this reason, many data cleaning tools have been developed to automatically resolve inconsistencies in databases. However these tools: are not able to remove all erroneous data: 95% accuracy in YOGO database. It is impossible to correct all errors manually in big datasets. are not usually able to determine what information is missing from a database. A novel query oriented data cleaning system with oracle crowds. It uses materialized views (i.e., query oriented views which are defined through user queries) are used as a trigger for identifying incorrect or missing information. If an error (i.e., a wrong tuple or missing tuple) in the materialized view is detected, the system will interact minimally with a crowd of oracles by asking only pertinent questions. Answers to a certain question will help to identify the next pertinent questions to ask and ultimately, a sequence of edits is derived and applied to the underlying database. Cleaning the entire database is not the goal of QOCO. It cleans parts of the database as needed. Hence, it could be used as a complementary tool alongside with other cleaning tools. Data cleaning techniques: QOCO uses crowd to correct query results. QOCO propagates updates back to the underplaying database. QOCO discovers and inserts true tuples that are missing from the input database. Crowdsourcing is a model where humans perform small tasks to help solve challenging problems such as Entity/conflict resolution . Duplicate detection. Schema matching. Correct tuples Wrong tuples Missing tuples Consider a user query which searches for European teams that won the World Cup at least twice. Result will contain ESP (which is wrong) it ITA will be missing. Tuples which contains the wrong answer t1 = Game(11:07:10;ESP;NED; final; 1:0) t2 = Game(17:07:94;ESP;NED; final; 3:1) t3 = Team(ESP;EU) t2 = Game(17:07:94;ESP;NED; final; 3:1) t4 = Game(12:07:98;ESP;NED; final; 4:2) t3 = Team(ESP;EU) t4 = Game(12:07:98;ESP;NED; final; 4:2) t1 = Game(11:07:10;ESP;NED; final; 1:0) t3 = Team(ESP;EU) 1. Finding the most frequent tuple (t3) and asking the oracle if it is true or not? 2. t3: is correct -> {t1, t2}, {t2, t4}, {t4, t1} the remaining candidates will be The rest of tuples have the same frequency so QOCO will choose one of them randomly, let say t1. 1. t1: is correct -> {t2}, {t2, t4}, {t4} -> ESP won the world cup only once; hence both t2 and t4 are wrong and should be deleted! Could be find in the original research paper: M. Bergman, T. Milo, S. Novgorodov, and W. Tan. Query-oriented data cleaning with oracles. In ACM SIGMOD, 2015 Presented by: Shahab Helmi VLDB 2015 Paper Review Series Fall 2015 Authors: Weimo Liuy, The George Washington University Md Farhadur Rahmanz, University of Texas at Arlington Saravanan Thirumuruganathanz, University of Texas at Arlington Nan Zhangy, The George Washington University Gautam Das, University of Texas at Arlington Publication: VLDB 2015 Type: Research Paper Location-based services (LBS) Location-returned services (LR-LBS): this services return the location of the k returned tuples. Google Maps. Location-not-returned services (LNR-LBS): this services does not return the location of the k tuples and returns some other attributes such as ID, ranking and etc. WeChat Sina Weibo K-nearest-neighbors (kNN) queries: returns the k nearest tuples to the query point according to a ranking function (Euclidian distance in this paper). LBS with a kNN interface: third-party applications and/or end users do not have complete and direct access to this entire database. The database is essentially “hidden”, and access is typically limited to a restricted public web query interface or API. These interfaces impose some constraints: Query limitation: 10,000 per user per day in Google Maps Maximum coverage limit, for example 5 miles away from the query point Aggregate Estimations: For many interesting third-party applications, it is important to collect aggregate statistics over the tuples contained in such hidden databases such as sum, count, or distributions of the tuples satisfying certain selection conditions. A hotel recommendation application would like to know the average review scores for Marriott vs Hilton hotels in Google Maps; A cafe chain startup would like to know the number of Starbucks restaurants in a certain geographical region; A demographics researcher may wish know the gender ratio of users of social networks in China etc. Aggregate information can be obtained by: Entering into data sharing agreements with the location-based service providers, but this approach can often be extremely expensive, and sometimes impossible if the data owners are unwilling to share their data. Getting the whole data using limited interfaces would take so long. Goals: Approximate estimates of such aggregates by only querying the database via its restrictive public interface. Minimizing the query cost (i.e., ask as few queries as possible) in an effort to adhere to the rate limits or budgetary constraints imposed by the interface. Making the aggregate estimations as accurate as possible. Analytics and Inference over LBS: Estimating COUNT and SUM aggregates. Error reduction, such as bias correction Aggregate Estimations over Hidden Web Repositories: Unbiased estimators for COUNT and SUM aggregates for static databases. efficient techniques to obtain random samples from hidden web databases that can then be utilized to perform aggregate estimation. Estimating the size of search engines. For LR-LBS interfaces: the developed algorithm (LR-LBS-AGG), for estimating COUNT and SUM aggregates, represents a significant improvement over prior work along multiple dimensions: a novel way of precisely calculating Voronoi cells leads to completely unbiased estimations; top-k returned tuples are leveraged rather than only top-1; several innovative techniques developed for reducing error and increasing efficiency. For LNR-LBS interfaces: the developed algorithm (LNR-LBS-AGG) which was a novel problem with no prior work. The algorithm is not bias-free bias-free, but the bias can be controlled to any desired precision. In a Voronoi diagram, for each point, there is a corresponding region consisting of all points closer to that point than to any other. Top-1 Voronoi Top-2 Voronoi 1. 2. Precisely Compute Voronoi Cells Faster Initialization Leverage history on Voronoi cell computation Error Reduction Bias error removal/reduction Variance reduction ... Datasets: Offline Real-World Dataset (OpenStreetMap, USA Portion): to verify the correctness of the algorithm. Online LBS Demonstrations: to evaluate efficiency of the algorithm. Google Maps WeChat Sina Weibo