>> K. Shriraghav: Hello. It's my pleasure to welcome Stephen Whang from Stanford University. His thesis is on entity resolution, data analytics, more broadly and he's going to be talking to us about it. >> Stephen Whang: Thanks Shriraghav for the introduction. So I'm Stephen Whang and I'm a PhD student out of Stanford University. Today I'll be talking about my thesis work called data analytics integration and privacy. So nowadays the amount of data in the world is exploding and companies are just scrambling to make sense out of all of this information. There are lots of recent articles talking about analyzing large amounts of data and here is one cover article from the Economist, cover page. You can see that there are lots of zeros and ones raining from the sky and there is this gentleman in the middle trying to gather some of the rain and he's watering this flower which signifies useful information that is being extracted from all of this information. So within data analytics I have been working on two problems, data integration and data privacy. Even before you start analyzing data, it's very important to combine data from different sources and so data integration plays an extremely important role in data analytics. For example, a few years ago we had this devastating earthquake in Haiti where a lot of people died and people around the world came as volunteers to help out the Haitians. The following figure shows different types of data that was integrated and analyzed in order to help out our Haitian friends. So at first there were SMS broadcasts and then people were texting each other and then the local media was generating a lot of data about what was happening after the earthquake and finally there weren't any official maps in Haiti for navigation so people geo tagged their own location and crowd sourced the maps, and these were used for volunteers to drive around Haiti. So you can see that data integration played a very important role in helping out the Haitians. Now MSR has been a pioneer in data integration so I really don't need to motivate this problem for this audience. The second topic I worked on is data privacy. The following image is from a recent Wall Street Journal article and it is depicting how insurance companies nowadays are sifting through the web and collecting personal information and analyzing it in order to predict the lifespans of their customers. So let's say that we have Sarah on the top here in green color and let's say that she has a lot of bad health indicators. Actually, she has good health indicators. For instance, she only has a one mile commute distance and let's say she does a lot of exercise and she reads a lot of travel magazines and she does not watch much television. On the other hand, let's say that we have Beth on the bottom in the red color and this time let's say that she has a lot of bad health indicators. For example, here she has a long commute distance of 45 miles; she buys a lot of fast food, and she has bad foreclosure activities and she watches a lot of TV. All of these types of information can be found from social networks, blogs and homepages and in this example the insurance company may conclude that Beth is likely to be, is probably less healthy than Sarah and therefore is more likely to have a shorter lifespan than Sarah. Now although this types of analytics may be quite useful for the insurance companies, it's very disturbing for the customer side, and there is actually a study that shows that there is a high correlation between the analysis of this data with the actual medical tests that the insurance companies perform on their patients. So this, I have been studying the problem of data analytics in the data privacy point of view as well, and there is an interesting connection between data integration and data privacy. The better you are at integrating data, the more likely someone's information is leaked to the public, so you have worse privacy and vice versa. The better privacy measures that you take, the harder it is to integrate the information. So it's no coincidence that I've been studying these two problems. So the following slide summarizes my PhD work. First, I've written a number of papers on a topic called entity resolution which is a data integration problem. Secondly I have recently written a number of papers on data privacy and finally I published a paper called Indexing Boolean Expressions which was done while I was interning at Yahoo Research and the techniques here were used in the display advertisement prototype of the Yahoo webpage. So for this talk I'm going to focus on the following works. During part one I will elaborate on a work called evolving rules and during part two which is going to be shorter than part one I will mainly talk about a work called managing information leakage and briefly talk about a newer work called disinformation techniques for entity resolution. Entity resolution is a problem where you want to identify which records in a database refer to the same real-world entity. For example, you might have a list of customers were some of these reports refer to the same person and you might want to remove the duplicates efficiently. Entity resolution can be a very challenging problem. In general, you're not simply given IDs like Social Security numbers that you can just compare. Instead many records contain different types of information and other records may have incorrect information like typos, which makes entity resolution an extremely challenging problem. There are many applications for entity resolution. In our Haiti example you can think of the scenario where there are lots of people sick in bed, and so the hospitals may have these lists of patients. At the same time other people may be posting photos of loved ones that are missing, so now you have an entity resolution problem where you want to identify which patients in these hospital lists refer to the same people as those in these photos of missing ones. In addition, there are many other applications for entity resolution and this makes entity resolution an extremely important problem well. So to summarize my research focus, I'm not trying to solve the entire data analytics problem. This is a huge problem that requires a lot of research to completely solve. Instead, I'm focusing on a sub problem called data integration. Within that I am solving an entity resolution problem and for this presentation I will focus on scalability. So my work on evolving rules was motivated through my interactions with a company called Spock.com. So this is a people search engine that collects information from various social networks like Facebook and MySpace, blogs and Wikipedia and the goal is to resolve the hundreds of millions of records to create one profile per person. Now whenever I was going to Spock, I was hearing these discussions of how to improve the logic comparison rule for comparing people records and at one point they were talking about comparing the names, addresses and zip codes of people, but when I made another visit they realized that the zip code attribute is not a good attribute for comparing people, so they then decided to compare the names addresses and phone numbers of people. So every time I made a visit they were always making these incremental changes of their comparison rule and that made me think of how to produce an updated ER result if you change your comparison rule. So let's say that starting with an input set of records i we use a comparison rule to run entity resolution and produce a resulting set of records. Now if you change your comparison rule, then the naïve approach is to simply start from scratch and produce a new result using the new comparison rule. The naïve approach obviously produces a correct result, but it is not necessarily the most efficient result, and if you are a company like Spock.com that's resolving hundreds of millions of records, then you don't really want to start from scratch every time you make a small incremental change on your comparison rule. So in our paper we proposed a technique incremental technique that can basically convert an old ER result into a new ER result, and we call this process rule evolution and hopefully rule evolution is much more efficient than simply starting from scratch. So for example, let's say that we want to resolve four records, yes please? >>: [inaudible] operation to go through and run the rules again. I'm trying to understand the motivation for trying to reduce the computation. >> Stephen Whang: I'm going to illustrate that problem; I am going to show our solutions using this simple example, so bear with me. For example, let's say that we are trying to resolve R1 through R4 and each record contains the attribute’s name, zip code and phone number. Now of course in the real world there may be many more attributes and many more records to resolve, but I'm only showing you a small example for illustration purposes. Let's say that we use a simple ER algorithm where we first perform pairwise comparisons of all the records and later on cluster the records that match with each other together. Initially, let's say that our comparison rule compares the names and zip codes of people and if two records have the same name and zip code, the comparison rule returns true and false otherwise. So in this example the only matching pair of records is R1 and R2 because they have the same name John and the same zip code 54321. There are no other matching pairs of records. So after clustering the matching records, we get our first ER result where R1 and R2 are in the same cluster but R3 and R4 are in separate clusters. The total amount of work we performed here is six pairwise comparisons, because we were comparing four records in a pairwise fashion. Now notice that record R4 has a null value for its zip code, so this may be an indicator that the zip code attribute is not an appropriate attribute for comparison because it seems to be very sparse, so let's say that we now change our comparison rule and compare the names and phone numbers of people instead of the names and zip codes. So using the naïve approach, again, we are going to compare all of the records in a pairwise fashion and in this case the only matching pair of records is R2 and R3 because they have the same name John and the same phone number 987. After clustering the records that match with each other we get our new ER result where R2 and R3 are in the same cluster and R1 and R4 are in separate clusters. The total amount of work that we've done here again is six pairwise comparisons because we started from scratch. Yes please? >>: [inaudible] comparisons wouldn't the comparisons be the function of the key that you're using. Wouldn't the number of comparisons you're doing for each case be a function of which fields you are using, of which rule you are using? You're saying you are always doing pairwise, but it seems like you know what the rules are, so the number of comparisons would be a function of what the rule matches. >> Stephen Whang: Right. So here I'm using a simple ER algorithm that just blindly does pairwise comparisons, but you can think of more advanced ER algorithms that reduce the number of record comparisons. Again, I'm illustrating using a simple ER algorithm that always performs pairwise comparisons among all of the records. That's why I'm saying it is six comparisons. If you use a different ER algorithm, then we can get a case where it does fewer than six comparisons, but just bear with me because I'm not saying that this is the state-of-theart ER algorithm at all. So now while six comparisons here may not seem like a big deal, if you're resolving hundreds of millions of records, then this is a big deal, so we would like to reduce this number as much as possible. So before I give you the answer of how efficiently produce the name and phone number results, let me show you, go through an easier example where I would like to perform rule evolution for name and zip code to name and zip code and phone number. In this case there is an interesting relationship between the two comparison rules. The second comparison rule seems to be stricter than the first comparison rule and you can see that two records that do not match according to name and phone number will not match according to name and phone number and zip code as well. Therefore, notice that we don't have to, starting from the first result, we don't have to compare the records across different clusters because we know that they will never match according to the new comparison rule. Instead we only need to compare the records within the same cluster. So in this case we will only compare R1 and R2. And it turns out that R1 and R2 have different phone numbers so they end up in different clusters. Just by performing one record comparison we've produced our new ER result for name and zip code and phone number, which is much more efficient than the six record comparisons that would have been produced using the naïve approach. Yes? >>: So this [inaudible] any other rule [inaudible] combinations? >> Stephen Whang: Right. So I'm assuming that the comparison rule is in a conjunctive normal form of arbitrary predicates in general. >>: [inaudible] greater combinations of [inaudible] without the [inaudible]. >> Stephen Whang: Right. So later on I also have these rule evolution techniques for distancebased clustering ER algorithms where you are clustering records based on their relative distances, so I'll talk about that later on. So going back to our original problem, notice that you can't use the exact same technique used here to produce the name and phone number result because there is no special strictness relationship between name and zip code and name and phone number, so for example, although R2 and R3 are in different clusters in the first result, they might have the same name and if they also had the same phone number, then they might end up in the same cluster in the name and phone number result. So we can't use this technique and power solution for this problem is to use what we call materialization. In addition to producing the ER result for name and zip code, the idea is to produce an ER result for the named comparison rule and another result for the zip code comparison rule as well. And starting from this materialization, notice that we can exploit the name result to efficiently produce the name and phone number result using the exact same technique used in the previous slide. So again, there is this interesting strictness relationship where name and phone number is stricter than name, so we are only going to compare the records that are within the same cluster in the name result. And just by doing three pairwise record comparisons we can now produce the name and phone number result, which is much more efficient then the six record comparisons that would have had to have been produced by the naïve approach. So here rule evolution appears to be very efficient, but an immediate question you can ask is, what are the costs to be paid for doing this, using this technique? There are time and space overheads for creating these materializations. For example, it seems like we are running answer resolution three times which basically defeats the purpose of reducing run-time. In our paper for the time overhead, we have various amortization techniques where the idea is to share expensive operations as much as possible when we are producing multiple ER results. For example, when you're initializing the records you have to read the records from the disc and you have to parse them and do so more processing, and this only needs to be done once even if you are producing ER results multiple times. We also have ER algorithm specific amortization techniques, so one of the ER algorithms that we implemented in our work always sorted the records before doing any resolution and the sorting here was the bottleneck of the entire ER operation. So in this case we only need to sort the records once, even when we are producing ER results multiple times. So all of these amortization techniques combined it turns out that the time overhead is reasonable in practice and I'll show you some experimental results later on. Now for the space overhead, notice that we need to store more partitions on the records, but the space complexity of throwing partitions is linear to the number of records and in addition we can use compression techniques where we only can simply store the record IDs in partitions instead of copying the entire contents of the records over and over again. So again, in our paper we show that the space overhead of materialization is reasonable as well. So we have considered many different… Yes, Arvin? >>: How do you know [inaudible]? >> Stephen Whang: In general, we viewed the comparison rule as a conjunctive normal form and the strategy is to materialize on each of the conjuncts. Now this is a most general case because we don't know how the new comparison rule is going to change, but you can actually improve this approach by, if you know some more information. For example, if some of the conjuncts always go together, then you don't have to materialize on each of the conjuncts. But the most general approach we materialize on one conjunct at a time. Yes, please? >>: [inaudible] for future use? So how does your technique compare against [inaudible]? >> Stephen Whang: Yeah, there's a great deal of work done on materialized views. This is in the same spirit as materialized views but this is mainly, I'm working on clusters of records, so this is more, this is an ER specific technique. So I'll just keep on explaining and hopefully this will be clearer. I considered many different ER algorithms in the literature and we've categorized them according to desirable properties that they satisfy. If an ER algorithm satisfies one or more of these properties then we have efficient rule evolution techniques. So the first property we identified is called rule monotonicity, and here you can see a number of ER algorithms in the literature that satisfy this property in the red circle. The second property is called context free, and you can see again a number of ER algorithms in the literature that satisfy this property in the blue circle. In addition, we used two other properties in the literature and further categorized ER algorithms according to the properties they satisfy and depending on where you are in this Venn diagram, we have different rule evolution techniques that you can use. In this talk I will only focus on the purple area which is the intersection of rule monotonicity and context free. I'll define these two properties and then illustrate our rule evolution techniques, but before that let me define two preliminary notations. Yes please? >>: So a lot of the real systems they would just couple the [inaudible] this is just blocking pairwise comparison and [inaudible] does this character spread to the individual components of this pipeline or it fully applies to a system which you describe as a single stage algorithm? >> Stephen Whang: Yes. Here I am assuming all of the steps to be one ER algorithm, so we assume that an ER algorithm basically receives a set of records and returns a partition of records. This can include blocking and resolution and post-processing, so everything is included in this blackbox ER algorithm. >>: [inaudible] individual stations, even though the systems tend to be more piecemeal. They want to plug in some blockers and then he wanted to play with some clusters. >> Stephen Whang: Certainly. You can view blocking to be a separate process from entity resolution so it's kind of orthogonal issue. In order to scale entity resolution you can use blocking separately but for, when you're resolving each of the blocks, you can use the rule evolution techniques here. A comparison rule B1 is said to be stricter than B2 if for all of the pairs that represent matches according to B1, they also match according to B2. For example, name and ZIP code is stricter than name, because any two records that have the same name and ZIP code will also have the same name as well. Now the second preliminary notation is domination. So this notion that, so an ER result is dominated by another result if for all of the records that are in the same cluster in the first result, they are also in the same cluster in the second result as well. In this example if you look at the first ER result, the only records that are in the same cluster are R1 and R2 and since these two records are in the same cluster in the second ER result as well, the first ER result is dominated by the second ER result. But a third ER result is not dominated by the second result because although R3 and R4 are in the same cluster, they are not in the same cluster in the second result, so this is not dominated. Using the strictness and domination I can now define the rule monotonicity property which is defined as follows, it is saying that if a comparison rule B1 is stricter than B2 then the ER result produced by B1 must always be dominated by the ER result produced by B2. So for example, let's say that we are producing an ER result using the name and ZIP code comparison rule, and we produce another ER result using the name comparison rule. If the ER algorithm satisfies rule monotonicity, then it must be the case that the first ER result is dominated by the second ER result. So the second property is called context free and the formal definition is shown on the screen. Intuitively it is saying that if we can divide all of the records into two sets such that we know that none of the records in these two different sets will ever end up in the same cluster, then we can resolve each data set independently. So for example, let's say that we have four records, R1 through R4 and let's say that R1 and R2 refer to children while R3 and R4 refer to adults. So from the start we know that none of the children will ever end up in the same cluster with the adults. So we can divide the records into the following two sets. So if the ER algorithm satisfies context free then we are allowed to do the following where we resolve each set independently. We can first resolve R1 and R2 together without caring whatever happens to R3 and R4 and produce clusters. Then we can resolve R3 and R4 without caring whatever happens to R1 and R2 and produce another set of clusters. The property guarantees that by simply collecting the resulting clusters, we are guaranteed to have a correct ER result. Now notice that I have this cute diagram of a thick black wall in the middle which is basically stopping information from flowing from left to right or from right to left. Yes? >>: [inaudible] B [inaudible]? >> Stephen Whang: B is the rule. Yes. As an example where context three is violated, let's say that R3 is the father of R1 and R4 is the father of R2. Let's say that the ER algorithm first result compares R1 and R2 and clusters them together and compares R3 and R4 and puts them in separate clusters. Now the fact that R1 and R2 are the same child may be strong evidence that R3 and R4 are actually the same father, so the ER algorithm may change its mind and end up merging R3 with R4. So that's an example where some of the information is flowing from left to right, so that is a violation of this context free property. So using rule monotonicity and context free I will illustrate a rule evolution algorithm that exploits these properties to do efficient evolution. So let's say that we are evolving from the comparison rule name address and ZIP code to name address and phone number. First of all I am going to materialize on all of the conjuncts. In this example I will produce an ER result for the name comparison rule and another result for the address result and another one for the ZIP code result. I don't show the ZIP code result due to space constraints. So during step one, the algorithm is going to first identify the common conjuncts of the old and new comparison rules. In this example the common conjuncts are the name and address comparison rules. It's going to perform what we call a Meet Operation between, among the ER results of the common conjuncts. In this example we are going to perform a Meet Operation between the name result and the address result. Intuitively, we are taking the intersecting clusters, so in this Meet result you can see that R2 and R3 are in the same cluster, and the reason is that these two records are in the same cluster in the name result and they are also in the same cluster in the second result as well. But R1 and R2 are not in the same cluster in the Meet result because although they are in the same cluster in the name result, they are not in the same cluster in the address result. Now using rule monotonicity we can prove that this Meet result always dominates the final ER result, which means that none of the records in different clusters are ever going to end up in the same cluster in the end result. During step two the algorithm exploits the context free property and since it knows that none of the records can match across different clusters, it is going to resolve each cluster independently. So for the first cluster it only contains R1 so we are going to return R1 as a singleton cluster. For the second cluster we are going to compare R2 and R3 and let's say that it turns out that R2 and R3 refer to different entities so they are placed in different clusters. Finally for the third cluster, it only contains R4 so we return R4 as a singleton cluster. And simply by collecting the final clusters we are guaranteed to have arrived at a correct ER result. The amount of work we have done here is only one record comparison which is much better than the six record comparisons that would have been performed by the naïve approach. So in summary, I defined rule monotonicity and context free properties and I've illustrated an efficient rule evolution algorithm that works well for this purple area. Again, depending on where you are in this Venn diagram, we have different rule evolutions that work for those regions as well. Yes? >>: [inaudible] example you actually rely on the [inaudible] name and address so suppose [inaudible] found like name [inaudible] is it possible that you [inaudible] the constraints [inaudible] each other columns actually was changed [inaudible]? >> Stephen Whang: I'm assuming that the conjunction rule is a conjunctive normal form of arbitrary predicates so within the predicate you can use whatever similarity function you like. Here I was just showing you an easy example where we only perform equality constraints. >>: I'm not saying that the approach [inaudible] obvious [inaudible] predicate. Like when you have such conjuncts are like you realize [inaudible] predicate [inaudible] name [inaudible] each predicate [inaudible]. >> Stephen Whang: In this model if you are doing a similarity comparison between names and you are comparing with a high threshold, then you have another predicate that uses a lower threshold, we consider these two predicates to be totally different. So that's, but you can do more optimizations by taking into account that these are two predicates actually doing the same stream comparison in comparison with the threshold. Yes? >>: What if [inaudible] threshold? >> Stephen Whang: It depends on the ER algorithm. If you are using, for example, the single link hierarchical clustering and you relax the threshold, which means you reduce the threshold, then all you have to do is build on top of the previous ER result. Now if you go the other way around then this approach does not work. Yes? >>: So you [inaudible] changing the rules, right? What if the data changes? Will the [inaudible] still work? >> Stephen Whang: That is a very important problem and this work is mainly focused on when the comparison rule itself changes. It is an orthogonal problem and in the future I would like to use, do research in both directions, but this work does not consider the case where you're having a new stream of records which you would like to add to your ER result. So far I have been talking about rule evolution techniques for what I've been calling match-based clustering algorithms. In our paper we also have rule evolution techniques for what we call distancebased clustering algorithms, so here records are cluster based on their relative distances. I won't have enough time to go through all of the details, but just to give you the high-level idea, let's say that we have four records. R, S, T and U and let's say that the record S is close to R because S is within that dotted circle around R, but the records T and U are far away from R because they are outside of that dotted circle. Now, whenever the distance function changes, all of these pairwise distances are going to change as well. Although we don't know how exactly those distances will change, let's say we at least know the lower and upper bounds of those distances, as the red brackets. So, for example, the record S may suddenly become closer to R or it may end up further away from R but we still know that S will still be inside that dotted circle. So we know that S is still, will still be close to R. For record U although the distance from R is going to change, we still know that U is going to be outside that dotted circle. Now the interesting case is record T. Now T may end up within that dotted circle or outside the circle and in that case we are going to be optimistic and cluster R and T together, but later on we visit this cluster and check and see if R and T are indeed close to each other. If not we are going to split T out of R's cluster. So I'm glossing over a lot of details, but that is the high-level idea for rule evolution for distance-based clustering algorithms. So the following slide shows all the data sets I've been using for all of my ER works. The first data set contains shopping items provided by Yahoo Shopping where items were collected from different online shopping malls and each item represents, record represents an item and it contains attributes including the title, price and category. Now since these records are collected from different data sources, there are naturally many duplicates that need to be resolved. The second data set is a hotel data set provided by Yahoo Travel and again records were collected from different travel sites like Orbitz and Expedia and again, different records may refer to the same hotel. And while there aren't too many records here, there are many attributes per record, and the data is very rich here. Finally, we have a person data set provided by Spock.com where people records are collected from social networks like Facebook and MySpace and blogs and Wikipedia and we have a lot of records here, and each record contains the name, gender, state, school and job information about a person. Among these three data sets we, for my evolving rules work I only perform experiments on the shopping and the hotel data sets and in this talk I will only show you experimental results using the shopping data set. Yes, Ravi? >>: [inaudible] data sets [inaudible] how expensive is it [inaudible]? >> Stephen Whang: It's extremely expensive. You don't want to run the naïve approach. >>: [inaudible]. >> Stephen Whang: So the naïve approach is defined to be starting from scratch. So your question is what, how fast, how long does it take to resolve hundreds of millions of records? So I was interacting with Spock.com and I think it takes in the world of hours, or maybe days. But I'm not sure about the exact number because it's kind of a secret that they have so they weren't revealing all of the numbers to me. >>: [inaudible] to do this or? >> Stephen Whang: Not that I know of. Not Spock.com. This company recently got acquired by a company called Intelius.com, so since then they may have changed their strategy, but when I was interacting with Spock.com they were just using a regular DBMS, like my SQL. >>: [inaudible] the gold standard sets for each of these? >> Stephen Whang: That's a good question. For the hotel data set we had a gold standard which was provided by a Yahoo employee that we knew, but for the other two data sets we don't have a gold standard. It's very hard to generate a gold standard for these data sets. I'm going to show you two representative experimental results of my work, and both of these are going to be runtime experiments. The reason I'm not going to show you any accuracy experiments is that our techniques are guaranteed to return 100% accurate ER results all the time, so we're not trading off accuracy with scalability; we are only improving the scalability of the rule evolution, so that's the feature of our work. This plot shows the results of gathering our techniques on 3000 shopping records and the X axis shows the strictness of the common comparison rule. We tested on many different comparison rules, and in some of these experiments we evolved from title and category to title and price. In that example, the common comparison rule is the title comparison rule and for this rule we were extracting the title from two different records and we were computing the string similarity of the two titles and then we compared this value with the threshold. If we increase the threshold, then the title comparison rule become stricter in the sense that fewer records match according to their titles, so in that case we moved to the left side of the X axis. On the other hand, if we decreased the threshold then the title comparison rule becomes less restricted in a sense that fewer records, more records match with each other and so in that case we moved to the right hand side of the X axis. So just think of the X axis to represent different application semantics. The y-axis shows the runtime improvement of rule evolution versus the naïve approach, but here I did not include the time overhead for the materialization, but in that next plot I will compare the total runtime of rule evolution including the time overhead with that of the naïve approach. So I implemented four different ER algorithms in the literature and you can see that in the best case we get over three orders of magnitude of improvement, while even in the worst case we still improve the runtime, improve the naïve approach. Yes please? >>: [inaudible]. >> Stephen Whang: The naïve approach is to simply run the ER from scratch without exploiting the materialization. Given a set of records, simply run your own ER algorithm using the new comparison rule. >>: [inaudible]. >> Stephen Whang: Oh, here? So these are the four different ER algorithms. The red plot is the sort of neighbor technique where we sort the records and then we use a sliding window and compare the records within the same window. These are all different techniques. HCB means we are merging records that match with each other in a greedy fashion and HCBR is similar but we have some properties, some desirable properties that guarantee that the ER result is going to be unique. Finally, HCDS is a distance-based clustering algorithm. It is simply, it's the single link clustering algorithm. >>: Have you tried other clustering algorithms [inaudible] K-Means [inaudible]? >> Stephen Whang: So K-Means did not satisfy any of the properties, so it was not like a good algorithm to demonstrate our techniques. When you're comparing the total runtimes, here is what you should expect. Let's say that for rule evolution you are performing materialization once and then you perform rule evolution one or more times. The X axis here shows the number of evolutions and the Y axis shows the total runtime. Initially if you, for rule evolution we’re paying an overhead for the materialization, so rule evolution is slower than the naïve approach, but as we perform more rule evolutions, the incremental costs paid by rule evolution is smaller than the cost for the naïve approach where we simply run ER from scratch. So at some point you can imagine rule evolution is going to outperform the naïve approach. Now the following slide shows a result of one scenario we've considered. The X axis shows a number of shopping records that were resolved and the Y axis shows the total runtime in hours on the log scale. I'm only going to show, explain the top two plots that are colored. So the red plot shows the total runtime of the naïve approach, while the blue plot shows the total runtime of when we are using rule evolution using the exact same ER algorithm. And here we performed materialization once followed by one rule evolution. You can see that the blue plot still improves the red plot which means that in this case the rule evolution was saving enough time to compensate for the time overhead paid by, the time overhead for materialization. Although this gap seems very small, the unit of the Y axis is in hours so the runtime’s improvement is actually significant. >>: It's also a log scale. >> Stephen Whang: Yes, it's also a log scale, as well, so that makes the improvement even more significant. Now the important thing to understand here is that this result shows, is only for one scenario that we've considered. We can think of many other scenarios where you could either get results that are much better than these results or much worse. For example, if you're comparing multiple evolutions instead of just this one, then you're probably going to have much better runtime improvements than those shown in this plot. On the other hand, you can think of pathological cases as well, pathetic cases. For example, if the old and new comparison rules are totally different, then there is no point in performing rule evolution because there is nothing to exploit, so in this case simply running the naïve approach is the most efficient technique. If you run rule evolution here you are probably slower than the naïve approach because you still have to pay the time overhead for materialization. The take away of this plot is that we now have this general framework for rule evolution that you can use to evaluate if rule evolution makes sense for your application or not. So in conclusion, we've proposed a rule evolution framework that exploits ER properties in materialization and we have shown that the rule evolution can significantly enhance the runtime of the naïve approach. So that was the first part of my talk and the second part will be much shorter. Yes? >>: Can you talk about rules that are linear to the [inaudible] combinations? >> Stephen Whang: Figure combinations. I think that's the case of distance-based clustering, so an example for distance function is you add the titles of the name similarity with the address similarity and you can do a weighted sum, so at the end you get some distance value, so our distance-based rule evolution techniques for distance-based clustering algorithms work for your case. >>: And in non-metric cases in the case where there is a nonlinear sum or if the distance is not in the metric space? >> Stephen Whang: As long as you return a distance, it's fine. You don't have to satisfy any triangular property. The only assumption we make is that you have to have an idea of how much this distance can change in the next evolution, so we have to have some information that, for example, each distance is only going to change at most by 10% or by some constant amount like 5. As long as we have that information you can use our techniques. Yes? >>: So [inaudible]? >> Stephen Whang: Right. In this work we assume conjunctive normal forms, so if you only have disjunction everything is going to be considered as one predicate. >>: [inaudible] junction [inaudible]? >> Stephen Whang: Okay. We assume that you can convert this into a conjunctive normal form and then the idea is to materialize on each of the conjuncts. Yes please? >>: [inaudible]? >> Stephen Whang: I mentioned that briefly, so we're only saving multiple partitions of records. Instead of, now, so the space complexity is linear to the number of records, but notice that you can use a lot of compression techniques, so you can save a partition of record IDs instead of a partition of the actual records, so this saves a lot of space and it turns out that the space overhead is reasonable in practice. It doesn't… >>: [inaudible]. >> Stephen Whang: It's not more than 10%. >>: [inaudible] lose your advantage [inaudible] normalizing [inaudible]? >> Stephen Whang: Yeah. That's an issue. That's a concern that we have. >>: [inaudible] lose already [inaudible] assumptions [inaudible]? >> Stephen Whang: At least for the experiments that we performed, a lot of these comparison rules were just conjunctions of predicates, so we believe that it is reasonable to assume that you have a conjunction of predicates. But there is, the complexity for converting a DNF expression to CNF expression is like exponential, but once you do that then you can use our rule evolution techniques. Yes, please? >>: So your speedup is a function of a sequence of rules that you have, and currently what do you find in terms of these rule sequences that you see people using [inaudible] resolution? >> Stephen Whang: My understanding is, I don't have the--again, I interacted with Spock.com but they didn't give me all of the details that they have. But my impression is that some of these predicates are always used, so you always compare the names and addresses of people and afterwards you may compare the zip codes or phone numbers or some other attribute like gender, for example. So in that case you can actually do something better than, you can improve this approach where you can materialize on the name and address combination and if you have a zip code then you may or may not materialize on the zip code, so this is a very like application-specific optimization. You have to see which conjuncts are used together and which conjuncts are likely to change or not, and my understanding is that certain conjuncts never really change and it's kind of only like the tail that is changing, but I can't confirm that. >>: A follow-up question is how many iterations would someone typically go through in trying to resolve [inaudible] resolution system? >> Stephen Whang: When I was visiting Spock.com, every time I meet a visit they were discussing how to improve their comparison rules, so this always happens. I can't give a definite number, but it's always a continuous improvement, so it's very rare that you have complete information about your understanding about your data and your applications. In the real world you're always getting, while you're constructing your system, you're getting a deeper understanding of your application and you make improvements to your match function comparison rule and then it's a back-and-forth thing, and so you evaluate this rule and you realize you missed something and you improve your comparison rule and this kind of repeats, so it's pretty frequent. Yes, please? >>: [inaudible] about the scalability of the naïve [inaudible] something about the naïve, for example, if you look at your ER techniques, you can say there are two parts. There is pairwise matching and taking things I [inaudible] and then there are other clustering's [inaudible] both of these if I'm not mistaken [inaudible] so [inaudible] against a well-designed ER as a baseline or [inaudible] what [inaudible] used for [inaudible] system then how much you can use [inaudible]? >> Stephen Whang: For my work I only considered the algorithms in this Venn diagram so I know these are very cryptic, but this is sort of the sorting neighbor technique and this is the, the HC's are all hierarchal clustering. The ME is the technique proposed by Mons and Elkind [phonetic] where you have a queue and you want to only compare each new record with clusters in the queue and so [inaudible] was mentioning K-Means, but we didn't consider KMeans. We didn't consider your case where you have this various scalable way of decoupling the identification of candidates followed by a clustering. So these are the algorithms we used in the paper. It would be very interesting to see actually how your techniques, if your algorithm actually satisfies any of these properties. Yes? >>: How do you think [inaudible] experiments [inaudible]? >> Stephen Whang: It was just hand-picked, so I was just considering--a comparison rule that I used contained two predicates and then I was changing one of them, so I wasn't considering like, I wasn't doing an entire survey of doing rule evolution, so this, I was just trying to demonstrate how rule evolution saves time and compensates for materialization overhead. But I agree that there is a lot of future work that can be done from here; basically, there are a lot of unanswered questions. I'll move on to the second part of my talk where I will mainly talk about a work called managing information leakage and then briefly I'll mention more recent work called disinformation techniques for ER. So at the beginning of my talk I motivated you about the data privacy problem by explaining how insurance companies are sifting through the web and are collecting and analyzing your personal information to predict your lifespans. Here is another interesting example for my work. So Quora.com is a Q&A site where people can ask questions and other people can answer those questions. Not too long ago there was this person who posted a question basically challenging anyone out there to find all of his information on the web. If you look at this question you can see that there is no indicator of the identity; there's no clue about the identity of the person who asked the question. You can't find his name. There's no ID of the person who posted this question, so just by looking at the text you can't get any clue. Now it turns out that once you post a question on Quora.com you become the only follower of your own question, so there is this anonymous user that came along and he was lucky enough to see, to discover that the only follower of this question and he clicked on the profile, got the name and started searching through various social networks, blogs and homepages to extract all the information about this guy, and so the anonymous user posted 26 bullet points of personal information about this person. One bullet point says that you are 24 and you will be turning 25 very soon. Another bullet point says your mother's maiden name starts with the letter P. Another bullet point said you can program in C++ and another one says that you are related to some incident in LA back in 2009, and finally another one says that you are a Democrat. So Joseph was very impressed, who turns out to be the person who posted this question was impressed and actually categorized the 26 bullet points as follows. Some of them were categorized as correct and downright scary. Apparently Joseph did not anticipate that someone would figure out his exact age. If someone figures out my mother’s maiden name, I would be freaked out as well. Now some of the bullet points were categorized as deliberately public, so the claim that Joseph can program in C++ was probably information that could be found in his online resume. Interestingly, there were six incorrect bullet points. The claim that Joseph is somehow related to this incident back in LA in 2009 turns out to be false information. So Joseph had this Facebook profile image that was related to that incident but it turns out that this image was a deliberate obfuscation that Joseph made in order to confuse the adversary, any adversary that wanted to find his information. Also the claim that Joseph is a Democrat turns out to be wrong as well. It turns, in this case the anonymous user confused this Joseph to be the same person as some other Joseph with the same name and since this other Joseph was clearly a Democrat, the anonymous user thought that the first Joseph was a Democrat as well. But here Joseph clarifies that no party can handle his idiosyncrasies. So after this experiment Joseph concludes as follows in bold font. He's saying that it seems to be far more effective to allow incorrect information to propagate than to try to stem the tide. So my research in data privacy is twofold. First, I like to quantify just how much of your personal information is being leaked into the public and second, I like to propose management techniques for information leakage. Now the interesting connection between this work and the previous works on entity resolution is that I'm assuming that the adversary is performing entity resolution to piece together your information. In this example the anonymous user was piecing together Joseph’s information to know more about him. So I'm basically using the ER models that I developed in my previous works to simulate the adversary to then quantify the information leakage. That is the interesting connection between this work and my previous ER works. To formalize this problem a little bit further, I'm going to assume that all of the pieces of information on the web are simply records in a database as shown as the blue boxes. And I'm going to assume that there is a notion of full information of Joseph as shown on the top of the screen. Again, the adversary is going to perform entity resolution and piece together the relevant records that refer to Joseph and intuitively the information leakage is defined as the similarity of this blob of information with that complete information. So I won't have enough time to go through all of the details of my measure, but here are the key features. So first we incorporate entity resolution. Second, we don't assume that privacy is an all or nothing concept. If you look at many previous privacy works, many of them assume especially in data publishing the assumption is that you somehow have complete control of your data before you release it, so once you make it perfectly private then you can just give it to someone else. For example, a hotel, a hospital may be trying to release a set of medical records of its patients to the public for research reasons. There are lots of normalization techniques that you can use to make it very hard to figure out which patient has which disease, so once you've made the data entirely anonymous, then you are allowed to publish this information to the public. In comparison, our works make a fundamentally different assumption, where we assume that some of our information is already in the public, so whenever you want to interact with your friends through Facebook, you have to give out some of your personal information like your birth date. If you want to buy a product from Amazon.com, you have to give out your credit card information. So in order to do our everyday lives, we are continuously exposing ourselves to the public. Our view of privacy is that there is no such thing as perfect privacy and that privacy should be within a continuum from 0 to 1 and we are just trying to be as private as possible. In addition, our information leakage measure incorporates the uncertainty that the adversary has on his data and the correctness of the information of the data itself. So once you release the information, your information to the public, we assume that it's very hard to delete that information, so even if you attempt to remove some information, some other people may have made copies of that information and companies may have backup files of that information, so if you delete a photo from Facebook, who knows if that photo is still floating around the web. So in that setting the only way to reduce information leakage is to add what we call disinformation records, as shown in red, where the idea is to dilute information by producing records that are realistic but incorrect. Now this information is not a technique that we're proposing as a new idea. This strategy has been used since the dawn of mankind. So during World War II the Allied forces generated a great deal of disinformation to trick the Germans into thinking that they would land on Calais, but they actually landed on Normandy and this was a main turning point of the war. So we're adapting this ancient strategy to the realm of information management. So in our paper we propose two techniques, two types of disinformation records. The first type is called self disinformation where the disinformation record snaps onto one of the correct records and lowers the information leakage by itself. The second type of disinformation is called linkage disinformation, and here the disinformation record is connecting an existing record that is not correct to one of the correct records and here the contents of the existing record, YYY is being used to lower the information leakage. These two strategies turn out to be very effective in practice and we already saw examples for both of these strategies in our Quora.com example. We saw that the anonymous user mistakenly thought that Joseph was related to some incident in LA back in 2009 and it turns out to be, that turned out to be incorrect because of the Facebook profile image of Joseph was kind of a deliberate obfuscation. That's an example for self disinformation. The anonymous user also mistakenly thought that some other Joseph was the same person as this Joseph and mistakenly thought that the original Joseph was a Democrat. That's an example of linkage disinformation and in our paper we discuss these two strategies in more detail. In a more recent paper that I submitted to VLDV, I focus on the linkage disinformation problem in more depth. Here I assume that the adversary is running entity resolution to cluster records that refer to the same entity. Once we have that, I assume that for each pair of the clusters there is a way to generate disinformation that will trick the adversary into thinking that these two clusters refer to the same entity and therefore should be merged with each other. I call that, the cost for generating that information is what we call the merge cost and the costs are written in black numbers for each pair of clusters. Now for each cluster that we successfully merge with our target cluster, which is Joseph cluster in this example, we assume that there is a benefit that we obtain, because we are diluting the information of Joseph which is a good thing. For each of the clusters I indicate the benefit in red numbers. So given the setting, we can now formulate an optimization problem where the purpose is to, the goal is to merge clusters in a pairwise fashion starting from Joseph's cluster so that we obtain the maximum benefit while paying a cost, total cost that is within a budget B. In our example, in this example if the budget B is three then the optimal disinformation plan turns out to be that green tree where we first merge the left cluster with the top cluster and simultaneously merge the left cluster with the bottom cluster, and in that case the total merge cost is 1+2 which is exactly 3 here and the total benefit we attain is 1 plus 1 which is equal to two. Now I won't have enough time to go into all of the details of our algorithms, but it turns out that this problem is strongly NP hard which means that you can't come up with a pseudopolynomial algorithm and there are no approximation algorithms as well. So in this, in the most general case we only have heuristics. However, if we restrict the height of the plans to be at most one where we can only merge clusters to the target record directly, then this problem becomes weakly NP hard and we do have an exact pseudo-polynomial algorithm and a two approximate algorithm. I'll be happy to talk about more details if you are interested. Yes, please? >>: [inaudible] more traditional matrix [audio begins] why do you not come up with [inaudible]? >> Stephen Whang: Here I am talking about the disinformation problem I think your question refers to the measure of information leakage. So compare K anonymity, the K anonymity measure is zero or one measure where you are either completely safe or you are not safe at all. If there are--so your data is considered safe or not safe. It is a black-and-white thing. In comparison for an information leakage measure, we can quantify more fine grain notions of privacy. So that was about the measure, but this slide is more about how to generate these information records that can maximize the benefit of the target record. >>: [inaudible]. >> Stephen Whang: Oh this is the budget we can use to, for generating the disinformation record, so there is a cost for creating a brand-new record. You can't simply generate whatever you want, so this is a way to kind of limit the total amount of information you can, kind of disinformation records that you can produce using that will confuse the adversary. >>: I have a more of a problem with the benefit, how you qualify the benefit. >> Stephen Whang: This is one way to do it. There are many ways to model this problem. >>: How did you come up with the numbers for the benefit? >> Stephen Whang: So one approach is to simply define the benefit as the number of records within this cluster. But that is not like the only way that you can do it. You can, for example, compute the benefit by using an arbitrary function that works on, so after you merge the clusters you can apply a general benefit function on all of the records that have been mistakenly merged. But in this problem we assume that you can somehow add these benefits by summing them together and this makes the problem easy to solve analytically, so it's, we're not claiming that this is the only way that you can define benefit. This is actually an open problem that we haven't, that has to be studied more. Yes? >>: Is there a risk that somebody might take this and use it in a different way to smear you? >> Stephen Whang: Yes. So these information techniques can actually be very harmful for companies that are doing data analytics on big data, so in that case if you are trying to extract profiles from social network, social media data, then… >>: I was thinking about things like Obama wasn't really born in Hawaii; he was born in Kenya [laughter]. >> Stephen Whang: Yes, so if people managed to, so that's kind of misinformation. Here, I am focusing on generating data that can confuse the adversary into thinking that different records refer to the same person or object. So it's in the same line as producing misinformation for President Obama. Yes? >>: [inaudible] have some notion I know general unknown [inaudible] where if you don't know [inaudible] the picture on the way then [inaudible]. >> Stephen Whang: This is very case-by-case. In general, you can assume that you have knowledge of all your information out in the public, but there are some cases, some applications, certain applications where you do know where your information is, so a good example that I always use is let's say you are the camera company Nikon. You kind of have your latest and greatest product and you don't want to leak the specs of your camera. There is a site called Nikon rumors.com which is dedicated to spreading rumors about the next best camera produced by Nikon and people actually post rumors and information that they think is correct and so for example, you might guess that the next Nikon product is going to have 40 million pixels and some shutter speed of a few seconds, has a certain shutter speed and so on. Now you can also add a confidence value which reflects how confident you are in this information. There aren't too many rumor sites, so in this case Nikon can sort of figure out where all of its camera information, where all of the rumors of its cameras are coming from. Plus you might only be interested in these rumor sites as well. So the bottom line is, I'm not claiming that you have perfect information about all of the data, but in real world applications, for certain applications, it's enough to know that you do have information about where your information is, you do have a good idea of where your information is located. I hope that answered your question. >>: [inaudible] it seems to be like this strategy domain, I mean when you get right down to it this strategy is really for situations where other people are disseminating information about you. The kinds of information you would disseminate about yourself about places like on Facebook and Twitter all of these kinds of places, it seems to me like you if you are putting it out there in the first place the assumption is that you want people to know that. So would you say that it's correct to characterize the certain utility of this is really about [inaudible] people trying to disseminate information about you that you don't want disseminated? >> Stephen Whang: I think you're getting at the issue of reputation, so you don't want to post information that gives a bad name to you. >>: [inaudible] something about yourself that you just don't want people to know that about you. That is the camera example, right? I mean you as a company, you don't want someone else to put the specs for your latest camera out there, so you are going to go on this disinformation campaign to spam what people think they know. >> Stephen Whang: Exactly. >>: That probably doesn't apply so much to the things that you put out there yourself because the assumption is if you put it out there, you put it out there because you want people to know. Is that a fair assumption to make? >> Stephen Whang: So here your, the problem is focused on trying to hide certain information and lowering information leakage. It's not about trying to reveal information. There is an interesting startup called reputation.com which actually tries to solve the reverse problem, where they want to promote some of your positive information to the public, so they have these interesting web spam techniques where you can kind of make sure that some of your information appears in the top search results of Bing. Yeah? >>: I just wanted you to keep an eye on the time. >> Stephen Whang: Yeah, I will just move on. These are all interesting questions and this is kind of an open problem. So in summary, I've proposed a new measure for quantifying information leakage and I proposed these information techniques that can be used to reduce information leakage and thus manage it. So that was a second part of my talk and for the related works, there has been a lot of previous works on entity resolution and privacy and instead of just listing all of the works on one screen I thought it would be a better, a good idea to just talk about the high level ideas. There has been a lot of works on entity resolution that focus on accuracy and relatively fewer works that focus on scalability. MSR has been a pioneer in entity resolution result and I am well aware of the data cleaning project by the DNX group and I regularly cite and papers from [inaudible], [inaudible], and [inaudible]. My work has been mostly focused on the scalability aspect of ER, and has also proposed many new functionalities for ER, so for example, my rule evolution work solves a new problem for ER where I want to update an ER result incrementally when the comparison rule changes frequently. For privacy, there has been a lot of data publishing works in the past and again, most of these works assume that you somehow have complete control of your data and there are anonymization techniques that you can use to make this data set private before you release it to the public. In comparison, we assume that there is no such thing as perfect privacy and we're just trying to be as private as possible. There have been a lot of measures on privacy as well. Again, just to address [inaudible]'s comment, most of the measures assume that privacy is a zero or one concept, and in comparison our techniques assume that privacy can be anywhere between zero and one so this kind of flexibility enables us to measure in more fine-grained notions of privacy. I am aware that MSR also has produced a state-of-the-art privacy measure called differential privacy from the Silicon Valley research lab. For future work I am interested in many ideas. A direct extension of my work is to study data analytics in a more distributed setting. So you might end up performing data analytics on many machines either because you simply have too much data to run on a single machine or you might have privacy constraints where companies are not willing to share their information so you're forced to run analytics from separate nodes. The issue here is to exploit parallelism as much as possible and also perform analytics in an accurate fashion. I'm also interested in social networks and nowadays you find a lot of graphs about people and it's very important to analyze this information and identify interesting trends among people. A few months ago I got this e-mail from LinkedIn which is a professional social network where they were telling me that they are solving this problem where they would like to identify, resolve user profiles and user skills and companies so that they can connect users with certain skills to companies that want people with those skills. In order to do this mapping correctly, you really need to resolve people and skill sets and the companies correctly and so you can immediately see that there are entity resolution problems here. Currently I'm working on a fascinating topic called crowd sourcing where the idea is to use the wisdom of the crowd to solve problems that are hard to resolve by using computer algorithms only. For example, you might want to resolve a set of photos where the goal is to figure out which photos refer to the same person. Now if I show you two photos of [inaudible] you can probably immediately see that these two photos refer to the same person, but if you try to use this using a computer algorithm, this is going to be very challenging because this involves sophisticated image analysis and even if you use those algorithms the computer is probably not going to be able to do a good job. So the challenge is to ask humans just the right questions and use those answers to do the right clustering. Now humans can be very slow, expensive and error-prone and so they make mistakes both intentionally and unintentionally, so a significant challenge is to incorporate this erroneous behavior of humans when doing crowd sourcing. So in conclusion, data analytics is a critical problem. I solved two problems within analytics, data integration and data privacy and I've mentioned that there is an interesting connection between these two topics. The better you are at integration the worse you are in privacy and vice versa. So thanks for listening to my talk and I will open up for questions. [applause]. >>: You got all of the questions along the way. >> Stephen Whang: Okay. Thank you. [applause].