>> K. Shriraghav: Hello. It's my pleasure to... His thesis is on entity resolution, data analytics, more broadly...

advertisement
>> K. Shriraghav: Hello. It's my pleasure to welcome Stephen Whang from Stanford University.
His thesis is on entity resolution, data analytics, more broadly and he's going to be talking to us
about it.
>> Stephen Whang: Thanks Shriraghav for the introduction. So I'm Stephen Whang and I'm a
PhD student out of Stanford University. Today I'll be talking about my thesis work called data
analytics integration and privacy. So nowadays the amount of data in the world is exploding
and companies are just scrambling to make sense out of all of this information. There are lots
of recent articles talking about analyzing large amounts of data and here is one cover article
from the Economist, cover page. You can see that there are lots of zeros and ones raining from
the sky and there is this gentleman in the middle trying to gather some of the rain and he's
watering this flower which signifies useful information that is being extracted from all of this
information. So within data analytics I have been working on two problems, data integration
and data privacy. Even before you start analyzing data, it's very important to combine data
from different sources and so data integration plays an extremely important role in data
analytics. For example, a few years ago we had this devastating earthquake in Haiti where a lot
of people died and people around the world came as volunteers to help out the Haitians. The
following figure shows different types of data that was integrated and analyzed in order to help
out our Haitian friends. So at first there were SMS broadcasts and then people were texting
each other and then the local media was generating a lot of data about what was happening
after the earthquake and finally there weren't any official maps in Haiti for navigation so people
geo tagged their own location and crowd sourced the maps, and these were used for
volunteers to drive around Haiti. So you can see that data integration played a very important
role in helping out the Haitians. Now MSR has been a pioneer in data integration so I really
don't need to motivate this problem for this audience. The second topic I worked on is data
privacy. The following image is from a recent Wall Street Journal article and it is depicting how
insurance companies nowadays are sifting through the web and collecting personal information
and analyzing it in order to predict the lifespans of their customers. So let's say that we have
Sarah on the top here in green color and let's say that she has a lot of bad health indicators.
Actually, she has good health indicators. For instance, she only has a one mile commute
distance and let's say she does a lot of exercise and she reads a lot of travel magazines and she
does not watch much television. On the other hand, let's say that we have Beth on the bottom
in the red color and this time let's say that she has a lot of bad health indicators. For example,
here she has a long commute distance of 45 miles; she buys a lot of fast food, and she has bad
foreclosure activities and she watches a lot of TV. All of these types of information can be
found from social networks, blogs and homepages and in this example the insurance company
may conclude that Beth is likely to be, is probably less healthy than Sarah and therefore is more
likely to have a shorter lifespan than Sarah. Now although this types of analytics may be quite
useful for the insurance companies, it's very disturbing for the customer side, and there is
actually a study that shows that there is a high correlation between the analysis of this data
with the actual medical tests that the insurance companies perform on their patients. So this, I
have been studying the problem of data analytics in the data privacy point of view as well, and
there is an interesting connection between data integration and data privacy. The better you
are at integrating data, the more likely someone's information is leaked to the public, so you
have worse privacy and vice versa. The better privacy measures that you take, the harder it is
to integrate the information. So it's no coincidence that I've been studying these two problems.
So the following slide summarizes my PhD work. First, I've written a number of papers on a
topic called entity resolution which is a data integration problem. Secondly I have recently
written a number of papers on data privacy and finally I published a paper called Indexing
Boolean Expressions which was done while I was interning at Yahoo Research and the
techniques here were used in the display advertisement prototype of the Yahoo webpage. So
for this talk I'm going to focus on the following works. During part one I will elaborate on a
work called evolving rules and during part two which is going to be shorter than part one I will
mainly talk about a work called managing information leakage and briefly talk about a newer
work called disinformation techniques for entity resolution. Entity resolution is a problem
where you want to identify which records in a database refer to the same real-world entity. For
example, you might have a list of customers were some of these reports refer to the same
person and you might want to remove the duplicates efficiently. Entity resolution can be a very
challenging problem. In general, you're not simply given IDs like Social Security numbers that
you can just compare. Instead many records contain different types of information and other
records may have incorrect information like typos, which makes entity resolution an extremely
challenging problem. There are many applications for entity resolution. In our Haiti example
you can think of the scenario where there are lots of people sick in bed, and so the hospitals
may have these lists of patients. At the same time other people may be posting photos of loved
ones that are missing, so now you have an entity resolution problem where you want to
identify which patients in these hospital lists refer to the same people as those in these photos
of missing ones. In addition, there are many other applications for entity resolution and this
makes entity resolution an extremely important problem well. So to summarize my research
focus, I'm not trying to solve the entire data analytics problem. This is a huge problem that
requires a lot of research to completely solve. Instead, I'm focusing on a sub problem called
data integration. Within that I am solving an entity resolution problem and for this
presentation I will focus on scalability. So my work on evolving rules was motivated through my
interactions with a company called Spock.com. So this is a people search engine that collects
information from various social networks like Facebook and MySpace, blogs and Wikipedia and
the goal is to resolve the hundreds of millions of records to create one profile per person. Now
whenever I was going to Spock, I was hearing these discussions of how to improve the logic
comparison rule for comparing people records and at one point they were talking about
comparing the names, addresses and zip codes of people, but when I made another visit they
realized that the zip code attribute is not a good attribute for comparing people, so they then
decided to compare the names addresses and phone numbers of people. So every time I made
a visit they were always making these incremental changes of their comparison rule and that
made me think of how to produce an updated ER result if you change your comparison rule. So
let's say that starting with an input set of records i we use a comparison rule to run entity
resolution and produce a resulting set of records. Now if you change your comparison rule,
then the naïve approach is to simply start from scratch and produce a new result using the new
comparison rule. The naïve approach obviously produces a correct result, but it is not
necessarily the most efficient result, and if you are a company like Spock.com that's resolving
hundreds of millions of records, then you don't really want to start from scratch every time you
make a small incremental change on your comparison rule. So in our paper we proposed a
technique incremental technique that can basically convert an old ER result into a new ER
result, and we call this process rule evolution and hopefully rule evolution is much more
efficient than simply starting from scratch. So for example, let's say that we want to resolve
four records, yes please?
>>: [inaudible] operation to go through and run the rules again. I'm trying to understand the
motivation for trying to reduce the computation.
>> Stephen Whang: I'm going to illustrate that problem; I am going to show our solutions using
this simple example, so bear with me. For example, let's say that we are trying to resolve R1
through R4 and each record contains the attribute’s name, zip code and phone number. Now
of course in the real world there may be many more attributes and many more records to
resolve, but I'm only showing you a small example for illustration purposes. Let's say that we
use a simple ER algorithm where we first perform pairwise comparisons of all the records and
later on cluster the records that match with each other together. Initially, let's say that our
comparison rule compares the names and zip codes of people and if two records have the same
name and zip code, the comparison rule returns true and false otherwise. So in this example
the only matching pair of records is R1 and R2 because they have the same name John and the
same zip code 54321. There are no other matching pairs of records. So after clustering the
matching records, we get our first ER result where R1 and R2 are in the same cluster but R3 and
R4 are in separate clusters. The total amount of work we performed here is six pairwise
comparisons, because we were comparing four records in a pairwise fashion. Now notice that
record R4 has a null value for its zip code, so this may be an indicator that the zip code attribute
is not an appropriate attribute for comparison because it seems to be very sparse, so let's say
that we now change our comparison rule and compare the names and phone numbers of
people instead of the names and zip codes. So using the naïve approach, again, we are going to
compare all of the records in a pairwise fashion and in this case the only matching pair of
records is R2 and R3 because they have the same name John and the same phone number 987.
After clustering the records that match with each other we get our new ER result where R2 and
R3 are in the same cluster and R1 and R4 are in separate clusters. The total amount of work
that we've done here again is six pairwise comparisons because we started from scratch. Yes
please?
>>: [inaudible] comparisons wouldn't the comparisons be the function of the key that you're
using. Wouldn't the number of comparisons you're doing for each case be a function of which
fields you are using, of which rule you are using? You're saying you are always doing pairwise,
but it seems like you know what the rules are, so the number of comparisons would be a
function of what the rule matches.
>> Stephen Whang: Right. So here I'm using a simple ER algorithm that just blindly does
pairwise comparisons, but you can think of more advanced ER algorithms that reduce the
number of record comparisons. Again, I'm illustrating using a simple ER algorithm that always
performs pairwise comparisons among all of the records. That's why I'm saying it is six
comparisons. If you use a different ER algorithm, then we can get a case where it does fewer
than six comparisons, but just bear with me because I'm not saying that this is the state-of-theart ER algorithm at all. So now while six comparisons here may not seem like a big deal, if
you're resolving hundreds of millions of records, then this is a big deal, so we would like to
reduce this number as much as possible. So before I give you the answer of how efficiently
produce the name and phone number results, let me show you, go through an easier example
where I would like to perform rule evolution for name and zip code to name and zip code and
phone number. In this case there is an interesting relationship between the two comparison
rules. The second comparison rule seems to be stricter than the first comparison rule and you
can see that two records that do not match according to name and phone number will not
match according to name and phone number and zip code as well. Therefore, notice that we
don't have to, starting from the first result, we don't have to compare the records across
different clusters because we know that they will never match according to the new
comparison rule. Instead we only need to compare the records within the same cluster. So in
this case we will only compare R1 and R2. And it turns out that R1 and R2 have different phone
numbers so they end up in different clusters. Just by performing one record comparison we've
produced our new ER result for name and zip code and phone number, which is much more
efficient than the six record comparisons that would have been produced using the naïve
approach. Yes?
>>: So this [inaudible] any other rule [inaudible] combinations?
>> Stephen Whang: Right. So I'm assuming that the comparison rule is in a conjunctive normal
form of arbitrary predicates in general.
>>: [inaudible] greater combinations of [inaudible] without the [inaudible].
>> Stephen Whang: Right. So later on I also have these rule evolution techniques for distancebased clustering ER algorithms where you are clustering records based on their relative
distances, so I'll talk about that later on. So going back to our original problem, notice that you
can't use the exact same technique used here to produce the name and phone number result
because there is no special strictness relationship between name and zip code and name and
phone number, so for example, although R2 and R3 are in different clusters in the first result,
they might have the same name and if they also had the same phone number, then they might
end up in the same cluster in the name and phone number result. So we can't use this
technique and power solution for this problem is to use what we call materialization. In
addition to producing the ER result for name and zip code, the idea is to produce an ER result
for the named comparison rule and another result for the zip code comparison rule as well.
And starting from this materialization, notice that we can exploit the name result to efficiently
produce the name and phone number result using the exact same technique used in the
previous slide. So again, there is this interesting strictness relationship where name and phone
number is stricter than name, so we are only going to compare the records that are within the
same cluster in the name result. And just by doing three pairwise record comparisons we can
now produce the name and phone number result, which is much more efficient then the six
record comparisons that would have had to have been produced by the naïve approach. So
here rule evolution appears to be very efficient, but an immediate question you can ask is, what
are the costs to be paid for doing this, using this technique? There are time and space
overheads for creating these materializations. For example, it seems like we are running
answer resolution three times which basically defeats the purpose of reducing run-time. In our
paper for the time overhead, we have various amortization techniques where the idea is to
share expensive operations as much as possible when we are producing multiple ER results.
For example, when you're initializing the records you have to read the records from the disc
and you have to parse them and do so more processing, and this only needs to be done once
even if you are producing ER results multiple times. We also have ER algorithm specific
amortization techniques, so one of the ER algorithms that we implemented in our work always
sorted the records before doing any resolution and the sorting here was the bottleneck of the
entire ER operation. So in this case we only need to sort the records once, even when we are
producing ER results multiple times. So all of these amortization techniques combined it turns
out that the time overhead is reasonable in practice and I'll show you some experimental
results later on. Now for the space overhead, notice that we need to store more partitions on
the records, but the space complexity of throwing partitions is linear to the number of records
and in addition we can use compression techniques where we only can simply store the record
IDs in partitions instead of copying the entire contents of the records over and over again. So
again, in our paper we show that the space overhead of materialization is reasonable as well.
So we have considered many different… Yes, Arvin?
>>: How do you know [inaudible]?
>> Stephen Whang: In general, we viewed the comparison rule as a conjunctive normal form
and the strategy is to materialize on each of the conjuncts. Now this is a most general case
because we don't know how the new comparison rule is going to change, but you can actually
improve this approach by, if you know some more information. For example, if some of the
conjuncts always go together, then you don't have to materialize on each of the conjuncts. But
the most general approach we materialize on one conjunct at a time. Yes, please?
>>: [inaudible] for future use? So how does your technique compare against [inaudible]?
>> Stephen Whang: Yeah, there's a great deal of work done on materialized views. This is in
the same spirit as materialized views but this is mainly, I'm working on clusters of records, so
this is more, this is an ER specific technique. So I'll just keep on explaining and hopefully this
will be clearer. I considered many different ER algorithms in the literature and we've
categorized them according to desirable properties that they satisfy. If an ER algorithm satisfies
one or more of these properties then we have efficient rule evolution techniques. So the first
property we identified is called rule monotonicity, and here you can see a number of ER
algorithms in the literature that satisfy this property in the red circle. The second property is
called context free, and you can see again a number of ER algorithms in the literature that
satisfy this property in the blue circle. In addition, we used two other properties in the
literature and further categorized ER algorithms according to the properties they satisfy and
depending on where you are in this Venn diagram, we have different rule evolution techniques
that you can use. In this talk I will only focus on the purple area which is the intersection of rule
monotonicity and context free. I'll define these two properties and then illustrate our rule
evolution techniques, but before that let me define two preliminary notations. Yes please?
>>: So a lot of the real systems they would just couple the [inaudible] this is just blocking
pairwise comparison and [inaudible] does this character spread to the individual components of
this pipeline or it fully applies to a system which you describe as a single stage algorithm?
>> Stephen Whang: Yes. Here I am assuming all of the steps to be one ER algorithm, so we
assume that an ER algorithm basically receives a set of records and returns a partition of
records. This can include blocking and resolution and post-processing, so everything is included
in this blackbox ER algorithm.
>>: [inaudible] individual stations, even though the systems tend to be more piecemeal. They
want to plug in some blockers and then he wanted to play with some clusters.
>> Stephen Whang: Certainly. You can view blocking to be a separate process from entity
resolution so it's kind of orthogonal issue. In order to scale entity resolution you can use
blocking separately but for, when you're resolving each of the blocks, you can use the rule
evolution techniques here. A comparison rule B1 is said to be stricter than B2 if for all of the
pairs that represent matches according to B1, they also match according to B2. For example,
name and ZIP code is stricter than name, because any two records that have the same name
and ZIP code will also have the same name as well. Now the second preliminary notation is
domination. So this notion that, so an ER result is dominated by another result if for all of the
records that are in the same cluster in the first result, they are also in the same cluster in the
second result as well. In this example if you look at the first ER result, the only records that are
in the same cluster are R1 and R2 and since these two records are in the same cluster in the
second ER result as well, the first ER result is dominated by the second ER result. But a third ER
result is not dominated by the second result because although R3 and R4 are in the same
cluster, they are not in the same cluster in the second result, so this is not dominated. Using
the strictness and domination I can now define the rule monotonicity property which is defined
as follows, it is saying that if a comparison rule B1 is stricter than B2 then the ER result
produced by B1 must always be dominated by the ER result produced by B2. So for example,
let's say that we are producing an ER result using the name and ZIP code comparison rule, and
we produce another ER result using the name comparison rule. If the ER algorithm satisfies rule
monotonicity, then it must be the case that the first ER result is dominated by the second ER
result. So the second property is called context free and the formal definition is shown on the
screen. Intuitively it is saying that if we can divide all of the records into two sets such that we
know that none of the records in these two different sets will ever end up in the same cluster,
then we can resolve each data set independently. So for example, let's say that we have four
records, R1 through R4 and let's say that R1 and R2 refer to children while R3 and R4 refer to
adults. So from the start we know that none of the children will ever end up in the same cluster
with the adults. So we can divide the records into the following two sets. So if the ER algorithm
satisfies context free then we are allowed to do the following where we resolve each set
independently. We can first resolve R1 and R2 together without caring whatever happens to R3
and R4 and produce clusters. Then we can resolve R3 and R4 without caring whatever happens
to R1 and R2 and produce another set of clusters. The property guarantees that by simply
collecting the resulting clusters, we are guaranteed to have a correct ER result. Now notice that
I have this cute diagram of a thick black wall in the middle which is basically stopping
information from flowing from left to right or from right to left. Yes?
>>: [inaudible] B [inaudible]?
>> Stephen Whang: B is the rule. Yes. As an example where context three is violated, let's say
that R3 is the father of R1 and R4 is the father of R2. Let's say that the ER algorithm first result
compares R1 and R2 and clusters them together and compares R3 and R4 and puts them in
separate clusters. Now the fact that R1 and R2 are the same child may be strong evidence that
R3 and R4 are actually the same father, so the ER algorithm may change its mind and end up
merging R3 with R4. So that's an example where some of the information is flowing from left to
right, so that is a violation of this context free property. So using rule monotonicity and context
free I will illustrate a rule evolution algorithm that exploits these properties to do efficient
evolution. So let's say that we are evolving from the comparison rule name address and ZIP
code to name address and phone number. First of all I am going to materialize on all of the
conjuncts. In this example I will produce an ER result for the name comparison rule and
another result for the address result and another one for the ZIP code result. I don't show the
ZIP code result due to space constraints. So during step one, the algorithm is going to first
identify the common conjuncts of the old and new comparison rules. In this example the
common conjuncts are the name and address comparison rules. It's going to perform what we
call a Meet Operation between, among the ER results of the common conjuncts. In this
example we are going to perform a Meet Operation between the name result and the address
result. Intuitively, we are taking the intersecting clusters, so in this Meet result you can see that
R2 and R3 are in the same cluster, and the reason is that these two records are in the same
cluster in the name result and they are also in the same cluster in the second result as well. But
R1 and R2 are not in the same cluster in the Meet result because although they are in the same
cluster in the name result, they are not in the same cluster in the address result. Now using
rule monotonicity we can prove that this Meet result always dominates the final ER result,
which means that none of the records in different clusters are ever going to end up in the same
cluster in the end result. During step two the algorithm exploits the context free property and
since it knows that none of the records can match across different clusters, it is going to resolve
each cluster independently. So for the first cluster it only contains R1 so we are going to return
R1 as a singleton cluster. For the second cluster we are going to compare R2 and R3 and let's
say that it turns out that R2 and R3 refer to different entities so they are placed in different
clusters. Finally for the third cluster, it only contains R4 so we return R4 as a singleton cluster.
And simply by collecting the final clusters we are guaranteed to have arrived at a correct ER
result. The amount of work we have done here is only one record comparison which is much
better than the six record comparisons that would have been performed by the naïve approach.
So in summary, I defined rule monotonicity and context free properties and I've illustrated an
efficient rule evolution algorithm that works well for this purple area. Again, depending on
where you are in this Venn diagram, we have different rule evolutions that work for those
regions as well. Yes?
>>: [inaudible] example you actually rely on the [inaudible] name and address so suppose
[inaudible] found like name [inaudible] is it possible that you [inaudible] the constraints
[inaudible] each other columns actually was changed [inaudible]?
>> Stephen Whang: I'm assuming that the conjunction rule is a conjunctive normal form of
arbitrary predicates so within the predicate you can use whatever similarity function you like.
Here I was just showing you an easy example where we only perform equality constraints.
>>: I'm not saying that the approach [inaudible] obvious [inaudible] predicate. Like when you
have such conjuncts are like you realize [inaudible] predicate [inaudible] name [inaudible] each
predicate [inaudible].
>> Stephen Whang: In this model if you are doing a similarity comparison between names and
you are comparing with a high threshold, then you have another predicate that uses a lower
threshold, we consider these two predicates to be totally different. So that's, but you can do
more optimizations by taking into account that these are two predicates actually doing the
same stream comparison in comparison with the threshold. Yes?
>>: What if [inaudible] threshold?
>> Stephen Whang: It depends on the ER algorithm. If you are using, for example, the single
link hierarchical clustering and you relax the threshold, which means you reduce the threshold,
then all you have to do is build on top of the previous ER result. Now if you go the other way
around then this approach does not work. Yes?
>>: So you [inaudible] changing the rules, right? What if the data changes? Will the [inaudible]
still work?
>> Stephen Whang: That is a very important problem and this work is mainly focused on when
the comparison rule itself changes. It is an orthogonal problem and in the future I would like to
use, do research in both directions, but this work does not consider the case where you're
having a new stream of records which you would like to add to your ER result. So far I have
been talking about rule evolution techniques for what I've been calling match-based clustering
algorithms. In our paper we also have rule evolution techniques for what we call distancebased clustering algorithms, so here records are cluster based on their relative distances. I
won't have enough time to go through all of the details, but just to give you the high-level idea,
let's say that we have four records. R, S, T and U and let's say that the record S is close to R
because S is within that dotted circle around R, but the records T and U are far away from R
because they are outside of that dotted circle. Now, whenever the distance function changes,
all of these pairwise distances are going to change as well. Although we don't know how
exactly those distances will change, let's say we at least know the lower and upper bounds of
those distances, as the red brackets. So, for example, the record S may suddenly become closer
to R or it may end up further away from R but we still know that S will still be inside that dotted
circle. So we know that S is still, will still be close to R. For record U although the distance from
R is going to change, we still know that U is going to be outside that dotted circle. Now the
interesting case is record T. Now T may end up within that dotted circle or outside the circle
and in that case we are going to be optimistic and cluster R and T together, but later on we visit
this cluster and check and see if R and T are indeed close to each other. If not we are going to
split T out of R's cluster. So I'm glossing over a lot of details, but that is the high-level idea for
rule evolution for distance-based clustering algorithms. So the following slide shows all the
data sets I've been using for all of my ER works. The first data set contains shopping items
provided by Yahoo Shopping where items were collected from different online shopping malls
and each item represents, record represents an item and it contains attributes including the
title, price and category. Now since these records are collected from different data sources,
there are naturally many duplicates that need to be resolved. The second data set is a hotel
data set provided by Yahoo Travel and again records were collected from different travel sites
like Orbitz and Expedia and again, different records may refer to the same hotel. And while
there aren't too many records here, there are many attributes per record, and the data is very
rich here. Finally, we have a person data set provided by Spock.com where people records are
collected from social networks like Facebook and MySpace and blogs and Wikipedia and we
have a lot of records here, and each record contains the name, gender, state, school and job
information about a person. Among these three data sets we, for my evolving rules work I only
perform experiments on the shopping and the hotel data sets and in this talk I will only show
you experimental results using the shopping data set. Yes, Ravi?
>>: [inaudible] data sets [inaudible] how expensive is it [inaudible]?
>> Stephen Whang: It's extremely expensive. You don't want to run the naïve approach.
>>: [inaudible].
>> Stephen Whang: So the naïve approach is defined to be starting from scratch. So your
question is what, how fast, how long does it take to resolve hundreds of millions of records? So
I was interacting with Spock.com and I think it takes in the world of hours, or maybe days. But
I'm not sure about the exact number because it's kind of a secret that they have so they weren't
revealing all of the numbers to me.
>>: [inaudible] to do this or?
>> Stephen Whang: Not that I know of. Not Spock.com. This company recently got acquired
by a company called Intelius.com, so since then they may have changed their strategy, but
when I was interacting with Spock.com they were just using a regular DBMS, like my SQL.
>>: [inaudible] the gold standard sets for each of these?
>> Stephen Whang: That's a good question. For the hotel data set we had a gold standard
which was provided by a Yahoo employee that we knew, but for the other two data sets we
don't have a gold standard. It's very hard to generate a gold standard for these data sets. I'm
going to show you two representative experimental results of my work, and both of these are
going to be runtime experiments. The reason I'm not going to show you any accuracy
experiments is that our techniques are guaranteed to return 100% accurate ER results all the
time, so we're not trading off accuracy with scalability; we are only improving the scalability of
the rule evolution, so that's the feature of our work. This plot shows the results of gathering
our techniques on 3000 shopping records and the X axis shows the strictness of the common
comparison rule. We tested on many different comparison rules, and in some of these
experiments we evolved from title and category to title and price. In that example, the
common comparison rule is the title comparison rule and for this rule we were extracting the
title from two different records and we were computing the string similarity of the two titles
and then we compared this value with the threshold. If we increase the threshold, then the
title comparison rule become stricter in the sense that fewer records match according to their
titles, so in that case we moved to the left side of the X axis. On the other hand, if we
decreased the threshold then the title comparison rule becomes less restricted in a sense that
fewer records, more records match with each other and so in that case we moved to the right
hand side of the X axis. So just think of the X axis to represent different application semantics.
The y-axis shows the runtime improvement of rule evolution versus the naïve approach, but
here I did not include the time overhead for the materialization, but in that next plot I will
compare the total runtime of rule evolution including the time overhead with that of the naïve
approach. So I implemented four different ER algorithms in the literature and you can see that
in the best case we get over three orders of magnitude of improvement, while even in the
worst case we still improve the runtime, improve the naïve approach. Yes please?
>>: [inaudible].
>> Stephen Whang: The naïve approach is to simply run the ER from scratch without exploiting
the materialization. Given a set of records, simply run your own ER algorithm using the new
comparison rule.
>>: [inaudible].
>> Stephen Whang: Oh, here? So these are the four different ER algorithms. The red plot is
the sort of neighbor technique where we sort the records and then we use a sliding window
and compare the records within the same window. These are all different techniques. HCB
means we are merging records that match with each other in a greedy fashion and HCBR is
similar but we have some properties, some desirable properties that guarantee that the ER
result is going to be unique. Finally, HCDS is a distance-based clustering algorithm. It is simply,
it's the single link clustering algorithm.
>>: Have you tried other clustering algorithms [inaudible] K-Means [inaudible]?
>> Stephen Whang: So K-Means did not satisfy any of the properties, so it was not like a good
algorithm to demonstrate our techniques. When you're comparing the total runtimes, here is
what you should expect. Let's say that for rule evolution you are performing materialization
once and then you perform rule evolution one or more times. The X axis here shows the
number of evolutions and the Y axis shows the total runtime. Initially if you, for rule evolution
we’re paying an overhead for the materialization, so rule evolution is slower than the naïve
approach, but as we perform more rule evolutions, the incremental costs paid by rule evolution
is smaller than the cost for the naïve approach where we simply run ER from scratch. So at
some point you can imagine rule evolution is going to outperform the naïve approach. Now the
following slide shows a result of one scenario we've considered. The X axis shows a number of
shopping records that were resolved and the Y axis shows the total runtime in hours on the log
scale. I'm only going to show, explain the top two plots that are colored. So the red plot shows
the total runtime of the naïve approach, while the blue plot shows the total runtime of when
we are using rule evolution using the exact same ER algorithm. And here we performed
materialization once followed by one rule evolution. You can see that the blue plot still
improves the red plot which means that in this case the rule evolution was saving enough time
to compensate for the time overhead paid by, the time overhead for materialization. Although
this gap seems very small, the unit of the Y axis is in hours so the runtime’s improvement is
actually significant.
>>: It's also a log scale.
>> Stephen Whang: Yes, it's also a log scale, as well, so that makes the improvement even
more significant. Now the important thing to understand here is that this result shows, is only
for one scenario that we've considered. We can think of many other scenarios where you could
either get results that are much better than these results or much worse. For example, if you're
comparing multiple evolutions instead of just this one, then you're probably going to have
much better runtime improvements than those shown in this plot. On the other hand, you can
think of pathological cases as well, pathetic cases. For example, if the old and new comparison
rules are totally different, then there is no point in performing rule evolution because there is
nothing to exploit, so in this case simply running the naïve approach is the most efficient
technique. If you run rule evolution here you are probably slower than the naïve approach
because you still have to pay the time overhead for materialization. The take away of this plot
is that we now have this general framework for rule evolution that you can use to evaluate if
rule evolution makes sense for your application or not. So in conclusion, we've proposed a rule
evolution framework that exploits ER properties in materialization and we have shown that the
rule evolution can significantly enhance the runtime of the naïve approach. So that was the
first part of my talk and the second part will be much shorter. Yes?
>>: Can you talk about rules that are linear to the [inaudible] combinations?
>> Stephen Whang: Figure combinations. I think that's the case of distance-based clustering,
so an example for distance function is you add the titles of the name similarity with the address
similarity and you can do a weighted sum, so at the end you get some distance value, so our
distance-based rule evolution techniques for distance-based clustering algorithms work for
your case.
>>: And in non-metric cases in the case where there is a nonlinear sum or if the distance is not
in the metric space?
>> Stephen Whang: As long as you return a distance, it's fine. You don't have to satisfy any
triangular property. The only assumption we make is that you have to have an idea of how
much this distance can change in the next evolution, so we have to have some information that,
for example, each distance is only going to change at most by 10% or by some constant amount
like 5. As long as we have that information you can use our techniques. Yes?
>>: So [inaudible]?
>> Stephen Whang: Right. In this work we assume conjunctive normal forms, so if you only
have disjunction everything is going to be considered as one predicate.
>>: [inaudible] junction [inaudible]?
>> Stephen Whang: Okay. We assume that you can convert this into a conjunctive normal
form and then the idea is to materialize on each of the conjuncts. Yes please?
>>: [inaudible]?
>> Stephen Whang: I mentioned that briefly, so we're only saving multiple partitions of
records. Instead of, now, so the space complexity is linear to the number of records, but notice
that you can use a lot of compression techniques, so you can save a partition of record IDs
instead of a partition of the actual records, so this saves a lot of space and it turns out that the
space overhead is reasonable in practice. It doesn't…
>>: [inaudible].
>> Stephen Whang: It's not more than 10%.
>>: [inaudible] lose your advantage [inaudible] normalizing [inaudible]?
>> Stephen Whang: Yeah. That's an issue. That's a concern that we have.
>>: [inaudible] lose already [inaudible] assumptions [inaudible]?
>> Stephen Whang: At least for the experiments that we performed, a lot of these comparison
rules were just conjunctions of predicates, so we believe that it is reasonable to assume that
you have a conjunction of predicates. But there is, the complexity for converting a DNF
expression to CNF expression is like exponential, but once you do that then you can use our rule
evolution techniques. Yes, please?
>>: So your speedup is a function of a sequence of rules that you have, and currently what do
you find in terms of these rule sequences that you see people using [inaudible] resolution?
>> Stephen Whang: My understanding is, I don't have the--again, I interacted with Spock.com
but they didn't give me all of the details that they have. But my impression is that some of
these predicates are always used, so you always compare the names and addresses of people
and afterwards you may compare the zip codes or phone numbers or some other attribute like
gender, for example. So in that case you can actually do something better than, you can
improve this approach where you can materialize on the name and address combination and if
you have a zip code then you may or may not materialize on the zip code, so this is a very like
application-specific optimization. You have to see which conjuncts are used together and which
conjuncts are likely to change or not, and my understanding is that certain conjuncts never
really change and it's kind of only like the tail that is changing, but I can't confirm that.
>>: A follow-up question is how many iterations would someone typically go through in trying
to resolve [inaudible] resolution system?
>> Stephen Whang: When I was visiting Spock.com, every time I meet a visit they were
discussing how to improve their comparison rules, so this always happens. I can't give a
definite number, but it's always a continuous improvement, so it's very rare that you have
complete information about your understanding about your data and your applications. In the
real world you're always getting, while you're constructing your system, you're getting a deeper
understanding of your application and you make improvements to your match function
comparison rule and then it's a back-and-forth thing, and so you evaluate this rule and you
realize you missed something and you improve your comparison rule and this kind of repeats,
so it's pretty frequent. Yes, please?
>>: [inaudible] about the scalability of the naïve [inaudible] something about the naïve, for
example, if you look at your ER techniques, you can say there are two parts. There is pairwise
matching and taking things I [inaudible] and then there are other clustering's [inaudible] both of
these if I'm not mistaken [inaudible] so [inaudible] against a well-designed ER as a baseline or
[inaudible] what [inaudible] used for [inaudible] system then how much you can use
[inaudible]?
>> Stephen Whang: For my work I only considered the algorithms in this Venn diagram so I
know these are very cryptic, but this is sort of the sorting neighbor technique and this is the,
the HC's are all hierarchal clustering. The ME is the technique proposed by Mons and Elkind
[phonetic] where you have a queue and you want to only compare each new record with
clusters in the queue and so [inaudible] was mentioning K-Means, but we didn't consider KMeans. We didn't consider your case where you have this various scalable way of decoupling
the identification of candidates followed by a clustering. So these are the algorithms we used
in the paper. It would be very interesting to see actually how your techniques, if your algorithm
actually satisfies any of these properties. Yes?
>>: How do you think [inaudible] experiments [inaudible]?
>> Stephen Whang: It was just hand-picked, so I was just considering--a comparison rule that I
used contained two predicates and then I was changing one of them, so I wasn't considering
like, I wasn't doing an entire survey of doing rule evolution, so this, I was just trying to
demonstrate how rule evolution saves time and compensates for materialization overhead. But
I agree that there is a lot of future work that can be done from here; basically, there are a lot of
unanswered questions. I'll move on to the second part of my talk where I will mainly talk about
a work called managing information leakage and then briefly I'll mention more recent work
called disinformation techniques for ER. So at the beginning of my talk I motivated you about
the data privacy problem by explaining how insurance companies are sifting through the web
and are collecting and analyzing your personal information to predict your lifespans. Here is
another interesting example for my work. So Quora.com is a Q&A site where people can ask
questions and other people can answer those questions. Not too long ago there was this
person who posted a question basically challenging anyone out there to find all of his
information on the web. If you look at this question you can see that there is no indicator of
the identity; there's no clue about the identity of the person who asked the question. You can't
find his name. There's no ID of the person who posted this question, so just by looking at the
text you can't get any clue. Now it turns out that once you post a question on Quora.com you
become the only follower of your own question, so there is this anonymous user that came
along and he was lucky enough to see, to discover that the only follower of this question and he
clicked on the profile, got the name and started searching through various social networks,
blogs and homepages to extract all the information about this guy, and so the anonymous user
posted 26 bullet points of personal information about this person. One bullet point says that
you are 24 and you will be turning 25 very soon. Another bullet point says your mother's
maiden name starts with the letter P. Another bullet point said you can program in C++ and
another one says that you are related to some incident in LA back in 2009, and finally another
one says that you are a Democrat. So Joseph was very impressed, who turns out to be the
person who posted this question was impressed and actually categorized the 26 bullet points as
follows. Some of them were categorized as correct and downright scary. Apparently Joseph
did not anticipate that someone would figure out his exact age. If someone figures out my
mother’s maiden name, I would be freaked out as well. Now some of the bullet points were
categorized as deliberately public, so the claim that Joseph can program in C++ was probably
information that could be found in his online resume. Interestingly, there were six incorrect
bullet points. The claim that Joseph is somehow related to this incident back in LA in 2009
turns out to be false information. So Joseph had this Facebook profile image that was related
to that incident but it turns out that this image was a deliberate obfuscation that Joseph made
in order to confuse the adversary, any adversary that wanted to find his information. Also the
claim that Joseph is a Democrat turns out to be wrong as well. It turns, in this case the
anonymous user confused this Joseph to be the same person as some other Joseph with the
same name and since this other Joseph was clearly a Democrat, the anonymous user thought
that the first Joseph was a Democrat as well. But here Joseph clarifies that no party can handle
his idiosyncrasies. So after this experiment Joseph concludes as follows in bold font. He's
saying that it seems to be far more effective to allow incorrect information to propagate than
to try to stem the tide. So my research in data privacy is twofold. First, I like to quantify just
how much of your personal information is being leaked into the public and second, I like to
propose management techniques for information leakage. Now the interesting connection
between this work and the previous works on entity resolution is that I'm assuming that the
adversary is performing entity resolution to piece together your information. In this example
the anonymous user was piecing together Joseph’s information to know more about him. So
I'm basically using the ER models that I developed in my previous works to simulate the
adversary to then quantify the information leakage. That is the interesting connection between
this work and my previous ER works. To formalize this problem a little bit further, I'm going to
assume that all of the pieces of information on the web are simply records in a database as
shown as the blue boxes. And I'm going to assume that there is a notion of full information of
Joseph as shown on the top of the screen. Again, the adversary is going to perform entity
resolution and piece together the relevant records that refer to Joseph and intuitively the
information leakage is defined as the similarity of this blob of information with that complete
information. So I won't have enough time to go through all of the details of my measure, but
here are the key features. So first we incorporate entity resolution. Second, we don't assume
that privacy is an all or nothing concept. If you look at many previous privacy works, many of
them assume especially in data publishing the assumption is that you somehow have complete
control of your data before you release it, so once you make it perfectly private then you can
just give it to someone else. For example, a hotel, a hospital may be trying to release a set of
medical records of its patients to the public for research reasons. There are lots of
normalization techniques that you can use to make it very hard to figure out which patient has
which disease, so once you've made the data entirely anonymous, then you are allowed to
publish this information to the public. In comparison, our works make a fundamentally
different assumption, where we assume that some of our information is already in the public,
so whenever you want to interact with your friends through Facebook, you have to give out
some of your personal information like your birth date. If you want to buy a product from
Amazon.com, you have to give out your credit card information. So in order to do our everyday
lives, we are continuously exposing ourselves to the public. Our view of privacy is that there is
no such thing as perfect privacy and that privacy should be within a continuum from 0 to 1 and
we are just trying to be as private as possible. In addition, our information leakage measure
incorporates the uncertainty that the adversary has on his data and the correctness of the
information of the data itself. So once you release the information, your information to the
public, we assume that it's very hard to delete that information, so even if you attempt to
remove some information, some other people may have made copies of that information and
companies may have backup files of that information, so if you delete a photo from Facebook,
who knows if that photo is still floating around the web. So in that setting the only way to
reduce information leakage is to add what we call disinformation records, as shown in red,
where the idea is to dilute information by producing records that are realistic but incorrect.
Now this information is not a technique that we're proposing as a new idea. This strategy has
been used since the dawn of mankind. So during World War II the Allied forces generated a
great deal of disinformation to trick the Germans into thinking that they would land on Calais,
but they actually landed on Normandy and this was a main turning point of the war. So we're
adapting this ancient strategy to the realm of information management. So in our paper we
propose two techniques, two types of disinformation records. The first type is called self
disinformation where the disinformation record snaps onto one of the correct records and
lowers the information leakage by itself. The second type of disinformation is called linkage
disinformation, and here the disinformation record is connecting an existing record that is not
correct to one of the correct records and here the contents of the existing record, YYY is being
used to lower the information leakage. These two strategies turn out to be very effective in
practice and we already saw examples for both of these strategies in our Quora.com example.
We saw that the anonymous user mistakenly thought that Joseph was related to some incident
in LA back in 2009 and it turns out to be, that turned out to be incorrect because of the
Facebook profile image of Joseph was kind of a deliberate obfuscation. That's an example for
self disinformation. The anonymous user also mistakenly thought that some other Joseph was
the same person as this Joseph and mistakenly thought that the original Joseph was a
Democrat. That's an example of linkage disinformation and in our paper we discuss these two
strategies in more detail. In a more recent paper that I submitted to VLDV, I focus on the
linkage disinformation problem in more depth. Here I assume that the adversary is running
entity resolution to cluster records that refer to the same entity. Once we have that, I assume
that for each pair of the clusters there is a way to generate disinformation that will trick the
adversary into thinking that these two clusters refer to the same entity and therefore should be
merged with each other. I call that, the cost for generating that information is what we call the
merge cost and the costs are written in black numbers for each pair of clusters. Now for each
cluster that we successfully merge with our target cluster, which is Joseph cluster in this
example, we assume that there is a benefit that we obtain, because we are diluting the
information of Joseph which is a good thing. For each of the clusters I indicate the benefit in
red numbers. So given the setting, we can now formulate an optimization problem where the
purpose is to, the goal is to merge clusters in a pairwise fashion starting from Joseph's cluster
so that we obtain the maximum benefit while paying a cost, total cost that is within a budget B.
In our example, in this example if the budget B is three then the optimal disinformation plan
turns out to be that green tree where we first merge the left cluster with the top cluster and
simultaneously merge the left cluster with the bottom cluster, and in that case the total merge
cost is 1+2 which is exactly 3 here and the total benefit we attain is 1 plus 1 which is equal to
two. Now I won't have enough time to go into all of the details of our algorithms, but it turns
out that this problem is strongly NP hard which means that you can't come up with a pseudopolynomial algorithm and there are no approximation algorithms as well. So in this, in the most
general case we only have heuristics. However, if we restrict the height of the plans to be at
most one where we can only merge clusters to the target record directly, then this problem
becomes weakly NP hard and we do have an exact pseudo-polynomial algorithm and a two
approximate algorithm. I'll be happy to talk about more details if you are interested. Yes,
please?
>>: [inaudible] more traditional matrix [audio begins] why do you not come up with
[inaudible]?
>> Stephen Whang: Here I am talking about the disinformation problem I think your question
refers to the measure of information leakage. So compare K anonymity, the K anonymity
measure is zero or one measure where you are either completely safe or you are not safe at all.
If there are--so your data is considered safe or not safe. It is a black-and-white thing. In
comparison for an information leakage measure, we can quantify more fine grain notions of
privacy. So that was about the measure, but this slide is more about how to generate these
information records that can maximize the benefit of the target record.
>>: [inaudible].
>> Stephen Whang: Oh this is the budget we can use to, for generating the disinformation
record, so there is a cost for creating a brand-new record. You can't simply generate whatever
you want, so this is a way to kind of limit the total amount of information you can, kind of
disinformation records that you can produce using that will confuse the adversary.
>>: I have a more of a problem with the benefit, how you qualify the benefit.
>> Stephen Whang: This is one way to do it. There are many ways to model this problem.
>>: How did you come up with the numbers for the benefit?
>> Stephen Whang: So one approach is to simply define the benefit as the number of records
within this cluster. But that is not like the only way that you can do it. You can, for example,
compute the benefit by using an arbitrary function that works on, so after you merge the
clusters you can apply a general benefit function on all of the records that have been
mistakenly merged. But in this problem we assume that you can somehow add these benefits
by summing them together and this makes the problem easy to solve analytically, so it's, we're
not claiming that this is the only way that you can define benefit. This is actually an open
problem that we haven't, that has to be studied more. Yes?
>>: Is there a risk that somebody might take this and use it in a different way to smear you?
>> Stephen Whang: Yes. So these information techniques can actually be very harmful for
companies that are doing data analytics on big data, so in that case if you are trying to extract
profiles from social network, social media data, then…
>>: I was thinking about things like Obama wasn't really born in Hawaii; he was born in Kenya
[laughter].
>> Stephen Whang: Yes, so if people managed to, so that's kind of misinformation. Here, I am
focusing on generating data that can confuse the adversary into thinking that different records
refer to the same person or object. So it's in the same line as producing misinformation for
President Obama. Yes?
>>: [inaudible] have some notion I know general unknown [inaudible] where if you don't know
[inaudible] the picture on the way then [inaudible].
>> Stephen Whang: This is very case-by-case. In general, you can assume that you have
knowledge of all your information out in the public, but there are some cases, some
applications, certain applications where you do know where your information is, so a good
example that I always use is let's say you are the camera company Nikon. You kind of have your
latest and greatest product and you don't want to leak the specs of your camera. There is a site
called Nikon rumors.com which is dedicated to spreading rumors about the next best camera
produced by Nikon and people actually post rumors and information that they think is correct
and so for example, you might guess that the next Nikon product is going to have 40 million
pixels and some shutter speed of a few seconds, has a certain shutter speed and so on. Now
you can also add a confidence value which reflects how confident you are in this information.
There aren't too many rumor sites, so in this case Nikon can sort of figure out where all of its
camera information, where all of the rumors of its cameras are coming from. Plus you might
only be interested in these rumor sites as well. So the bottom line is, I'm not claiming that you
have perfect information about all of the data, but in real world applications, for certain
applications, it's enough to know that you do have information about where your information
is, you do have a good idea of where your information is located. I hope that answered your
question.
>>: [inaudible] it seems to be like this strategy domain, I mean when you get right down to it
this strategy is really for situations where other people are disseminating information about
you. The kinds of information you would disseminate about yourself about places like on
Facebook and Twitter all of these kinds of places, it seems to me like you if you are putting it
out there in the first place the assumption is that you want people to know that. So would you
say that it's correct to characterize the certain utility of this is really about [inaudible] people
trying to disseminate information about you that you don't want disseminated?
>> Stephen Whang: I think you're getting at the issue of reputation, so you don't want to post
information that gives a bad name to you.
>>: [inaudible] something about yourself that you just don't want people to know that about
you. That is the camera example, right? I mean you as a company, you don't want someone
else to put the specs for your latest camera out there, so you are going to go on this
disinformation campaign to spam what people think they know.
>> Stephen Whang: Exactly.
>>: That probably doesn't apply so much to the things that you put out there yourself because
the assumption is if you put it out there, you put it out there because you want people to know.
Is that a fair assumption to make?
>> Stephen Whang: So here your, the problem is focused on trying to hide certain information
and lowering information leakage. It's not about trying to reveal information. There is an
interesting startup called reputation.com which actually tries to solve the reverse problem,
where they want to promote some of your positive information to the public, so they have
these interesting web spam techniques where you can kind of make sure that some of your
information appears in the top search results of Bing. Yeah?
>>: I just wanted you to keep an eye on the time.
>> Stephen Whang: Yeah, I will just move on. These are all interesting questions and this is
kind of an open problem. So in summary, I've proposed a new measure for quantifying
information leakage and I proposed these information techniques that can be used to reduce
information leakage and thus manage it. So that was a second part of my talk and for the
related works, there has been a lot of previous works on entity resolution and privacy and
instead of just listing all of the works on one screen I thought it would be a better, a good idea
to just talk about the high level ideas. There has been a lot of works on entity resolution that
focus on accuracy and relatively fewer works that focus on scalability. MSR has been a pioneer
in entity resolution result and I am well aware of the data cleaning project by the DNX group
and I regularly cite and papers from [inaudible], [inaudible], and [inaudible]. My work has been
mostly focused on the scalability aspect of ER, and has also proposed many new functionalities
for ER, so for example, my rule evolution work solves a new problem for ER where I want to
update an ER result incrementally when the comparison rule changes frequently. For privacy,
there has been a lot of data publishing works in the past and again, most of these works
assume that you somehow have complete control of your data and there are anonymization
techniques that you can use to make this data set private before you release it to the public. In
comparison, we assume that there is no such thing as perfect privacy and we're just trying to be
as private as possible. There have been a lot of measures on privacy as well. Again, just to
address [inaudible]'s comment, most of the measures assume that privacy is a zero or one
concept, and in comparison our techniques assume that privacy can be anywhere between zero
and one so this kind of flexibility enables us to measure in more fine-grained notions of privacy.
I am aware that MSR also has produced a state-of-the-art privacy measure called differential
privacy from the Silicon Valley research lab. For future work I am interested in many ideas. A
direct extension of my work is to study data analytics in a more distributed setting. So you
might end up performing data analytics on many machines either because you simply have too
much data to run on a single machine or you might have privacy constraints where companies
are not willing to share their information so you're forced to run analytics from separate nodes.
The issue here is to exploit parallelism as much as possible and also perform analytics in an
accurate fashion. I'm also interested in social networks and nowadays you find a lot of graphs
about people and it's very important to analyze this information and identify interesting trends
among people. A few months ago I got this e-mail from LinkedIn which is a professional social
network where they were telling me that they are solving this problem where they would like
to identify, resolve user profiles and user skills and companies so that they can connect users
with certain skills to companies that want people with those skills. In order to do this mapping
correctly, you really need to resolve people and skill sets and the companies correctly and so
you can immediately see that there are entity resolution problems here. Currently I'm working
on a fascinating topic called crowd sourcing where the idea is to use the wisdom of the crowd
to solve problems that are hard to resolve by using computer algorithms only. For example,
you might want to resolve a set of photos where the goal is to figure out which photos refer to
the same person. Now if I show you two photos of [inaudible] you can probably immediately
see that these two photos refer to the same person, but if you try to use this using a computer
algorithm, this is going to be very challenging because this involves sophisticated image analysis
and even if you use those algorithms the computer is probably not going to be able to do a
good job. So the challenge is to ask humans just the right questions and use those answers to
do the right clustering. Now humans can be very slow, expensive and error-prone and so they
make mistakes both intentionally and unintentionally, so a significant challenge is to
incorporate this erroneous behavior of humans when doing crowd sourcing. So in conclusion,
data analytics is a critical problem. I solved two problems within analytics, data integration and
data privacy and I've mentioned that there is an interesting connection between these two
topics. The better you are at integration the worse you are in privacy and vice versa. So thanks
for listening to my talk and I will open up for questions. [applause].
>>: You got all of the questions along the way.
>> Stephen Whang: Okay. Thank you. [applause].
Download