>> Phil Bernstein: Well, welcome everybody. It's a... Luna Dong from AT&T labs. Luna is a local. ...

advertisement

>> Phil Bernstein: Well, welcome everybody. It's a pleasure to be introducing

Luna Dong from AT&T labs. Luna is a local. She got her PhD at the University of Washington four years ago and has been at AT&T since then working on various problems in the area of data integration, particularly with uncertainty, where the integration is fuzzy for some reason.

She's done work on personal information management, pay-as-you-go scenarios, where you kind of integrate as you learn more about the data, record linkage problem, and is currently working on web-based integration of data sources based on observing copies of the same data propagating around, which is the project she's going to be talking about this morning. Luna.

>> Xin Luna Dong: Thank you. Hi there. This is Luna from AT&T Labs

Research. I'm very happy to come back to Seattle and come here for a visit. I think I did internship here in 2005 with Raga [phonetic].

And today I'm going to talk about our Solomon project, where the goal is to find the truth via copy copying detection.

So we live in an information era. There is a lot of information on the web, and we are surrounded by data, especially after Web 2.0, we can easily know what's happening out there from blogs, from Twitter, from social network websites such as Facebook, LinkedIn, et cetera.

In a sense, we are lucky that when we have a question, we just ask the web.

However, a recent study by the British Library shows that the speed of young people's web searching means that little time is spent in evaluating information, either for relevance, accuracy, or authority.

And actually another study shows that a lot of teenagers think that as far as a web page is listed by Yahoo, even not Google or bing, it must be authoritative.

And this is certainly not true. On the web we have a lot of low quality information being inaccurate, erroneous, out of date and so on. And what makes it even worse is that the web technology has made it extremely easy to propagate such low quality information. Everything is digital. There is definitely no need to talk face to face or making phone calls. To propagate some information, it is as easy as making a copy. And so rumors quickly spread out.

So let me tell you several stories. The first story is an old story about United

Airlines. In 2002, Chicago Tribune has a news article about UA's bankruptcy.

And then a couple of years ago, Sun-Sentinel.com listed this article as one of the most popular articles for some reason. And then Google News robot look at this list and send it to UAL alerts clients. And after that, Bloomberg.com look at the timestamp from Google News and list it as news ware confirmed today. So this caused the UAL stock to drop from $12 to $3.

The second story is about a musician. His name is Jarre, and he's a French conductor and composer. He died two years ago. And a lot of newspapers reported his death with the following quote: One could say my life itself has been

one long soundtrack. Music was my life, music brought me to life, and music is how I will be remembered long after I leave this life. When I die, there will be a final waltz playing in my head and that only I can hear.

So this is very touching. But it was not said by Jarre. It was not said by any musician. It was said by and overcurious psychology college student from UC

Davis. So close, right? And he wanted to find out what is the impact of

Wikipedia on people's daily life. So he uploaded these quotes after he heard of

Jarre's death. And even he was shocked that a lot of quality newspapers with very good reputation, from North America, from British, from India, to Australia cited his words without verifying or even referencing it.

So Wikipedia is a good place to spread rumors. And Twitter, which gains its popularity in the last couple of years is a perfect place for rumors. So after the

Japan earthquake and the tsunami, there have been numerous rumors on Twitter about the toxic rain, about the death of some famous Japanese people, about some fake donation accounts. And this one is most creative: Please help Japan.

Earthquake weapons caused tsunami.

And the last story also happened recently. So Shirley Sherrod was an officer in the Georgia state. I guess you still remember this possibly. She gave a talk last

April. And in the summer Andrew Brietbart posted one video excerpt about her talk in his blog showing that she once discriminated against a white farmer. This is a big thing in the country. And so the -- the video was quickly included and commented on by major newspaper websites. And this was quickly picked up by the blogosphere and Twitter. And so this time even the U.S. Government was fooled. They called Sherrod and forced her to resign.

But after they watch the full video, they realize that not only she didn't discriminate, she actually helped the white farmer. So eventually Obama had to call her and apologize. Obama said that we now live in this media culture where something goes up on YouTube or a blog and everybody scrambles.

Also, Tim Berners-Lee, the invent of Worldwide Web said the Internet needs a way to help people separate rumor from real science.

And, by the way, I got both quotes from the web. So you decide if you want to believe it.

So not only texts can be copied. Structured data can also be copied. Here I'm showing 18 websites about U.S. weather. And by investigating the websites about the explicitly claimed clients, partnerships and resources and by checking their code for citations we found the following copy relationships or data derivation. Here an ad from A to B means that A copies derived the data from B.

So we can see that the -- most of the sources derive think data directly or indirectly from weather.gov. In addition such copying can be large scaled. Here

I'm showing a map of about 900 data sources. They are about book information extracted from AbeBooks.com, which is an aggregator of online book stores. So in this map, each node presents to data source, a book store, and an arrow

corresponds to a copy relationship detected using our techniques. And we have some countries, each country is a cluster of data sources according to the copy relationship.

We actually found some interesting countries. For example, this country which we call Departmentstoria has several department book stores, such as thebookcom, a1books, powell's books.

And this country, which we call Textbookistan has several textbook stores such as lgtextbooks.com, textbooksrus, xpresstext, and so on.

And when we integrate data from all these data sources we sell want to be aware of the copying relationship, so we won't be biased by the copy data.

For this purpose, we worked on this Solomon project. As you may know,

Solomon is -- was a wise Israeli king who was good at telling the truth from the false.

And the goal of our project is twofold. First, we want to discover copying relationships between structured data sources. And second, we want to leverage the detected copying to improve various components of data integration. There are certainly many other benefits for detecting copying. For example, we can use that for business purposes. Data are valuable. And data providers might want to protect their own rights. And also we can use that for in-depth data analysis to understand how the information is disseminated and how the rumors are spread out.

So our project has three components, copying detection, applications of the copying in data integration, and also how to visualize the copying relationships and how to explain our decisions.

In my talk, I will focus on copy discovery. And I will briefly talk about a couple of applications in data integration. And at the end I will show a demo for the detected copying and the discovered truth.

Okay. So let's get started with copy detection. Where we are considering a set of objects, each represents a real-world entity and is described by a set of attributes. And here for each attribute we have a single true value reflecting the real world. And our input is a set of data sources, each providing data for a subset of objects.

So in this example we have four data sources, each providing data for two books.

And we can see that there can be some missing values and there are -- there can be some incorrect or partially correct values. For example, here instead of

Peter, it should be Pete, P-e-t-e.

And for the same value, there can be different formats. So here I'm showing three different formats for the same name, Jonathan Lazar.

Actually, this pie chart shows that there is a high diversity of formatting for author lists.

And our goal is to find the copying between each pair of sources. We want to decide the probability that S1 copying directly from S2 or the other way around.

And here a copier can copy all or a subset of data. It can add some values or verify and even modify some of the copied values. And such values are considered as independent contribution by the copier.

A copier can also reformat some copied values. But such values are still considered as copied.

So in this example, S3 copies book 1 from S1 and copies book 2 from source 2.

S4 copies from S3 but reformats the author lists.

Note that there are a lot of challenges in copy detection. First, sharing a lot of data in itself does not necessarily indicate copying. Because two accurate sources they can share a lot of correct values.

Second, not sharing a lot of data does not necessarily indicate no copying because a copier can only copy a small fraction of the data.

And, third, in many applications we have only a snapshot of the data, so it is hard to decide which source is a copier.

And finally, the copy -- the shared data can be caused also by co-copying or transitive copying. So we want to distinguish that from directly copying.

For example, here S4 transitively copied from S1, S2. And we want to distinguish that from the directive copying from S3. So so far so good. Any questions? Yeah?

>>: So the notion of copying that you have is a purely syntactical [inaudible] so if

I copy something and modify something, say, AFN, the letter A [inaudible] so do you consider that a copy or not?

>> Xin Luna Dong: So if we just append a letter A, that is more of a, I mean, kind of value reformatting.

>>: But then you mentioned that you can modify the [inaudible] and independent contribution?

>> Xin Luna Dong: Yes, but if I -- so that is considered as reformatting, and reformatting is considered as copied. But if I modify the information, if I try to find out -- if I copied that Luna is from AT&T and I verified, I looked for her web page and tried to verify whether she is really from AT&T, then this verified information would be considered as independent contribution from the copier. Okay?

>>: Assume that there is a truth behind all this data.

>> Xin Luna Dong: Yes.

>>: But you're not making any assumptions that you know what that truth is, you just have the data that you find off the web in.

>> Xin Luna Dong: Yes. That's a perfect question. And yes. So, I mean, for copy detection so far, I mean, the techniques which assumes knowledge of correct information. But at the later part of my talk, I will talk about how to find the truth by leveraging the copy -- copying relationship. And how to do this iteratively if at the beginning we know neither the copy relationships nor the truths. Yes is this.

>>: [inaudible] translate into a different language [inaudible].

>> Xin Luna Dong: So this is a very, very good question. And so far we haven't considered this. And I would consider translation more as a reformatting. So it basically means I got the data from you and I transform it. But I didn't really look at the real world and get the information from the real world. There is no observation from the copier. Yeah.

>>: Perhaps the word reformatting isn't ideal. It seems to be broader than -- broader concept.

>> Xin Luna Dong: That's true. That's true. I agree. Yeah.

Okay. So let me present some high-level intuitions for copying detection. So intuitively if the probability of our observation of S1's data conditioned on S1 being a copier of S2 is much higher than the probability conditioned on S1 being independent of S2. Then according to the Bayes rule, S1 is a copier of S2.

But when will this inequation hold? So first this would hold when this part is very small. So considering two sources sharing some values and such values are very unpopular, for example, some particular incorrect values. Then this would indicate copying.

So to illustrate this point, let us do a quiz. I think a couple of you have seen this before. So don't tell the answers.

So considering two data sources that list the USA presidents and ignore the formatting issue for now because I put them into the same format such that you can easily compare. So we see that they share the same data everywhere. And that's actually the correct list of the USA presidents. So do you think one has to copy from the other? Yes. No. No. Okay. So, yeah, the two sources can both be accurate sources, and so copying is not necessary.

How about this time? This time we actually see some different data but on the other hand we see a lot of shared values and a lot of common mistakes. So what do you think now. Yeah. Making a lot of common mistakes independently is a rare event. And so copying is likely for this case.

Okay. So the second this inequation can hold when the data from S1 is inconsistent. Yes?

>>: [inaudible] does your model assume that I actually am aware of both sources? Because in all of these cases you can never distinguish between two people having copied from each other or having copied from an unseen source.

>> Xin Luna Dong: Yes, yes, yes. Yes, so far we've basically make a closed-world assumption. Yeah. Okay.

So consider a value that is provided only by S1 but not by S2. And so when we compute this probability, we will use the profile of S1's own data. But when we compute this probability, we will use the profile of S1's overall data. So if S1's data is inconsistent, then we will compute different probabilities in this case, and this can hold.

So, in other words, if we know some property function of the data, such as the accuracy of the data, and if the property of the overlapping data is much more different from the property of S1's own data, then from S2's own data, then S1 is more likely to be a copier. And this is a way for us to detect the copy direction.

So come back to our quiz. Consider this example where the two sources -- whoops. Sorry. These two are actually false. So the two sources share some data and make some common mistakes. So copying is likely.

On the other hand, we see that S1 also have some correct data and some mistakes in its own data, but S2 provides very bad quality data, inaccurate data for its own data. So which one looks more like a copier? S1 or S2? How many people think S1? How many people think S2? One for each. Okay. So if this example, the answer is actually S2 because it might copy the data from S1 and a just randomly guess some of the values for the rest of the data items. And because we see that the accuracy of its own data is very different from the accuracy of the shared data.

And how about this time? Again, they share some data and make a lot of common mistakes. So the accuracy of the shared data is very low. On the other hand, we see that S1 has all correct data for its own values and S2 consistently make a lot of mistakes. Which one is the copier the this time? I guess it's easy this time. It has to be S1. So note that it's not the accuracy of S1 that tells us that it is a copier. It is the inconsistency of S1's data that tells us it is a copier.

Okay. So let's now look at the technical details. So basically our goal is to compute the probability of independence or dependence conditioned on our observation. And the sum of the probabilities should be one. According to the

Bayes rule, we need to know the invert probability, and that's the probability of our observation conditioned on independence or dependence. And if we assume copying on each data item is independent, then the key becomes to compute the probability, the conditional probability for each data item.

So we consider the data -- data the items shared by S1 and S2. And we can categorize them into three classes. Those on which the two sources provide different values, those on which they provide the same true values, and they provide the same false value.

And now we consider each consider. Yes?

>>: [inaudible].

>> Xin Luna Dong: Yes. We assume the same schema. We assume we have applied record linkage or we have some key values so we know they are describing the same real-world entity. Yeah.

>>: That's an assumption for now or [inaudible].

>> Xin Luna Dong: I will talk about that later. Yeah.

So let's first consider the independence case. And in this case, let's for now assume the two sources have the same accuracy or error rate, epsilon. So the probability for the two sources to provide the same true value is the probability that each of them provide a true value. That's one minus the error rate. The probability that the two sources provide the same false value if we assume there are N false values and they are uniformly distributed, then it is the probability that each source provides a particular false value times the number of false values.

And the probability that they provide different values is one minus the above two.

Now, consider the condition of copying. And here let's assume that for each data item the copier has probability C to copy on that data item. And in this case, the probability that they provide the same true value is the probability that the copier copies times the probability that the original service provides the true value, plus the probability that the copier does not copy times the probability that the two sources independently provide this true value, the same true value. And similar for the same false value.

And finally, the probability that they provide different values is the probability that the copier does not copy times the probability that they independently provide different values.

So this inequation is showing that sharing the same values serve as positive evidence for copying and providing different values serve as [inaudible] evidence for copying. And actually we can proof of that sharing common false values is much stronger evidence for copying than sharing common true values.

So if we remove the assumption that the two sources have the same error rate, then this is what we have. So note that this time we have different probabilities for different copy directions. And that's how we decide the copy direction.

For our example, we suspect copying between S1 and S3 because they share the same false author. And we suspect the copying between S3 and S2 because they share the same false name, book name, and similar for S3 and S4.

Note that the correctness of the data in itself is not enough in this example to show the copy direction. Okay. So this is the core technique. And we have made several extensions. First we extended by considering additional evidence.

In addition to correctness, we also consider the formatting of the data and also the coverage of the data. In our example, we find that S3 uses inconsistent formats for the author lists. So it is more likely to be a copier. And, on the other hand, the author list provided by S4 are subvalues of those provided by S3. So it is more likely to be a copier of S3.

Second, we expend this by considering correlated copying. And here let's -- so recall that we made the assumption that copying on different data items are independent. But this is seldom true in practice. In practice, a copier either copy the objects and copy all attributes of the objects or copy for a couple of attributes for all of the objects that they provide. So we call it by opted copying and by attribute copying. And we can actually benefit from this correlated copying.

Consider can these two cases where for both cases we have five objects, the key attribute and the four other attributes. So here S indicates the two sources provide the same value and D indicates they provide different values. And in both cases they provide 17 same values and 8 different values.

So which one looks more like copying? So we see that in this case S2 seems to provide the same data for all attributes of objects O1 to O3. And for other two objects provide different values. And -- but in this example, the same -- the shared values are dispersed among the objects. So this is more likely to have a copying.

So we massage these intuitions into Bayesian analysis, and on the other hand we also extend it by considering the updates. And we use the Hidden Markov

Model to reasoning about the update patterns. Okay.

So far all this is about reasoning between a pair of sources. Recall that we said there can be co-copying and transitive copying and we want to distinguish that from direct copying. And this would require some global analysis. So let me show you why this would be hard first.

Consider these three cases in all of which we have three sources where S1 provides values to V1 to V100, and where V81 to V100 are popular values.

On the other hand S2 copies V1 to 50 from S1 in all of these three cases.

In case one, S2 independently provides V101 to 130, and S3 copies 51 to 100 from S1 and 101 to 130 from S2. So it is multi-source copying.

In the second case, S3 copies V21 to 70 from S1. So it is co-copying.

In the third case, S3 copies 21 to 50 from S2 and independently provides 81 to

100. So it is transitive copy.

So these are three different cases. But local copying detection would obtain exactly the same results. So how can we distinguish that? An obvious thought is let's look at the copy probabilities we compute. But unfortunately it does not work here because once two sources share a lot of data, then if there is copying, it is very likely that the Bayesian analysis will compute a probability of one.

How about we count the number of shared values? It doesn't work either because it's exactly the same numbers in these three cases. How about we compare the set of shared values? Again, it doesn't work because we see that

V21 to 50 are shared by all three cases in co-copying and the transitive copying.

So we say that just the reasoning about some aggregated information is not enough and we need to reason for each data item in the principled way.

So our global copying detection has two steps. The first step considers that because transitive copying and co-copying typically can be inferred from direct copying. So it tries to find a set of copyings, R, that can significantly influence the rest of the copyings. And here the influence is mirrored by the accumulated difference between the probability computed by local analysis and the probability conditioned on this set R.

In the second step, we -- for the rest of the copyings instead of directly saying no, this cannot be direct copying, we actually adjust the copying probability to the probability -- to the probability conditioned on this R. And according to this, we decide if it is direct copying or indirect copying.

And let's see how it works for our examples. So in the first case, let's assume -- I mean, I will skip the details of how we find R. But let's assume R contains S3 copying from S1. And now if we reexamine the copying between S3 and S2, we find that among the third shared values S3 cannot copy them from S1. So the probability conditioned on this R is the same as the original probability. And so we consider there is still copying between S2 and S3.

In the second example, assume the same R, but now for the third shared values

S3 can actually copy them from S1. So the conditional probability is much larger than the original probability. Meaning that they are not strong evidence indicating copying. So we remove this copy relationship.

In the third case, suppose R contains S3 copying from S2. Then we find that for values V21 to 50 shared between S1 and S3, S3 can copy from S2. For values

V81 to 100, S3 can actually [inaudible] it independently because they are popular. So again, we lack strong evidence indicating copying, and we remove the copy relationship.

Okay. Let's see some experimental results using the weather dataset extracted from 18 weather websites for 30 major USA cities and collected every 45 minutes for a day. And from this golden standard we generate some [inaudible] because

-- so for instance, because there are two sources whose data are not available because of technical reasons or business reasons they can sell the data, so we

remove the copy relationships to it. And we add some copying for -- between the co-copiers and the transitive copiers of those two sources.

And then we remove -- we add some copy relationships for the data sources that did not provide any information about this like popular shape, et cetera, by comparing the data where we see most appropriate. And we remove some of the partnership that are not supported by the data.

We actually tried to do better by e-mailing or calling them, calling the websites.

But as you can imagine they didn't tell us any more information. And this actually shows the importance of the research.

So we measure the precision, recall and F-measure of our results. Precision measures among the detected copying how many are correct. Recall measures among the real copying how many are direct -- detected. And the F-measure is the harmonic mean of precision recall.

And for this example, there are 11 real copyings in the silver [inaudible] and we detected 8 of them, and we detected three more copyings. So both are -- both precision recall are .79 for our global detection.

Local detection on the other hand has a much lower precision because it does not remove the transitive or co-copying.

If we do not consider the co-related copying, then we have a very low recall. On the other hand, note that if we only consider the correctness have the data, we have better results than considering other quality measures. Why? Because in this particular dataset where temperatures for like humidity, there is not really a true-false notion for the values, more of a popularity of the notions. But this -- this -- this method actually enforces that only one value is correct and all other values are false. So that's why it has a higher recall. But for some synthetic data where we do have true-false notion we observed that enriched actually has higher than F-measure than considering only accuracy.

Okay. So let's now consider how we can leverage the detected copying for various aspects of data integration. So when we integrate data from several different data sources, we face three challenges. First, different sources can describe the same domain using different schemas. This is like when two persons organize their stationary drawers, they can organize the same things in very different ways. So to resolve structure heterogeneity we want to find the one-to-one happening between the components elements from different sources.

Second, different sources can describe the same real-world entity using very different attribute values. This is like one person can call this paper cutting to scissors and another one can call it paper scissors. And we want to be aware that they actually describe the same real-world entity.

Third, different sources can provide conflicting data. This is like one person can call this scissors and another one can call it glue. So we need to decide which value is correct among the conflicting values.

So before Solomon existing solutions all have an important assumption that all of the data sources are independent. But knowing the dependence relationships actually can add one more dimension to different aspects of data integration. For

-- whoops. For data conflicts, for resolving data conflicts we now can discover the truth with awareness of the copying and ignore the copied data. We can also integrate probabilistic data by removing the unrealistic assumption that all the data sources are independent. For resolving instance heterogeneity we can improve record linkage by distinguishing between wrong values and alternative presentations -- representations.

And for resolving structural heterogeneity, we can optimize query answering by ignorer the copied data or the data -- or the sources that are widely copied, and we can also improve scheme matching with awareness of the copying.

And finally, we can recommend trustworthy, up-to-date and independent sources.

In this talk, I will focus on two applications for resolving data conflicts. Recall that as the question when we consider -- when we detect a copying we can consider sharing false values as strong indicators of copying. However, in practice we often do not know which values are true and which values are false and this is especially tricky when we have copiers.

So in this example, we have three data sources each providing affiliation information for five database researchers. And if we do a naive voting taking the value provided by the majority of data sources we can actually do pretty good.

For four out of five database researchers we got the correct value. And I note that here S1 provides all correct values.

However now consider can that we have two sources that copy all or most of the data from S3. And now if we still naive voting S3 can actually dominate the results. And we will make mistakes for three out of five database researchers.

So on you solution starts with copy detection, assuming that all sources have -- all values have the same probability of being true. And we see that in the first round we compute a high probability of copying between S3, 4, 5 because they share a lot of values. And we actually also compute a high probability of copying between S1 and S2, because they also share three values. According to this copy relationships, we can do voting, ignoring the copied data.

So now, let's consider voting for [inaudible] affiliation. And we see that here each big circle represents a value. And the larger the circle, the higher vote count.

Each small circle represents a data source, a provider of that value. And the larger the circle, the more accurate is the source.

And a note that even in the first round the -- because of the copying relationships, the vote count of BEA is not three anymore. How we compute the vote count? We consider the source in a particular direction. Let's say now it's

3, 4, 5. And for S3, let's say it's -- it's vote count is one. For S4, the probability that it provides the value independently is one minus the copying probability

times the probability that the source copies on each the particular data item. So it's .2. For S5 the probability that it provides the value independently of S3 and

S4 is .2 squared. So the vote count adds up to 1.24. But it still wins the voting.

And according to the voting results for all of the data items, we can compute the accuracy of each data source. And in later rounds for more accurate resources we can assign higher vote counts. And we can also recompute the copying probabilities. Actually in the second round we reduce the probability of copying between S1 and S2 because they only share correct values. And we note that in later rounds the accuracy of S1 gradually increases and that of S3, 4, 5 gradually decreases. So starting from round four the vote count of UCI starts to be larger than the vote counts of BEA. Yes?

>>: [inaudible] slightly weird question. In theory this is all fine. But wouldn't the fact that there's one source that's heavily copied by others, in fact, be additional evidence for the accuracy of that source in practice? I mean sounds like it will.

But actually if the source is heavily copied even from the first round we will compute low vote count for the copied values. And then later on, if there are some like wrong values, then the accuracy of the -- that source will be quite low.

And then in later rounds the accuracy will just decrease gradually and eventually exactly as it's shown here the accumulated vote count will not be that high.

And in our experiments we actually didn't observe such cases where it will affect the results.

>>: Do you have examples of an inaccurate [inaudible] but heavily counted and copied sources?

>> Xin Luna Dong: Yes, a lot. A lot. Yeah. And I can show you later.

>>: Okay.

>> Xin Luna Dong: Yeah. Okay. And this trend basically continues and then later on we see that the accurate -- the vote count of UCI keep increasing until the last round we compute zero probability of copying between S1, S2. We find

-- we compute an accuracy of nearly one for S1 and a very low accuracy for S3,

4, 5. And we are able to find the correct affiliation for all of the data researchers.

Note that there is an interdependence between truth discovery, accuracy computation, and copying detection. Our solution conducts copy detection, truth discovery and the source-accuracy computation iteratively until the results converge. And we can prove that if we do not consider accuracy. Then this process will definitely converge. And if we do consider accuracy our observation is that it will converge when the number of objects is much higher than the number of data sources.

Okay. So so far this process is good, but it is offline. And it is so not appropriate for web data given the sheer volume of the data dawn the frequent updates of the data. And now the question is whether or not we can do this data fusion in query answering in an online fashion. And we proceeds the Solaris system

where I'm going to illustrate the system by this example where we are trying to answer the question, with where is AT&T Shannon Research Labs?

And here we assume there are nine data sources providing three different values in New Jersey, New York, and Texas. And we assume we know the accuracy of the sources and the copying relationships as input.

So Solaris, as it probe each new source, it will return the current selected true value, according to the probed sources. And it will return a probability that the value is correct. And the minimum and the maximum probability of this value.

And we see that as it probes more sources it might change its mind. So now the answer is not Texas anymore but becomes New Jersey as it probes more sources. And we also see that the probability of the value gradually increases.

And at this point, after we have probed eight sources we are confident enough that New Jersey must be the correct value. So we will to be here even without probing all of the sources.

So there are many challenges in building this system. First we want to quickly find answers as we probe new sources. Second, we want to be able to compute the probabilities, quantifying our confidence that this value is correct. And the third, we want to order the sources in such a way that we can quickly find the correct answers and we can quickly terminate.

And this all become extremely hard, much more tricky, when we have copying between the data sources. So you are welcome to check out the details from our paper. And if you go to VLDB, you are welcome to the -- I think it's Thursday's session for data integration for our talk on this project.

Okay. Next let me give a demo of our system. So here I will demo not only the copy detection results and the truth cover results but also how we are trying to provide appropriate visualization to help users to understand the results and how we can explain our decisions to the users. And the motivation is actually that originally we use this [inaudible] to represent the results. And as you can see, it is very hard to figure out what's happening out there. And I give it to my colleague, and he generated a map for me, as you have seen at the beginning of the talk.

So this is the entry page of Solomon, and it shows a map of the AbeBooks data sources. And now let's consider three scenarios. In the first scenario, let's assume you are a data analyzer, and you want to understand the quality of the data sources and the copying relationships between the data sources. So in the map you can see that each node is a data source. The larger the font, the higher coverage. And the darker the color, the more accurate. And you can search the sources by quality. For example we can search high coverage sources let's say.

So here it returns six sources. And we can see that they are all on this map.

And a lot of them are in this Departmentstoria country. And if we click on one of

-- actually from the table, we can see different quality measures of the data sources. And if we click on one of them, we can see more details. And we also use a term face to show the quality of the source. For example here we see a

quite friendly face showing that this is actually high quality source. And for this one, we see a kind of evil face because of the low accuracy of the source.

We can also search the sources by names. For example, we can search textbook -- whoops. And here it shows that we see that there are a bunch of sources on the map and we see that a lot of them are from the Textbookistan country. However textbook data sources do not have to copy from each other.

So we also see some from this country. We also see some from the islands.

And we see there are even some outside this map because they are independent.

And we can search a particular data source. For example thebookcom. And here it shows three data sources with thebookcom in the name. And we see that it is thebookcom, free postage at the book company and free US Air shipping at the book company. Intuitively they seem to be different accounts for the same company. And actually, we do find copying relationships between them. So here if we click on one of them, we can see that copying is likely and the copying probability is .5 in each direction. And here we are showing the data items that are likely to be copied.

And note that we didn't find a copying between free postage and the free air shipping. And if we wonder why, we can search the copy relationship. Ut-oh.

Thebookcom. Okay. So here there are three pairs of data sources with thebookcom and for this pair that we are interested in, we can see that it says they are independent and they share very few common values. And actually by investigating their data, we found they are disjoined. And so that's actually why we don't know what is the copy direction. We don't know if they split the data from thebookcom or thebookcom actually union the data from the two data sources.

Okay. Now, let's assume -- in the second scenario, you had data provider, and you want to look for the copiers of your data sources. So let's search -- oh.

Ut-oh. Okay. So we find this book store. And here we -- okay. So for -- let me click on one have them and we can see the quality measures of the source. And we can also find all of the copiers direct or indirect copiers.

And so you might wonder why you think one have the sources is your copier.

And here let's click on one of them. Note that it's actually interesting that we have [inaudible] books, we have owls books, and then we have beagle books.

So a lot of animal book stores here. And so if we click on one of them, we can see, okay, here are the copied items and we can ask for an explanation for the decision.

And if this is too hard to understand, to long to understand, we also show a visualization. And here each big square corresponds to a data -- a data source.

And the overlapping area corresponds to the overlapping data items. And here we see three squares. The inner one shows the data items on which they provide the same value in the same format. The middle one shows the data items on which they provide the same value with different formats. And we

divide the popular values and unpopular values. And the outside one is those that they overlap the -- provide different values.

And here we see that among the overlapping data items for a lot of them they provide the same value and there are some common mistakes so that's why we decide there is copying.

And also we see that the coloring of the squares shows the quality of the -- the quality of the data sources. And these two sources actually have very similar quality. And we actually do not quite know what is the copy direction. Okay.

In the third scenario, let's assume you are information user and you want to find information about books. So let's say we want to search about the books by

Bernstein. And it shows three of them, two by Arthur Bernstein, one, two, and one by Phil Bernstein. And if you click on one of them, you can actually see copying between sources on this particular book, and you can understand how the information about this book is propagated. And you might wonder why this particular author list or this book is the correct one. Is that the correct one?

>>: Yes.

>> Xin Luna Dong: Okay. So if you click on one of them, it actually sort of describes -- explains why this is considered as the correct answer. For this one it is easy because the correct value provided by a lot of data sources. Let's try another one. And let's try this actually. Ut-oh. Okay.

So you see that here this is considered as the correct author list. And that is represented by this blue circle. And we see that there are a lot of other author lists that are provided. For example, this one has only one author, and it has six providers. And why do we consider this value, which is provided by more sources, as being wrong? This is because we see that among those six sources, five are considered as copiers. And that's why its vote count is only 6.5, but this value actually has vote count 23.5.

Okay. So for related work, there have been a lot of work on copy detection for texts, for software programs, for images/videos, and also for structured sources.

We have given a tutorial in this year's Sigmod for such works. And also our works seem to be very related to data provenance. However, data provenance assumes knowledge of provenance or lineage, and they focus on how to effectively present such information and retrieve such information, whereas we are considering detecting such copying, such lineage information and how to apply that in various aspects of data integration.

And several take-aways. Copying is common on the web, as we have seen from all the examples. Copying can be detected using statistical approaches, as we have shown. And knowing the copying relationships can benefit various aspects of data integration.

So this concludes my talk. And I would like to thank all the contributors to this project. I found that they have spent -- all spent more or less time in AT&T varying from like 35 years to like 35 hours, and I'm ranking them according to the time they have spent at AT&T. And this concludes my talk. And you're welcome to try our demo. And thanks a lot.

[applause].

>> Phil Bernstein: So you didn't say anything about the time relationship between -- between copies. If I'm crawling the web on a regular basis then of course -- and if I'm clever about crawling where I crawl websites that change frequently.

>> Xin Luna Dong: Yes.

>>: More often than I crawl websites that are static.

>> Xin Luna Dong: That's true.

>>: Then I can see version histories.

>> Xin Luna Dong: Yes, yes, yes.

>>: And that would seem to be a very strong hint.

>> Xin Luna Dong: Yes, yes, yeah. We actually have some work. So what I presented here is about static information. And we have extended to consider the updates of the data. And we use the HMM model to capture such -- to reason about such information.

>>: And does that provide better --

>> Xin Luna Dong: Yes.

>>: Better information about what the direction of the --

>> Xin Luna Dong: Yeah, it provides not only better information but actually can even improve the accuracy of the detected copy. Because later on we might find well, although several data sources share data but actually one copies from the other one and the second one copies from the third one, the transitive copying realize.

>>: I see.

>> Xin Luna Dong: Yeah.

>>: So in this example with the book company [inaudible] a copying license here and there [inaudible] these because they're [inaudible].

>> Xin Luna Dong: Yes, yes.

>>: And I think at that point you said it's not clear whether this is the primary source --

>> Xin Luna Dong: Yeah.

>>: Or whether this is the primary sources --

>> Xin Luna Dong: That's true. That's true.

>>: But doesn't the fact that they're disjoint indicate that there was a splitting. If they were developed independently and then put together, you would expect that there would be at least some overlap between those two.

>> Xin Luna Dong: That's actually a very good point. Yes, I think this would have been important intuition, yeah. We didn't think about that yet. Yeah. So far we basically conceded the quality of the information. And it happens that the two sources have the same accuracy, the same -- certainly different coverage, but the same accuracy, the same formatting pattern. Yeah. Yes?

>>: [inaudible].

>> Xin Luna Dong: Yes.

>>: All your technical data seem to be the structured data.

>> Xin Luna Dong: That's true, yeah.

>>: So all [inaudible].

>> Xin Luna Dong: So there were actually a lot of work on text -- copy detection on texts. For example, Hector from Stanford, they have done a lot of work there.

And what I found is that for texts, the emphasis so far is how we can do this quickly. So how I can find fingerprints of the texts and how I can find the copying very quickly.

So the current techniques, they are -- a lot of them are scaleable, but they are not very robust. Meaning that if I copy and then reward, then it is hard to detect the copying.

>>: [inaudible] the example you mentioned [inaudible].

>> Xin Luna Dong: Yes. Yeah, yeah. It's definitely made possible to find the copying. But on the other hand, the -- I mean, there are also some extra challenges. For example, how can we find what is the original place for the rumor?

>>: [inaudible].

>> Xin Luna Dong: It's a little tricky because, I mean, we can find -- we can -- we can find some texts with high overlaps, but not necessarily with copying. Yeah.

But definitely time stamps help. Yes.

>>: There seems to be a combinatorial problem there though because when you're look at text fragments, how big are the fragments.

>> Xin Luna Dong: Yes. Yeah. Yeah. That's true. Yeah.

>>: But you say it's scaleable.

>> Xin Luna Dong: They are -- it's scaleable because they use the index. And so for example they will generate fingerprints for the -- so it's scalable for two reasons.

>>: They're indexing very small fragments.

>> Xin Luna Dong: Yes.

>>: Once they find overlap, then they can expand?

>> Xin Luna Dong: That's true. That's true. They index on fingerprints, which is typically a summary of subset of space subscreens. And then they index on it.

And they will explore a pair of sources only if they share a few -- at least a few fingerprints.

>>: I see. So the this all depends on how good the fingerprinting is.

>> Xin Luna Dong: Exact.

>>: And the date matches.

>> Xin Luna Dong: Yes, yes.

>>: Then you can drill them deep as long as they have a reason to believe --

>> Xin Luna Dong: That's true. That's true. Yeah. And also, I mean, currently, as I said, the fingerprints are all syntactic. So if there is any like rewarding, then it's very hard. And also, I mean, if I just randomly change some of the letters -- I mean, I think currently they can still do a pretty good job on that. But it makes it much harder.

>>: I think the whole [inaudible] with real text if you randomly change a few of the letters, it becomes useless, right? This will not look like readable text any more.

>> Xin Luna Dong: I mean --

>>: [inaudible].

>> Xin Luna Dong: It depends. It sort of depends because it could be some like misspellings. And with some misspellings you can still sort of read it.

>>: These are random?

>> Xin Luna Dong: Yeah. Well, it could be random. But it's not, I mean, I change the whole word. But I change a couple of let's.

>>: Instead of changing letters you just transpose to a trace of let's. You can do that [inaudible].

>> Xin Luna Dong: Yes. Yes. Yes.

>>: That people will read it with no problem.

>>: [inaudible].

>>: [inaudible].

>>: In practice, yes. However, unless you do that on large scale, the fingerprints that happen [inaudible] actually rather robust. So because it's owned those are all based on Jaccard Overlap. Those tend to perform very well if you only change a small portion of [inaudible].

>> Xin Luna Dong: So actually I want to answer your question about whether there was some low quality sources whose data are copied. Unfortunately I don't have the slides here, but we found for the AbeBooks data sources, we found the

10 sources that are mostly copied. And then we found their accuracy is actually not that high. And if we do naive voting only on these 10 sources, our results only has like 50 percent accuracy.

>>: So which dataset was this?

>> Xin Luna Dong: AbeBooks.

>>: Okay. Got it.

>> Xin Luna Dong: Yeah.

>>: Can you give an additional example [inaudible].

>> Xin Luna Dong: That's true. That's true. Yeah. Yes?

>>: [inaudible] students plagiarizing --

>> Xin Luna Dong: Huh-uh.

>>: [inaudible].

>> Xin Luna Dong: Actually don't do that anymore, because I heard -- I mean, there is a software tool from Stanford about finding the plagiarism for coding.

Sorry. And even for essays. And I know a lot of faculty members actually are aware of the tools and actually run that tools. And they are copy detection for software. For software programs. And they are trying to leverage the information from the -- because software code is kind of semi-structured. And they are trying to leverage the structure or the data flow or the process chi of the code.

>>: Again making it robust against certain kinds of obvious transformations.

>> Xin Luna Dong: Yes. Yes.

>>: [inaudible]. [laughter].

>> Xin Luna Dong: Like renaming is -- I mean, they can perfectly handle that.

But, on the other hand, the more robust -- I mean, the more tolerance you are with some changes, the more likely you have some false positives. Yeah.

>>: Have you found commercially valuable applications of this group at AT&T?

>> Xin Luna Dong: So two parts. One part is that at a AT&T we actually buy data, purchase data from some data sources. And this actually motivates the project whoever we want to recommend good data sources. I mean, a lot of sources purchase data from another source. I mean, it's questionable whether it's worthwhile to purchase data from all of the sources. And the second application is that at AT&T, I mean, we have a lot of data and there are actually a lot of kind of copying derivation of the data. And then -- I mean, as you can imagine, we have a lot of low-quality data and how we can -- I mean, the question is how we can use the copy relationships, how we can apply the data techniques to discovery techniques, to find the true information about some like the companies, about some clients, about some customers and so on.

>>: So, I mean, it's a well known problem in companies with regards to data

[inaudible] people pulling data from all kinds of places --

>> Xin Luna Dong: That's true.

>>: And then you don't really know which is authoritative and which --

>> Xin Luna Dong: That's true.

>>: And then people annotate them and they start adding things that are authoritative for a small amount of the data but the regulation of it is actually derived.

>> Xin Luna Dong: Yes, yes, exactly.

>>: [inaudible].

>> Xin Luna Dong: Yes.

>>: Which is the annotations and which is the stuff that's been copied?

>> Xin Luna Dong: Yeah, yeah, yeah. A lot of the data basis or tables are duplicates.

>>: Those are good examples.

>> Xin Luna Dong: Yeah.

>> Phil Bernstein: Okay. Well, again, thank you.

>> Xin Luna Dong: Thank you for coming.

[applause]

Download