>> Christian Konig: Hi. It's my pleasure to introduce Daisy Wang from University of California, Berkeley. Her thesis is on probabilistic databases and how you manage them in a relation database, and she's going to be telling us more about it. So over to you. >> Daisy Zhe Wang: Thanks, everyone, for coming. Today I'm going to talk about extracting and querying probabilistic information in BayesStore. In the past few years the number of applications that need to deal with large amounts of data has grown remarkably. The data underlying these applications are often uncertain, as is the analysis, which usually involve probabilistic inference and modeling. One such application is information extraction. So the amount of text data in the world is significant and is growing at a faster speed, both in the enterprises and over the Web. Information extraction is one type of text analysis which extracts entities such as person names and companies from a piece of text such as news articles or e-mails. So this is one piece of text from New York Times article. Information extraction uses probabilistic models to extract entities such as person, here Harold, or a company, such as McGraw-Hill from this piece of text. Moreover, these probabilistic extraction models extract probabilities and uncertainties for possible extractions. For example, it might generate alternatives of extracting Big Apple as the company with probability or Big Apple as the city with a different probability. So this kind of extraction generates probabilistic data or entities. And one possible query over these probabilistic data could be which New York Times article mentions Apple as a company with top-k highest probability. Another application that also generates a large amount of probabilistic data is sensor networks. The sensor readings are probabilistic. It's full of missing values and erroneous values because of, for example, interference in a signal, so low battery life and so on. If you are to model these uncertainties using a Gaussian distribution like a bell curve here, one possible query you ask on top of this probabilistic data is what is the Gaussian distribution of the average temperature in a certain area. So other application that requires this probabilistic data analysis includes data integration systems where the entity resolution and schema mapping are usually probabilistic, and social networks where probabilistic inference and modeling are used to classify users and predict user behaviors and so on. In this talk I will first describe what is the state of the art, what is the current approach in industry to perform probabilistic data analysis and what are the problems. Then I will describe my approach to perform probabilistic data analysis, which is implemented in a probabilistic database system called BayesStore. I will go into details, the new algorithm that we invented for probabilistic relational queries, and then talk about techniques to scale up these algorithms. I will finally conclude and talk about future work. So the standard approach in industry to perform probabilistic data analysis looks like this. You extract uncertain data from different real-life systems, such as from text or sensor networks, put it into a relational database. At the data analysis time, all the raw data is extracted from relational database, put it into a file, massage it into the right format, and feed it into statistical machine learning packages, such as R and MATLAB. Inside of these packages, probabilistic models are learned on top of the uncertain data, and a number of data analysis tasks are performed, such as inference, data cleaning, and aggregation. And the result of it is put back into the relational database as analytic result tables on top of which the user is going to query on top of. There are at least two problems to this standard approach. The first one is the performance problem. As you can see that the data is stored inside of the database, while as the data processing and analysis is done outside of the database, there is an expensive data transfer caused when the dataset is large. Moreover, because all the computation is done outside of the database, all these benefits such as optimization and parallelization and indexing inside of the database is not utilized because database is only used as data storage. So this is the performance problem. A second problem is information loss. In a standard database, only the deterministic data can be stored. So in a standard approach, only the top-1 highest probability analytic result is stored in a database for people to query, which prematurely disregarding all the uncertainties and probabilities. Thus, the information loss. I will have one example illustrating this point. So this is the top-1 extraction of this piece of text from the New York Times. As we can see, this top-1 extraction mistakenly missed the entity Big Apple as the city. But fortunately these probabilistic extraction models have -- give alternatives to possible extractions. And top-k extraction, it correctly extracted the Big Apple as the city. So for different document, this correct extraction comes at different K, top-k, the K is different for different documents, but usually it comes at the first few top extractions. So if you imagine that you only have a deterministic database only storing the top-1 extraction and running a query on top of it, for example, return all the articles with city equals the Big Apple, because the top-1 extraction missed Big Apple as the city, so it will return no documents. But if you are to consider the uncertainties and probabilities, then this query will return a set of documents with articles that has Big Apple as a city, with descending probability. Now, these results might not be the top-1 extraction. But they might as well be the correct answer and what the user is asking for. In fact, it's what the search engine is provided. It's a ranked list of results that is ordered by the probability. So this example illustrate that only querying over the top-1 extraction is not sufficient. What we want is to query over the full distribution of probabilistic data. And the problem of storing and querying over probabilistic data has been the focus of the probabilistic database literature for the past ten years. So before the BayesStore project, or before 2007, there has been a number of -- probabilistic data representation has been proposed. One is from Dalvi and Suciu from University of Washington here, where the probability is attached to the rows in each table. For example, the probability of this tuple as 1 exists is .8. Another representation is proposed by the Trio project at Stanford where they assume a much finer granularity of probability attaching this probability to the attribute values done of the rows, but both of these early representations assume independence between the uncertain values. So it's very hard for them to represent a high-dimensional distribution with complex correlation between these uncertain values. So in fact you can -- for example, to represent the distribution of all possible extraction from information extraction using this kind of representation by storing all the -- each possible extractions with its associating probabilities one by one for all possible extractions. But if you have a sentence with 30 tokens and each token as four possible labels -- here person, company, location or other -- the number of possible extractions is 4 to the power of 30. And if you have lots of sentence in a corpus, it's impossible for you to store all possible extractions. Now, modeling the uncertainty and reason with it has been the focus of statistical machine learning literature. And they come up with probabilistic graphical models to compactly represent high-dimensional distributions with correlations. So graphical model has been studied as an efficient representation of these high-dimensional distributions. This is an example of graphical model. So a graphical model contains random variables as these nodes in a graph and the correlations between the random variable, which are the edges between these nodes. So here the edges -- so this random variable is over the piece of text that we've seen before in the New York Times article. And this correlation is basically saying that the labels of a specific token is correlated with the token itself and the previous label. So this a set of local correlations which is converted or factorized from a high-dimensional distribution that we've seen earlier based on the factorization and conditional independence that exist between these random variables. So if now we have the same number of tokens in a sentence and the same possible number of labels, the size of the graphical model is merely 30 factors and each of them is a 4-by-4 two-dimensional matrix. So this graphical model encodes the full distribution of 4 to the power of 30, which is exponential to the number of tokens, and reduced the size to linear to the number of tokens. So graphical model is -- through this example is a very efficient representation of the distribution. So in my thesis I designed and developed system -- probabilistic database system called BayesStore that natively support these probabilistic graphical models and natively support inference algorithms on top of these models. So the picture now looked like this. The uncertain data directly stored in BayesStore in which both relational query engines and the graphical models and inference engines are supported. So graphical models include both directed and undirected models. Between these two engines, the query constraints and probabilities are passed between each other. On top of BayesStore the user can ask a query on top of uncertain data and the model and get distribution and probabilities as answer. >>: I'm getting the feeling that understanding this graphical model is going to be important to understand what follows. Is that right? >> Daisy Zhe Wang: Yes. >>: Can you back up one slide? I'm not sure I got it. So you've got one -- is it true you've got one node for token and one node per label? Is that what's going on? >> Daisy Zhe Wang: Yes. >>: Now, apple or Big Apple -- I mean, Big Apple is a token as well as apple in this model? >> Daisy Zhe Wang: So in this model big is one token and apple is one token. >>: And -- but then you want to assign a label to Big Apple? Or you're going to ->> Daisy Zhe Wang: I want to assign a label to big and apple independently. So I can sign both as company or apple as a city. >>: Okay. But then here each of these tokens -- this is undoubtedly a simplified graph, but each of these tokens has only got an edge to one label. Is that just a simplification? I mean, it seems like they're going to have edges to multiple labels because there's different choices within the probabilities. >> Daisy Zhe Wang: Um-hmm. So the linear chain CRF model only has correlation token with its own label. But this label is also correlated with the previous label. So in the Big Apple case, if one of the apple has some confidence of label as company, and if previous label is also company, then its confidence is ->>: Let's do the [inaudible]. >> Daisy Zhe Wang: So the CRF not only models the top-1, it models the whole distribution. >>: In this example, what you've shown is the [inaudible]. >> Daisy Zhe Wang: So this is a random variable. If this random variable could be given any of the possible labels, it's a random label that could be given multiple values, different kind of values. The top-1 can only give one value. But to Phil's question, because the label are -- the current label is correlated with the previous label, so we can capture the correlation between two pair of tokens being labeled the same as company or only one token is being labeled as city. So there is a carry-on correlation. Although there is no token directly correlated with the previous label, but because they are correlated, it's captured -- the correlation is captured using this tweak. >>: And so you have [inaudible] 4 by 4, right? For each one there are four possible labels, and the four possible labels are a function of the four possible labels of the previous two. >> Daisy Zhe Wang: Yes, yes. So this is a two-dimensional matrix. Given a particular label, this matrix saying that what is the confidence of this pair of labels. >>: Is there anything magic about doing only one [inaudible] pairs as opposed to triples or things of that sort? >> Daisy Zhe Wang: So there are more complicated models than linear chain CRF called semi-CRF model that try to group -- rather than just looking at unigrams, look at -- try to group tokens into one and assign a label to these groups. But it's just not studied here because linear chain CRF is kind of the state of the art, and more complicated model are used for more complicated corpus, and there's no reason why we cannot support it here because BayesStore is supposed to design for general graphical models. It's just that for this example we're not ->>: So let me ask you a really dumb question. So you have Big Apple. Why would you ever put a probability -- how reasonable [inaudible] probability that big refers to a city or a location? >> Daisy Zhe Wang: Sorry? >>: So you have Big Apple and you're trying to identify Big Apple as some probability being a location, city, New York. And so you -- so I don't know exactly how you proceed here, but you've got the word big and you want to put some sort of label or probabilistic distribution of what tag it goes with. Why would the tag for big ever be likely to look like a location? >> Daisy Zhe Wang: Yes. So big in this case might not be directly labeled as company as the top-1 extraction, but coupled with apple, it's together, because we are computing the joint distribution and the top-1 for the whole sentence, not for -- we're not doing classification for each token at a time. And I agree with you that maybe for this kind of a model to deal with multiple token entities is a little bit hard. But it's still possible because when you look at apple, it's still -- a lot of the times it is labeled with the company instead of -- as the city instead of the company. And then coupled with the big ->>: I don't think you'd ever see apple by itself without the word big in front of it referring to anything other than the fruit or company. I don't think you -- people don't refer to New York as the apple, as far as I know. >> Daisy Zhe Wang: You're right. Okay. >>: But that's not really part of your system. You just are assuming that you have a reasonable extraction system that gives you this information and then your job is to take it from then on. >> Daisy Zhe Wang: Yeah. So that's more of like what the CRF can do. And I agree that the linear CRF has its shortcomings, but this is not the focus of the thesis. So okay. So I talked about one -- that the graphical model is an efficient implementation or representation of the high-dimensional distribution. So my thesis is about how to natively support these graphical models in general inside of the relational database and their inference algorithms. So it's both stored -- it stores the uncertain data natively inside the relational database and support both relational query engine and inference query engine. And the user can ask query to get probabilities and distributions as answers. So the BayesStore basically -- it's an integrated system that combines the power of large-scale data processing from relational database and the power of advanced analytics from statistical machine learning literature and combine in a single system to provide a framework for users to ask query over uncertain data and do probabilistic analysis. So the BayesStore can solve the problem, for example, information loss, where uncertainties and probabilities are lost by using these graphical models to capture the probabilities and uncertainties and use inference engine on top of it, and it can solve the problem of performance because the relational engine here has the indexing and optimization and all these benefits. We implemented BayesStore on top of Postgres 8.4. So my contribution, technical contribution, consists of three parts. First one is efficient implementation of graphical models as first-class objects inside of the relational database and implementation of inference efficiently using SQL in database. Second one is inventing new algorithms for relational -- probabilistic relational query, both to compute top-k results as well as result distributions. Because these algorithms are probabilistic instead of deterministically querying over the top-1 extraction, we are doing probabilistic queries beyond top-1 over the whole distribution, the results show that we can reduce the false negative by as far as 80 percent. And lastly I come up with a set of techniques to scale up these algorithms, for example, query-driven extraction and so on, and achieve orders of magnitude speedup compared to the standard approach which does all the computation outside of the database exhaustively agnostic about what the user is asking for. So the BayesStore system is general in a sense that it supports both directed and undirected model. It can support applications such as text analysis as well as sensor networks. But for the rest of the talk, for simplicity, I'm only going to use information extraction as the driving example. >>: One question. >> Daisy Zhe Wang: Yeah. >>: So there's been some work on taking data mining models and applying to databases efficiently. So would you say that your work can be classified exactly or is it something ->> Daisy Zhe Wang: So there's two -- there's two side of the coin. I can see my work from -- you can see my work from two different angles. One is the -- one is definitely the data mining and statistical methods and supporting it inside of the database. And I don't -- I don't aware there are a lot of work that support text analysis algorithm or text mining inside of the database using probabilistic extraction models. And on the other side is more novel or where we originally coming from is probabilistic database. So we deal with probabilities in addition to the statistical methods. We support probabilistic relational operators on top of these distributions. So I don't think that those work deal with that. Okay. So that's the introduction. Some basics of BayesStore with graphical model, do we support in information extraction, what are the data model, what query to do we support, and what are the query semantics. So the graphical model that we support for information extraction is called conditional random fields. It's a state-of-the-art information extraction model, very much like HMM. This is a piece of six-token address string, and a conditional random field model rendered on top of it looks like this, very similar to what we've seen before. These nodes are associating with each tokens in a text. And these are observed. They're value of fixed. And these are random variable whose value is to be inferred. So we don't know the value of these. They could be stream name or country or states and so on. So CRF model encodes the distribution of all possible extractions over this piece of text. So two possible extractions over this piece of text looks like this. So the extraction is over the entire string, and the probability is on the right-hand side. Each extraction gives a specific label to each token. And for different extractions you can give different labels to a specific token. And there is many more such possible extractions, and we are omitting here. So the BayesStore system extends relational data model to support probabilistic data and probabilistic graphical models. So in the example of information extraction, the text is stored in a probabilistic table called TokenTable, which has one probabilistic attribute label whose value is to be inferred. And each row in a TokenTable is unique occurrence of a token identified by the document ID and the position where the token is taken from. So the possible words over this probabilistic table is encoded in a CRF model. And the random variables in this CRF model is mapped to the attribute values in this table like this. It's not actually stored, but the mapping is like this. So the CRF model consists -- as we said is a set of local correlations or a set of factors. And they're stored in a relational format as a set of -- so it's stored in a relation that contains the factors of all the unique tokens in a corpora. And each factor is represented as a set of rows. For example, the token Berkeley being labeled city proceeded by street name has a high confidence of 25. So it's a set of such entries. So the query on top of such a probabilistic database is over the text in a TokenTable. >>: Can you go back one slide, please. So in the original CRF model the factors are not just -aren't just functions of the label itself, [inaudible] token and a next token, have these features which are functional ->> Daisy Zhe Wang: Yeah, definitely. >>: So how do we capture [inaudible]? >> Daisy Zhe Wang: You mean if we have further on what -- for example, this label has a long edge between this one and this one? >>: No, no. It's basically a function of any value of the input [inaudible]. This label of the first token, it could be a function of the [inaudible]. So how to capture such pictures. >> Daisy Zhe Wang: So you're still talking about this linear chain conditional random field. >>: I'm still talking about this linear chain -- >> Daisy Zhe Wang: So ->>: [inaudible] condition variables of extract. So this label of a particular random variable [inaudible] function of any of the [inaudible], which is why people do conditional random fields [inaudible]. So how do we capture such pictures? >> Daisy Zhe Wang: So one thing it's the local correlation still only concerned with three random variables, right, given a specific token, the relationship is between its label and previous label. So the correlation is modeled by a set of features, and these features are general. These features are used to generate these factor tables, and this factor table is just materialization of these features on different tokens. So the queries is on top of the text in the TokenTable, and it includes both the relational operators and the inference operator, the relational operators of course and not only the deterministic selection, project, join and aggregates, but also contain its probabilistic counterparts, the probabilistic selection, project, join and agates. Moreover, we support inference operator, includes top-k inference and marginal inference. Marginal inference is over a small subset of the random variables in a graphical model. So one query on top of such a text is giving me all the tokens and the top-k labels for the first ten documents. So we also talk about probabilistic selection, join, and marginal inference later in the discussion. So these kind of queries follows the possible world semantics. The possible world semantics say that starting from the probabilistic database, DP, it can be extended to a set of possible worlds, from D1 to D N, and the probabilistic query can be applied individually on these possible worlds and generating a new world, QD1 to QDN, which represent the resulting probabilistic database, QDDP. However, we cannot execute the query through this path because the number of possible worlds could be exponential to the number of random variables. So a major part of the work they've done is how to directly allow probabilistic queries over the probabilistic database and get the resulting probabilistic data as an answer without going through the possible -- to expand to the possible worlds. So with that, I'm going to details of the algorithm. I will go into details of an example how to compute the top-k query results and briefly over how to compute the marginal distribution there using different inference algorithms. >>: So what class of queries are you [inaudible]? >> Daisy Zhe Wang: I will have slides later talking about different kinds of queries. But basically in terms of inference, we can support top-k or marginal. In terms of probabilistic queries, we can support different kinds, like project join and aggregation. >>: The biggest [inaudible] queries computing on the probabilistic databases [inaudible]. >> Daisy Zhe Wang: Has -- yeah. [inaudible] hard. >>: So those class of queries, you can also not handle [inaudible]. >> Daisy Zhe Wang: I do handle aggregates, and the reason -- I think the reason why the -- it's sharp -- it's the inference algorithm is not poly time algorithm. It is NP-hard algorithms. It doesn't say that the computation bottleneck goes away. It's just because that probabilistic -- these graphical models more efficiently use this conditional independence in a practical case. There are a lot of conditional independencies. So it factorize the large distribution into a local distributions, so it makes the problem much efficient. Okay. So this example is include probabilistic join and top-k inference. So it's basically compute the top-k join results between two e-mail corpus that contain the same company name. And the result of this join has to be bigger than the threshold T. So imagine that you have two input documents represented as two distributions of possible extractions using CRF model. A probabilistic join on top of it generate a new distribution. So say that we simplify the query and only compute the top-1 probabilistic join result. This top-1 join result is not necessarily computed from the top-1 extractions of the input document, because the top-1 extraction of the input document might not join each other. In fact, it might be computed from top two, top three, top ten or top a hundred extractions from the input document. Only that the later two -- the join results might not be bigger than that threshold T, so we will be filtered. But the insight here is that for different pair of document, in order to compute the top-1 join result, we cannot deterministically specify the top-k extractions that we need from each document and compute the top-k and top-1 join result or top-k join result. We have to incrementally fetch the highest probability extraction as we are computing a top-k join results. But the algorithm to compute this query involves three parts. The first one is Viterbi dynamic programming algorithm, which computes top-k extraction. The second part is the incremental Viterbi which gives you incremental access to the next best extraction, giving you a rank list of extractions. A third part is probabilistic rank-join algorithm which takes the input of incremental Viterbi and compute the top-k join incrementally. >>: So seems like the [inaudible] correlation you're modeling is correlation of different extractions of the same document. There are different documents, those extractions are [inaudible]. >> Daisy Zhe Wang: Yes. >>: And even if you have different CRF models on the same document, the different models, those extractions also being made independently. >> Daisy Zhe Wang: Can you say that again. >>: [inaudible] one is for extracted cities, the other is for extracting countries with the different models and just feeding them through different [inaudible]. >> Daisy Zhe Wang: Yes, if they are in different models, yes, they are independent. But usually you model in the same -- if you are extracting several entities, you usually train the same models to extract them. >>: [inaudible] have an accuracy extractor and have a citation extractor, I just run the same [inaudible]. >> Daisy Zhe Wang: But you might have more accuracy if you train the integrated model. Because maybe the person and address appear one before the other or the person and telephone number one after the other, things like that. So this kind of correlation do have -- yeah, do appear in the text. So I will go into a little bit detail for each of them, but the Viterbi dynamic programming algorithm is a standard algorithm over CRF to compute top-k. And the incremental Viterbi and probabilistic rank-join are new algorithms. So Viterbi algorithm, the contribution here that we found that I did is to implement it natively inside of the database using SQL inside of -- and measure the performance compared to the Java implementation outside of the database. So the Viterbi is a dynamic programming algorithm that computes a V matrix. It's a two-dimensional matrix. Each cell in V, VIY, stores the top-k partial segmentation ended with position I with label Y. As you can see, it's a recursive algorithm because VIY is recursively computed from VI minus 1, adding this additional step with weighted sum of features and taking the maximum on top of everything. So the Viterbi implementation inside of Postgres looks like this query plan. We use with recursive join inside of Postgres joining over the TokenTable and the factor table, which are if relational representation of the text and the model. So you we are computing VIY recursively from VI minus 1, and followed by group I aggregation, which basically computes the maximum. So after measuring the performance of this implementation with the Java open source CRF implementation, we find it's five times slower. And the reason is that the way that we are representing the factors, so we are representing each factor as a set of rows -- yeah. >>: One question. So how come sort of [inaudible]. >> Daisy Zhe Wang: Oh, yes. So this is only part of the main loop. You still have to backtrack. >>: [inaudible] without computing [inaudible]? >> Daisy Zhe Wang: The normalization, you still have to use sum-product algorithm, which is very similar to this recursive algorithm to compute the normalization factor. And then you just do the normalization factor over the probability that you compute from the top-1. And you get the probability. >>: But this normalization is [inaudible] if I understand, correctly. >> Daisy Zhe Wang: Normalization is the same ->>: [inaudible]. >> Daisy Zhe Wang: No, no, it's the same complexity as this one, because you are basically summing over -- replacing max with sum. >>: So can [inaudible]. >> Daisy Zhe Wang: Yes. Yes, yes. It's also [inaudible] in a very similar fashion. So as we said, this is five times slower because the factor is represented as a set of rows. And what we did is replace this representation using array data type, which is supported in Postgres. And we also developed a number of aggregation functions on top of the array data type. So this is a new implementation which has the similar structure, use recursive join, but uses array data types and aggregation functions, which result in a more compact representation of these array factors, a better main memory behavior and more efficient join. So the result of this is as efficient as the Java implementation and sometimes even more efficient. So this exercise tells us that using SQL inside of the database we can efficiently implement inference algorithm, complicated inference algorithm over graphical models. >> But you're comparing a Postgres [inaudible]. >> Daisy Zhe Wang: Yes, they are not the the same, but we are not trying to beat them; we are just trying to say that in set orientated processing inside of the database, we can achieve similar performance to -- it's not ->>: But when you applied sample, you could have taken a different [inaudible] a new operator, which is not [inaudible]. >> Daisy Zhe Wang: Yes, yes, yes, yes. So if we ->>: [inaudible]. >> Daisy Zhe Wang: I agree. So the other way to implement it is a new operator, instead of posing this as a query, we implement it as an operator. But there are two major benefits of implementing this way. One is that this entire query plan is representing an inference algorithm can then be optimized with the rest of the relational algebra in a larger query. So based on different statistics over the data, we can optimize it more nicely, the inference implementation with the rest of the relational algebra. And second, because it's on top of the database instead of inside of the database is it's [inaudible] to other relational database. >>: [inaudible] examples where you want to combine this with other relational operators? >> Daisy Zhe Wang: Oh, exactly. We'll see to that. And this is -- so this kind of native -- the native implementation of this inference algorithm make it possible for you to cooptimize the relational operators with these inference operators, whereas if you hard code everything, it's a black box. You cannot cooptimize. >>: Do you have some [inaudible] examples where you can -- >> Daisy Zhe Wang: Yeah. >>: [inaudible]? >> Daisy Zhe Wang: Yeah. Okay. So the second part of the algorithm is incremental Viterbi, which is a new variation of the Viterbi algorithm over CRF. The input is the top-1 extraction of the Viterbi and the states, which is the V matrix. So here instead of deterministically saying I want top-k extraction, it can get the extraction incrementally. So the algorithm is basically incrementally computes some of the cells new elements in a cell of the V matrix, and using this V matrix to extract next best extraction result in a list of extractions ranked by the probability. So the complexity of this incremental Viterbi is -- big OT is number of tokens, Y plus KY is the number of labels, and K is extraction depth, log Y plus K is interestingly less -- or is more efficient than the Viterbi algorithm, which means that every time we fetch a new extraction, it's more efficient than computing a top-1. So using the result of incremental Viterbi, this -- the result is feed into the rank-joining algorithm, which is computed between each pair of joinable documents. So each document can be represented as a set of possible extractions. It has its join keys and possible extractions is listed in the decreasing order of the probability. So this is the probability and it's decreasing. And K is the extraction depth. So the rank-join is computed between those two tables or list of extractions. And what it does is to compute the next best extraction incrementally while it's computing a top-k join result. And as soon as it computes a top-k join result, it stops the incremental inference and return the results. >>: So your outperformer has only the token labels, right? [inaudible]. >> Daisy Zhe Wang: Yes. >>: If I want to say [inaudible] which is probably the most natural use [inaudible] I want Big Apple as a single segment [inaudible]. >> Daisy Zhe Wang: So it's -- so what you have is apple as a city, Big Apple as a city. And there is a way to [inaudible]. >>: Don't label, right? So I don't just label big city and apple city [inaudible]. >> Daisy Zhe Wang: Yes. So when you want to extract city as segments, there is -- >>: [inaudible]. >> Daisy Zhe Wang: Yeah, there is a SQL for it. I forgot I actually used it. But you can actually concatenate adjacent tokens of the same label and put them into the same segment. >>: [inaudible]. >> Daisy Zhe Wang: Yeah, but then -- yeah, as I agree. You can probably return segments. But you can return segments. I forgot the operator's name that I used. But you can still do that. So the rank-joining algorithm only joined between two individual documents. If but if you want to join two sets of the documents, then you have to compute a set of rank-join algorithms simultaneously between one other document and a set of joinable inner documents at the same time. So there is details on how to share computation and how to maintain states and so on. So this is a new algorithm that's based on the rank-join algorithm. So this is a detailed look at one type of probabilistic query. The reason why we study probabilistic query is to look beyond the top-1 extraction, to look at how to query over the full distribution of probabilistic data. So this is an evaluation to look at how these probabilistic queries can improve answer quality. So the query is probabilistic join and the corpus that we run on is 200 hand-labeled signature blocks in Enron corpus. So it's hand labeled, so we are able -- then this is the ground truth. We are comparing the deterministic selection over top-1 extraction with the result of probabilistic selection. So the query is a number of selection conditions over different attributes. So basically, for example, we want all the articles that company equals to apple. And we are measuring the false negative rate. So means that the missing results, missing errors. So we are comparing this baseline with the probabilistic selection, and we see that false negative errors can be reduced by as much as 80 percent. >>: [inaudible]. >> Daisy Zhe Wang: So it's because before the standard approach is only querying over the top-1 extraction. If we are able to query over more than top-1 over the top-k extraction or the whole distribution and then still using the threshold probability bigger than T to control the false positives, then we get better answer. So in this experiment we show that we can reduce the false negative while keeping the false positive roughly the same. >>: [inaudible] to go beyond top-1? >> Daisy Zhe Wang: The reason why it's -- no. So the standard approach, I'm comparing to the standard approach where only top-1 ->>: [inaudible]. >> Daisy Zhe Wang: Sure. But the reason is you don't know which K you should be storing. And I have full on -- I think -- yeah, the next example is we have to query over the whole distribution instead of just the top-k. >>: I mean, it's sort top-1 [inaudible] do the same comparison as top ten that's [inaudible]. >> Daisy Zhe Wang: Yeah. So there are two aspects. One is you don't know what K you should get for different kind of queries. Second you might need the full distribution as I will show you in the next example. >>: [inaudible] is these queries, I guess. Are they hand picked? I mean, they can be rare queries, which doesn't make too much sense. >> Daisy Zhe Wang: So the query that we picked here are -- so on the company and it's over different company entities. So we picked all the entities that exist in this corpus. So basically company equals to a number of different values, and these values are picked from what is existing in this corpus. So it's a set of queries that we computed and aggregate the values here, the false negatives here. >>: So it's just over [inaudible] it's not like the example where you can just kind of join and stuff. >> Daisy Zhe Wang: So this is probabilistic selection. Yes. So for probabilistic join here, we are also comparing the baseline of join over deterministic top-1 and the probabilistic join and the query is to find a pair of signature blocks that has the same -- that's from the same company. And X axis is increasing from one to 30 looking beyond -- more and more beyond the top-1 extraction. And we are measuring a false negative rate. As we can see, the probabilistic join is looking more and more beyond top-1 extraction looking into the full distribution. It reduces false negative from .4 to .25, a 30 percent drop of false negatives, while false positives remain roughly the same. >>: What is the false positive rate? >> Daisy Zhe Wang: False positive is meaning that you ->>: No, I know what it is. What's the number? >> Daisy Zhe Wang: I think it's pretty decent number, around 90, I think. I don't remember. But I think it's pretty decent number and it doesn't change when we use probabilistic selection or not. So in the last part we talked about how to compute top-K results over probabilistic join -- over probabilistic queries. So that this example is computing marginal distribution over probabilistic aggregation. So the aggregation in this query is to compute the distribution of the number of companies that we can find extract from the articles that also contain a keyword apple. So this is the query plan. We first do an accurate join to find all the articles that contains the keyword apple. And the documents are represented as a distribution of possible extractions ->>: [inaudible] English, want to find pairs of documents, let's say one of them has a token called document [inaudible]. >> Daisy Zhe Wang: No, I want to ->>: [inaudible] label called company? >> Daisy Zhe Wang: So because we are representing this as a TokenTable, right, so it's basically -- it just give me all the documents that contains apple. >>: [inaudible]. >> Daisy Zhe Wang: Yes, yes. So this is just a self-join, say give me all the -- first compute ->>: [inaudible] join the reconstructor. >> Daisy Zhe Wang: Yes. So -- yeah. So we first compute the number of set of documents that contain the keyword apple, and they are probabilistic distribution of possible extractions, then goes to probabilistic selection and count. And the result is basically a histogram of counts possible. It's a count distribution. So we cannot answer this query by using Viterbi algorithm, because Viterbi computes the top-k extraction. So what we want, the inference algorithm that we can use, that we used is Markov Chain Monte Carlo inference algorithm called Gibbs Sampler, which basically computes a small subset of possible extractions according to this full extraction, and this small subset of extraction, it goes to probabilistic selection and count, and which result in a set of counts which are then computed into a count histogram. As we see an example here, different inference algorithm is needed to compute different probabilistic queries. In fact -- so this is in Gibbs Sampler, and we implemented inside -efficiently inside of database using window functions. And in fact we computed four inference -I computed four inference algorithm including Viterbi algorithm, sum-product, MCMC Gibbs Sampler, and MCMC Metropolis-Hastings. And the star is saying which inference algorithm can be applied to which type of queries. And it can also be noted that the same type of query, for example, marginal over probabilistic join, can be answered by different inference algorithm, including sum-product and two MCMC algorithm. But this have a tradeoff between accuracy and runtime because sum-product is an exact algorithm but it cannot deal with general graph, for example. So that's the algorithms, so that these algorithms solves the problem of information loss. We can query over the probabilistic distribution. The next section deal with the performance problem, which talks about a set of techniques to scale up these algorithms. So the gist of -- or the main intuition in these scale-up technique is how to replace the exhaustive extraction with query-driven extraction. So the traditional approach relies on exhaustive extraction because analysis done outside of the database agnostic about what the query is asking for. So you have to exhaustively extract all possible queries over all the documents. But if you imagine you have a BayesStore system that has an integrated solution of both query engine and extraction engine, then you know when your extraction what the query is -- what the user is asking for. So you can prune away a lot of unrelated information at the extraction time. So that's what we rely on. So the techniques -- so this query is very similar, trying to extract company that also -- in the articles that also mention apple as a company. So the techniques includes three. The first one is very simple, inverted index, because we have the document inside of the database and extraction inside of the database, so we are at the leisure of throwing away of bunch of documents that doesn't mention the keyword that is in the query. The second technique is called minimizing the models, because when the graphical models is rendered over the entire corpus, it's a huge model. But the query that we are asking for only ask a very small subset of the random variables inside of this graphical model. So we want to prune away the rest of the graphical model that is unrelated to these query nodes. So this is called minimizing the models. The third part is the general notion of trying to push the relational operators such as join and selection into the inference process. So for answering this query, we have an algorithm called early-stopping Viterbi which is trying to push probabilistic selection into the Viterbi inference algorithm. So a little bit more detail. Invert index, we basically use this token equals to apple and use inverted index to quickly select a set of documents that include this keyword. That's very simple. And the second one is minimizing the models. So, again, the key insight is that query node is only a very small subset, so we want to prune away the large parts of the graphical model that is unrelated, for example, in this Bayesian network and suppose that node is A is the query node and which one we want to compute the distribution or top-k. And B is a random variable that is observed or has the evidence. Then according to Bayesian network, node A is independent of node C, D, and E given B. So we can prune away the rest of the graphical model only inference over these two random variables. So very similarly on the CRF model that I study, the CRF is rendered over each sentence independently in the single document. So between the sentences there is no correlations. If you want to ask a query over one specific label in one sentence, then you can very safely prune away the rest of the sentences in a document. So I implement this minimizing the model using graph traversal inside of the transitive closure over graph inside of SQL. The third algorithm is Viterbi early-stopping algorithm which uses this probabilistic selection and try to push it into the Viterbi algorithm which is computing this two-dimensional matrix. This is the tokens in a sentence, Big Apple, lands, year 14, Super Bowl. And this is possible labels. So when you're computing the candidate partial segmentations as denoted by these arrows, at position 2 you see that none of these candidate partial segmentation points back to labeling apple as the company. All of them say that Big Apple is labeled as city. So you can stop the inference there and toss this document away because it doesn't satisfy the probabilistic condition that label has equals to -- label has to be company without the full inference. So with these scale-up techniques, this is evaluation over New York Times dataset, which is 7 gigabytes of a million New York Time articles. So if we are to run one of the queries that we see in a previous slide, without inverted index or none of these scale-up techniques by exhaustively extract all the possible extractions and query on top of it, then we need around 13 hours to compute the whole corpus. >>: [inaudible]. >> Daisy Zhe Wang: Yes. It will. But when you do query driven, you can also maintain the results of the previous queries that you executed and just like view maintenance. So it's basically you don't have to do it if you -- you're not asked. So if we are to add in the scale-up techniques, for example, that we talked earlier, so this graph is showing you the runtime of adding these techniques one at a time. So if we have inverted index to filter away all the documents that doesn't contain keyword apple, then it reduces from 13 hours to 14 hours. If we add in minimizing the models to prune away the part of the graphical model that is not related to the query node, then it reduces from 14 minutes to 1.4 minutes, another factor of 10. And if we add further the early-stopping algorithm, then the runtime reduces from 1.4 minutes to 16 seconds. So we can query and get answer over a large dataset in a matter of seconds. So these techniques are very effective in pruning away the unrelated information to the queries. So this graph is showing you different kind of queries with different selectivity. On the left-hand side is higher selectivity. On the right-hand side is lower selectivity. So this graph is showing us that if selectivity is high, then the speedup -- the speedup is more significant while as otherwise the speedup is smaller. This -- yep. >>: So is this a fair comparison? This 13-hour runtime is supposed only need to be run once, right? [inaudible] can be used by all the subsequent queries. So if you compare this to one query [inaudible] and your techniques is inverted, and maybe even earliest stopping the -- maybe can also apply on top of the extractions. Means you first do all the extractions using these 13 hours, and then you can find these techniques over these fully extracted [inaudible]. >> Daisy Zhe Wang: Yeah, I agree. So these are two modes of computation. One is precomputation and batch orientated, compute everything, and the other is query driven. And we argue that there are scenarios for both mode of extraction. For example, for iterative, you could have -- you don't have the full access to the entire document. For example, legal documents, you kind have to do query driven. Because the number of documents that you get out is limited, restricted. Or you don't -- small companies don't have the computational power to process, preprocess over the all the documents, and they want to do pointed. For example, you are -- again you are doing -- you have -- you have the hope the e-mail of the whole corporate, but then you are doing some investigation and you only want the documents that's specific to David, for example. You don't want to extract all the extraction, you only want for David, you don't want to process everything. So there are scenarios that you do want to do this exploratory iterative interactive and query-driven extraction, other than precomputation everything. But I agree that precomputation does have its application areas. So this is the -- I don't know if I have time. It's already ->> Christian Konig: We started a little late. >> Daisy Zhe Wang: Okay. So this is the one optimization technique that we submitted in the recent -- and accepted in the recent SIGMOD paper. It's called hybrid inference. I will just go in a very high level what it is. So as you already know that I implemented different inference algorithm inside of the database, and a single query can actually be answered by different inference algorithms. And hybrid inference basically at runtime tries to figure out which inference algorithm is appropriate for which subset of the document and apply accordingly, so according to the structure properties. What do I mean by structure properties of the graphical model? So we started out with two documents. And when we apply a join, on this pair of documents it can result in a tree-shaped graph, graph model. And if we do inference on it, we should use sum-product algorithm, because exact and it's polynomial. On the other hand, if we change the first document a little bit, the join on top of it actually results in a cyclic graph which we should use the MCMC algorithm. So the hybrid inference basically after it's -- the model is instantiated, it forks into two execution paths deciding whether the structure is a tree structure or a cyclic structure, uses different inference algorithm and then union in the end. So using this kind of hybrid inference to optimize the query execution, we find that it's five to ten times more efficient than monotonically applying the general inference algorithm over the entire corpus. So as a conclusion, I have described a number of applications that require probabilistic data analysis. I have described BayesStore, a probabilistic database system that efficiently implement graphical models and inference algorithms. I described new algorithms for answering probabilistic queries and improve query answers and reducing false negatives by as much as 80 percent. And finally I talked about a number of scale-up techniques that can achieve orders-of-magnitude speedup for interactive queries. This is a number of related work I'm not going into. So I have other collaborations with different research labs during my Ph.D., other than my thesis work. But this is related. So my inference algorithm that I described, like Viterbi algorithm and so on, is incorporated into MAD Library. It's collaboration with EMC. So they want -- my library is basically a statistical data analysis library. It's open source. You can check it out. It's supposed to do text analysis and clustering and classification and so on. It's in database statistical analysis. And I also collaborated with IBM Almaden and Google on various projects. For future work, I'm really excited about the domain that I mean, which is trying to use advanced statistical methods to analyze large amounts of data, and I think that there is a lot of challenges in this domain, both in building infrastructure and systems to deal with this large-scale data analysis and to look at different application, both in social sciences and in health informatics to apply these algorithms or even inventing new algorithms. And finally I think there is a need for a high-level language or interface or visualization for people to manipulate these statistical models and to apply to large amounts of data. And with that I -- so this is a pointer to BayesStore project that you can get more detail, and this is a pointer to the MAD Library, which is an open source statistical library and database. And that's all. Thank you very much. [applause]. >> Christian Konig: Questions? >>: So I have a very practical question. So you have this information extraction as your main driving application. I might personally have used a few of these open source or information extraction tools. And not they only output their most confident extraction result, seldom do they output the probabilities. And it's even more rare to see they output a model, a graphical model, in their output. So how do you see to make this [inaudible] database so that you can utilize these graphical models? >> Daisy Zhe Wang: So these models do -- so depends on which information extraction toolkit you use. There is rule based information extraction which doesn't generate probabilities, and there are machine learning communities that other tools are probabilistic, but because you're not seeing them because they don't think people are going to use them. So they are inherent in these probabilistic models. They are not generating them because their mindset is our output will be put into a database for people to query. So there is no need to generate probabilities or uncertainties. So but they are there for this set of models. Now, whether these are important, these probabilities are important, I did have an example earlier why these uncertainties are important, because the top-1 extraction might not be correct. And you want to involve humans or involve statisticians to look into the uncertainties, especially in the case that for -- like I was talking to a national security person that really wants the uncertainties in a text. Because the top-1 doesn't capture what they are looking for, and they want weak evidences to combine to alarm and so on. So for these cases you do want alternatives rather than just the top-1. >>: I guess the question was really what motivates these [inaudible] to expose their outlying models which may be built on a lot of training data they have to the outside world. >> Daisy Zhe Wang: I think we have to come up with the -- so I think right now there are really no -- currently there are no killer app that really utilize these probabilities and generate values out of. And part of it is chicken and egg problem, because there's no tools. So it's a part of coming up with killer app and having the tools to deal with them. >>: So a very simple example where you're looking for a match to -- trying to match against a city name that's in a document, and you want to only match documents where the probability of the city name being correct is greater than .8 from the extraction. So my understanding is that -- so to get that -- so the graphical model will probably give you in the sequence of things. To get the probability of an individual entity or subsequence being extracted, you've got to compute the sum [inaudible]. So my understanding is that in your current [inaudible] you deal with something like that. So the question is during the process of that sort of forward/backward-like thing, you could introduce any number of positive or negative constraints, right? >> Daisy Zhe Wang: Right. >>: Is that -- are those -- is the ability to introduce constraints reflected in the query language? >> Daisy Zhe Wang: So the probabilistic selection actually is conditional Viterbi. It does -- so in a probabilistic selection, if you say that -- >>: If I want -- if I have specific requirements -- so when you're summing over all these different paths to compute the probability in C, you could constrain [inaudible]. >> Daisy Zhe Wang: Yes. You can specify evidence in the label. So the probabilistic -- in the TokenTable, there is a probabilistic column called label. And before it was all empty. It need to be inferred. There is no evidence in it. But if you want to pose some constraints, say that Obama is always a person, for example, then you can do that by specifying the labels. >>: Or, for example, I want to view this probabilistic matching but I want to ignore sequences where the previous thing was posted. >> Daisy Zhe Wang: Yeah, if you have more general constraints, then there has to be a better way to do that. >>: Is there a way to query language currently to express these constraints? >> Daisy Zhe Wang: Not in general, but there are ways to do simple ways. Like I said, if you want to fix that one certain token that is labeled in a certain way, you can do that. But more generally, so, for example, if you want -- if you want David to be a person always or just this occurrence of David to be person, and this occurrence of Big Apple to be city, you can do that. >>: Seems like there's a rich space of possible constraints you could describe in the process from the query that there's a particular point you could stick it in in this match. I'm thinking of, for example, the work of Aaron Cubana [phonetic], right? >> Daisy Zhe Wang: Yes, yes. That's part of our related work. And we were having that in mind, but in that paper, the kind of constraint is like the user specifies specifically that this label should be labeled this way. So that's the kind of constraint we do support. But more general constraints like, I don't know -like before an adjective there always have to be a noun or after adjective. Like that's kind of more first order constraints. But we do support what is supported in that paper, which is the evidence. >>: So what happens if you go from this Markov model to the semi Markov models, how much of your thing has to change? Because I saw most of the examples exploited the Markov demonstrates, just one token and really on the tokens and so forth. >> Daisy Zhe Wang: How do I extend to HMF? >>: Markov model. >> Daisy Zhe Wang: Markov model. >>: It's a segment not a token. Seems like the models determine. >> Daisy Zhe Wang: I read the semi Markov paper. I think their inference is also an extension of the Viterbi algorithm. And I don't really get quite what are the differences there. Do you -- like I think what they said is their computation is similar to Viterbi and it's a variation of the Viterbi. I didn't look very closely at that. >>: Christian Konig: Okay. Thank you. [Applause]