Christian Konig: Hi. It`s my pleasure to introduce Daisy Wang from

advertisement
>> Christian Konig: Hi. It's my pleasure to introduce Daisy Wang from University of California,
Berkeley. Her thesis is on probabilistic databases and how you manage them in a relation
database, and she's going to be telling us more about it. So over to you.
>> Daisy Zhe Wang: Thanks, everyone, for coming. Today I'm going to talk about extracting and
querying probabilistic information in BayesStore.
In the past few years the number of applications that need to deal with large amounts of data has
grown remarkably. The data underlying these applications are often uncertain, as is the analysis,
which usually involve probabilistic inference and modeling.
One such application is information extraction. So the amount of text data in the world is
significant and is growing at a faster speed, both in the enterprises and over the Web. Information
extraction is one type of text analysis which extracts entities such as person names and companies
from a piece of text such as news articles or e-mails.
So this is one piece of text from New York Times article. Information extraction uses probabilistic
models to extract entities such as person, here Harold, or a company, such as McGraw-Hill from
this piece of text.
Moreover, these probabilistic extraction models extract probabilities and uncertainties for possible
extractions.
For example, it might generate alternatives of extracting Big Apple as the company with
probability or Big Apple as the city with a different probability. So this kind of extraction
generates probabilistic data or entities.
And one possible query over these probabilistic data could be which New York Times article
mentions Apple as a company with top-k highest probability.
Another application that also generates a large amount of probabilistic data is sensor networks.
The sensor readings are probabilistic. It's full of missing values and erroneous values because of,
for example, interference in a signal, so low battery life and so on.
If you are to model these uncertainties using a Gaussian distribution like a bell curve here, one
possible query you ask on top of this probabilistic data is what is the Gaussian distribution of the
average temperature in a certain area.
So other application that requires this probabilistic data analysis includes data integration systems
where the entity resolution and schema mapping are usually probabilistic, and social networks
where probabilistic inference and modeling are used to classify users and predict user behaviors
and so on.
In this talk I will first describe what is the state of the art, what is the current approach in industry
to perform probabilistic data analysis and what are the problems. Then I will describe my
approach to perform probabilistic data analysis, which is implemented in a probabilistic database
system called BayesStore.
I will go into details, the new algorithm that we invented for probabilistic relational queries, and
then talk about techniques to scale up these algorithms. I will finally conclude and talk about
future work.
So the standard approach in industry to perform probabilistic data analysis looks like this. You
extract uncertain data from different real-life systems, such as from text or sensor networks, put it
into a relational database. At the data analysis time, all the raw data is extracted from relational
database, put it into a file, massage it into the right format, and feed it into statistical machine
learning packages, such as R and MATLAB.
Inside of these packages, probabilistic models are learned on top of the uncertain data, and a
number of data analysis tasks are performed, such as inference, data cleaning, and aggregation.
And the result of it is put back into the relational database as analytic result tables on top of which
the user is going to query on top of.
There are at least two problems to this standard approach. The first one is the performance
problem. As you can see that the data is stored inside of the database, while as the data processing
and analysis is done outside of the database, there is an expensive data transfer caused when the
dataset is large.
Moreover, because all the computation is done outside of the database, all these benefits such as
optimization and parallelization and indexing inside of the database is not utilized because
database is only used as data storage.
So this is the performance problem. A second problem is information loss. In a standard
database, only the deterministic data can be stored. So in a standard approach, only the top-1
highest probability analytic result is stored in a database for people to query, which prematurely
disregarding all the uncertainties and probabilities. Thus, the information loss. I will have one
example illustrating this point.
So this is the top-1 extraction of this piece of text from the New York Times. As we can see, this
top-1 extraction mistakenly missed the entity Big Apple as the city. But fortunately these
probabilistic extraction models have -- give alternatives to possible extractions. And top-k
extraction, it correctly extracted the Big Apple as the city.
So for different document, this correct extraction comes at different K, top-k, the K is different for
different documents, but usually it comes at the first few top extractions.
So if you imagine that you only have a deterministic database only storing the top-1 extraction and
running a query on top of it, for example, return all the articles with city equals the Big Apple,
because the top-1 extraction missed Big Apple as the city, so it will return no documents.
But if you are to consider the uncertainties and probabilities, then this query will return a set of
documents with articles that has Big Apple as a city, with descending probability. Now, these
results might not be the top-1 extraction. But they might as well be the correct answer and what
the user is asking for. In fact, it's what the search engine is provided. It's a ranked list of results
that is ordered by the probability.
So this example illustrate that only querying over the top-1 extraction is not sufficient. What we
want is to query over the full distribution of probabilistic data.
And the problem of storing and querying over probabilistic data has been the focus of the
probabilistic database literature for the past ten years.
So before the BayesStore project, or before 2007, there has been a number of -- probabilistic data
representation has been proposed. One is from Dalvi and Suciu from University of Washington
here, where the probability is attached to the rows in each table. For example, the probability of
this tuple as 1 exists is .8.
Another representation is proposed by the Trio project at Stanford where they assume a much finer
granularity of probability attaching this probability to the attribute values done of the rows, but
both of these early representations assume independence between the uncertain values. So it's
very hard for them to represent a high-dimensional distribution with complex correlation between
these uncertain values.
So in fact you can -- for example, to represent the distribution of all possible extraction from
information extraction using this kind of representation by storing all the -- each possible
extractions with its associating probabilities one by one for all possible extractions. But if you
have a sentence with 30 tokens and each token as four possible labels -- here person, company,
location or other -- the number of possible extractions is 4 to the power of 30. And if you have
lots of sentence in a corpus, it's impossible for you to store all possible extractions.
Now, modeling the uncertainty and reason with it has been the focus of statistical machine
learning literature. And they come up with probabilistic graphical models to compactly represent
high-dimensional distributions with correlations.
So graphical model has been studied as an efficient representation of these high-dimensional
distributions. This is an example of graphical model. So a graphical model contains random
variables as these nodes in a graph and the correlations between the random variable, which are
the edges between these nodes.
So here the edges -- so this random variable is over the piece of text that we've seen before in the
New York Times article. And this correlation is basically saying that the labels of a specific token
is correlated with the token itself and the previous label.
So this a set of local correlations which is converted or factorized from a high-dimensional
distribution that we've seen earlier based on the factorization and conditional independence that
exist between these random variables.
So if now we have the same number of tokens in a sentence and the same possible number of
labels, the size of the graphical model is merely 30 factors and each of them is a 4-by-4
two-dimensional matrix. So this graphical model encodes the full distribution of 4 to the power of
30, which is exponential to the number of tokens, and reduced the size to linear to the number of
tokens.
So graphical model is -- through this example is a very efficient representation of the distribution.
So in my thesis I designed and developed system -- probabilistic database system called
BayesStore that natively support these probabilistic graphical models and natively support
inference algorithms on top of these models.
So the picture now looked like this. The uncertain data directly stored in BayesStore in which
both relational query engines and the graphical models and inference engines are supported. So
graphical models include both directed and undirected models.
Between these two engines, the query constraints and probabilities are passed between each other.
On top of BayesStore the user can ask a query on top of uncertain data and the model and get
distribution and probabilities as answer.
>>: I'm getting the feeling that understanding this graphical model is going to be important to
understand what follows. Is that right?
>> Daisy Zhe Wang: Yes.
>>: Can you back up one slide? I'm not sure I got it. So you've got one -- is it true you've got one
node for token and one node per label? Is that what's going on?
>> Daisy Zhe Wang: Yes.
>>: Now, apple or Big Apple -- I mean, Big Apple is a token as well as apple in this model?
>> Daisy Zhe Wang: So in this model big is one token and apple is one token.
>>: And -- but then you want to assign a label to Big Apple? Or you're going to ->> Daisy Zhe Wang: I want to assign a label to big and apple independently. So I can sign both
as company or apple as a city.
>>: Okay. But then here each of these tokens -- this is undoubtedly a simplified graph, but each
of these tokens has only got an edge to one label. Is that just a simplification? I mean, it seems
like they're going to have edges to multiple labels because there's different choices within the
probabilities.
>> Daisy Zhe Wang: Um-hmm. So the linear chain CRF model only has correlation token with
its own label. But this label is also correlated with the previous label. So in the Big Apple case, if
one of the apple has some confidence of label as company, and if previous label is also company,
then its confidence is ->>: Let's do the [inaudible].
>> Daisy Zhe Wang: So the CRF not only models the top-1, it models the whole distribution.
>>: In this example, what you've shown is the [inaudible].
>> Daisy Zhe Wang: So this is a random variable. If this random variable could be given any of
the possible labels, it's a random label that could be given multiple values, different kind of values.
The top-1 can only give one value.
But to Phil's question, because the label are -- the current label is correlated with the previous
label, so we can capture the correlation between two pair of tokens being labeled the same as
company or only one token is being labeled as city. So there is a carry-on correlation. Although
there is no token directly correlated with the previous label, but because they are correlated, it's
captured -- the correlation is captured using this tweak.
>>: And so you have [inaudible] 4 by 4, right? For each one there are four possible labels, and
the four possible labels are a function of the four possible labels of the previous two.
>> Daisy Zhe Wang: Yes, yes. So this is a two-dimensional matrix. Given a particular label, this
matrix saying that what is the confidence of this pair of labels.
>>: Is there anything magic about doing only one [inaudible] pairs as opposed to triples or things
of that sort?
>> Daisy Zhe Wang: So there are more complicated models than linear chain CRF called
semi-CRF model that try to group -- rather than just looking at unigrams, look at -- try to group
tokens into one and assign a label to these groups. But it's just not studied here because linear
chain CRF is kind of the state of the art, and more complicated model are used for more
complicated corpus, and there's no reason why we cannot support it here because BayesStore is
supposed to design for general graphical models. It's just that for this example we're not ->>: So let me ask you a really dumb question. So you have Big Apple. Why would you ever put
a probability -- how reasonable [inaudible] probability that big refers to a city or a location?
>> Daisy Zhe Wang: Sorry?
>>: So you have Big Apple and you're trying to identify Big Apple as some probability being a
location, city, New York. And so you -- so I don't know exactly how you proceed here, but you've
got the word big and you want to put some sort of label or probabilistic distribution of what tag it
goes with. Why would the tag for big ever be likely to look like a location?
>> Daisy Zhe Wang: Yes. So big in this case might not be directly labeled as company as the
top-1 extraction, but coupled with apple, it's together, because we are computing the joint
distribution and the top-1 for the whole sentence, not for -- we're not doing classification for each
token at a time.
And I agree with you that maybe for this kind of a model to deal with multiple token entities is a
little bit hard. But it's still possible because when you look at apple, it's still -- a lot of the times it
is labeled with the company instead of -- as the city instead of the company. And then coupled
with the big ->>: I don't think you'd ever see apple by itself without the word big in front of it referring to
anything other than the fruit or company. I don't think you -- people don't refer to New York as
the apple, as far as I know.
>> Daisy Zhe Wang: You're right. Okay.
>>: But that's not really part of your system. You just are assuming that you have a reasonable
extraction system that gives you this information and then your job is to take it from then on.
>> Daisy Zhe Wang: Yeah. So that's more of like what the CRF can do. And I agree that the
linear CRF has its shortcomings, but this is not the focus of the thesis.
So okay. So I talked about one -- that the graphical model is an efficient implementation or
representation of the high-dimensional distribution. So my thesis is about how to natively support
these graphical models in general inside of the relational database and their inference algorithms.
So it's both stored -- it stores the uncertain data natively inside the relational database and support
both relational query engine and inference query engine. And the user can ask query to get
probabilities and distributions as answers.
So the BayesStore basically -- it's an integrated system that combines the power of large-scale data
processing from relational database and the power of advanced analytics from statistical machine
learning literature and combine in a single system to provide a framework for users to ask query
over uncertain data and do probabilistic analysis.
So the BayesStore can solve the problem, for example, information loss, where uncertainties and
probabilities are lost by using these graphical models to capture the probabilities and uncertainties
and use inference engine on top of it, and it can solve the problem of performance because the
relational engine here has the indexing and optimization and all these benefits.
We implemented BayesStore on top of Postgres 8.4.
So my contribution, technical contribution, consists of three parts. First one is efficient
implementation of graphical models as first-class objects inside of the relational database and
implementation of inference efficiently using SQL in database.
Second one is inventing new algorithms for relational -- probabilistic relational query, both to
compute top-k results as well as result distributions. Because these algorithms are probabilistic
instead of deterministically querying over the top-1 extraction, we are doing probabilistic queries
beyond top-1 over the whole distribution, the results show that we can reduce the false negative by
as far as 80 percent.
And lastly I come up with a set of techniques to scale up these algorithms, for example,
query-driven extraction and so on, and achieve orders of magnitude speedup compared to the
standard approach which does all the computation outside of the database exhaustively agnostic
about what the user is asking for.
So the BayesStore system is general in a sense that it supports both directed and undirected model.
It can support applications such as text analysis as well as sensor networks. But for the rest of the
talk, for simplicity, I'm only going to use information extraction as the driving example.
>>: One question.
>> Daisy Zhe Wang: Yeah.
>>: So there's been some work on taking data mining models and applying to databases
efficiently. So would you say that your work can be classified exactly or is it something ->> Daisy Zhe Wang: So there's two -- there's two side of the coin. I can see my work from -- you
can see my work from two different angles. One is the -- one is definitely the data mining and
statistical methods and supporting it inside of the database. And I don't -- I don't aware there are a
lot of work that support text analysis algorithm or text mining inside of the database using
probabilistic extraction models.
And on the other side is more novel or where we originally coming from is probabilistic database.
So we deal with probabilities in addition to the statistical methods. We support probabilistic
relational operators on top of these distributions. So I don't think that those work deal with that.
Okay. So that's the introduction. Some basics of BayesStore with graphical model, do we support
in information extraction, what are the data model, what query to do we support, and what are the
query semantics.
So the graphical model that we support for information extraction is called conditional random
fields. It's a state-of-the-art information extraction model, very much like HMM. This is a piece
of six-token address string, and a conditional random field model rendered on top of it looks like
this, very similar to what we've seen before. These nodes are associating with each tokens in a
text. And these are observed. They're value of fixed.
And these are random variable whose value is to be inferred. So we don't know the value of these.
They could be stream name or country or states and so on. So CRF model encodes the distribution
of all possible extractions over this piece of text.
So two possible extractions over this piece of text looks like this. So the extraction is over the
entire string, and the probability is on the right-hand side. Each extraction gives a specific label to
each token. And for different extractions you can give different labels to a specific token. And
there is many more such possible extractions, and we are omitting here.
So the BayesStore system extends relational data model to support probabilistic data and
probabilistic graphical models. So in the example of information extraction, the text is stored in a
probabilistic table called TokenTable, which has one probabilistic attribute label whose value is to
be inferred. And each row in a TokenTable is unique occurrence of a token identified by the
document ID and the position where the token is taken from.
So the possible words over this probabilistic table is encoded in a CRF model. And the random
variables in this CRF model is mapped to the attribute values in this table like this. It's not
actually stored, but the mapping is like this.
So the CRF model consists -- as we said is a set of local correlations or a set of factors. And
they're stored in a relational format as a set of -- so it's stored in a relation that contains the factors
of all the unique tokens in a corpora. And each factor is represented as a set of rows. For
example, the token Berkeley being labeled city proceeded by street name has a high confidence of
25. So it's a set of such entries. So the query on top of such a probabilistic database is over the
text in a TokenTable.
>>: Can you go back one slide, please. So in the original CRF model the factors are not just -aren't just functions of the label itself, [inaudible] token and a next token, have these features
which are functional ->> Daisy Zhe Wang: Yeah, definitely.
>>: So how do we capture [inaudible]?
>> Daisy Zhe Wang: You mean if we have further on what -- for example, this label has a long
edge between this one and this one?
>>: No, no. It's basically a function of any value of the input [inaudible]. This label of the first
token, it could be a function of the [inaudible]. So how to capture such pictures.
>> Daisy Zhe Wang: So you're still talking about this linear chain conditional random field.
>>: I'm still talking about this linear chain --
>> Daisy Zhe Wang: So ->>: [inaudible] condition variables of extract. So this label of a particular random variable
[inaudible] function of any of the [inaudible], which is why people do conditional random fields
[inaudible]. So how do we capture such pictures?
>> Daisy Zhe Wang: So one thing it's the local correlation still only concerned with three random
variables, right, given a specific token, the relationship is between its label and previous label. So
the correlation is modeled by a set of features, and these features are general. These features are
used to generate these factor tables, and this factor table is just materialization of these features on
different tokens.
So the queries is on top of the text in the TokenTable, and it includes both the relational operators
and the inference operator, the relational operators of course and not only the deterministic
selection, project, join and aggregates, but also contain its probabilistic counterparts, the
probabilistic selection, project, join and agates.
Moreover, we support inference operator, includes top-k inference and marginal inference.
Marginal inference is over a small subset of the random variables in a graphical model.
So one query on top of such a text is giving me all the tokens and the top-k labels for the first ten
documents. So we also talk about probabilistic selection, join, and marginal inference later in the
discussion.
So these kind of queries follows the possible world semantics. The possible world semantics say
that starting from the probabilistic database, DP, it can be extended to a set of possible worlds,
from D1 to D N, and the probabilistic query can be applied individually on these possible worlds
and generating a new world, QD1 to QDN, which represent the resulting probabilistic database,
QDDP.
However, we cannot execute the query through this path because the number of possible worlds
could be exponential to the number of random variables. So a major part of the work they've done
is how to directly allow probabilistic queries over the probabilistic database and get the resulting
probabilistic data as an answer without going through the possible -- to expand to the possible
worlds.
So with that, I'm going to details of the algorithm. I will go into details of an example how to
compute the top-k query results and briefly over how to compute the marginal distribution there
using different inference algorithms.
>>: So what class of queries are you [inaudible]?
>> Daisy Zhe Wang: I will have slides later talking about different kinds of queries. But basically
in terms of inference, we can support top-k or marginal. In terms of probabilistic queries, we can
support different kinds, like project join and aggregation.
>>: The biggest [inaudible] queries computing on the probabilistic databases [inaudible].
>> Daisy Zhe Wang: Has -- yeah. [inaudible] hard.
>>: So those class of queries, you can also not handle [inaudible].
>> Daisy Zhe Wang: I do handle aggregates, and the reason -- I think the reason why the -- it's
sharp -- it's the inference algorithm is not poly time algorithm. It is NP-hard algorithms. It doesn't
say that the computation bottleneck goes away. It's just because that probabilistic -- these
graphical models more efficiently use this conditional independence in a practical case. There are
a lot of conditional independencies. So it factorize the large distribution into a local distributions,
so it makes the problem much efficient.
Okay. So this example is include probabilistic join and top-k inference. So it's basically compute
the top-k join results between two e-mail corpus that contain the same company name. And the
result of this join has to be bigger than the threshold T.
So imagine that you have two input documents represented as two distributions of possible
extractions using CRF model. A probabilistic join on top of it generate a new distribution. So say
that we simplify the query and only compute the top-1 probabilistic join result. This top-1 join
result is not necessarily computed from the top-1 extractions of the input document, because the
top-1 extraction of the input document might not join each other. In fact, it might be computed
from top two, top three, top ten or top a hundred extractions from the input document. Only that
the later two -- the join results might not be bigger than that threshold T, so we will be filtered.
But the insight here is that for different pair of document, in order to compute the top-1 join result,
we cannot deterministically specify the top-k extractions that we need from each document and
compute the top-k and top-1 join result or top-k join result. We have to incrementally fetch the
highest probability extraction as we are computing a top-k join results.
But the algorithm to compute this query involves three parts. The first one is Viterbi dynamic
programming algorithm, which computes top-k extraction. The second part is the incremental
Viterbi which gives you incremental access to the next best extraction, giving you a rank list of
extractions. A third part is probabilistic rank-join algorithm which takes the input of incremental
Viterbi and compute the top-k join incrementally.
>>: So seems like the [inaudible] correlation you're modeling is correlation of different
extractions of the same document. There are different documents, those extractions are
[inaudible].
>> Daisy Zhe Wang: Yes.
>>: And even if you have different CRF models on the same document, the different models,
those extractions also being made independently.
>> Daisy Zhe Wang: Can you say that again.
>>: [inaudible] one is for extracted cities, the other is for extracting countries with the different
models and just feeding them through different [inaudible].
>> Daisy Zhe Wang: Yes, if they are in different models, yes, they are independent. But usually
you model in the same -- if you are extracting several entities, you usually train the same models
to extract them.
>>: [inaudible] have an accuracy extractor and have a citation extractor, I just run the same
[inaudible].
>> Daisy Zhe Wang: But you might have more accuracy if you train the integrated model.
Because maybe the person and address appear one before the other or the person and telephone
number one after the other, things like that. So this kind of correlation do have -- yeah, do appear
in the text.
So I will go into a little bit detail for each of them, but the Viterbi dynamic programming
algorithm is a standard algorithm over CRF to compute top-k. And the incremental Viterbi and
probabilistic rank-join are new algorithms.
So Viterbi algorithm, the contribution here that we found that I did is to implement it natively
inside of the database using SQL inside of -- and measure the performance compared to the Java
implementation outside of the database.
So the Viterbi is a dynamic programming algorithm that computes a V matrix. It's a
two-dimensional matrix. Each cell in V, VIY, stores the top-k partial segmentation ended with
position I with label Y. As you can see, it's a recursive algorithm because VIY is recursively
computed from VI minus 1, adding this additional step with weighted sum of features and taking
the maximum on top of everything.
So the Viterbi implementation inside of Postgres looks like this query plan. We use with recursive
join inside of Postgres joining over the TokenTable and the factor table, which are if relational
representation of the text and the model. So you we are computing VIY recursively from VI
minus 1, and followed by group I aggregation, which basically computes the maximum.
So after measuring the performance of this implementation with the Java open source CRF
implementation, we find it's five times slower. And the reason is that the way that we are
representing the factors, so we are representing each factor as a set of rows -- yeah.
>>: One question. So how come sort of [inaudible].
>> Daisy Zhe Wang: Oh, yes. So this is only part of the main loop. You still have to backtrack.
>>: [inaudible] without computing [inaudible]?
>> Daisy Zhe Wang: The normalization, you still have to use sum-product algorithm, which is
very similar to this recursive algorithm to compute the normalization factor. And then you just do
the normalization factor over the probability that you compute from the top-1. And you get the
probability.
>>: But this normalization is [inaudible] if I understand, correctly.
>> Daisy Zhe Wang: Normalization is the same ->>: [inaudible].
>> Daisy Zhe Wang: No, no, it's the same complexity as this one, because you are basically
summing over -- replacing max with sum.
>>: So can [inaudible].
>> Daisy Zhe Wang: Yes. Yes, yes. It's also [inaudible] in a very similar fashion.
So as we said, this is five times slower because the factor is represented as a set of rows. And
what we did is replace this representation using array data type, which is supported in Postgres.
And we also developed a number of aggregation functions on top of the array data type.
So this is a new implementation which has the similar structure, use recursive join, but uses array
data types and aggregation functions, which result in a more compact representation of these array
factors, a better main memory behavior and more efficient join.
So the result of this is as efficient as the Java implementation and sometimes even more efficient.
So this exercise tells us that using SQL inside of the database we can efficiently implement
inference algorithm, complicated inference algorithm over graphical models.
>> But you're comparing a Postgres [inaudible].
>> Daisy Zhe Wang: Yes, they are not the the same, but we are not trying to beat them; we are
just trying to say that in set orientated processing inside of the database, we can achieve similar
performance to -- it's not ->>: But when you applied sample, you could have taken a different [inaudible] a new operator,
which is not [inaudible].
>> Daisy Zhe Wang: Yes, yes, yes, yes. So if we ->>: [inaudible].
>> Daisy Zhe Wang: I agree. So the other way to implement it is a new operator, instead of
posing this as a query, we implement it as an operator. But there are two major benefits of
implementing this way. One is that this entire query plan is representing an inference algorithm
can then be optimized with the rest of the relational algebra in a larger query.
So based on different statistics over the data, we can optimize it more nicely, the inference
implementation with the rest of the relational algebra. And second, because it's on top of the
database instead of inside of the database is it's [inaudible] to other relational database.
>>: [inaudible] examples where you want to combine this with other relational operators?
>> Daisy Zhe Wang: Oh, exactly. We'll see to that. And this is -- so this kind of native -- the
native implementation of this inference algorithm make it possible for you to cooptimize the
relational operators with these inference operators, whereas if you hard code everything, it's a
black box. You cannot cooptimize.
>>: Do you have some [inaudible] examples where you can --
>> Daisy Zhe Wang: Yeah.
>>: [inaudible]?
>> Daisy Zhe Wang: Yeah. Okay. So the second part of the algorithm is incremental Viterbi,
which is a new variation of the Viterbi algorithm over CRF. The input is the top-1 extraction of
the Viterbi and the states, which is the V matrix. So here instead of deterministically saying I
want top-k extraction, it can get the extraction incrementally. So the algorithm is basically
incrementally computes some of the cells new elements in a cell of the V matrix, and using this V
matrix to extract next best extraction result in a list of extractions ranked by the probability.
So the complexity of this incremental Viterbi is -- big OT is number of tokens, Y plus KY is the
number of labels, and K is extraction depth, log Y plus K is interestingly less -- or is more
efficient than the Viterbi algorithm, which means that every time we fetch a new extraction, it's
more efficient than computing a top-1.
So using the result of incremental Viterbi, this -- the result is feed into the rank-joining algorithm,
which is computed between each pair of joinable documents. So each document can be
represented as a set of possible extractions. It has its join keys and possible extractions is listed in
the decreasing order of the probability. So this is the probability and it's decreasing. And K is the
extraction depth.
So the rank-join is computed between those two tables or list of extractions. And what it does is
to compute the next best extraction incrementally while it's computing a top-k join result. And as
soon as it computes a top-k join result, it stops the incremental inference and return the results.
>>: So your outperformer has only the token labels, right? [inaudible].
>> Daisy Zhe Wang: Yes.
>>: If I want to say [inaudible] which is probably the most natural use [inaudible] I want Big
Apple as a single segment [inaudible].
>> Daisy Zhe Wang: So it's -- so what you have is apple as a city, Big Apple as a city. And there
is a way to [inaudible].
>>: Don't label, right? So I don't just label big city and apple city [inaudible].
>> Daisy Zhe Wang: Yes. So when you want to extract city as segments, there is --
>>: [inaudible].
>> Daisy Zhe Wang: Yeah, there is a SQL for it. I forgot I actually used it. But you can actually
concatenate adjacent tokens of the same label and put them into the same segment.
>>: [inaudible].
>> Daisy Zhe Wang: Yeah, but then -- yeah, as I agree. You can probably return segments. But
you can return segments. I forgot the operator's name that I used. But you can still do that.
So the rank-joining algorithm only joined between two individual documents. If but if you want
to join two sets of the documents, then you have to compute a set of rank-join algorithms
simultaneously between one other document and a set of joinable inner documents at the same
time. So there is details on how to share computation and how to maintain states and so on. So
this is a new algorithm that's based on the rank-join algorithm.
So this is a detailed look at one type of probabilistic query. The reason why we study probabilistic
query is to look beyond the top-1 extraction, to look at how to query over the full distribution of
probabilistic data.
So this is an evaluation to look at how these probabilistic queries can improve answer quality. So
the query is probabilistic join and the corpus that we run on is 200 hand-labeled signature blocks
in Enron corpus. So it's hand labeled, so we are able -- then this is the ground truth. We are
comparing the deterministic selection over top-1 extraction with the result of probabilistic
selection.
So the query is a number of selection conditions over different attributes. So basically, for
example, we want all the articles that company equals to apple. And we are measuring the false
negative rate. So means that the missing results, missing errors. So we are comparing this
baseline with the probabilistic selection, and we see that false negative errors can be reduced by as
much as 80 percent.
>>: [inaudible].
>> Daisy Zhe Wang: So it's because before the standard approach is only querying over the top-1
extraction. If we are able to query over more than top-1 over the top-k extraction or the whole
distribution and then still using the threshold probability bigger than T to control the false
positives, then we get better answer.
So in this experiment we show that we can reduce the false negative while keeping the false
positive roughly the same.
>>: [inaudible] to go beyond top-1?
>> Daisy Zhe Wang: The reason why it's -- no. So the standard approach, I'm comparing to the
standard approach where only top-1 ->>: [inaudible].
>> Daisy Zhe Wang: Sure. But the reason is you don't know which K you should be storing.
And I have full on -- I think -- yeah, the next example is we have to query over the whole
distribution instead of just the top-k.
>>: I mean, it's sort top-1 [inaudible] do the same comparison as top ten that's [inaudible].
>> Daisy Zhe Wang: Yeah. So there are two aspects. One is you don't know what K you should
get for different kind of queries. Second you might need the full distribution as I will show you in
the next example.
>>: [inaudible] is these queries, I guess. Are they hand picked? I mean, they can be rare queries,
which doesn't make too much sense.
>> Daisy Zhe Wang: So the query that we picked here are -- so on the company and it's over
different company entities. So we picked all the entities that exist in this corpus. So basically
company equals to a number of different values, and these values are picked from what is existing
in this corpus. So it's a set of queries that we computed and aggregate the values here, the false
negatives here.
>>: So it's just over [inaudible] it's not like the example where you can just kind of join and stuff.
>> Daisy Zhe Wang: So this is probabilistic selection. Yes. So for probabilistic join here, we are
also comparing the baseline of join over deterministic top-1 and the probabilistic join and the
query is to find a pair of signature blocks that has the same -- that's from the same company. And
X axis is increasing from one to 30 looking beyond -- more and more beyond the top-1 extraction.
And we are measuring a false negative rate.
As we can see, the probabilistic join is looking more and more beyond top-1 extraction looking
into the full distribution. It reduces false negative from .4 to .25, a 30 percent drop of false
negatives, while false positives remain roughly the same.
>>: What is the false positive rate?
>> Daisy Zhe Wang: False positive is meaning that you ->>: No, I know what it is. What's the number?
>> Daisy Zhe Wang: I think it's pretty decent number, around 90, I think. I don't remember. But
I think it's pretty decent number and it doesn't change when we use probabilistic selection or not.
So in the last part we talked about how to compute top-K results over probabilistic join -- over
probabilistic queries. So that this example is computing marginal distribution over probabilistic
aggregation. So the aggregation in this query is to compute the distribution of the number of
companies that we can find extract from the articles that also contain a keyword apple.
So this is the query plan. We first do an accurate join to find all the articles that contains the
keyword apple. And the documents are represented as a distribution of possible extractions ->>: [inaudible] English, want to find pairs of documents, let's say one of them has a token called
document [inaudible].
>> Daisy Zhe Wang: No, I want to ->>: [inaudible] label called company?
>> Daisy Zhe Wang: So because we are representing this as a TokenTable, right, so it's
basically -- it just give me all the documents that contains apple.
>>: [inaudible].
>> Daisy Zhe Wang: Yes, yes. So this is just a self-join, say give me all the -- first compute ->>: [inaudible] join the reconstructor.
>> Daisy Zhe Wang: Yes. So -- yeah. So we first compute the number of set of documents that
contain the keyword apple, and they are probabilistic distribution of possible extractions, then
goes to probabilistic selection and count. And the result is basically a histogram of counts
possible. It's a count distribution. So we cannot answer this query by using Viterbi algorithm,
because Viterbi computes the top-k extraction.
So what we want, the inference algorithm that we can use, that we used is Markov Chain Monte
Carlo inference algorithm called Gibbs Sampler, which basically computes a small subset of
possible extractions according to this full extraction, and this small subset of extraction, it goes to
probabilistic selection and count, and which result in a set of counts which are then computed into
a count histogram.
As we see an example here, different inference algorithm is needed to compute different
probabilistic queries. In fact -- so this is in Gibbs Sampler, and we implemented inside -efficiently inside of database using window functions. And in fact we computed four inference -I computed four inference algorithm including Viterbi algorithm, sum-product, MCMC Gibbs
Sampler, and MCMC Metropolis-Hastings. And the star is saying which inference algorithm can
be applied to which type of queries.
And it can also be noted that the same type of query, for example, marginal over probabilistic join,
can be answered by different inference algorithm, including sum-product and two MCMC
algorithm. But this have a tradeoff between accuracy and runtime because sum-product is an
exact algorithm but it cannot deal with general graph, for example.
So that's the algorithms, so that these algorithms solves the problem of information loss. We can
query over the probabilistic distribution. The next section deal with the performance problem,
which talks about a set of techniques to scale up these algorithms.
So the gist of -- or the main intuition in these scale-up technique is how to replace the exhaustive
extraction with query-driven extraction. So the traditional approach relies on exhaustive
extraction because analysis done outside of the database agnostic about what the query is asking
for. So you have to exhaustively extract all possible queries over all the documents.
But if you imagine you have a BayesStore system that has an integrated solution of both query
engine and extraction engine, then you know when your extraction what the query is -- what the
user is asking for. So you can prune away a lot of unrelated information at the extraction time.
So that's what we rely on. So the techniques -- so this query is very similar, trying to extract
company that also -- in the articles that also mention apple as a company.
So the techniques includes three. The first one is very simple, inverted index, because we have the
document inside of the database and extraction inside of the database, so we are at the leisure of
throwing away of bunch of documents that doesn't mention the keyword that is in the query.
The second technique is called minimizing the models, because when the graphical models is
rendered over the entire corpus, it's a huge model. But the query that we are asking for only ask a
very small subset of the random variables inside of this graphical model. So we want to prune
away the rest of the graphical model that is unrelated to these query nodes. So this is called
minimizing the models.
The third part is the general notion of trying to push the relational operators such as join and
selection into the inference process. So for answering this query, we have an algorithm called
early-stopping Viterbi which is trying to push probabilistic selection into the Viterbi inference
algorithm.
So a little bit more detail. Invert index, we basically use this token equals to apple and use
inverted index to quickly select a set of documents that include this keyword. That's very simple.
And the second one is minimizing the models.
So, again, the key insight is that query node is only a very small subset, so we want to prune away
the large parts of the graphical model that is unrelated, for example, in this Bayesian network and
suppose that node is A is the query node and which one we want to compute the distribution or
top-k. And B is a random variable that is observed or has the evidence. Then according to
Bayesian network, node A is independent of node C, D, and E given B. So we can prune away the
rest of the graphical model only inference over these two random variables.
So very similarly on the CRF model that I study, the CRF is rendered over each sentence
independently in the single document. So between the sentences there is no correlations. If you
want to ask a query over one specific label in one sentence, then you can very safely prune away
the rest of the sentences in a document. So I implement this minimizing the model using graph
traversal inside of the transitive closure over graph inside of SQL.
The third algorithm is Viterbi early-stopping algorithm which uses this probabilistic selection and
try to push it into the Viterbi algorithm which is computing this two-dimensional matrix. This is
the tokens in a sentence, Big Apple, lands, year 14, Super Bowl. And this is possible labels.
So when you're computing the candidate partial segmentations as denoted by these arrows, at
position 2 you see that none of these candidate partial segmentation points back to labeling apple
as the company. All of them say that Big Apple is labeled as city. So you can stop the inference
there and toss this document away because it doesn't satisfy the probabilistic condition that label
has equals to -- label has to be company without the full inference.
So with these scale-up techniques, this is evaluation over New York Times dataset, which is 7
gigabytes of a million New York Time articles. So if we are to run one of the queries that we see in
a previous slide, without inverted index or none of these scale-up techniques by exhaustively
extract all the possible extractions and query on top of it, then we need around 13 hours to
compute the whole corpus.
>>: [inaudible].
>> Daisy Zhe Wang: Yes. It will. But when you do query driven, you can also maintain the
results of the previous queries that you executed and just like view maintenance. So it's basically
you don't have to do it if you -- you're not asked.
So if we are to add in the scale-up techniques, for example, that we talked earlier, so this graph is
showing you the runtime of adding these techniques one at a time. So if we have inverted index to
filter away all the documents that doesn't contain keyword apple, then it reduces from 13 hours to
14 hours. If we add in minimizing the models to prune away the part of the graphical model that
is not related to the query node, then it reduces from 14 minutes to 1.4 minutes, another factor of
10. And if we add further the early-stopping algorithm, then the runtime reduces from 1.4 minutes
to 16 seconds.
So we can query and get answer over a large dataset in a matter of seconds. So these techniques
are very effective in pruning away the unrelated information to the queries.
So this graph is showing you different kind of queries with different selectivity. On the left-hand
side is higher selectivity. On the right-hand side is lower selectivity. So this graph is showing us
that if selectivity is high, then the speedup -- the speedup is more significant while as otherwise
the speedup is smaller. This -- yep.
>>: So is this a fair comparison? This 13-hour runtime is supposed only need to be run once,
right? [inaudible] can be used by all the subsequent queries. So if you compare this to one query
[inaudible] and your techniques is inverted, and maybe even earliest stopping the -- maybe can
also apply on top of the extractions. Means you first do all the extractions using these 13 hours,
and then you can find these techniques over these fully extracted [inaudible].
>> Daisy Zhe Wang: Yeah, I agree. So these are two modes of computation. One is
precomputation and batch orientated, compute everything, and the other is query driven. And we
argue that there are scenarios for both mode of extraction. For example, for iterative, you could
have -- you don't have the full access to the entire document. For example, legal documents, you
kind have to do query driven. Because the number of documents that you get out is limited,
restricted.
Or you don't -- small companies don't have the computational power to process, preprocess over
the all the documents, and they want to do pointed. For example, you are -- again you are doing --
you have -- you have the hope the e-mail of the whole corporate, but then you are doing some
investigation and you only want the documents that's specific to David, for example. You don't
want to extract all the extraction, you only want for David, you don't want to process everything.
So there are scenarios that you do want to do this exploratory iterative interactive and
query-driven extraction, other than precomputation everything. But I agree that precomputation
does have its application areas.
So this is the -- I don't know if I have time. It's already ->> Christian Konig: We started a little late.
>> Daisy Zhe Wang: Okay. So this is the one optimization technique that we submitted in the
recent -- and accepted in the recent SIGMOD paper. It's called hybrid inference. I will just go in
a very high level what it is.
So as you already know that I implemented different inference algorithm inside of the database,
and a single query can actually be answered by different inference algorithms. And hybrid
inference basically at runtime tries to figure out which inference algorithm is appropriate for
which subset of the document and apply accordingly, so according to the structure properties.
What do I mean by structure properties of the graphical model? So we started out with two
documents. And when we apply a join, on this pair of documents it can result in a tree-shaped
graph, graph model. And if we do inference on it, we should use sum-product algorithm, because
exact and it's polynomial.
On the other hand, if we change the first document a little bit, the join on top of it actually results
in a cyclic graph which we should use the MCMC algorithm. So the hybrid inference basically
after it's -- the model is instantiated, it forks into two execution paths deciding whether the
structure is a tree structure or a cyclic structure, uses different inference algorithm and then union
in the end.
So using this kind of hybrid inference to optimize the query execution, we find that it's five to ten
times more efficient than monotonically applying the general inference algorithm over the entire
corpus.
So as a conclusion, I have described a number of applications that require probabilistic data
analysis. I have described BayesStore, a probabilistic database system that efficiently implement
graphical models and inference algorithms. I described new algorithms for answering
probabilistic queries and improve query answers and reducing false negatives by as much as 80
percent. And finally I talked about a number of scale-up techniques that can achieve
orders-of-magnitude speedup for interactive queries.
This is a number of related work I'm not going into. So I have other collaborations with different
research labs during my Ph.D., other than my thesis work. But this is related.
So my inference algorithm that I described, like Viterbi algorithm and so on, is incorporated into
MAD Library. It's collaboration with EMC. So they want -- my library is basically a statistical
data analysis library. It's open source. You can check it out. It's supposed to do text analysis and
clustering and classification and so on. It's in database statistical analysis. And I also collaborated
with IBM Almaden and Google on various projects.
For future work, I'm really excited about the domain that I mean, which is trying to use advanced
statistical methods to analyze large amounts of data, and I think that there is a lot of challenges in
this domain, both in building infrastructure and systems to deal with this large-scale data analysis
and to look at different application, both in social sciences and in health informatics to apply these
algorithms or even inventing new algorithms.
And finally I think there is a need for a high-level language or interface or visualization for people
to manipulate these statistical models and to apply to large amounts of data.
And with that I -- so this is a pointer to BayesStore project that you can get more detail, and this is
a pointer to the MAD Library, which is an open source statistical library and database.
And that's all. Thank you very much.
[applause].
>> Christian Konig: Questions?
>>: So I have a very practical question. So you have this information extraction as your main
driving application. I might personally have used a few of these open source or information
extraction tools. And not they only output their most confident extraction result, seldom do they
output the probabilities. And it's even more rare to see they output a model, a graphical model, in
their output. So how do you see to make this [inaudible] database so that you can utilize these
graphical models?
>> Daisy Zhe Wang: So these models do -- so depends on which information extraction toolkit
you use. There is rule based information extraction which doesn't generate probabilities, and there
are machine learning communities that other tools are probabilistic, but because you're not seeing
them because they don't think people are going to use them. So they are inherent in these
probabilistic models. They are not generating them because their mindset is our output will be put
into a database for people to query. So there is no need to generate probabilities or uncertainties.
So but they are there for this set of models. Now, whether these are important, these probabilities
are important, I did have an example earlier why these uncertainties are important, because the
top-1 extraction might not be correct. And you want to involve humans or involve statisticians to
look into the uncertainties, especially in the case that for -- like I was talking to a national security
person that really wants the uncertainties in a text. Because the top-1 doesn't capture what they
are looking for, and they want weak evidences to combine to alarm and so on.
So for these cases you do want alternatives rather than just the top-1.
>>: I guess the question was really what motivates these [inaudible] to expose their outlying
models which may be built on a lot of training data they have to the outside world.
>> Daisy Zhe Wang: I think we have to come up with the -- so I think right now there are really
no -- currently there are no killer app that really utilize these probabilities and generate values out
of. And part of it is chicken and egg problem, because there's no tools. So it's a part of coming up
with killer app and having the tools to deal with them.
>>: So a very simple example where you're looking for a match to -- trying to match against a city
name that's in a document, and you want to only match documents where the probability of the
city name being correct is greater than .8 from the extraction. So my understanding is that -- so to
get that -- so the graphical model will probably give you in the sequence of things. To get the
probability of an individual entity or subsequence being extracted, you've got to compute the sum
[inaudible]. So my understanding is that in your current [inaudible] you deal with something like
that.
So the question is during the process of that sort of forward/backward-like thing, you could
introduce any number of positive or negative constraints, right?
>> Daisy Zhe Wang: Right.
>>: Is that -- are those -- is the ability to introduce constraints reflected in the query language?
>> Daisy Zhe Wang: So the probabilistic selection actually is conditional Viterbi. It does -- so in
a probabilistic selection, if you say that --
>>: If I want -- if I have specific requirements -- so when you're summing over all these different
paths to compute the probability in C, you could constrain [inaudible].
>> Daisy Zhe Wang: Yes. You can specify evidence in the label. So the probabilistic -- in the
TokenTable, there is a probabilistic column called label. And before it was all empty. It need to
be inferred. There is no evidence in it. But if you want to pose some constraints, say that Obama
is always a person, for example, then you can do that by specifying the labels.
>>: Or, for example, I want to view this probabilistic matching but I want to ignore sequences
where the previous thing was posted.
>> Daisy Zhe Wang: Yeah, if you have more general constraints, then there has to be a better way
to do that.
>>: Is there a way to query language currently to express these constraints?
>> Daisy Zhe Wang: Not in general, but there are ways to do simple ways. Like I said, if you
want to fix that one certain token that is labeled in a certain way, you can do that. But more
generally, so, for example, if you want -- if you want David to be a person always or just this
occurrence of David to be person, and this occurrence of Big Apple to be city, you can do that.
>>: Seems like there's a rich space of possible constraints you could describe in the process from
the query that there's a particular point you could stick it in in this match. I'm thinking of, for
example, the work of Aaron Cubana [phonetic], right?
>> Daisy Zhe Wang: Yes, yes. That's part of our related work. And we were having that in
mind, but in that paper, the kind of constraint is like the user specifies specifically that this label
should be labeled this way.
So that's the kind of constraint we do support. But more general constraints like, I don't know -like before an adjective there always have to be a noun or after adjective. Like that's kind of more
first order constraints. But we do support what is supported in that paper, which is the evidence.
>>: So what happens if you go from this Markov model to the semi Markov models, how much of
your thing has to change? Because I saw most of the examples exploited the Markov
demonstrates, just one token and really on the tokens and so forth.
>> Daisy Zhe Wang: How do I extend to HMF?
>>: Markov model.
>> Daisy Zhe Wang: Markov model.
>>: It's a segment not a token. Seems like the models determine.
>> Daisy Zhe Wang: I read the semi Markov paper. I think their inference is also an extension of
the Viterbi algorithm. And I don't really get quite what are the differences there. Do you -- like I
think what they said is their computation is similar to Viterbi and it's a variation of the Viterbi. I
didn't look very closely at that.
>>: Christian Konig: Okay. Thank you.
[Applause]
Download