22096 >> Tao Cheng: All right. So good morning. ...

advertisement
22096
>> Tao Cheng: All right. So good morning. Thanks everyone for coming to Jian's talk. So
probably many of you already know Jian because Jian was an intern here last summer.
So today Jian will talk about his thesis work, I guess. So Jian has been doing research probably
a little bit as well as a number of algorithmic problems, and he's been publishing very well in top
database conferences as well as algorithmic conferences, and I think he won 2009 VLDB Best
Paper award. Without further ado, let's welcome Jian.
Jian LI: Thanks for the introduction. Good morning, everybody.
So the title of my talk is decision-making under uncertainties.
Again, I'm Jian Li. I'm a student at the University of Maryland. So can you hear me? So I'm a
student at University of Maryland.
So before defining my problems as I'm going to motivate here the problem a little bit. So today I
certain data is everywhere. It can be generated by many different systems, like data integration
system or information extraction system or it can be generated by a variety of reasons, for
example. Noise, errors, kind of stuff.
So let me name a few applications here. So this is the data integration system. We have two
different data sources, and we see two tuples here. One which has the same as the number but
different names.
So one we were trying to integrate them into a single table. We don't know which information to
choose. We may as well put just two tuples together and associate the probability with 0.5 with
each tuple.
So another application here is the assessor network application. So the assessor measurement
is accurate. What we do is use Goshen variable to model the sense of measurement.
Just a little notation here. So in this tuple, in this example, so the existence of a particular tuple is
uncertain, so we call this tuple uncertainty.
But in this example, each tuple exists as a tuple is certain, but the attribute values are uncertain,
a variable. We call this attribute uncertainty.
So another application here is the social network application. So we extracted the information
from the [inaudible] band. We conclude some friendship link which sometimes is uncertain
because we extract this information from that from the conversation or some tag of the picture.
And another instructing, very interesting application here I want to talk about is the [inaudible]
exchange programs. We have a set of patients and a set of donors, kidney donors, and we want
to find the matches between the donors and the patients.
And what we have to do to determine a match between a patient and donor is to run this
cross-match test. This test is actually very expensive and time-consuming.
So ideally, so if we want to maximize the number of matches, we have to run this cross-match
task between all pairs of the donors and the patients.
But actually what we do, if we use the estimation of the success probability, by using some
easy-to-get information, such as blood type, we can actually estimate the probability, whether this
true person a match.
So we can use this information to minimize the number of cross-matched tests and still get a very
good matching.
So this is where we can also use this probabilistic information.
So another big source of uncertain data is the future data. Future data is definitely uncertain. So,
for example, tomorrow's stock market or stock price is uncertain.
So because this increasing volume of uncertain data generated every day, so there's increasing
need to analyze and sort of the reasoning over such data.
So actually it's making decisions under uncertainty is not a new area. Actually, it's a very old and
broad area, involve many different players. So, for example, economics, the finance, so they
study problems, for example, how to do investment in their uncertain future and the probability
theory or statistics they gave mathematic foundation to deal with such problems such as
psychology study different people's behavior and their uncertain future, and the last but not least
the computer scientist focus on the computational aspect in the problem of decision-making and
uncertainty.
Okay. So actually many different sub areas in computer science have interest in dealing with
uncertain data. For example, networking and machine learning. But in this talk I'm going to stay
in two topics.
One is this probabilistic database. So in this domain, researchers are interested in design, so
they incorporate probabilities into relational table.
And the people study the semantics and the algorithms for the relational query such as SQL
query top-k query [inaudible] queries. In this talk I'll talk about rankings of top-k queries over
probability database.
So another area I will talk about is stochastic optimization for combinatorial problems. And
there's a number of very well-studied model. For example there's two stages, stochastic
optimization or some sort of on line stochastic optimization.
So but in this talk I'm going to define a new class of problem. So we call it expected utility
maximization problem, motivate this problem a little bit later.
So if time allows, I'm going to talk about stochastic matching problem which has application,
interesting application, online dating and also the key exchange program I just talked about.
Okay. So if you have any questions, just feel free to interrupt me. So just review a little bit about
the probabilistic database.
So the general goal is to manage probabilistic data and support at the clarity of language, query
processing.
So currently there are actually many probabilistic database nowadays, we can see there are quite
a number of them.
And the problem I'm going to talk about belongs to this project developed in the University of
Maryland.
And so our general goal is to again support ranking on top-k queries over probabilistic database
which has a number of applications.
And so first off I'm going to talk about this possible word semantic, which is the most popular
semantics to do query processing over probabilistic database. Again, for example, we have this
probablistic table here. We have three tuples, and for each tuple we associate it with a
probability. So we assume this is a tuple uncertainty model.
So with probability 0.2 this tuple exists with probability 0.8 this tuple doesn't exist.
So we assume all the tuples are independent of each other. So, again, this single probabilistic
table actually corresponds to a class of deterministic table. So each deterministic table can be
thought as outcome of this probabilistic table.
So we call each deterministic table a possible word. So I'm going to use this word, possible word.
So basically a probabilistic table can be thought as probability distribution or possible words.
Okay. So if we do query processing over such probabilistic database, it's equivalent to -- issuing
the same query over this class of possible words.
So suppose we are trying to answer the top-k query. So suppose top well answer so which tuple
did return to the user. So what we do is just issue the top answer for each possible words.
We see actually we return different answers for different possible words. So what we get is the
probability distribution of different answers.
But the user only wants one tuple. So which one should we return? So we have to have some
systematic way to combine those different answers, which is the probability distribution of
different answers.
Okay. So again just one thing to note here: For simplicity we have a score value which is used
to rank the tuple in the deterministic setting.
Okay. So actually this is not easy problem. And there are a number of prior work on this. So the
simplest one is just to return the K tuples with higher stakes packet score which is defined to be
the score times the probability.
This is an expected score. And so the researchers have found this is actually not very good
criterion to do top-k and later on Soliman, et al., proposed two different semantics, top-k and
U-rank. So the U top-k returned the most probable top-k answer. And U rank K what they do,
they position rank I, they return a tuple with largest probability to be a rank I answer.
So another one is this probabilistic threshold top-k answer so they return K tuples with this
largest, this probability, this probability is the rank of this tuple, it's less than K, which means this
tuple is top-k answer and returns those tuples with the largest probability of being in the top-k
answer.
There are some few others. So what I want to say here is we have many different top-k
semantics, and each of them seems very natural.
You see, with most probable and the largest probability. So which one we are going to use. So
in order to address this question -- okay. So let me give you a very simple example, it answers
the question which one we use.
So let's take this probabilistic threshold top-k answer, for example. And so we return K tuples
with the largest probability that this tuple would be top-k answer. This is the probability, and this
is the table we have and probability of existence, and this table corresponds to eight possible
words.
So each possible word corresponds to a subset of tuples. And we have to compute this
probability values for each tuple. So we have three tuples we compute the probability value here.
So, for example, for T3, so suppose K equal to 2, okay. K equal to 2. Let's compute the
probability for T3. So this probability comes from these three possible words. So in these three
possible words 3 is the top answer. So we sum up the probabilities and we get this probability
that we rank according to this probability.
So, again, so one question was raised. So in order to compute this probability, we have to sum
up some probability of possible word. But if there are many tuples, the number of possible word
exponential, we have to have more efficient algorithm to estimate this probability, right?
So this is algorithmic challenge here. And let's written into the question, there are many top-k
ranking functions, which ones do we use.
So in order to answer this question, let's use a normalized Kendall distance between the two
top-k answers to measure the dissimilarity of two top-k answers.
If all those ranking functions return very similar top-k answers we may as well use any of them.
So if they return very different answers. So there must be some other problems. So we use this
Kendall distance between two top-k answers. Basically this distance is between 0 and 1. If it's 0
it means two answers are the same. If it's 1, it means two answers are disjoint.
So let's look at these two tables here. So they are computed from two different datasets. This is
from the real dataset, which has one million tuples, and those are the different ranking functions,
at least in the previous slides, and the number here means that this turns the conduct dance
distance of two top-k answers first one is obtained by using this ranking function versus obtained
by using this ranking function. We can see the number here. The small number means two top-k
answer is very similar. We can see those numbers are very close to one which means the
answers are pretty much disjoint.
Let's see this table here, which is quite different from this table. So we can see that this expected
rank is quite close to expected score. This number is very small.
But to both of them they're quite different from the rest of the ranking functions. So we conclude,
actually, those ranking functions return drastically different answer. So there must be something
wrong.
So what do I do? Actually, so our proposal is actually so different users may have different
preference. And this different preference may imply different ranking functions.
So, therefore, what defined this parameterized ranking function, so we are identifying two classes
of parameterized ranking function, PRFW and PRFE, which it can be controlled by parameter one
we change the parameter and achieve different user preference.
And also this parameter parameterized function generalize most of the previous ranking
functions. So we'll see why is that later.
Okay. So, again, so one thing to note is this PRFE is more efficient to evaluate. So here's our
overall framework here. So we start from the user.
So if the user knows which ranking function to use, so the user has a particular preference. So if
this user knows which ranking function to use, so in some cases this ranking function is a special
case of the PRFW function, so we represent it just as PRFW function with a particular parameter.
So in some cases the user has no idea which ranking function to use. So instead we give this
user a small table and larger user to rank this smaller table.
We can actually use -- use the learning algorithm to learn the parameters from this sample and
derive PRFW function from these particular parameters; we can learn the parameters from this
sample.
We actually compute, use this PRFW function to rank this table, to rank those tuples. And if we
feel those, it's not efficient enough, we can use PRFE to approximate. Suggested PRFE you can
evaluate much more efficiently.
Okay? Okay. So now let's see the formal definition of our parameter ranking function, PRF. So
this is a PRFW function. So the parameter is a weight function. This weight function is mapping
a rank to a possibly complex number. So rank is integer one, two, three, four, five.
And this -- so PR function for a particular tuple is defined to be the linear combination of the
weight function, weight divided by the positional probability. So what's the positional probability?
This is a probability that this particular tuple, the rank of this particular tuple is I.
So this is a probability. Okay? So we'll just use this linear function, and the second
parameterized function is PRFE which is a special case of PRFW.
The weight here is an exponential function. It's just an exponential function, it's a parameter
which can be a real number or complex number.
Okay. So this is again a linear sum of this weight function which is exponential function.
Then we compute those PRF values for each of those tuples and we rank those tuple by the
highest, this value. So it can be a complex number so we take absolute value here.
To see whether we generalize some other ranking functions. So, for example, if this weight
function is one, so if this weight function is one, with submission over all the positional probability
which is the existence probability.
So we just rank those tuples by the existence probability. So this is a very special case. So
another example here is if the weight function is one, if the weight is between one and K, so
which means the rank of the tuple is top-k. So this is a PTK we just discussed which we rank the
tuple by the probability of this tuple being rank K.
So to generalize other ranking functions. So I'm not going into the detail here.
So now let's see our algorithm to evaluate this. So one thing to note, to recall, we cannot
evaluate it by at least all the possibility words because there might be exponential number
possibility of words.
So instead of discussing the ->>: So can I go back to the earlier -- so because of the function, function of I alone, the rank, so
how do you do anything that's a function of this code?
Jian LI: Can incorporate the score. Because this weight function can incorporate a score. This
weight function can be a function of score and to the rank. So this doesn't affect the evaluation
algorithm. Actually, it's not going to be a problem, bus just for simplicity I didn't discuss it.
Okay. So instead of discussing how to compute PR function directly, I'm going to propose a
much more general framework. And this more general framework can compute a variety of
different kind of probabilities over this probabilistic database model which we call probabilistic
annual tree. So this probabilistic annual tree can capture two kinds of correlation. One is mutual
exclusivity and one is the co-existence.
Let's consider this example here. So the leafs of these tree are the possible tuples. This is the
key of the tuple. This is the score of the tuple. Key and score. Those are the tuples. And this
Xor and Xor captures the mutual exclusivity.
And this Xor node means that at most more than one tuple can exist in this sub tree. This is the
probabilities, which means the probability 0.5, this tuple exists. And there's a probability 0.3 this
tuple exists, if it's probability 0.2 nothing exists. And this N node here means the co-existence.
So the result of this Xor node co-exist.
For example, we have this possible word, this 3150, so what's the possibility this exists? It's the
probability that nothing in this sub tree exists, which is this probability.
And nothing in this sub tree exists, there's this probability. And this tuple itself exists. This is a
probability. In general the probability scanner tree can have more levels, more than two levels.
And the one thing to note to generalize the X tuple model. This is a widely used model before
which can only capture mutual exclusivity.
Okay. What's wrong? Okay. So to compute the -- I said I would introduce a framework to
compute a variety of probabilities.
So what we do is to use this generating function method. So in general a generating function is a
polynomial. And we say this generative function generate a particular sequence if the coefficient
of this polynomial are the numbers of the sequence we're going to generate.
So what we do is for each leaf of that tree, we associate each leaf with some variable X, Y or Z.
So this is the construction of the generating function. I'm going to tell you what's the result later.
So for N node the generating function for this N node is the product of generating function for
each of its leafs.
For Xor node, the generating function is this probability times this generative function, probability
times this path probablity times this generative function, and the plus Q, Q is the rest of the
probability, the probability that nothing in this sub tree exists.
And we construct this generative construction bottom up and the root node we also have a
generating function, all right, which those variables are the variables we associated with the
leaves and the way you expand this generating function we've got to sum, submission of
monomials and these are the coefficients and these are the monomials.
And the theorem here is the coefficient of the term X2 over power I. But for J there might be
more variables. This coefficient is the probability of all the possible words that have I tuples we
associated with variable X. And the Y tuple, J tuples will associate with variable Y.
So this is a whole probability that with this many X with this many Y. So is this clear? Any
questions?
Okay. Therefore, so let's look at an example here. So the example here -- let's say we want to
compute the probability distribution of the size of the possible word.
Okay. So what we do is just we associate all the leafs with variable X. So because we want to
compute the number of tuples in this possible order.
So then we construct the generative function this times this plus this plus 0.2. So forth. And the
root we have this generating function and this generating function will expand this generating
function to standard form and the coefficient will be the probability distribution of the possible
word.
Okay. Okay. So this is one simple application of the theorem. And I'm going to talk about how to
compute the positional probability. So if we can compute the positional probability we can
evaluate the PR function, right? Because PR function is defined to be the linear sum of the
position of the probability.
Okay. So let's look at this example here. So we want to compute that the positional probability
for T4 and so one thing to note that the rank of the tuple I is being -- it's J if and only F there are J
minus 1 tuples with higher score exist, and this tuple itself exists, right.
Which, for example, T4 -- okay. So then the construction here is for T4 we gave -- we associated
with variable Y. And for all the tuples which has a higher score, we associate with variable X.
The four other tuples with lower score associated with it is just one. So, therefore, if we compute
the positional probabilities and the positional probability of T4 is going to be the coefficient of, if
this is J, this is going to be X power to the power J minus 1 Y. So why is this? This means
there's J minus one tuples with higher score exist, and Y means the tuple itself exists.
Okay. So therefore we have this generative function with two variables, and we expand this
generating function and we get those positional probabilities and we can compute the PRF score.
And this can all be done in polynomial time for sure and it's quite fast.
We can do it in order, in squared time and we have some here other results here. So this is a
summary of the results, the summary of the results. We have PRFW function for independent
tuples.
Actually, in order to achieve all these results we need some tricks to do the polynomial
multiplication for which we can use the FFT algorithm to do a polynomial multiplication, and we
have to use in other tricks to further reduce running time.
So this is the result we get. This is H plus log N. So H is a parameter here. So -- and this is a
previous result for very special case recorded. So you rank and the PTK are special cases of our
PR function.
And we actually, this is times better than the previous one. And for N or 2 E -- so this T, this T is
the height of, the number of levels of the tree. And previous results are only for special cases.
You run K over X tuples. So our result is PR for and/or trees, which is much more than X tuples
and there's the PTK for X tuples they got square H. And square H. And we have actually, if T is
equal to H, which is almost linear. Right?
So this is squared. We got linear. So for PR, we can actually have more efficient algorithms. So
basically for independent tuples we have a log N and trees we have MD plus log N.
So if we have X tuples T equal to 2, we have basically log N. Which is the same as ranking in
deterministic table.
>>: And the number of tuples?
Jian LI: Yeah.
>>: What is the H?
Jian LI: H is the parameter here. So H -- I may forget to say that. So H is -- so recall for PFW
function we have this weight function which maps the rank to a number. And H is the parameter
that if your rank is larger than H, then this function is 0.
I forgot to say that. Sorry. So from last slides we know that PRV function is particularly easy to
evaluate. It can be evaluating log N times, so presumably we would like to rank using PRFE.
But if the user really wants to rank by using a PFRW. What we can do, we actually approximate
PRFW by linear function of PRFE.
So what we do here suppose we can write a function as the linear combination of exponential
functions. So this is a weight function. This is a linear combination of exponential functions.
And this is a PRFW -- definition of PRFW, which is the linear combination of the weight function.
So if you plug this in, you see actually this is a linear combination of PRFE function. So in
bracket we have a PRFE function. The weight is exponential function.
So therefore we can evaluate this. Previously we had N squared time to evaluate this. Now we
actually have an L algorithm to evaluate this, because we have L PRFE function to evaluate, to
sum up to get this.
Right? Okay. So the only thing that we need to do is we actually need to decompose this weight
function into a linear combination of exponential function. In order to do this, we use an analog of
discrete foray transformation, we made some modification to this to achieve this goal and it works
pretty well for reasonably smooth weight function.
Okay.
>>: So the example working in context of making this -Jian LI: Yes, that's the reason. We need complex numbers. Yeah. So I'm going to show you -so some experimental results here. So recall for the PRFE function we have only one single
parameter.
So I'm going to show if we change this parameter from 0 to 1, a single PRFE function can actually
approximate many other ranking functions.
Okay. So we still use Kendall distance as a measurement. And here for each curve corresponds
to a particular ranking function. And the X axis corresponding to the parameter Rfa [phonetic],
the single parameter Rfa, because of scale problems we use a function of Rfa here. I is a
function of Rfa. But it's changing Rfa from 0 to 1, essentially.
So we can see each curve. So each point here is Kendall distance between two top-k, one is
obtained by the ranking function. And the other one is obtained by the PRFE function, which is
not the parameter.
So we can see for each curve, each curve is like this shape. This univalley shape, the decrease
and the nonincrease.
What that means, this means at the bottom point, so there's a point that the PRFE function was
that parameter can approximate that ranking function.
Okay. We can see for each curve there's a lowest point. And this point is pretty low, Kendall
distance very small.
So that means there's a PRFE function with a particular parameter that kind of approximates that
ranking function.
Okay. And another one is this figure shows we use a linear combination of PRFE function to
approximate PRFW function. So this is a number of terms. Recall we decompose the weight
function to a linear ->>: Sorry, just getting back to your earlier thing. So does the setting of this parameter that I need
to approximate a different ranking function, to which degree does that depend on the specific
dataset?
Jian LI: To which? It actually depends on the dataset.
>>: Okay. So you can sort of set that once and for all -Jian LI: We cannot do that. But what we can do actually for this particular application, we can
actually sample some data from the original data and estimate the parameter here.
And the second figure here is a number of terms. Recall we decompose a weight function into a
linear combination of exponential function, that this is the number of terms. As the term increases
we see Kendall distance become smaller and smaller.
So even number of terms it's like 20 or 30, the Kendall distance is already pretty low. It's below 0
to 0.1. So this means if we use 30 times -- running time is 30, we can achieve pretty good
performance.
And these two figures shows some running time here. This is the previous hours like you rank
like PTK, and this is a PRFE function, which is pretty fast here.
And this is an approximation versus the exact hours, this is the exact hours which it takes a pretty
long time. If we use approximation, the linear combination of PRFE function which is much
faster, much, much faster.
Okay. So going to show you some other results here. So I discuss this -- so the algorithm I
discussed only works for discrete probability distribution.
So for the continuous distribution, we have to do something more. Actually, this is a generating
function we are going to use. I'm not going to tell you why we use this generating function, but
you can see we have this integral here, and in order to evaluate this integral we need some
numerical method.
And actually, so use this generative function method, we have some polynomial time algorithm
from uniform distributions and the piecewise distribution. So the distribution can be written as a
piecewise polynomial. So each piece can be represented as polynomial so we can have a
polynomial time algorithm.
So in general, arbitrary distribution, we cannot hope for exact algorithm. So but instead we have
an approximation scheme for arbitrary distribution.
And we can show that our scheme can converge much faster than, for example, Monte Carlo or
some other straightforward method. Converge much faster.
Okay. So I'm going to speak -- return to the original, the query processing over probabilistic
database.
So, again, so recall that single probabilistic table correspond to probablity distribution of possible
words. So if we do query processing on that, we have to have some systematic way to combine
those different answers coming from different possible words.
So here is one proposal here. So we call consistent answer.
So what we do in this scenario is -- so recall we have collection of possible words. And each
possible word may return different answers.
So we see different answer as a point in some space, and we gave some metric of this space,
some distance function to this space.
And this distance function between two points measure whether there's two answers similar or
dissimilar.
So if the distance is small, then it means those two answers are similar, okay?
And we define the consistent answer to be a single plug into the space, which is a single answer,
which minimizes the following. Which has the minimal packed distance to the rest of the answer.
This is the distance. This is the distance I just talked about. This answer minimizes the distance
from this answer to the other possible answers.
Okay. So we call this consensus answer. Okay. And this single -- this single point, consensus
answer, can be thought of as the centroid center of mass of this set of possible answers.
Okay?
So we use this idea, this consensus idea to give semantics for different queries in probabilistic
database. And, in particular, if we apply this idea to the top-k answer, so actually we can show
that PTK, the probablistic threshold, the example I gave you at the beginning, which is actually
the consensus top-k answer, and distance function is symmetric difference, which is the number
of tuples. We have two top-k answers. This measures the number of tuples, exactly one top-k
answer, but not both.
So therefore if it's 0, it means two answers are completely destroying -- as I said.
Let me do the reverse.
More generally we have PRFW function which is consistent to the answers over the weighted
symmetric difference. The weighted symmetric difference is generalize the symmetric difference
in a way that we gave more weight for higher positions.
Okay. So actually we have some other results for the Jacquard distance and schema to conduct
tau [phonetic] distance which measures similarity between two top-k answers.
And we also have some other result for regular queries of clustering. It's a pretty general
framework we can apply to different kinds of queries.
And now let's come to the second part of my talk which I want to discuss, the stochastic
optimization, for combinatorial problems.
Okay. So one thing I want to note is that so far most of the work on this domain have focused
optimized expected value of some objective. For example, maximized expected profit, minimized
expected cost or something similar.
But so my argument is sometimes the expected value is not very good criterion here. So it's
not -- especially in some cases where we want to capture the risk averse and the risk prompt
behavior.
So let me give you an example here. So suppose we have two actions. One action is I give you
$100. So the action two is I give you $200, with probability of one-half, and I give you nothing
with probability one-half.
So both actions have the same expected value, right? It's easy to see 100. But risk averse user
usually prefer action one to action two. And the risk prone user prefer action two.
Why is that? Because a gambler would like to spend $100 in order to play double or nothing.
This corresponds to double or nothing. This corresponds to spending $100, which means risk
averse player prefers action two. That means expected value cannot differentiate these two
actions.
So this is a very simple example. A more extreme but more interesting example is this is a
paradox.
So in this paradox, this is a game. So you spend X dollars to enter this game, and in this game
what you do is repeated tosses [inaudible], and here the tail appears. And there's some payoff.
The payoff is 2 to the K where K is number of heads. So because you always gain some money
from this game, so you have to pay to play, otherwise you always gain, right?
You to pay something to play. So how much should you pay? If we use expected value criterion,
so the value of the page would be the expected payoff, right?
If the payment is less than expected pay off to play, so the expected payoff turns out to be one
times one-half. Not heads. So only one heads. There's no heads appeared. And this is only
one heads appeared, two heads appeared. Right?
So if you submission over all this, you find it's infinity, which means no matter how much you pay
you should play. You should always play this because the expected payoff is infinity but
economists and psychologists have done extensive survey and find actually few people that pay
even $25 for that.
What does $25 means, which means you should get at least four heads or five heads, right?
Which is actually we have pretty small probability. People are smart and the people don't behave
according to expected value.
So this means actually there's huge gap between expected value and people's behavior.
So to address this question, so we proposed to optimize another more general function, which we
call the utility function. So utility function maps the profit to utility value. So utility measures your
satisfaction.
So the expected utility maximization principle process, this is the maker always maximize the
expected utility.
So particular user have a particular preference which implies a particular utility function.
So again so wide can capture -- this can capture the risk averse and risk prone behavior.
So we have two -- so these two examples. We have this risk averse behavior which corresponds
to this concave utility function. So this concave utility function, the [inaudible] is decreasing. So
this is decreasing margin utility, which is typical risk averse behavior.
Let's see, we have two actions. Just the example we had. So let's see what's action one. So
action one here is $100. And the expected utility here is this point, right? So this maps this utility,
expected utility.
And the action two is $200 for 0.5 probability in the 0 dollars with probability 0.5, which is a
midpoint of this true point, which this is action 2, the probability of action 2. So this means the
expected utility of action one is larger than expected utility of action 2, which again, is consistent
with risk averse behavior
And this is risk prone behavior, which is exactly the opposite. And actually so actually if we
design carefully the utility function we can address the Pittsburgh paradox.
So another nice thing about this expected utility maximization principle is [inaudible] gave a
semantictization of this principle. What they did is gave three basic axioms about user
preference.
So three very simple statement of user preference. And they can actually deduce this principle
rigorously, mathematically.
So now let's just apply this expected utility maximization principle to see, to do the combinatorial
problem.
So the combinatorial problem we're going to discuss is fairly general model here. We have -suppose we have a set of elements. Each element have a weight. So in the deterministic
setting. And the solution is going to be a subset of those elements, [inaudible] property. So we
want to find the path, those edges should form a path. If you want to form spanning tree those
elements should form spanning tree just from the subset of elements.
And the goal is to minimize -- to minimize the total weight in the deterministic setting. So in the
stochastic setting, so where we have probability, we have this weight is going to be a positive
running variable. It's not deterministic value anymore.
And again we have a utility function, which is also given, utility function.
So because of the problem of trying to minimize the objective. So we assume the utility function
is sort of decreasing. So if the weight is very large, the utility is very low. You get low utility.
And we assume that this utility function goes to 0 if the value goes to infinity. And we want to
maximize the expected utility.
Okay. Okay. So this is the general theorem we got. This is a general theorem we got. So
suppose the following two condition holds. So the first condition, there's a pseudopolynomial time
algorithm for the exact version of the deterministic problem. So I'm going to explain. Exact
version means we're not trying to optimize any object.
We just want to find a solution of exact weight. K is a given value. So we want to find solution
with exact value, not optimize anything.
And pseudo polynomial time means this algorithm is running in polynomial in K, not in the size of
the input.
And another condition is if the utility function satisfy this other condition, so you don't have to see
this, so you just remember this other condition means this function is fairly smooth. And this is
continuous and fairly smooth, that's okay.
Then we can actually have polynomial time algorithm that can find optimize this expected utility
up to absolute predictive error for any arbitrarily small absolute.
This is our result. Actually, there are many deterministic problems satisfied this condition one.
Top-k [inaudible] path, minimum spanning tree matching so on and so forth, and our theorem can
apply to all of this.
And actually our theorem can actually generalize many previous results, generalize many
previous results. So one problem is this stochastic path problem.
So we have a network. We have a network. And each edge has a weight which is actually -which is uncertain. It's a random variable.
We want to find a path in this network. And we want to maximize the weight, the probability that
this path is less than something.
So, okay, and we want to find the path and these are some previous results. They got some
approximation for some particular distribution. So if the weights are normally distributed or
responsible distributed, our result we give -- I mean, in general approximation for any distribution,
for any distribution. We don't care which distribution.
So one thing to note that by optimizing this probability, it can be captured by optimizing utility
function. So we just take this utility function. This step utility function actually is the same as this
probability.
So I'm not going into detail here. So actually another application on the theorem is stochastic
knapsack problem. So in this problem we have a set of elements, we have a set of elements
where we want to put knapsack with particular capacity, and each element has a profit and a size.
The size is uncertain. So we want to make sure that the probability that those elements fit into
this knapsack is large enough. And these are some previous results. We have logarithmic
approximation or for some particular distribution.
And the result gives a general approximation up to arbitrary close error for any distribution. And
actually our algorithm is quite simple. So I'm not going into detail of the algorithm, but it's actually
quite simple. Analysis is also quite simple.
So I'm not going to into this. So I think the last part, what we want to discuss is the problem from
relation with the key exchange program that I discussed in the beginning. So this is a stochastic
matching problem. The stochastic matching problem, where we have -- it has interesting
application also in online dating system.
So I'm going to discuss. So the problem formulation for this is we were given a probabilistic
graph. So the edge, the existence of each edge is not -- it's not certain. So this edge exists with
certain probabilities.
And for each vortex we have a patient's level which is an integer, one, two, three, four, 5,
something like that. And what we want to find a match in general. We want to find a large
matching.
And what we can do is we prop those edges. We prop this edge and we know whether this edge
exists or not. If it exists it's P if it doesn't exist it's one minus P. If this edge exists we have to
take this edge into our matching.
If this edge doesn't exist, so we decrease the patient's level of post N vertices by one. Decrease
the patient's level of post vertices by one. So the objective is to find the strategy, to prop those
edges and find the large matching.
Okay. So this ->>: Get what matching is what?
Jian LI: Exact weight of the matching.
So in general we can have a weight for each edge.
>>: The sum of the weights of the edges that are in the matching.
Jian LI: Yeah, in the matching. So this problem was proposed previously by Chen et al. where
we solved -- where we solved the unweighted version -- unweighted version which all the edges
have weight one which is the cardinality of the matching.
So in this paper we gave good approximation for the weighted version, which generalized this.
So this resolves open question proposed in the previous problem.
So let's not discuss the result, but let's motivate this problem. So why we are at all interested in
this problem. So actually there's a very interesting application, online dating.
So we have a bunch of people log into this on-line dating system. We think of each person as a
point. And each edge is a match between these two persons.
And we have this existence probability, which can be, this existence probability of an edge can be
estimated by this profile.
So user logged into this online dating system and input some personal information. And the
online dating system can use this personal information to compute whether this is true person
match or not. And they give some probability.
So we use the existence probability to capture this. And what's the probe of each edge, we can
probe each edge we set this two person to a particular -- to an actual date.
So actually which can result in two outcomes. One is this edge exists which means these two
persons are matched. And the other outcome is the edge doesn't exist and the date didn't end up
well and the patient's level again for each vortex we have each level.
So this model is a particular person experience too many failed dates, he or she will eventually
quit this online dating system.
And the goal of this is the dating system is to find particular probing strategy to maximize the
weight of the matching.
So actually this has some other applications. The techniques change and the online word
problem, in both applications we want to find large matching, uncertain scenario.
So this is pretty much all I want to see. So I guess I'm now going to talk about this -- some future
work. So just some future work.
So the first future work is for the expected utility maximization. I discussed a very general
theorem, but it's still not quite general enough because the condition one required that the
terminus version of the problem have a pseudo polynomial time algorithm, but for strong NP hard
problems, what can we do. So for example word, Steiner tree, and the McGregor function
[phonetic] what do we do. This is future work.
And another future work is the psychologist, special economist, and the worst case find out that
some people behave, even deviate from the expected utility service. Sometimes people can be
both risk averse and risk prone and to capture this really -- I mean weird preference you need a
very general theory, proposed as prospect theory, for which they obtain this Noble Prize for this
work, and this is a more general framework.
And this may have some implication in our particular computational problem, I don't know. It's
future work. Another future work is ranking in the more complicated scenario or, for example, so
the PR function I just discussed, one PR function can address one particular user preference.
But if the system has many users, has many users which address sort of a [inaudible] preference,
and sometimes there are very classic examples here. So the Jagger means two different things
for different persons.
So which means different persons have different preference which we'll address simultaneously.
And this leads to the term of diversification of the ranking result, which means top result we
should make this result diversified.
And there are some models for this. And I'm trying to see if there's anything that they can do
here to incorporate the uncertainty into those models. There's some theoretical models here. So
this is -- I guess this is all I want to say. Thanks. [applause].
>> Tao Cheng: We have time for questions.
>>: You mentioned earlier on that you can learn the ranking from examples, the examples, can
you explain?
Jian LI: Actually, there are some, because our ranking function is linear function, linear function,
linear function of the weight function.
So the weight we want to learn the weight which is the parameters. Actually, there are some -the SVM rank can do this test, and there are some other learning functions can learn the linear
weight.
>> Tao Cheng: Okay. I guess if we don't have questions, let's thank the speaker again.
[applause]
Download