CISC453 Winter 2010 AIMA3e Chapter 13: Quantifying Uncertainty OUTLINE 2 overview 1. rationale for a new representational language what logical representations can't do 2. 3. 4. 5. 6. utilities & decision theory possible worlds & propositions unconditional & conditional probabilities random variables probability distributions Quantifying Uncertainty OUTLINE 3 overview 7. using the Joint Probability Distribution for inference by enumeration for unconditional & conditional probabilities 8. reducing complexity independence 9. Bayes' Rule from causal probability to diagnostic probability conditional independence 10. pits & probability in Wumpus World 11. summary Quantifying Uncertainty 4 Quantifying Uncertainty consider our approach so far we've handled limited observability &/or non-determinism using belief states that capture all possible world states but the representation can become large, as can corresponding contingent plans, and it's possible that no plan can be guaranteed to reach the goal, yet the agent must act agents should behave rationally this rationality depends both on the importance of goals and on the chances of & degree to which they'll be reached Quantifying Uncertainty 5 A Visit to the Dentist we'll use medical/dental diagnosis examples extensively our new prototype problem relates to whether a dental patient has a cavity or not the process of diagnosis always involves uncertainty & this leads to difficulty with logical representations (propositional logic examples) (1) toothache cavity (2) toothache cavity gumDisease ... (3) cavity toothache (1) is just wrong since other things cause toothaches (2) will need to list all possible causes (3) tries a causal rule but it's not always the case that cavities cause toothaches & fixing the rule requires making it logically exhaustive Quantifying Uncertainty 6 Representations for Diagnosis logic is not sufficient for medical diagnosis, due to our Laziness: it's too hard to list all possible antecedents or consequents to make the rule have no exceptions our Theoretical Ignorance: generally, there is no complete theory of the domain, no complete model our Practical Ignorance: even if the rules were complete, in any particular case it's impractical or impossible to do all the necessary tests, to have all relevant evidence the example relationship between toothache & cavities is not a logical consequence in either direction instead, knowledge of the domain provides a degree of belief in diagnostic sentences & the way to represent this is with probability theory next slide: recall our discussion of ontological & epistemological commitments from 352 Quantifying Uncertainty 7 Epistemological Commitment ontological commitment what a representational language assumes about the nature of reality - logic & probability theory agree in this, that facts do or do not hold epistemological commitment the possible states of knowledge for logic, sentences are true/false/unknown for probability theory, there's a numerical degree of belief in sentences, between 0 (certainly false) and 1 (certainly true) Quantifying Uncertainty The Qualification Problem 8 for a logical representation the success of a plan can't be inferred because of all the conditions that could interfere but can't be deduced not to happen (this is the qualification problem) probability is a way of dealing with the qualification problem by numerically summarizing the uncertainty that derives from laziness &/or ignorance returning to the toothache & cavity problem in the real world, the patient either does or does not have a cavity a probabilistic agent makes statements with respect to the knowledge state, & these may change as the state of knowledge changes for example, an agent initially may believe there's an 80% chance (probability 0.8) that the patient with the toothache has a cavity, but subsequently revises that as additional evidence is available 9 Rational Decisions making choices among plans/actions when the probabilities of their success differ this requires additional knowledge of preferences among outcomes this is the domain of utility theory: every state has a degree of utility/usefulness to the agent & the agent will prefer those with higher utility utilities are specific to an agent, to the extent that they can even encompass perverse or altruistic preferences Quantifying Uncertainty 10 Rational Decisions making choices among plans/actions when the probabilities of their success differ we can combine preferences (utilities) + probabilities to get a general theory of rational decisions: Decision Theory a rational agent chooses actions to yield the highest expected utility averaged over all possible outcomes of the action this is the Maximum Expected Utility (MEU) principle expected = average of the possible outcomes of an action weighted by their probabilities choice of action = the one with highest expected utility Quantifying Uncertainty Revising Belief States 11 belief states in addition to the possible world states that we included before, belief states now include probabilities the agent incorporates probabilistic predictions of action outcomes, selecting the one with the highest expected utility AIMA3e chapters 13 through 17 address various aspects of using probabilistic representations an algorithmic description of the Decision Theoretic Agent function DT-AGENT (percept) returns an action persistent: belief-state, probabilistic beliefs about the current state of the world action, the agent's action update belief-state based on action and percept calculate outcome probabilities for actions, given action descriptions and current belief state select action with the highest expected utility, given probabilities of outcomes and utility information return action 12 Notation & Basics we should interpret probabilities as describing possible worlds and their likelihoods the sample space is the set of all possible worlds note that possible worlds are mutually exclusive & exhaustive for example, a roll of a pair of dice has 36 possible worlds we use the Greek letter omega to refer to possible worlds refers to the sample space, to its elements (particular possible worlds) a basic axiom for probability theory (13.1) 0 P() 1, P() 1 as an example, for the dice rolls, each possible world is a pair (1, 1), (1, 2), ..., (6, 6) each with a probability of 1/36, all summing to 1 Quantifying Uncertainty 13 Notation & Basics assertions & queries in probabilistic reasoning these are usually about sets of possible worlds these are termed events in probability theory for AI, the sets of possible worlds are described by propositions in a formal language the set of possible worlds corresponding to a proposition contains those in which the proposition holds the probability of the proposition is the sum over those possible worlds Quantifying Uncertainty Propositions propositions another axiom of probability theory, using the Greek letter phi () for proposition (13.2) P() P() so for a fair pair of dice P(total = 7) = P((1+6))+P((6+1))+P((2+5))+P((5+2))+P(((3+4))+P((4+3)) =1/36+1/36+1/36+1/36+1/36+1/36 = 1/6 asserting the probability of a proposition constrains the underlying probability model without fully determining it 14 Propositions 15 propositions: unconditional & conditional probabilities P(total = 7) from the previous slide & similar probabilities are called unconditional or prior probabilities, sometimes abbreviated as priors they indicate the degree of belief in propositions without any other information, though in most cases, we do have other information, or evidence when we have evidence, the probabilities are conditional or posterior, given the evidence Conditional Probabilities calculating conditional probabilities in terms of unconditional probabilities, for propositions a & b P (a b ) notation & formula: P (a | b ) P ( b) intuitively, observing b excludes the possible worlds where b is false, so with a total probability P(b), within which the worlds where a is true satisfy a b and are the fraction P(a b)/P(b) an alternative formulation of the conditional rule is: axiom (13.3) P(a b) = P(a | b)P(b) this is the product rule form 16 Random Variables 17 more terminology & notation for chapters 13 & 14, propositions for sets of possible worlds use notation that combines aspects of propositional logic & constraint satisfaction - a factored representation in which a possible world is represented as a set of variable + value pairs for example: Weather = sunny variables in probability are called random variables as a convention, their names begin with an UC letter, & each has a domain of all its possible values for the Weather example, say {sunny, rain, cloudy, snow} Random Variables & Values 18 propositions in our probability notation by convention, the values for random variables use lower case letters, for example Weather = rain each random variable has a domain, its set of possible values for a Boolean random variable the domain is {true, false} also by convention, A = true is written as simply a, A = false as ¬a domains also may be arbitrary sets of tokens, like the {red, green, blue} of the map coloring CSP or {juvenile, teen, adult} for Age when it's unambiguous, a value by itself may represent the proposition that a variable has that value for example, using just sunny for Weather = sunny Random Variables & Values 19 propositions in our probability notation more background & notation domains may be infinite (like the integers) domains may be continuous (something like temperature) for ordered domains inequality notation is allowed: val1 < val2 finally, we use propositional logic connectives to combine elementary propositions: P(cavity | ¬toothache teen) = 0.1 20 Distribution Notation bold is used as a notational coding for the probabilities of all possible values of a random variable we may list the propositions or we may abbreviate, given an ordering on the domain as in the ordering (sunny, rain, cloudy, snow) for Weather then P(Weather) = <0.6, 0.1, 0.29, 0.01>, where bold indicates there's a vector of values this defines a probability distribution for the random variable Weather we can use a similar shorthand for conditional distributions, for example: P(X|Y) lists the values for P(X=xi | Y=yj) for all i,j pairs Quantifying Uncertainty Continuous Variables 21 distributions are the probabilities of all possible values of a random variable there's alternative notation for continuous variables where there cannot be an explicit list: instead, express the distribution as a parameterized function of value for example, P(NoonTemp=x) = Uniform[18C,26C] (x) specifies a probability density function (pdf) that defines density function values for intervals of the NoonTemp variable values AIMA3e uses the same notation for discrete distributions & density functions, P, since confusion about what is intended is unlikely note that while probabilities are unitless, density functions are measured with a unit, reciprocal degrees in the temperature example above Distribution Notation 22 for distributions on multiple variables we use commas between the variables: so P(Weather, Cavity) denotes the probabilities of all combinations of values of the 2 variables for discrete random variables we can use a tabular representation, in this case yielding a 4x2 table of probabilities this gives the joint probability distribution of Weather & Cavity tabulates the probabilities for all combinations Distribution Notation for distributions on multiple variables the notation also allows mixing variables & values P(sunny, Cavity) is just a 2-vector of probabilities the distribution notation, P, allows compact expressions for example, here are the product rules for all possible combinations of Weather & Cavity P(Weather, Cavity) = P(Weather | Cavity)P(Cavity) the distribution notation summarizes what otherwise would be 8 separate equations each of the form P(W = sunny C = true) = P(W = sunny | C = true)P(C = true) 23 Full Joint Distribution 24 now we fill in some details of the semantics of the probability of a proposition as the sum of probabilities for the possible worlds in which it holds possible worlds are analogous to those in propositional logic each possible world is specified by an assignment of values to all of the random variables under consideration for the random variables Cavity, Toothache & Weather there are 16 possible worlds (2x2x4) & the value of a given proposition is determined in the same recursive fashion as for formulas in propositional logic Full Joint Distribution 25 semantics of a proposition the probability model is determined by the joint distribution for all the random variables: the full joint probability distribution for the Cavity, Toothache, Weather domain, the notation is: P(Cavity, Toothache, Weather) this can be represented as a 2x2x4 table given the definition of the probability of a proposition as a sum over possible worlds, the full joint distribution allows calculating the probability of any proposition over its variables by summing entries in the FJD Probability Axioms 26 we can derive some additional relationships for degrees of belief among logically related propositions, from axioms 13.1 & 13.2 & some algebraic manipulation for example, P(¬a) = 1 - P(a), the relationship between the probability of a proposition & its negation and also axiom (13.4) P(a b) = P(a) + P(b) - P(a b) this axiom for the probability of a disjunction is referred to as the inclusion-exclusion principle (13.1) 0 P() 1, P() 1 (13.4) P(a b) = P(a) + P(b) - P(a b) together, 13.1 & 13.4 are referred to as Kolmogorov's axioms, from which the Russian mathematician derived all of probability theory, including issues related to handling continuous variables Is Probability the Answer? 27 historically there's been a debate over whether probabilities are the only viable mechanism for describing degrees of belief the degree of belief in a proposition can be reformulated as betting odds for establishing amounts of wagers on outcomes of events deFinetti (1931, 1993) proved that if an agent's set of degrees of belief are inconsistent with the probability axioms, then when formulated as bets on outcomes of events, there is a combinations of bets by an opposing agent that will cause the agent to lose money every time Quantifying Uncertainty Rationality & Probability Axioms 28 apparently then no rational agent will have beliefs that violate the axioms of probability a common rebuttal to this argument is that betting is a poor metaphor & the agent could just refuse to bet which itself is countered by pointing out that betting is just a model for the decision-making that goes on, inevitably, all the time other authors have constructed similar arguments to support those of deFinetti furthermore, in the "real world", AI reasoning systems based on probability have been highly successful Quantifying Uncertainty 29 Don't Mess with the Probability Axioms from Table 13.2 evidence for the rationality of probability Agent 1 Agent 2 Outcomes & Payoffs to Agent 1 Proposition Belief Bet Stakes a, b a, ¬b ¬a, b ¬a, ¬b a 0.4 a 4 to 6 -6 -6 4 4 b 0.3 b 3 to 7 -7 3 -7 3 ab 0.8 ¬(a b) 2 to 8 2 2 2 -8 -11 -1 -1 -1 Agent 1's inconsistent beliefs allow Agent 2 to set up bets to guarantee Agent 1 loses, independent of the outcome of a and b so, for example, Agent 1's degree of belief in a is 0.4, so will bet "against" it & pay 6 to Agent 2 if a is the outcome, receive 4 from Agent 2 if it is not, and so on Quantifying Uncertainty 30 Inference With Probability using the full joint distributions for inference here's the FJD for the Toothache, Cavity, Catch domain of 3 Boolean variables as required by the axioms, the probabilities sum to 1.0 when available, the FJD gives a direct means of calculating the probability of any proposition just sum the probabilities for all the possible worlds in which the proposition is true Quantifying Uncertainty Full Joint Distribution & Inference 31 an example of using the FJD for inference to calculate: P(cavity toothache) cavity toothache holds for 6 possible worlds the corresponding sum is: 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28 so P(cavity toothache) = 0.28 Quantifying Uncertainty Full Joint Distribution & Inference 32 using the full joint distributions for inference a common task is to state the distribution over a single variable or a subset of variables: sum over the other variables to get the unconditional or marginal probability for example, P(cavity) = 0.108 + 0.012 + 0.072 + 0.008 = 0.2 the terminology for this is: "marginalization" or "summing out" it takes other variables out of the equation for sets of variables Y and Z: P(Y) P(Y, z) zZ zZ means to sum over all the possible combinations of values of the set of variables Z Full Joint Distribution & Inference 33 using the full joint distributions for inference a variant considers conditional probabilities instead of joint probabilities uses the product rule, referred to as conditioning P ( Y) P ( Y | z ) P( z ) z the common scenario is to want conditional probabilities of some variable given evidence about others use the Product Rule (13.3) P(a | b)=P(a b) / P(b) to get an expression in terms of unconditional probabilities, then sum appropriately in the FJD for example: the probability of a cavity, given evidence of a toothache P(cavity | toothache) = P(cavity toothache) / P(toothache) = (0.108 + 0.012) / (0.108 + 0.012 + 0.016 + 0.064) = 0.6 Full Joint Distribution & Inference 34 as a check we might compute the probability of no cavity, given a toothache P(cavity | toothache) = P(cavity toothache) / P(toothache) = (0.016 + 0.064) / (0.108 + 0.012 + 0.016 + 0.064) = 0.4 as they should, the probabilities sum to 1.0 we note that P(toothache) is the denominator for both, & as part of the calculation of both values for cavity, can be viewed as a normalization constant for the distribution P(Cavity | toothache) terms both have P(toothache) as denominator ensuring they sum to 1 Quantifying Uncertainty 35 Normalization Constant note that P(toothache) was the denominator for calculating both conditional probabilities it functions as a normalization constant for the distribution P(Cavity | toothache), ensuring the probabilities add to 1 in AIMA, this constant is denoted by and we use it to mean a normalizing constant , where probabilities must add to 1 since the sum for the distribution must be 1, we can just sum the raw values obtained and then use 1/sum for this may make calculations simpler, and might even allow them when some probability assessment is not available Quantifying Uncertainty 36 Normalization Constant an example of using the normalization constant P(Cavity | toothache) = P(Cavity, toothache) / P(toothache) = P(Cavity, toothache) = [P(Cavity, toothache, catch)+P(Cavity, toothache, catch)] = [<0.108, 0.016> + <0.012, 0.064>] = <0.12, 0.08> = <0.6, 0.4> since the probabilities must add to 1.0, the calculation can be done without knowing , just normalizing at the end Quantifying Uncertainty 37 Generalization of Inference given a query, the generalized version of the process for a conditional probability distribution is: for a single variable X (Cavity in the preceding example), let E be the list of evidence variables (just Toothache in the example) and e the list of observed values for them, and Y the unobserved variables (Catch in the example) the query: P(X | e) is calculated by summing out over the unobserved variables (13.9) P(X | e) = P(X, e) = yP(X, e, y) Quantifying Uncertainty 38 Inference for Probability given the full joint distribution & 13.9 we can answer all probability queries for discrete variables are we left with any any unresolved issues? well, given n variables, and d as an upper bound on the number of values then the full joint distribution table size & corresponding processing of it are O(dn), exponential in n since n might be 100 or more for real problems, this is often simply not practical as a result, the FJD is not the implementation of choice for real systems, but functions more as the theoretical reference point (analogous to role of truth tables for propositional logic) the next sections we look at are foundational for developing practical systems Quantifying Uncertainty Efficiency Through Independence 39 consider a new version of our example domain now defined in terms of 4 random variables Toothache, Catch, Cavity, Weather so P(Toothache, Catch, Cavity, Weather) has a FJD with 2x2x2x4=32 entries one way to display it would be as four 2x2x2 tables, 1 for each value of Weather how are they related? for example: P(toothache, catch, cavity, cloudy) & P(toothache, catch, cavity) Quantifying Uncertainty Efficiency Through Independence 40 in the 4-variable domain what is the relationship between P(toothache, catch, cavity, cloudy) & P(toothache, catch, cavity) given what we know about relating probabilities (the product rule) P(toothache, catch, cavity, cloudy) = P(cloudy | toothache, catch, cavity) P(toothache,catch,cavity) but we "know" that dental problems don't influence the weather & we know weather doesn't seem to influence dental variables so P(cloudy | toothache, catch, cavity) = P(cloudy) P(toothache, catch, cavity, cloudy) = P(cloudy) P(toothache, catch, cavity) & similarly for each entry in P(Toothache, Catch, Cavity, Cloudy) thus the 32 element table for 4 variables reduces to an 8 element table & a 4 element table Quantifying Uncertainty 41 Independence the property of independence or marginal independence or absolute independence notationally, in terms of propositions or random variables, is: P(a|b) = P(a) or P(b|a) = P(b) or P(a b) = P(a) P(b) P(A|B) = P(A) or P(B|A) = P(B) or P(A, B) = P(A) P(B) from our knowledge of the domain, we can simplify the full joint distribution, dividing variables into independent subsets with separate distributions as an example, for the Dentistry-Weather domain Quantifying Uncertainty 42 Independence absolute independence while very powerful for simplifying probability representation & inference absolute independence is unfortunately rare though, for example, for n independent coin tosses P(C1, …, Cn), the full joint distribution with 2n entries becomes n single variable distributions P(Ci) and while this is an artificial example and the converse is more likely the case for real domains that is, within a large domain like dentistry there are likely dozens of diseases & hundreds of symptoms, all interrelated Quantifying Uncertainty 43 Bayes' Rule from the Product Rule, for propositions a & b P(a b)= P(a | b) P(b), or alternatively = P(b | a) P(a) we can derive Bayes' Rule for conditional probabilities equate the alternative RHSs & divide by P(b) to yield Bayes' rule P(a | b) = P(b | a) P(a) / P(b) in the general case of multivalued variables, in distribution form P(Y|X) = P(X|Y) P(Y) / P(X) representing the set of equations, each for specific values of the variables & finally, a version indicating conditionalizing on background evidence e P(Y | X, e) = P(X | Y, e) P(Y | e) / P(X | e) Quantifying Uncertainty 44 Bayes' Rule Bayes' rule is the basis of most AI systems of probabilistic inference we are often able to estimate the 3 RHS probabilities & so compute the LHS finding diagnostic probability from causal probability P(effect|cause) specifies relationship in causal direction P(cause|effect) describes diagnostic direction P(cause|effect) = P(effect|cause) P(cause) / P(effect) in the medical domain, it is common to have conditional probabilities on causal relationships P(symptoms | disease) Quantifying Uncertainty 45 Bayes' Rule Bayes' rule: a medical example P(cause | effect) = P(effect | cause) P(cause) / P(effect) here's a medical domain example a patient presents with a stiff neck, a known symptom of the disease meningitis the physician "knows" the prior probabilities of stiff neck (P(s) = 0.01) & meningitis (P(m) = 0.00002) in addition the physician knows that 70% of patients with meningitis have a stiff neck: P(s|m) = 0.7 P(m|s) = P(s|m) P(m) / P(s) = 0.7 × 0.00002 / 0.01 = 0.0014 Quantifying Uncertainty 46 Bayes' Rule Example Bayes' rule & the meningitis example P(m|s) = P(s|m) P(m) / P(s) = 0.7 × 0.00002 / 0.01 = 0.0014 so, we should expect only 1 in 700 patients with a stiff neck to have meningitis, reflecting the much higher prior probability of stiff neck than of meningitis note: normalization can be applied when using Bayes' Rule P(Y|X) = <P(X|Y)P(Y)> where is a normalization constant so entries in P(Y|X) sum to 1 Quantifying Uncertainty Bayes' Rule: n Evidence Variables 47 Bayes' rule & the dental diagnosis: scaling up for the combining of evidence from multiple sources/variables, how does use of Bayes' Rule scale up, compared to using the FJD? the sample problem: what does the dentist conclude about a cavity when the patient has a toothache & the probe catches in the sore tooth (13.16) P(Cavity | toothache catch) = P(toothache catch | Cavity) P(Cavity) there's not an issue with just 2 sources, but if there are n, then we have 2n possible combinations of observed values & we need to know the conditional probabilities for each (no better than needing the full joint distribution) Quantifying Uncertainty Bayes' Rule: n Evidence Variables 48 Bayes' rule & the dental diagnosis: scaling up we return to the idea of independence in the example, Toothache & Catch are not absolutely independent, but are independent given either the presence or absence of a cavity (each is caused by the cavity but otherwise they are independent) expressing the conditional independence given Cavity we get (13.17) P(toothache catch | Cavity) = P(toothache | Cavity) P(catch | Cavity) (13.16) P(Cavity | toothache catch) = P(toothache catch | Cavity) P(Cavity) substituting into 13.16 yields the following, reflecting the conditional independence of Toothache and Catch P(Cavity | toothache catch) = P(toothache | Cavity) P(catch | Cavity) P(Cavity) Quantifying Uncertainty Conditional Independence 49 the general form of the conditional independence rule here are the most general & for the dental diagnosis domain P(X,Y|Z) = P(X|Z) P(Y|Z) (13.19) P(Toothache, Catch|Cavity) = P(Toothache|Cavity) P(Catch|Cavity) conditional independence also allows decomposition for the dental problem, algebraically, given 13.19, we have P(Toothache, Catch, Cavity) = P(Toothache, Catch |Cavity) P(Cavity) (product rule) = P(Toothache|Cavity) P(Catch|Cavity) P(Cavity) (13.19) Conditional Independence 50 implications of the conditional independence rule P(Toothache, Catch, Cavity) = P(Toothache, Catch |Cavity) P(Cavity) (product rule) = P(Toothache|Cavity) P(Catch|Cavity) P(Cavity) (3.19) we decompose the original large table, which has 23 – 1 = 7 independent entries, into 3 smaller tables 2 of the tables are of the form P(T|C) with 2 rows, each of which must sum to 1 so has 1 independent number 1 table with 1 row for the prior distribution P(C) so having 1 more independent number for our Toothache, Catch, Cavity domain, we've gone from 7 to 5 independent values in total, a small gain for a small problem but if there were n symptoms, all conditionally independent given Cavity, the size of the resulting representation would be linear in n instead of exponential 51 Conditional Independence summary: conditional independence allows scaling up to real problems since the representational complexity can go from exponential to linear is more often applicable than absolute independence assertions yields this net gain: the decomposition of large domains into weakly connected subsets is illustrated in a prototypical way by the dental domain: one cause influences multiple effects, which are conditionally independent, given that cause Quantifying Uncertainty 52 Conditional Independence summary: conditional independence with multiple effects, which are conditionally independent, given the cause, the full joint distribution then is rewritten as P(Cause, Effect1, ..., Effectn) = P(Cause) i P(Effecti | Cause) this is called the naïve Bayes model it makes the simplifying assumption that all effects are conditionally independent it is naïve in that it is applied to many problems although the effect variables are not precisely conditionally independent given the cause variable nevertheless, such systems often work well in practice Quantifying Uncertainty 53 A Return to Wumpus World recall the Wumpus World agent the agent explores the grid world to grab the gold while attempting to avoid being eaten by the Wumpus or falling into a bottomless Pit we used propositional logic for representation & inference now we'll explore an example that uses probability in Wumpus World we'll simplify by restricting our WW hazards only to Pits recall that 1. the percept of a breeze in a square indicates a pit in a neighbouring square 2. the logical representation allowed some conclusions about whether a square was safe but not a quantitative measure of risk if not absolutely safe the "is it safe" problem can be reformulated to use our new probability tools Quantifying Uncertainty 54 Wumpus World Revisited the world incomplete information about the presence of Pits leads to uncertainty, & the agent should choose the best next move here are the Random Variables in the problem one per square, Pij = true iff [i,j] contains a pit one per observed square, Bij = true iff [i,j] is breezy the agent has visited only [1,1], [1,2], [2,1] so we include only B1,1, B1,2, B2,1 in the probability model Quantifying Uncertainty Probabilities in Wumpus World 55 we begin with the full joint distribution P(P1,1, ...,P4,4, B1,1, B1,2, B2,1) applying the product rule yields P(B1,1, B1,2, B2,1 | P1,1, ..., P4,4)P(P1,1, ..., P4,4) 1st term: the conditional probability of a breeze configuration given a pit configuration (P(Effect | Cause)) Bi,j values in the first term are 1 if adjacent to a pit, 0 otherwise 2nd term: the prior probability of a pit configuration pits are placed randomly, independent of each other, with probability 0.2 for any square, so 4, 4 (13.20) P(P1,1, ..., P4,4) = P(P i , j1,1 i, j ) which, for a particular configuration that has n pits is P(P1,1, ..., P4,4) = 0.2n x 0.816-n Quantifying Uncertainty Probabilities in Wumpus World in the example, we have observed evidence 56 the example a breeze or not in each visited square + no pit in any visited square, abbreviated as b & known: b = ¬b1,1 b1,2 b2,1 known = ¬p1,1 ¬p1,2 ¬p2,1 an example query concerns the safety of other squares: what's the probability of a pit at [1,3], given the evidence so far? P(P1,3 |known, b) we could answer by summing over cells in the FJD Quantifying Uncertainty Probabilities in Wumpus World 57 to use summation over the FJD let Unknown be the set of Pi,j variables for squares other than Known & [1,3] so from 13.9 we have P(P1,3 |known, b) = unknownP(P1,3, unknown, known, b) that is, we can just sum over the entries in the Full Joint Distribution but with 12 unknown squares we have 212 terms in the summation, so the calculation is exponential in the number of squares so we'll need to simplify from insight about independence we note: not all unknown squares are equally relevant to the query Probabilities in Wumpus World since summations over the FJD are exponential we need to simplify, given insight about independence to begin, we note that not all unknown squares are equally relevant to the query first, some terminology about partitioning the pit variables frontier are those pit variables (besides the query variable) neighbouring the visited squares other are the remaining pit variables with this revision, we see that the observed breezes are conditionally independent of the other variables, given the known, frontier & query variables 58 Probabilities in Wumpus World 59 using conditional independence P(b|P1,3, Known, Unknown) = P(b|P1,3, Known, Frontier) note that the figures use Fringe, while the text uses Frontier to name the relevant squares neighbouring the visited squares ([2,2] & [3,1]) then we'll need to manipulate our query into a form where we can use this the query: P(P1,3 |known, b) = unknownP(P1,3, unknown, known, b) the world: Using Conditional Independence using the conditional independence simplification P(P1,3 |known, b) = unknownP(P1,3, known, b, unknown) (the query, from 13.9) then by the product rule = unknownP(b | P1,3, known, unknown) P(P1,3, known, unknown) then partitioning unknown into frontier & other = frontierotherP(b | known, P1,3, frontier, other) P(P1,3, known, frontier, other) then using the conditional independence of b from other given known, P1,3 & frontier (& so dropping other from first term) = frontierotherP(b | known, P1,3, frontier) P(P1,3, known, frontier, other) since the 1st term now does not depend on other, move the summation inward = frontierP(b | known, P1,3, frontier) otherP(P1,3, known, frontier, other) 60 Using Conditional Independence manipulating the query to get efficient computation we began with P(P1,3 |known, b) = unknownP(P1,3, unknown, known, b) (the query, from 13.9) so far we have = frontierP(b | known, P1,3, frontier) otherP(P1,3, known, frontier, other) use independence as in 13.20 to factor the prior term = frontierP(b | known, P1,3, frontier) otherP(P1,3) P(known) P(frontier) P(other) then reorder terms = P(known) P(P1,3) frontierP(b | known, P1,3, frontier) P(frontier) other P(other) fold P(known) into the normalizing constant & use other P(other) = 1 = P(P1,3) frontierP(b | known, P1,3, frontier) P(frontier) 61 Probabilities in Wumpus World 62 using conditional independence & independence has yielded an expression with just 4 terms in the summation over the frontier variables P2,2 & P3,1 eliminating other squares P(P1,3 |known, b) = P(P1,3) frontierP(b | known, P1,3, frontier) P(frontier) the expression P(b | known, P1,3, frontier) is 1 when the frontier is consistent with the breeze observations, 0 otherwise so to get each value of P1,3 we sum over the logical models for frontier variables that are consistent with known facts this figure shows the models & the associated priors P(frontier) Probabilities in Wumpus World sum over the logical models for frontier variables that are consistent with known facts P(P1,3 |known, b) = P(P1,3) frontier P(frontier) P(P1,3 |known, b) = ´<0.2(0.04 + 0.16 + 0.16), 0.8(0.04 + 0.16)> <0.31, 0.69> consistent models for frontier variables P2,2 & P3,1with P(frontier) for each model, for P1,3 = true & P1,3 = false 63 Using Conditional Independence note that P1,3, P3,1 are symmetric so by symmetry, [3,1] would contain a pit about 31% of the time: P(P3,1 |known, b) = <0.31, 0.69> & by a similar calculation, [2,2] can be shown to contain a pit with about 0.86 probability: P(P2,2|known, b) <0.86, 0.14> it is clear to the probabilistic agent where not to go next 64 Probabilities in Wumpus World 65 the logical agent & the probabilistic agent strictly logical inferencing can only yield known safe/known unsafe/unknown the probabilistic agent knows which move is relatively safer, relatively more dangerous for efficient probabilistic solutions we can use independence & conditional independence among variables to simplify the summations involved fortunately, these often match our natural understanding of how the problem should be decomposed our next topic considers formal representations for these relationships algorithms to operate on them to do efficient probabilistic inferencing Quantifying Uncertainty 66 Summary uncertainty is due to laziness &/or ignorance, and is unavoidable under nondeterminism &/or partial observability probabilities describe an agent's inability to decide on the truth of a sentence, summarizing belief relative to evidence decision theory combines beliefs & preferences, defining an optimal action as one that maximizes expected utility statements in probability involve prior & conditional probabilities over propositions axioms of probability constrain the probabilities of propositions such that an agent that ignores them behaves irrationally Quantifying Uncertainty 67 Summary the Full Joint Probability Distribution specifies the probability for every assignment of values to random variables, & when available allows summation over possible worlds to answer queries, but has complexity exponential in the number of variables absolute independence allows decomposition of a problem's random variables into smaller joint distributions, reducing complexity, but is rare Bayes' rule allows computing probabilities typically of a cause, given an effect, from known conditional probabilities, but does not scale when there are many evidence variables conditional independence derives from shared causal relationships in the domain & may allow factoring of the FJD into smaller conditional distributions Quantifying Uncertainty 68 Summary a naïve Bayes model assumes conditional independence of all effects variables with a single cause variable so complexity grows linearly with the number of effects in Wumpus World by simplifying calculations via conditional independence the agent may calculate probabilities for unobserved variables and so do better than a purely logical agent Quantifying Uncertainty