Chapter 3 Sequential Decisions “Life must be understood backward, but … it must be lived forward.” - Soren Kierkegaard Terminology • • • • tree terminal node (leaf) backward graph (edges reversed) decision graph – each arc represents a choice. Must be acyclic • Payoff can be at terminal node or along edges • Theorem: any subpath of an optimal path is optimal Games of Chicken Potential Entrant • A monopolist faces a potential entrant • Monopolist can accommodate or fight • Potential entrant can enter or stay out In Out Monopolist Accommodate Fight 50 , 50 -50 , -50 0 , 100 0 , 100 Equilibrium Potential Entrant • Use best response method to find equilibria In Out Monopolist Accommodate Fight 50 , 50 -50 , -50 0 , 100 0 , 100 Importance of Order • Two equilibria exist • ( In, Accommodate ) • ( Out, Fight ) • Only one makes temporal sense • Fight is a threat, but not credible – because once I decide to enter, you lose if you fight! • Not sequentially rational • Simultaneous outcomes may not make sense for sequential games. Sequential Games The Extensive Form 0 , 100 E M -50 , -50 50 , 50 Looking Forward… • Entrant makes the first move: • Must consider how monopolist will respond • If enter: M • Monopolist accommodates -50 , -50 50 , 50 … And Reasoning Back • Now consider entrant’s move 0 , 100 E M acc 50 , 50 • Only ( In, Accommodate ) is sequentially rational Sequential Rationality COMMANDMENT Look forward and reason back. Anticipate what your rivals will do tomorrow in response to your actions today Solving Sequential Games • Start with the last move in the game • Determine what that player will do • Trim the tree • Eliminate the dominated strategies • This results in a simpler game • Repeat the procedure – called roll back. Example 3.9 R 2 2 A 3 4 3 E L C B D 1 4 4 2 T F 1 G K 3 2 1 2 2 M N P Q S Pick nodes one up from leaves and select best choice, reduce graph. Example 3.9 R 2 2 A 3 4 3 E L C B D 1 4 T 4 2 F K 1 G 2 1 N S Pick nodes at last choice point select best (lowest) choice, reduce graph. Cost is sum along edges of path. Example 3.9 R 2 2 A 3 D 4 C B 1 2 F K G 2 1 L S Repeat: Pick nodes at lasta choice point and select best (lowest) choice, reduce g Cost is sum along edges of path. Example 3.9 – backwards induction also called roll back R 2 C 1 G 2 S Repeat: Pick nodes at lasta choice point and select best (lowest) choice, reduce g Cost is sum along edges of path. Voting • Majority rule results – no conclusion: • B>G>R G>R>B R>B>G • B beats G ; G beats R ; R beats B • What if you want “R” to Win? • B vs. G (B wins) then winner vs. R R • Problem: • Everyone knows you want “R” B vs. G then winner vs. R? Good Luck! • Better chance: R vs. G, then winner versus B Interesting how voting order makes a winner in a no winner case! Extensive Form • B>G>R G>R>B R>B>G B B vs. R R vs. B R R wins B B vs. G G Looking Forward B B vs. R A majority prefers R to B R B A majority prefers B to G B vs. G G Trim The Tree B vs. R R vs. B R B B vs. G Rollback in Voting and “Being Political” • Not necessarily good to vote your true preferences • • • • Amendments to make bad bills worse Crossing over in open primaries “Centrist” voting in primaries Supporting your second-best option • STILL – Outcome predetermined • AGENDA SETTING! Predatory Pricing An incumbent firm operates in three markets, and faces entry in each • Market 1 in year 1, Market 2 in year 2, etc. Each time, I can slash prices, or accommodate the new entry What should I do the first year? Predatory Pricing E3 E2 E1 M M Predatory Pricing The end of the tree: year 3 0 , 100 + previous E3 M -50 , -50 + previous 50 , 50 + previous In year 3: ( In, Accommodate ) Predatory Pricing • Since the Incumbent will not fight Entrant 3, he will not fight Entrant 2 • Same for Entrant 1 • Only one “Rollback Equilibrium” • All entrants play In • Incumbent plays Accommodate • Why do we see predatory pricing? • predatory pricing : An anti-competitive measure employed by a dominant company to protect market share from new or existing competitors. Predatory pricing involves temporarily pricing a product low enough to end a competitive threat. Sophie’s choice • Sophie has $100 and a long boring holiday without exciting University lectures • She can watch videos or play Nintendo games • Videos are $4 each • Nintendo games are $5 each • She has $100 • What is Sophie’s choice? Standard price taker budget set Qvideos First find Sophie’s choice set and budget line. 25 Qgames 20 Convex, smooth preferences U=120 Qvideos Then show her preferences. Note Sophie’s perspective both videos and Nintendo games are ‘goods’ (desirables). Utility function is U U=140 Qgames U=80 U=100 Put them together U=120 Qvideos 25 First – note that as both videos and games are goods and there is nothing else for Sophie to spend her money on, she will consume on her budget line solution space U=140 Qgames 20 U=80 U=100 Where on the budget line? U=120 Qvideos 25 Start of with 25 videos and 0 games. This is a bundle on her budget line. But can she do better? Yes! If she buys less videos and uses some money to buy games, she moves to a higher indifference curve, so she is better off. U=140 Qgames 20 U=80 U=100 Where on the budget line? U=95 Qvideos 25 What if we start with 20 games and no videos? Can Sophie make a better choice for herself? Yes! If she buys fewer games and uses some of her money to buy videos she moves to higher indifference curves. U=100 Qgames 20 U=85 U=90 So she prefers a mixture of videos and movies. But what mix is best? The best bundle for Sophie will be where her indifference curve is just tangent to her budget line. Here that is where she has 10 videos and 12 movies Qvideos 25 10 Qgames 12 20 Tangency condition To see this, lets magnify her budget line and indifference curves around the tangency point Qvideos 25 10 Qgames 12 20 Tangency condition Here is the magnified version. Notice that she can move anywhere on her budget line. But if Sophie stops before she reaches the tangency bundle then she is not maximising her utility Tangency condition Only when she reaches her ‘tangency’ bundle is she on her highest indifference curve (U=95). Tangency condition Further she cannot do better than this bundle. For example, she cannot reach the U=95.5 indifference curve. She doesn’t have enough money. Summary so far • So: • Sophie will choose her optimal bundle where her indifference curve is just tangent to her budget line. • This gets her on her highest possible indifference curve given her budget. • But why does this make economic sense? Founders of Probability Theory Blaise Pascal Pierre Fermat (1623-1662, France) (1601-1665, France) They laid the foundations of the probability theory in a correspondence on a dice game. Prior, Joint and Conditional Probabilities P(A) = prior probability of A P(B) = prior probability of B P(A, B) = joint probability of A and B P(A | B) = conditional (posterior) probability of A given B P(B | A) = conditional (posterior) probability of B given A Probability Rules Product rule: P(A, B) = P(A | B) P(B) or equivalently P(A, B) = P(B | A) P(A) Sum rule: P(A) = ΣB P(A, B) = ΣB P(A | B) P(B) if A is conditionalized on B, then the total probability of A is the sum of its joint probabilities with all B Statistical Independence Two random variables A and B are independent iff: P(A, B) = P(A) P(B) P(A | B) = P(A) P(B | A) = P(B) knowing the value of one variable does not yield any information about the value of the other Statistical Dependence Bayes Thomas Bayes (1702-1761, England) “Essay towards solving a problem in the doctrine of chances” published in the Philosophical Transactions of the Royal Society of London in 1764. Bayes Theorem P(A|B) = P(A B) / P(B) P(B|A) = P(A B) / P(A) => P(A B) = P(A|B) P(B) = P(B|A) P(A) => P(A|B) = P(B|A) P(A) P(B) Bayes Theorem Causality P(A|B) = P(B|A) P(A) P(B) Diagnostic: P(Cause|Effect) = P(Effect|Cause) P(Cause) / P(Effect) Pattern Recognition: P(Class|Feature) = P(Feature|Class) P(Class) / P(Feature) Bayes Formula and Classification Conditional Likelihood of the data given the class Prior probability of the class before seeing anything p( X | C ) p( C ) p( C | X ) p( X ) Posterior probability of the class after seeing the data Unconditional probability of the data Medical Example • Probability you have a disease is .002 • If you have the disease, the probability that the test is positive is .97 • If you don’t have the disease, the probability that the test is positive is .04 • What is the probability of a positive test? • p(+test) = .002*.97 + .998*.04 Medical example p(+disease) = 0.002 p(+test | +disease) = 0.97 p(+test | -disease) = 0.04 p(+test) = p(+test | +disease) * p(+disease) + p(+test | -disease) * p(-disease) = 0.97 * 0.002 + 0.04 * 0.998 = 0.00194 + 0.03992 = 0.04186 p(+disease | +test) = p(+test | +disease) * p(+disease) / p(+test) = 0.97 * 0.002 / 0.04186 = 0.00194 / 0.04186 = 0.046 p(-disease | +test) = p(+test | -disease) * p(-disease) / p(+test) = 0.04 * 0.998 / 0.04186 = 0.03992 / 0.04186 = 0.953 Bayesian Decision Theory cont. • Fish Example: • Each fish is in one of 2 states: sea bass or salmon • Let w denote the set of possible outcomes w = w1 for sea bass w = w2 for salmon Bayesian Decision Theory cont. • The State of nature is unpredictable. • w is a variable that must be described probabilistically. • If the catch produced as much salmon as sea bass the next fish is equally likely to be sea bass or salmon. • a priori: before the event • ex post: after the event • Define P(w1 ) : a priori probability that the next fish is sea bass P(w2 ): a priori probability that the next fish is salmon. Bayesian Decision Theory cont. • If other types of fish are irrelevant: P( w1 ) + P( w2 ) = 1. • Prior probabilities reflect our prior knowledge (e.g. time of year, fishing area, …) • Simple decision Rule: Make a decision (about the next fish caught) without seeing the fish. Decide w1 if P( w1 ) > P( w2 ); w2 otherwise. OK if deciding for one fish If several fish, all assigned to same class. • If we knew something about the fish (like how light it looked), could we make a better decision? Bayesian Decision Theory cont. • In general, we will have some features we can use to help us predict. • Feature: lightness reading = x Different fish yield different lightness readings (x is a random variable) Bayesian Decision Theory cont. • Define p(x|w1) = Class-Conditional Probability Density Probability density function for x given that the state of nature is w1 The probability that you have light reading x when fish is w1 • The difference between p(x|w1 ) and p(x|w2 ) describes the difference in lightness between sea bass and salmon. Bayesian Decision Theory cont. Hypothetical class-conditional probability Density functions are normalized (area under each curve is 1.0) Bayesian Decision Theory cont. • Suppose that we know The prior probabilities P(w1 ) and P(w2 ), The conditional densities p( x | w1 ) and p( x | w2 ) Measure lightness of a fish = x. • What is the category of the fish given the lightness reading ? p(w j | x) Bayes' formula P(wj | x) = P(x |wj ) P(wj ) / P(x), where 2 P( x) p( x | w j ) P(w j ) j 1 Likelihood Prior Posterior Evidence • P(A|B) = P(A union B) / P(B) Bayes' formula cont. • p(x|wj ) is called the likelihood of wj with respect to x. (the wj category for which p(x|wj ) is large is more "likely" to be the true category) • p(wj) is the prior probability that wj is true • p(x) is the evidence how frequently we will measure a pattern with feature value x. Scale factor that guarantees that the posterior probabilities sum to 1. Bayes' formula cont. Posterior probabilities for the particular priors P(w1)=2/3 and P(w2)=1/3. At every x the posteriors sum to 1. Error If we decide w2 P(w1 | x) P(error | x) If we decide w1 P(w2 | x) For a given x, we can minimize the probability of error by deciding w1 if P(w1|x) > P(w2|x) and w2 otherwise. Bayes' Decision Rule (Minimizes the probability of error) w1 : if P(w1|x) > P(w2|x) w2 : otherwise or w1 : if P ( x |w1) P(w1) > P(x|w2) P(w2) w2 : otherwise and P(Error|x) = min [P(w1|x) , P(w2|x)] This means P(Error|x) = min [P(w1|x) , P(w2|x)] If the conditional probabilities are both ½, our chance of error is ½ If the conditional probabiities differ [1/4,3/4] our chance of error is only ¼. We will pick the most likely case, meaning the least likely case will all be diagnosed incorrectly. Why many spam filter vendors have implemented Bayesian filtering • Most spam filtering products currently on the market are keyword/keyphrase based filters. • These filters were fairly effective in stopping spam two years ago, although they have always exhibited an unacceptably high false-positive rate. • However, spammers have been busy developing custom software to generate their spam, which hides these keywords and phrases in increasingly sophisticated ways. • To make matters worse, the spamming community actually publishes these keywords on the Internet, so that spammers can avoid their use. This has resulted in keyword/keyphrase filters becoming virtually ineffective in stopping spam. • Confronted with the harsh reality that their products entire infrastructure is built around an outdated, ineffective paradigm, the keyword spam filter vendors decided they would hook their wagon to a small portion of the Bayesian theory. They theorized that by applying a score to their existing keywords and then aggregating that score based on hits for that keyword, their keyword filters could prolong the life of their failing products. • Advantages to their pseudo bayesian approach • 1) Lower false positive rates than keyword filters alone (it takes more keyword hits to classify as spam ) 2) Slightly increased spam identification rate over keywords alone. Problems with their pseudo bayesian approach Problems with their pseudo bayesian approach 1) Significantly increased system resource usage (what used to take one pass, now takes as many as 10-15 passes), to aggregate the total point value necessary to identify a message as spam, or to clear a message as ok. 2) Can't identify cloaked spam (which is generally the most vile spam), such as "v*i(a)g-r-a" or bogus HTML tags, as well as more sophisticated cloaking. 3) Still based on and dependent upon, having clearly visible and obvious keyword/keyphrases. 4) No method of determining why a particular message was caught by the filter - making it impossible to subsequently, intelligently tune the filter for optimal spam recognition. 5) Blind "training" and retraining of the bayesian filter usually results in unpredictable results and often negatively impacts the filter's ability to correctly identify future spam. Example • • • • • • • • • • A certain disease is fatal 40% of the time 45% of those cured took radiation 20% of the people who did not survive took radiation. Let A: cured Let B: took radiation Want to find P(A|B) P(A) = .60 P(Ac) = .4 P(B|A) =.45 P(B|Ac) = .2 P(A|B) = .45*.6/(.45*6 + .2*.4) = .7714 Example • For a particular year, forty-five of seventy-four athletes admitted to the university graduate. • Roughly 15% of all students are athletes. • Suppose the graduation rate for the university is 45%. • At graduation, if you meet someone, what is the probability he/she is an athlete? P(A|G) = P(G|A)*P(A)/P(G) = .61*.15/.45 = .20 Deductive Reasoning Consider the propositions: A = (The Sprinklers are on) B = (The Grass is wet) Major premise: If A is TRUE, then B is TRUE Minor premise: A is TRUE Conclusion: Therefore, B is TRUE Major premise: If A is TRUE, then B is TRUE Minor premise: B is FALSE Conclusion: Therefore, A is FALSE Aristotle, ~ 350 BC Deductive Reasoning - ii Consider the propositions: A = (The Sprinklers are on) B = (The Grass is wet) Major premise: If A is TRUE, then B is TRUE Minor premise: A is FALSE Conclusion: Therefore, B is ? Major premise: If A is TRUE, then B is TRUE Minor premise: B is TRUE Conclusion: Therefore, A is ? Inductive Reasoning Consider the propositions: A = (The Sprinklers are on) B = (The Grass is wet) Major premise: If A is TRUE, then B is TRUE Minor premise: B is TRUE Conclusion: Therefore, A is more plausible Major premise: If A is TRUE, then B is TRUE Minor premise: A is FALSE Conclusion: Therefore, B is less plausible Can “plausible” be made precise? Yes! Bayes(1763), Laplace(1774), Boole(1854), Jeffreys(1939), Cox(1946), Polya(1946), Jaynes(1957) In 1946, the physicist Richard Cox showed that inductive reasoning follows rules that are isomorphic to those of probability theory Probability A AB B Conditional Probability P( A | B) P( AB) / P( B) P( B | A) P( AB) / P( A) A theorem P( A B) P( A) P( B) P( AB) Probability - ii Product Rule Sum Rule Bayes’ Theorem P( AB) P( B | A) P( A) P( A | B) P( B) P( A| B) P( A | B) 1 P( B | A) P( A | B) P( B) / P( A) These rules together with Boolean algebra are the foundation of Bayesian Probability Theory Bayes’ Theorem P(Ci D j | A) P( A | Ci D j ) P(Ci D j ) / P( A) if Ci D j are exhaustivepropositions, i.e., P(Ci D j | A) 1, i, j then wecan writeBayes'T heoremas P( A | Ci D j ) P(Ci D j ) P(Ci D j | A) P( A | Ci D j ) P(Ci D j ) i, j We can sum over propositions that are of no interest marginalization P(Ci | A) P(Ci D j | A) j Bayes’ Theorem: Example 1 • Signal/Background Discrimination – S = Signal – B = Background P(Data | S ) P( S ) P( S | Data) P(Data | S ) P( S ) P(Data | B) P( B) • The probability P(S|Data), of an event being a signal given some event Data, can be approximated in several ways, for example, with a feed-forward neural network Black and blue taxis • Consider the witness problem in law courts. Witness reports are notoriously unreliable, which does not stop people being locked away on the basis of little more. • Consider a commonly cited scenario. • A town has two taxi companies, one runs blue taxi-cabs and the other uses black taxi-cabs. It is known that Blue Company has 15 taxis and the Black Cab Company has 85 vehicles. Late one night, there is a hit-and-run accident involving a taxi. It is assumed that all 100 taxis were on the streets at the time. • A witness sees the accident and claims that a blue taxi was involved. At the request of the defence, the witness undergoes a vision test under conditions similar to those on the night in question. Presented repeatedly with a blue taxi and a black taxi, in‘random’ order, the witness shows he can successfully identify the colour of the taxi 4 times out of 5 (80% of the time). The rest or 1/5 of the time, he misidentifies a blue taxi as black or a black taxi as blue. • Bayesian probability theory asks the following question, “If the witness reports seeing a blue taxi, how likely is it that he has the colour correct?” • As the witness is correct 80% of the time (that is, 4 times in 5), he is also incorrect 1 time in 5, on average. • For the 15 blue taxis, he would (correctly) identify 80% of them as being blue, namely 12, and misidentify the other 3 blue taxis as being black. • For the 85 black taxis, he would also incorrectly identify 20% of them as being blue, namely 17. • Thus, in all, he would have misidentified the colour of 20 of the taxis. Also, he would have called 29 of the taxis blue where there are only 15 blue taxis in the town! • In the situation in question, the witness is telling us that the taxi was blue. • • • • • • In the situation in question, the witness is telling us that the taxi was blue. But he would have identified 29 of the taxis as being blue. That is, he has called 12 blue taxis ‘blue’, and 17 black taxis he has also called ‘blue’. Therefore, in the test the witness has said that 29 taxis are blue and only been correct 12 times! Thus, the probability that the taxis the witness claimed to be blue actually being blue, given the witness's identification ability, is 12/29, i.e. 0.41. When the witness said the taxi was blue, he was incorrect therefore nearly 3 times out of every 5 times. The test showed the witness to be correct less than half the time. Bayesian probability takes account of the real distribution of taxis in the town. It takes account, not just of the ability of a witness to identify blue taxis correctly (80%), but also the witness’s ability to identify the colour of blue taxis among all the taxis in town. In other words, Bayesian probability takes account of the witness’s propensity to misidentify black taxis as well. In the trade, these are called ‘false positives’. • The ‘false negatives’ were the blue taxis that the witness misidentified as black. Bayesian probability statistics (BPS) becomes most important when attempting to calculate comparatively small risks. BPS becomes important in situations where distributions are not random, as in this case where there were far more black taxis than blue ones. • Had the witness called the offending taxi as black, the calculation would have been {the 68 taxis the witness correctly named as black} over {the 71 taxis the witness thought were black}. That is, 68/71 (the difference being the 3 blue taxis the witness thought were black); or nearly 96% of the time, when the witness thought the taxi was black, it was indeed black. • Unfortunately, most people untrained in the analysis of probability tend to intuit, from the 80% accuracy of the witness, that the witness can identify blue cars among many others with an 80% rate of accuracy. • I hope the example above will convince you that this is a very unsafe belief. • Thus, in a court trial, it is not the ability of the person to identify a person among 8 (with a 1/8th, or 12.5%, chance of guessing ‘right’ by luck!) in a pre-arranged line up that matters, but their ability to recognise them in a crowded street or a darkened alleyway in conditions of stress. Testing for rare conditions • Testing for rare conditions • Virtually every lab-conducted test involves sources of error. Test samples can be contaminated, or one sample can be confused with another. The report on a test you receive from your doctor just may belong to someone else, or be sloppily performed. When the supposed results are bad, such tests can produce fear. But let us assume the laboratory has done its work well, and the medic is not currently drunk and incapable. • The problem of false positives is still a considerable difficulty. Virtually every medical test designed to detect a disease or medical condition has a built-in margin of error. The margin of error size varies from one test procedure to another, but it is often in the range of 1-5%, although sometimes it can be much greater than this. Error here means that the test will sometimes indicate the presence of the disease, even when there is no disease present. • Suppose a lab is using a test for a rare condition, a test that has a 2% false-positive rate. This means that the test will indicate the disease in 2% of people who do not have the condition. • Among 1,000 tested for the disease and who do not have it; the test will suggest that about 20 persons do have it. If, as we are supposing, the disease is rare (say it occurs in 0.1% of the population, 1 in 1000), it follows that the majority (here, 95%, 19 in 20) of the people whom the tests report to have the disease will be misdiagnosed! • Consider a concrete example [5]. Suppose that a woman (let us suppose her to be a white female, who has not recently had a blood transfusion and who does not take drugs and doesn’t have sex with intravenous drug users or bisexuals) goes to her doctor and requests an HIV test. Given her demographic profile, her risk of being HIV-positive is about 1 in 100,000. Even if the HIV test was so good that it had a false-positive rate as low as 0.1% (and it is nothing like that good), this means that approximately 100 women among 100,000 similar women will test positive for HIV, even though only one of them is actually infected with HIV. • When considering both the traumatising effects of such reports on people and the effects on future insurability, employability and the like, it becomes clear that the false-positive problem is much more than just an interesting technical flaw. • If your medic ever reports that you tested positive for some rare disorder, you should be extremely skeptical. There is a considerable likelihood the diagnosis itself is mistaken. Knowing this, intelligent physicians are very careful in their use of test results and in their subsequent discussion with patients. But not all doctors have the time or the ability to treat test results with the skepticism that they often deserve. How bad can it get? • In general: • The more rare a condition and the less precise the test (or judgement), then the more likely (frequent) the error. • Consider the HIV test above. Many such tests are wrong 5%, or more, of the time. Remember that the real risk for our heterosexual white woman was around 1 in 100,000, but the test would indicate positive for 5000 of every 100,000 tested! Thus, if applied to a low risk group like white heterosexual females (who did not inject drugs, and did not have sex with a member of a high-risk group like bisexuals, or haemophiliacs, or drug injectors) then the HIV test would be incorrect 4999 times out of 5000! • In general, if the risk were even less and the test method still had a 5% the error rate, the rate for false positives would be even greater. The false positive rate would also increase if the test accuracy were lower.