What should an active causal learner value? 1 Bramley , 2 Nelson , 1 Speekenbrink , 3 Crupi , 1 Lagnado Neil R. Jonathan D. Maarten Vincenzo & David A. nrbramley@gmail.com, jonathan.d.nelson@gmail.com, m.speekenbrink@ucl.ac.uk, vincenzo.crupi@unito.it, d.lagnado@ucl.ac.uk 1 2 3 University College London Max Planck Institute for Human Development University of Turin Simulating many strategies in many environments Beyond Shannon: Generalized entropy landscapes Entropy is expected surprise across possible outcomes. Shannon entropy uses the normal P mean (e.g. x Surprise(x) × p(x)) as expectation and log(1/p(x)) for surprise. Different types of mean, and different types of surprise, can be used (Crupi and Tentori, 2014). SharmaMittal entropy (Nielsen and Nock, 2011) provides a unifying framework: P 1−β 1 ( i p(x)α) 1−α − 1 , where α > 0, α 6= 1, and β 6= 1 • Sharma-Mittal entropy: Hα,β (p) = 1−β • Renyi (1961) entropy if degree (β) = 1 • Learners based on 5 × 5 entropies in Sharma-mittal space α, β ∈ [1/10, 1/2, 1, 2, 10], plus our 5 weird and wonderful entropy functions. • Test cases: all 25 of the 3-variable causal graphs, through 8 sequentially chosen interventions. Repeated simulations 10 times each in 3 × 3 environments: 1. Spontaneous activation rates of [.2, .4, .6] 2. Causal powers of [.4, .6, .8] • Tsallis (1988) entropy if degree (β) = order (α) 0.35 • Error entropy (probability gain; Nelson et al. (2010)) if degree (β) = 2, order (α) = ∞ 5 =0 0.8 0.9 5 0.8 5 0.7 5 0.9 0.30 ● ● ● ● 0.10 0.1 0.2 0.3 0.3 0.6 0.4 0.4 0.2 0.5 p(x a=1/10 b=1/10 (Tsallis low) ● a=1/10, b=1 (Renyi low) a=10, b=10 (Tsallis high) ● a=10, b=1 (Renyi high) 0.2 0.4 0.4 0.25 ● 0.20 ● ● ● ● ● ● ● Central kink ● Early kink Late kink 3 step ● 5 step ● ● ● ● ● ● ● 2) 2) 0.8 p(x 0.6 ● ● ● 0.2 0.6 ● ● 3) =0 p(x 3) =0 3) =0 3) p(x p(x 0.3 0.5 1 1 1 1 =0 3) 5 p(x 0.6 p(x1) = 0 p(x =0 0.6 =0 0.8 0.7 0.5 0.6 5 0.7 =0 p(x1) = 0 0.4 0.8 ● 0.15 =0 3) 0.6 0.2 5 p(x 0.3 0.5 0.4 5 5 0.2 0.4 0.3 0.1 5 5 5 5 0.9 0.8 2) 0.6 Probability correct =0 0.3 0.4 0.2 5 0.1 5 0.3 5 0.4 5 p(x 3) =0 p(x 3) =0 p(x 3) =0 p(x 3) =0 p(x 3) 1 Degree (β) 0.5 0.7 0.9 0 0.6 0.8 p(x p(x1) = 0 Shannon 1.2 ● ● 0.15 0.0 7 0.0 1 0.1 0.0 9 0.1 0.1 =0 0.6 0.7 0.5 0.6 0.6 p(x1) = 0 0.5 0.6 0.7 0.2 0.8 0.7 0.4 0.6 0.1 0.5 ● ● 0.9 1 0.3 0.5 Probability gain 0.7 0.25 9 p(x =0 2) 5 5 0.4 0.3 5 0.1 0.2 0.6 0.8 ● ● 0.10 0 3) = 0 p(x 3) = 0 p(x 3) = 0 p(x 3) = 0 3) = p(x 4 3 2 2) p(x p(x 0.8 =0 p(x 3) =0 p(x 0.8 0.6 0.8 p(x1) = 0 2 3 4 5 6 7 8 0 2 0.2 0.4 0.8 0.6 1 1 =0 p(x 1.6 0.8 0.8 1 • Shannon entropy achieves higher accuracy than probability gain p(x1) = 0 0.0 Ts a lli s p(x1) = 0 0.2 0.4 0.4 0.6 0.8 0.8 p(x1) = 0 0.6 1.4 1 0.2 1.2 0.8 0.6 0.6 1.2 0.8 0.4 0.6 0.4 0.8 p(x1) = 0 0.4 1.4 0.6 1.2 1 0.2 0.8 0.6 1.6 0.6 p(x1) = 0 0.0 ? 0.2 0.4 0.6 0.8 .1 1.0 .5 1 2 • many entropy functions achieve similar accuracy as Shannon entropy 10 Order (α) Probability of element N interventions performed 1.6 0.2 0.2 1.8 1.8 1.6 1.8 1.6 1.2 1 1.4 1.8 3) =0 p(x 3) =0 p(x 3) =0 3) p(x p(x 3) =0 1.4 1.8 =0 0.8 1.4 .1 2) 0.6 =0 =0 =0 1.2 p(x 0.4 2) 2) p(x 0.8 8 Results 0.2 0.6 p(x 0.4 0.4 1 6 p(x1) = 0 N interventions performed 0.8 1.6 4 0.2 1 0.4 0.6 0.6 0.4 p(x1) = 0 0.8 0.6 0.8 p(x1) = 0 0.2 1 0.8 0.4 0.6 0.8 p(x1) = 0 0.4 0.6 0.8 0.6 0.6 1 1.2 1.2 1 0.4 0.8 0.2 1.2 0.6 0.8 0.4 1.4 1.4 1.4 1.4 1.2 1 0.8 1.2 3) =0 3) p(x p(x 3) =0 p(x 3) =0 1 .5 1.4 0.4 Atomic surprise 1 0.2 0.4 0.3 p(x1) = 0 Renyi 0.6 2) Elementwise contribution to entropy 0.1 7 0.9 0.8 0.7 0.6 =0 ? 0.5 =0 1 0.5 p(x Posterior 0.4 2) 0.7 0.2 0.16 0.3 0.6 0.4 C p(x1) = 0 0.2 0.8 1 0.8 0.7 0.9 1 0.5 0.1 p(x 0.9 0.2 0.13 B Infinity 0.35 p(x1) = 0 0.5 0.09 C 0.5 1 2) Prior 0.06 B 5 0.5 0.3 0.03 C 0.0 5 0 B 0.4 0.5 =0 0.16 C B 0.1 0.3 5 0.2 5 0.1 1 0.7 2) C 0.45 0.4 0.4 p(x B C A 0.55 0.5 p(x1) = 0 =0 0.13 B A =0 0.4 0.3 0.5 2) 0.09 C A 0.6 0.65 0.9 p(x ? 5 p(x 0.06 B A 2) =0 0.03 C B A A p(x 2) C A 0.45 0.2 =0 B 0 ? 0.6 0.4 0.5 2) C A + ? 0.3 0.3 0.7 p(x A 0.4 0.65 p(x1) = 0 0.6 Generalised entropy C B 0.4 0.5 0.5 1 A C B Probability of element 0.6 0.4 0.3 C B 1.0 0.3 0.45 0.4 p(x1) = 0 =0 C B 0.8 C B A 0.6 0.6 2) C B C B A 0.4 A C B 0.2 0.55 0.5 0.5 0.4 0.45 p(x C B C B A A 0.0 0.65 p(x1) = 0 =0 B Example 2 + − C B C B A A A 0.5 0.6 0.55 0.5 C B 0.65 0.3 0.35 0.6 2) A 3. Observe the outcome ? A C B C B C B 0.5 0.55 0.65 A 0.2 0.25 0.4 0.45 0.8 ? Example 1 A C B C B C B A A 9 0.35 0.7 0.55 p(x ? A C B C B C B A A A 0 0.0 0.3 0.4 5 0.55 0.75 0.1 0.2 0.4 0.5 2 C B = 2) 0 9 0.0 0.3 0.5 1 − ? A C B C B A C B p(x1) = 0 1 ? ? A A C B C B A A C B 1 0.8 0.4 0.5 =0 1 1 ? C B A C B C B A C B 0.1 0.85 2) + A C B C B 1 0.9 p(x C C B 0.1 p(x A A + C B p(x = 2) 0 A B C B ● ● 0.95 5 0.4 C 0.11 p(x1) = 0 0.65 B p(x = 2) p(x A 0.1 A 0.11 =0 A 0.1 p(x1) = 0 2) A 0.09 0.11 0.11 p(x A A 0 A = 2) A p(x A 0.11 p(x1) = 0 =0 A 0.11 p(x1) = 0 2) 2. Choose an intervention 10 p(x 4. Update beliefs 0 ? = 2) ? Tsallis 0.1 Tsallis 0.5 Tsallis 1 (Shannon surprise) Tsallis 2 (Quadratic surprise) Tsallis 10 0.07 0.1 0.11 p(x ? 5. Repeat a=1, b=1 (Shannon) a=Inf, b=2 (Probability gain) 0.30 Generalised surprise values 'Bad' entropy simulations 0.35 Generalised entropy simulations 0.20 1. Reach into the system 0.05 • Intervening on a causal system provides important information about causal structure (Pearl, 2000; Sprites et al., 1993; Steyvers, 2003; Bramley et al., 2014) • Interventions render intervened-on variables independent of their normal causes: P (A|Do[B], C) 6= P (A|B, C) • Which interventions are useful depends on the prior and the hypothesis space • Goal: Maximize probability of identifying the correct graph Probability correct What is active causal learning? • but the interventions chosen vary widely: 1.0 Intervention choice type proportions (N = 18000) 'Bad' atomic surprise functions ● 0.7 0.0 0.2 0.4 0.6 0.8 0.6 0.5 0.4 0.3 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Probability of element ● 0.1 0.18 0.3 0.22 0.2 0.8 0.1 0.1 0.4 0.9 0.2 4 0.1 4 2) 0.3 0.5 =0 0.8 =0 0.1 6 0.3 2) =0 8 0.1 =0 0.4 0.7 0.8 0.5 0.6 0.2 0.6 0.4 0.1 Sprites, P., Glymour, C., and Scheines, R. (1993). Causation, Prediction, and Search (Springer Lecture Notes in Statistics). MIT Press. (2nd edition, 2000), Cambridge, MA. 0.6 p(x 3) = p(x 3) = 0 0.2 0.6 0.5 0.4 0.6 0.2 p(x1) = 0 Three random steps 0.1 0.1 00.4 .5 0.7 0.9 0.8 0.1 4 0.1 0.2 0.3 0.8 0.1 0.4 0.2 Late kink 0.3 0.2 0.2 0.7 Early kink p(x1) = 0 .4 00.5 p(x1) = 0 0.3 0.1 0.14 0.8 0.16 0.2 0.24 0.2 .6 0.7 0 .4 0.5 0 0.14 0.28 0.3 0.18 0.3 0.26 0.22 0.9 Central kink 4 p(x1) = 0 8 0.1 0.5 0.1 Number of interventions performed 12 0.4 10 0.3 8 0.2 6 0.5 0.1 6 0.18 0.2 0.2 0.28 0.6 0 .7 0.4 0 .5 0.24 8 0.1 0.1 8 0.1 4 ● p(x 3) = 0 0.2 4 0 p(x 3) = 0 p(x 3) = 0.2 4 0 0.1 0.6 0 Renyi, A. (1961). On measures of entropy and information. In Fourth Berkeley Symposium on Mathematical Statistics and Probability, pages 547–561. p(x p(x 2) 2) p(x p(x =0 0.6 0.4 0.6 0.1 Number of interventions performed 0.2 0.2 0.1 2) 0.24 0.5 4 Five step 0.7 0.28 0.4 ● 0.2 0.6 0.26 0.3 2 Three step Pearl, J. (2000). Causality. Cambridge University Press (2nd edition), New York. 0.14 ● 0.24 12 Late kink Nielsen, F. and Nock, R. (2011). A closed-form expression for the sharma-mittal entropy of exponential families. arXiv preprint arXiv:1112.4221. 0.26 10 Early kink Nelson, J. D., McKenzie, C. R., Cottrell, G. W., and Sejnowski, T. J. (2010). Experience matters information acquisition optimizes probability gain. Psychological science, 21(7):960–969. ● ● 8 Central kink Crupi, V. and Tentori, K. (2014). State of the field: Measuring information and confirmation. Studies in History and Philosophy of Science Part A, 47:81–90. 0.8 6 Renyi high Bramley, N. R., Lagnado, D. A., and Speekenbrink, M. (2014). Forgetful conservative scholars - how people learn causal structure through interventions. Journal of Experimental Psychology: Learning, Memory and Cognition, doi:10.1037/xlm0000061. 0.7 4 Tsallis high References Probability of element 0.9 2 Renyi low Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3):261–273. ● ● Tsallis low • when are particular entropy functions cognitively plausible? 0.2 0.6 0.4 0.0 4 3 ● ● Shannon • which properties of entropy functions are important for particular situations? 0.0 5 0.4 ● ● ● p(x 0.2 ● ● 2 ● ● 1 0.6 ● ● Probability gain Information gain Random interventions Mean Shannon information gained 0.8 ● ● ● 0.0 Mean probability correct ● ● ● ● 0.2 Performance by simulated strategy 1.0 Performance by simulated strategy Atomic surprise Bramley et al. (2014), test performance learning all 25 of the 3-variable graphs in an environment with spontaneous activation rate p = .1 and causal power p = .8: Probability gain • in which instances do different strategies strongly contradict each other? Central kink Early kink Late kink Three step Five step 0.1 0.8 Elemental contribution to entropy Can information gain beat probability gain at probability gain? Human data Questions: 'Bad' elementwise entropies 1.0 Probability gain Choose intervention i ∈ I that maximises expected maximum probability φ in posterior over causal graphs G ∈ G P E[φ(G|i)] = o∈O φ(G|i, o)p(o|G, i) where φ(G|i, o) = maxg∈G p(g|i, o) Information gain Choose intervention that minimises expected Shannon entropy P P E[H(G|I)] = o∈O H(G|i, o)p(o|G, I) where − g∈G H(g|I, o) = p(g|I, o) log2 p(g|i, o) To further broaden the space of entropy functions considered, we first created a variety of atomic surprise functions. From these we derived corresponding entropy functions, always P defining entropy as x Surprise(x) × p(x). The resulting entropy functions vary in which entropy axioms (Csiszár, 2008) they satisfy. 0.0 Strategies for choosing interventions myopically 0.8 Weird and wonderful (bad) entropy functions Observation One−on One−on, one−off Two−on One−off Two−off 0.1 0.2 p(x1) = 0 Five random steps Steyvers, M. (2003). Inferring causal networks from observations and interventions. Cognitive Science, 27(3):453–489. Tsallis, C. (1988). Possible generalization of boltzmann-gibbs statistics. Journal of statistical physics, 52(1-2):479–487.