Bayesian Networks and Missing-Data Imputation ∗ Ran Spiegler

advertisement
Bayesian Networks and Missing-Data
Imputation∗
Ran Spiegler†
January 26, 2015
Abstract
A decision maker (DM) tries to learn an objective joint probability
distribution  over  variables. He gathers many independent observations with (randomly, independently generated) missing values. The
DM wishes to extend his incomplete observations into a fully specified, "rectangular" dataset. He employs an iterative imputation procedure, whose individual rounds are akin to the "random regression"
method in the literature on statistical inference with missing data.
The frequencies in the completed dadaset constitute the DM’s subjective belief. When the support of the missing-values process satisfies
the "running intersection property", the procedure produces a belief
with a Bayesian-network representation, which factorizes  according
to a perfect directed acyclic graph. The result provides a limitedfeedback foundation for various models of non-rational expectations.
It is also related to the so-called MaxEnt problem.
∗
This paper has benefitted from ESRC grant no. ES/L003031/1. I am grateful to Noga
Alon, Yair Antler, Simon Byrne, Philip Dawid, Kfir Eliaz, Ehud Lehrer and workshop
participants at Tel Aviv University for helpful conversations and comments.
†
Tel
Aviv
University
and
University
College
London.
URL:
http://www.tau.ac.il/~rani. E-mail: rani@post.tau.ac.il.
1
1
Introduction
Imagine a fresh business graduate who has just landed a job as a junior analyst in a consulting firm. The analyst is ordered to write a report about a
duopolistic industry. Toward this end, he gathers periodic data about three
relevant variables: the product price (denoted 1 ) and the quantities chosen
by the two producers (denoted 2 and 3 ). The price is public information
and its periodic realizations are always available. In contrast, quantities are
not publicly disclosed and the analyst must rely on private investigation to
obtain data about them. Lacking any assistance, he must do the investigation himself, and he is unable to monitor both quantities at the same time.
Specifically, at any given period , the analyst observes either 2 or 3 .
After collecting the data, the analyst records his observations in a spreadsheet:
 1 2 3
1
2
3
4
5
6
..
.
X
X
X
X
X
X
..
.
X
X
−
−
X
−
..
.
−
−
X
X
−
X
..
.
where a "X" ("−") sign in a cell indicates that the value is observed (missing). Let us use  to denote the value of  in observation , whenever this
value is not missing. Because of the missing values, the analyst’s spreadsheet is "non-rectangular". He seeks a way to "rectangularize" it - i.e., fill
the missing cells in a systematic way, in order to have a complete, "rectangular" dataset amenable to rudimentary statistical analysis (plot diagrams,
table statistics, regressions) to be included in his report. Ultimately, the
2
frequencies of (1  2  3 ) in the rectangularized spreadsheet will serve as the
analyst’s practical estimate of the joint distribution over prices and quantities
in the industry.
The situation described in this example is quite common. In many reallife situations, we try to learn the statistical regularities in our environment
through exposure to datasets with missing values. Of course, there are professional statistical techniques for imputing missing values (see Little and
Rubin (2002)). And one may go further and question the normative justification for any attempt to extend incomplete datasets to a fully specified
probability distribution (Gilboa (2014)).
However, my approach in this paper is not normative, but descriptive.
How would an ordinary decision maker, like our analyst, who is not a professional statistician but nevertheless wants to perform some kind of data
analysis, approach such a situation? Which intuitive methods would he apply to his limited dataset in order to make it an object of rudimentary data
analysis? Specifically, how would he impute missing values? To the extent
that the output of his method is a probabilistic belief that exhibits a systematic bias relative to the true underlying distribution, the method can be
viewed as an explanation for the bias. In this sense, what I am interested in
can be referred to as "behavioral imputation".
Returning to the example, the following is a plausible principle our analyst
might employ to fill the missing cells in his spreadsheet. When the value of
 ( = 2 3) is missing in some observation , he can use the observed data
about the joint distribution of  and 1 , coupled with the known realization
1 , to impute a value for  . In particular, he may draw the value of  from
the empirical distribution over  conditional on 1 . In fact, this procedure is
a non-parametric version of the so-called "stochastic regression" imputation
technique (Little and Rubin (2002, Ch. 4)).
What is the distribution over  = (1  2  3 ) in the rectangularized
spreadsheet induced by this imputation procedure? Suppose that in every
3
period, the profile  is independently drawn from some objective joint distribution . Assume further that the spreadsheet is long: the analyst has
managed to gather infinitely many observations of 1  2 and 1  3 . Thus,
through repeated observation, he has effectively learned the marginal distributions (1  2 ) and (1  3 ). Therefore, in the part of the spreadsheet
where the values of 3 were originally missing, the frequency of  is 1 () =
(1  2 )(3 | 1 ). And in the part of the spreadsheet where the values of
2 were originally missing, the frequency of  is 2 () = (1  3 )(2 | 1 ).
The overall frequency of  in the final spreadsheet is thus a weighted
average of 1 and 2 , where the weights match the relative size of the two
parts in the original spreadsheet. However, observe that by the basic rules
of conditional probability,
(1  2 )(3 | 1 ) = (1  3 )(2 | 1 ) = (1 )(2 | 1 )(3 | 1 )
(1)
Thus, both parts in the analyst’s rectangularized spreadsheet exhibit the
same frequencies of any given . As the R.H.S of (1) makes explicit, these
frequencies regard the firms’ quantities as statistically independent conditional on the price. This conditional independence property is a direct consequence of the analyst’s method of imputation; it need not be satisfied by
the objective distribution  itself. Thus, the analyst’s constructed statistical
description distorts the true underlying distribution.
Expression (1) is an instance of a "Bayesian-network representation". In
general, suppose that  is a joint probability distribution over a collection
of random variables  = (1    ), and let  be an asymmetric, acyclic
(but not necessarily transitive) binary relation over the set of variable labels
 = {1  }. Define, for every ,
 (1    ) =

Y
=1
4
( | () )
(2)
where () = { | }. Thus, the expression  "factorizes  according to
". It is convenient to represent  as a directed acyclic graph (DAG) over
the set of nodes , such that  corresponds to the directed edge  → ,
and () is the set of "direct parents" of the node . I will therefore refer
to (2) as a "Bayesian-network representation" or as a "DAG representation"
interchangeably. The R.H.S of (1) is an instance of a DAG representation,
where the DAG is 2 ← 1 → 3.
Bayesian networks - and more generally, probabilistic graphical models
- have many uses in statistics and Artificial Intelligence: they help analyzing systematic implications of conditional-independence assumptions, devising efficient algorithms for computing Bayesian updating, and systematize
reasoning about causal relations that underlie statistical regularities. For
prominent textbooks, see Cowell et al. (1999) and Pearl (2000).
Closer to home, Bayesian-network representations turn out to be relevant
for economic models of non-rational expectations. In recent years, economic
theorists have become increasingly interested in decision making under systematically biased beliefs. In particular, game theorists developed solution
concepts that extend the steady-state interpretation of conventional Nash
equilibrium, while relaxing its rational-expectations property in favor of the
assumption that players’ subjective beliefs systematically distort the objective steady-state distribution (e.g., Osborne and Rubinstein (1998), Eyster
and Rabin (2005), Jehiel (2005), Jehiel and Koessler (2008), Esponda (2008),
Esponda and Pouzo (2014a)).1
In a companion paper (Spiegler (2014)), I demonstrate that many of these
models of subjective belief distortion (as well as other belief biases that had
not been formalized before) can be expressed in terms of a Bayesian-network
representation. The individual agent is characterized by a "subjective DAG"
1
Piccione and Rubinstein (2003) and Eyster and Piccione (2013) developed ideas of a
similar nature in a competitive-equilibrium framework. Macroeconomic theorists examined
models in which agents learn misspecified models and thus come to hold biased beliefs
(Sargent (2001), Evans and Honkapohja (2001), Woodford (2013)).
5
, and his subjective belief factorizes the objective equilibrium distribution
according to . The "true model" is defined by the set of objective distributions that are consistent with the "true DAG" ∗ . We capture the agent’s
systematic bias by performing a basic operation on ∗ (removing, inverting
or reorienting links) to obtain . Thus, the Bayesian-network representation
of subjective beliefs can be viewed as a way to organize part of the literature
on equilibrium models with non-rational expectations.
An argument that is often invoked in justification of equilibrium concepts
with non-rational expectations, is that agents form their beliefs by naively
extrapolating from limited learning feedback. Sometimes, as in Osborne and
Rubinstein (1998), this idea is built formally into the solution concept. More
often, though, it is informal. The observation that in our example, the subjective belief that emerges from the analyst’s imputation procedure has a
Bayesian-network representation is thus of interest, because it provides a
"limited feedback foundation" for a useful representation of non-rational expectations. The main question in this paper is the following: If we extend the
analyst’s imputation procedure to general datasets with (randomly generated)
missing values, under what conditions will the subjective belief that emerges
from the procedure have a Bayesian-network representation?
To address this question, I construct a model in which a decision maker
(DM) has access to a dataset consisting of infinitely many observations, independently drawn from an objective distribution  over  = (1    ).
For each observation, the set of variables whose values are observed is some
 ⊂ {1  }, generated by an independent random process with support S.
In the "analyst" example, S = {{1 2} {1 3}}. I assume that S is a cover
of the set of variables {1  }, and (for convenience) that no member of S
contains another. Because there are infinitely many observations, the DM
effectively learns the marginal of  over  , denoted  , for every  ∈ S.
The DM’s objective is to extend his dataset to a fully specified probability
distribution over (1    ). He performs an iterative version of the imputa-
6
tion method described in the example. His starting point is some  0 ∈ S. In
the first round of the procedure, he searches for another  1 ∈ S having the
largest intersection with  0 (with arbitrary tie-breaking). The intuition for
this "maximal overlap" criterion is that the DM tries to make the most of
observed correlations among variables. He then performs the same imputation method as in the example, replacing the known marginal distributions
(0 ) and ( 1 ) with a single marginal distribution 1 ∈ ∆(1 ), where
1 = 0 ∪ 1.
Thus, at the end of the procedure’s first round, the DM has produced a
new dataset, which rectangularizes the part of the original dataset in which
the observed sets of variables were  0 and  1 . In the next round, the DM
applies the same imputation method, taking  1 as his new starting point,
and so forth. The procedure terminates after  − 1 rounds, producing a
complete, rectangular dataset. The frequencies of (1    ) in this dataset
serve as the DM’s final subjective belief.
The main result in the paper establishes a necessary and sufficient condition on S for the belief produced by the iterative imputation procedure
to have a Bayesian-network representation for all objective distributions .
The condition is known in the literature as the "running intersection property" (e.g., Cowell et al. (1999, p. 55)), and I demonstrate that it is satisfied by a variety of natural missing-values processes. When the condition
holds, the DAG  in the Bayesian-network representation is perfect: for
every   ∈ (), either  or . Conversely, every Bayesian-network
representation that involves a perfect DAG  can be justified as the output
of the iterative imputation procedure, applied to any missing-values process
whose support is the set of maximal cliques in .
The results in this paper thus provide a missing-data foundation for
Bayesian-network representations that involve perfect DAGs. Section 3 provides a few economic illustrations of how this foundation works. Section 4
shows how this foundation is related to the so-called MaxEnt problem. I refer
7
the reader to Spiegler (2014) for many other economic examples, in which the
DM’s subjective belief has a Bayesian-network representation that involves
a perfect-DAG. The results in this paper thus tell a story about the possible
origins of such beliefs.
2
The Model
Let  = 1 × · · · ×  be a finite set of states, where  ≥ 2. I refer to 
as a variable. Let  = {1  } be the set of variable labels. Let  ∈ ∆()
be an objective probability distribution over . For every  ⊆  , denote
 = ( )∈ and  = ×∈  , and let  denote the marginal of  over
 . Consider a decision maker (DM henceforth) who obtains an infinite
sequence of independent draws from . However, for each observation, he
only gets to see the realized values of a subset of variables  ⊂ , where  is
independently drawn from a probability distribution  ∈ ∆(2 ), referred to
as the missing-values process. The pair ( ) constitutes the DM’s dataset.
The support of  is denoted S. Let |S| =  ≥ 2, and assume that S is a
cover of . In addition, assume (purely for convenience) that there exist no
  0 ∈ S such that  ⊂  0 . Thanks to the DM’s infinite sample, he gets to
learn  for every  ∈ S.
We could define a dataset more explicitly, as an infinite sequence (    ),
where  is a random draw from ;   is an independent random draw from ;
   . However, for our purposes,
and the value of  is missing if and only if  ∈
identifying the dataset with ( ) is w.l.o.g and more convenient to work
with. The more elaborate definition would be appropriate for extensions of
the model, e.g. when the DM’s sample is finite.
2.1
An Iterative Imputation Procedure
The DM’s task is to extend his dataset into a fully specified subjective probability distribution over . He performs this task using an "iterative im8
putation procedure", which consists of  − 1 rounds. The initial condition
of round  = 1   − 1 round is a pair ( −1  −1 ), where  −1 ⊆ 
and −1 ∈ ∆( −1 ). In particular,  0 is an arbitrary member of S, and
0 = 0 . Round  consists of two steps:
¯
¯
Step 1: Select   ∈ arg max∈S−{0  −1 } ¯ ∩  −1 ¯. Let   =  −1 ∪   .
Step 2: Define two auxiliary distributions over  :
1 ( ) ≡ −1 (−1 ) · (  −−1 |   ∩−1 )
2 ( ) ≡ (  ) · −1 (−1 −  |   ∩−1 )
and then, define  ∈ ∆( ) as follows:

P−1
(  )
(  )
1 + P
2


=1 ( )
=1 ( )
 ≡ P=1

If  =  − 1, the procedure is terminated and −1 ∈ ∆() is the DM’s final
belief. If    − 1, switch to round  + 1.
This procedure describes a process by which the DM gradually completes
his dataset, iterating the method for imputing missing values described in
the Introduction. By the end of round  − 1, the DM has "rectangularized"
the part of the original dataset that consisted of observations of  ,  ∈
{ 0   1   −1 }. He has thus transformed the (infinite sets of) observations
of the variable sets  0   1   −1 into an infinite set of "observations" that
induce a fully-specified joint distribution over −1 . In step 1 of round ,
the DM looks for a new set of variables   ∈ S having maximal overlap
with  −1 . The rationale for this "maximal overlap" criterion is that the
DM tries to extrapolate as little as possible and make the most of observed
correlations.
In Step 2 of round , the DM extends the "rectangularization" of the
9
dataset from the variable set  −1 to the variable set   =  −1 ∪   ,
using the same method as in the introductory example. It will be useful to
describe it in terms of the "spreadsheet" image. First, the DM exploits the
observed distribution  over   - when the value of   −−1 is missing in
some row of the spreadsheet, he imputes a random draw from (  −−1 |
  ∩−1 ). The resulting frequency of  in this part of the spreadsheet
is 1 ( ) ≡ −1 (−1 ) · (  −−1 |   ∩−1 ). Second, the DM exploits
the constructed distribution −1 over −1 - when the value of −1 − 
is missing in some row of the spreadsheet, he imputes a random draw from
−1 (−1 −  |   ∩−1 ). The resulting frequency of  in this part of
the spreadsheet is 2 ( ) ≡ (  ) · −1 (−1 −  |   ∩−1 ). The DM
produces the distribution  over  as a weighted average of 1 and 2 ,
according to the relative number of "observations" in each of the two parts
of the dataset.
The DM terminates the procedure when he has "rectanguralized" the
entire dataset - namely, all missing values have been imputed.
Example 2.1. Consider a dataset in which S is a partition of . Then, in
 { 0   1    −1 }. Therefore,  
any round ,  −1 ∩  = ∅ for every  ∈
can be selected arbitrarily. The auxiliary distributions 1 and 2 , calculated
in Step 2, are both equal to −1 (−1 )(  ). It follows that in any round
,  ( ) = (0 )( 1 ) · · · (  ). In the final round  − 1, we obtain
−1 () =
Y
( )
∈S
We see that the belief that emerges from the DM’s procedure is a product of
marginal objective distributions.
Example 2.2. Let  = {1 2 3 4} and assume that the support of  is
S = {{1 3} {1 2} {2 4}}. This means that the DM has effectively learned
the joint distributions (1  3 ), (1  2 ) and (2  4 ). For an economic
10
story behind this specification, imagine that the DM learns the behavior of
two players in a game with incomplete information, where 1 and 2 represent
the players’ signals, and 3 and 4 represent their actions. The DM’s dataset
enables him to learn the joint distribution of the players’ signals, as well as
the joint signal-action distribution for each player.
Let us implement the iterative imputation procedure. In round 1, select the initial condition to be  0 = {1 3}. By the maximal-overlap criterion of Step 1, The only legitimate continuation is  1 = {1 2}, such that
 1 = {1 2 3}. Imputing the missing values of 3 and 2 is done exactly as
in the motivating example of the Introduction. By the end of the first round,
the DM has "rectanguralized" the part of his dataset in which he originally
observed 1  2 or 1  3 . He has replaced these observations with "manufactured observations" of the triple 1  2  3 , and the joint distribution over
these variables in the manufactured dataset can be written as
1 (1  2  3 ) = (3 )(1 | 3 )(2 | 1 )
In the second and final round,  2 = {2 4}, such that  2 = , and
22 ()
1
= (2  4 ) ·  (1  3 | 2 ) = (2  4 )
1{123} (1  2  3 )
1{123} (2 )
(3 )(1 | 3 )(2 | 1 )
= (2  4 ) P P
0
0
0
0
0
0 (3 )(1 | 3 )(2 | 1 )
1
3
(2 )(1 | 2 )(3 | 1 )
P
P
= (2  4 )
(2 ) 01 (01 | 2 ) 03 (03 | 01 )
= (2  4 )(1 | 2 )(3 | 1 )
= (3 )(1 | 3 )(2 | 1 )(4 | 2 )
= 1 (1  2  3 ) · (4 | 2 ) = 21 ()
11
Thus, 21 = 22 = 2 , which can be written as
2 () = (1 )(2 | 1 )(3 | 1 )(4 | 2 )
(3)
In the context of the economic story described earlier, this expression for
the DM’s subjective belief has a simple interpretation: he ends up believing
that each player’s mixture over his actions is measurable w.r.t his signal i.e., the player’s action is independent of his opponent’s signal and action,
conditional on the player’s own signal.
Comment: The role of the "maximal overlap" criterion
Suppose that in Example 2.2, the DM ignored the "maximal overlap" criterion and picked  1 = {2 4} in the first round. The output of this round
would be  1 =  and 1 () = (1  3 )(2  4 ). In the second round, we
would have 21 = 1 , and 22 () = (1  2 )1 (3  4 | 1  2 ). As a result, the
final output of the procedure would be the belief
2 () = ({1 3} + {2 4})1 () + {1 2}22 ()
£
¤
= 1 (3  4 | 1  2 ) {1 2}(1  2 ) + ({1 3} + {2 4})1 (1  2 )
Consider an objective distribution  under which 3 and 4 are both independently distributed, whereas the variables 1 and 2 are mutually correlated.
Then, 1 () = (1 )(2 )(3 )(4 ). Therefore,
2 () = (3 )(4 ) [{1 2}(1  2 ) + ({1 3} + {2 4})(1 )(2 )]
(4)
This expression underestimates the objective correlation between 1 and 2 .
By comparison, under the same assumptions on , expression (3) would reduce to
(5)
(3 )(4 )(1  2 )
which fully accounts for the objective correlation between 1 and 2 . This
12
example demonstrates that the maximal-overlap criterion matters for the
evolution of the iterative imputation procedure.
2.2
Bayesian Networks
The final belief that emerges from the DM’s procedure in each of the examples
factorizes  into a product of conditional probabilities. Both are instances of
a "Bayesian-network representation". I now provide a concise exposition of a
few basic concepts in the literature on Bayesian networks. The exposition is
standard - see Cowell et al. (1999) - occasionally using different terminology
that is more familiar to economists.
Let  be an acyclic asymmetric binary relation over . For every  ∈ ,
let () = { ∈  | }. The pair ( ) can be viewed as a directed acyclic
graph (DAG), where  is the set of nodes, and  means that there is a
link from  to . I will therefore refer to elements in  as "variable labels"
or "nodes" interchangeably. Slightly abusing terminology, I will refer to 
as a DAG and I will use the notations  and  →  interchangeably. The
skeleton of , denoted ̃, is its non-directed version - that is, ̃ if  or
. A subset of nodes  ⊆  is a clique in  if ̃ for every   ∈ . A
clique is maximal if it is not a strict subset of another clique. A clique  is
ancestral if () ⊂  for every  ∈ .
Fix a DAG . For every objective distribution  ∈ ∆(), define
 () ≡
Y
∈
( | () )
(6)
The distribution  is said to factorize  according to . In Example 2.1,
the final belief factorizes  according to a DAG that consists of disjoint
cliques. In Example 2.2, the final belief factorizes  according to the DAG
3 ← 1 → 2 → 4. The DAG  and the set of distributions that can be
factorized by  constitute a Bayesian network.
Different DAGs can be equivalent in terms of the distributions they fac13
torize. For example, the DAGs 1 → 2 and 2 → 1 are equivalent, since
(1 )(2 | 1 ) ≡ (2 )(1 | 2 ).
Definition 1 (Equivalent DAGs) Two DAGs  and  are equivalent if
 () ≡  () for every  ∈ ∆().
For instance, all linear orderings (i.e., transitive and anti-symmetric binary relations) are equivalent: in this case, the factorization formula (6)
reduces to the textbook chain rule, where the enumeration of the variables
is known to be irrelevant.
Verma and Pearl (1991) provided a complete characterization of equivalent DAGs, which will be useful in the sequel. Define the -structure of a
DAG  to be the set of all triples of nodes    such that , , 
/
and  .
/
Proposition 1 (Verma and Pearl (1991)) Two DAGs  and  are equivalent if and only if they have the same skeleton and the same -structure.
To illustrate this result, the DAGs 1 → 3 ← 2 and 1 → 3 → 2 have
identical skeletons but different -structures. Therefore, these DAGs are not
equivalent: there exist distributions that can be factorized by one DAG but
not by the other. In contrast, the DAGs 1 → 3 → 2 and 1 ← 3 ← 2
are equivalent because they have the same skeleton and the same (vacuous)
-structure.
A certain class of DAGs will play a special role in the sequel.
Definition 2 (Perfect DAGs) A DAG  is perfect if () is a clique for
every  ∈ .
14
Thus, a DAG is perfect if and only if it has a vacuous -structure: for
every triple of variables    for which  and , it is the case that  ̃.
For instance, the DAG 3 ← 1 → 2 → 4 that factorizes the final belief in
Example 2.2 is perfect. Proposition 1 thus implies the following result.
Remark 1 Two perfect DAGs are equivalent if and only if they have the
same set of maximal cliques. In particular, we can set any one of these
cliques to be ancestral w.l.o.g.
In what follows, I refer to  as a DAG representation. When  is perfect,
I refer to  as a perfect-DAG representation.
The DAG representation  generally distorts the objective distribution
. However, certain marginal distributions are not distorted. The following proposition, which will be useful in the next section, characterizes these
cases.2
Proposition 2 Let  be a DAG and let  ⊆ . Then,  ( ) ≡ ( ) for
every  if and only if  is an ancestral clique in some DAG in the equivalence
class of .
Thus, the marginal distribution over  induced by  never distorts
the true marginal if  is an ancestral clique in , or in some DAG that is
equivalent to . The intuition for this result can be conveyed through the
causal interpretations of DAGs, and by considering a set  that consists of
a single node . When  is an ancestral node, it is a "primary cause". The
belief distortions that arise from a misspecified DAG concern variables that
are either independent of  or (possibly indirect) effects of . Therefore, these
2
For other properties of Bayesian-network representations, see Dawid and Lauritzen
(1993).
15
distortions are irrelevant for calculating the marginal distribution of  . In
contrast, suppose that  cannot be represented as an ancestral node in any
DAG that is equivalent to . Then, it can be shown that there must be
two other variables   that function as (possibly indirect) causes of , and
the DAG deems  independent of  or . This failure to account for the
full dependencies among  and its multiple causes can lead to distorting the
marginal distribution over  .
3
The Main Results
In this section, I show that a perfect-DAG representation can always be
justified as the final output of the iterative imputation procedure. In contrast,
an imperfect-DAG representation lacks this foundation. The analysis will rely
on the following twin definitions.
Definition 3 A sequence of sets 1    satisfies the running intersection property (RIP) if for every  = 2  ,  ∩ (∪  ) ⊆  for some
  . The collection S satisfies RIP∗ if its elements can be ordered in a
sequence that satisfies RIP.
Thus, a sequence of sets satisfies RIP if the intersection between any
set along the sequence and the union of its predecessors is weakly contained in one of these predecessors. We will be interested in whether S
(the support of the missing-values process ) satisfies RIP∗ . The property
holds trivially when the sequence consists of two sets. The supports of 
in Examples 2.1 and 2.2 satisfy RIP* - vacuously in the former case, and
by the sequence {1 3} {1 2} {2 4} in the latter. In contrast, the collection
S = {{1 3} {1 2} {2 3}} violates RIP*.
In what kind of missing-values processes should we expect RIP∗ to hold?
The following are two natural cases.
16
Example 3.1. Consider situations (like the motivating example of the Introduction) in which there is a set of "public variables" that are always observed,
as well as "private variables" whose observability is mutually exclusive. Formally, there is a set  ⊂  such that S = { ∪ {}}∈− . Any ordering
of S satisfies RIP.
Example 3.2. Suppose that S consists of all "intervals" of length  − 1 - that
is,
S = {{1  } {2   + 1}  { −  + 1  }}
This is a natural assumption when different  ’s indicate values of a single
variable at different time periods , and observations of the variable consist
of lagged realizations with bounded memory  − 1. The above ordering of
the sets in S satisfies RIP, because the intersection of each set with the union
of all its predecessors is weakly contained in its immediate predecessor.
RIP is a familiar property in the Bayesian-network literature (see Cowell
et al. (1999, p. 54)), because of its relation to the notion of perfect DAGs.
Remark 2 (Cowell et al. (1999, p. 54)) The set of maximal cliques in
any perfect DAG satisfies RIP*.
We are now ready to state the main results of this paper.
Proposition 3 Suppose that S satisfies RIP*. Then, the iterative imputation procedure produces a final belief with a perfect-DAG representation,
where the set of maximal cliques in the DAG is S.
Thus, when S satisfies RIP∗ , −1 factorizes any objective distribution
 according to a perfect DAG whose set of maximal cliques is S. Since all
perfect DAGs with the same set of maximal cliques are equivalent, the DAG
17
representation of the DM’s final belief is essentially unique. In particular, it
is independent of the arbitrary selections that the procedure involves along
the way (the initial condition  0 , tie breaking when there are multiple ways
to respect the "maximal overlap" criterion in step 1 of some round).
The RIP plays a subtle role in the proof of Proposition 3. Recall that
RIP* states that the sets in S can be ordered in some sequence that will
satisfy RIP. However, the order in which the iterative imputation procedure
considers the sets in S (as the DM "rectangularizes" increasingly large parts
of the dataset) is governed by the maximal-overlap criterion of Step 1 in each
round. A priori, such a sequence need not satisfy RIP. However, a lemma
due to Noga Alon (Alon (2014)) ensures that it does.
Illustrations
Let us illustrate the economic meaning of Proposition 3, by revisiting the two
economic examples we provided for datasets that satisfy RIP∗ . In Example
3.1, the iterative imputation procedure produces a final belief with a perfectDAG representation, which can be written as follows:
−1 () = ( ) ·
Y
∈

( |  )
Thus, the DM ends up considering all "private variables" to be independently
distributed, conditional on the "public variables".
As to Example 3.2, the iterative imputation procedure produces a final
belief with a perfect-DAG representation, which can be written as follows:
−1 () = (1   −1 ) ·

Y
=
( | −1   −+1 )
If we interpret  as the value of some variable at time period , then −1
reads exactly like a serially correlated process with memory  − 1.
18
The two examples show how natural structures of subjective belief can
emerge as extrapolations from equally natural specifications of the DM’s
incomplete dataset. Spiegler (2014) contains numerous additional examples
of perfect-DAG representations in concrete economic settings. Moreover,
some of these examples are abstracted from existing models of non-rational
expectations in the literature. Proposition 3 can thus be viewed as a limitedfeedback foundation for these models, too.
The next result is a converse to Proposition 3: if S violates RIP∗ , the
iterative imputation procedure produces a final belief that lacks a DAG representation, in the following sense: there exists no DAG  such that −1
factorizes all objective distributions according to .
Proposition 4 Suppose that S violates RIP*. then, for every DAG  there
exists a (positive-measure) set of objective distributions  such that −1 () 6=
 () for some .
Propositions 3 and 4 imply that among all DAG representations, only
those that involve perfect DAGs have a "missing data" foundation, as articulated by the following corollary.
Corollary 1 () If  is perfect, then any missing-values process  whose
support consists of the maximal cliques in  has the following property: for
every objective distribution , applying the iterative imputation procedure to
the dataset ( ) will generate the final belief −1 =  . () If  is not
perfect, then for any missing-values process , there is an objective distribution  such that applying the iterative imputation procedure to the dataset
( ) will generate a final belief −1 6=  .
19
The first part of the corollary follows directly from Remark 2: the set
of maximal cliques of a perfect DAG  satisfies RIP∗ . The second part
says that imperfect-DAG representations cannot be justified by the iterative
imputation procedure - in other words, they cannot be "inferred" from the
data. For example, consider the DAG 1 → 3 ← 2. For any , we can find 
such that applying the iterative imputation procedure to the dataset ( )
will give rise to a final belief −1 such that −1 () 6= (1 )(2 )(3 |
1  2 ) for some .
Causal inferences
Pearl (2000) advocated a causal interpretation of DAG representations, and
turned them into a powerful formalism for systematic causal reasoning. According to this view, the DAG  represents a fundamental causal structure
that underlies the statistical regularities given by  , and the directed edge
 →  means that the variable  is considered to be a direct cause of variable  . The conditional-independence properties captured by  are thus a
consequence of the underlying causal structure.
In the context of the "analyst" example that opened this paper, imagine
that our analyst is tempted to draw causal inferences from the statistical
relations in his rectangularized spreadsheet. (I say "tempted" because he
should have been wary of imposing causal interpretations on purely statistical
data.) Are firms price takers responding to exogenous price changes? Or is
the product price a consequence of firms’ independent quantity choices, as in
Cournot duopoly? A priori, both theories are plausible, hence causality could
run in both directions. However, the Bayesian-network representation (1)
favors the "price taking" theory: price is a primary cause and quantities are
conditionally independent consequences. This causal theory can be described
by the DAG  : 2 ← 1 → 3. It is empirically distinct from the "quantity
competition" theory, described by the DAG 0 : 2 → 1 ← 3, in the sense
that  and 0 are not equivalent. Thus, the analyst’s willingness to draw
causal inferences from his incomplete dataset has led him to favor a causal
20
story that regards the publicly observed variable as a primary cause, and to
dismiss a causal story that regards it as a final consequence.
At the same time, the analyst’s causal inferences in this example are not
robust, because  is equivalent to the DAGs 2 → 1 → 3 and 3 → 1 → 2.
Thus, the causal links in the Bayesian-network representation of his final
belief are not pinned down. Corollary 1 implies that this is a general property.
In a perfect DAG, we can select any maximal clique to be ancestral, w.l.o.g.
Therefore, a causal interpretation of the DAG means that every variable can
be viewed as a primary cause. If there is a directed path from  to  in some
perfect DAG, then there is an equivalent DAG in which the path is entirely
inverted. In this sense, the beliefs that the iterative imputation procedure
extrapolates from datasets lack a meaningful causal interpretation.
An alternative termination criterion
A natural variant on the imputation procedure would terminate it in the earliest round  for which   = . In other words, the DM would stop as soon
as his "edited" spreadsheet has an infinite set of rows for which no cell has
a missing value, instead of trying to rectangularize the entire spreadsheet.
When RIP* is satisfied, this variation would not make any difference (in
particular, by the assumption that S does not include sets that contain one
another, the alternative procedure would terminate in exactly −1 rounds).
When RIP* is violated, it is possible that under the alternative termination
criterion, the imputation procedure will culminate in a belief with a DAG
representation. However, the DAG will not be essentially unique, as it will
depend on arbitrary selections made at various stages of the procedure. For
instance, let  = {1 2 3}, S = {{1 2} {1 3} {2 3}}. The alternative procedure will terminate after one round, leading to a belief with a perfect-DAG
representation  ←  → , where any permutation of    is possible.
21
4
Relation to the MaxEnt Problem
I have introduced a "behaviorally motivated" procedure for extending a
dataset with missing values into a fully-specified probability distribution over
, and characterized when this procedure generates a DAG representation.
One could consider other, more "normatively motivated" extension criteria.
One such criterion is maximal entropy. Suppose the DM faces the dataset
( ) and derives from it the marginal distributions of  over  ,  ∈ S. The
DM’s problem is to find a probability distribution  ∈ ∆() that maximizes
entropy subject to the constraint that ( ) ≡ ( ) for every  ∈ S. A
more general version of this problem was originally stated by Jaynes (1957)
and has been studied in the machine learning literature, where it is known
as the MaxEnt problem.
The maximal-entropy criterion generalizes the "principle of insufficient
reason" (recall that unconstrained entropy maximization yields the uniform
distribution). The idea behind it is that the DM wishes to be "maximally
agnostic" about the aspects of the distributions he has not learned, while
being entirely consistent with the aspects he has learned. For instance, suppose that the DM only manages to learn the marginal distributions over all
individual variables - i.e., S = {{1)  {}}. Then, the maximal-entropy
extension of these marginals is (1 ) · · · ( ), which is consistent with an
empty-DAG representation.
The following is an existing result, reformulated to suit our present purposes.
Proposition 5 ((Hajek et al. (1992))) Suppose that  is perfect. Then,
 is a maximal-extension entropy of the marginals of  over all  , where
 is a maximal clique in .
This result establishes a connection between the iterative imputation procedure and the MaxEnt problem: the former can be viewed as an algorithm
22
for implementing the latter whenever S satisfies RIP∗ . This is a partial,
quasi-normative justification for the iterative imputation procedure.
5
Conclusion
This paper presented an example of a broad research program: establishing
a "limited learning feedback" foundation for representations of "boundedly
rational expectations". The basic idea is that DMs extrapolate their belief
from some feedback they receive about a prevailing probability distribution.
The form of their belief - and the systematic biases it may be exhibit in
relation to the true distribution - will reflect the structure of their feedback and the method of extrapolation they employ. In this paper, limited
feedback took the form of an infinitely large sample subjected to an independent missing-values process; the extrapolation method was defined by the
iterative imputation procedure; and under some condition on the feedback
(RIP∗ ), the resulting belief had an essentially unique Bayesian-network representation, from which aspects of the underlying limited feedback could be
inferred. The same condition ensured that another extrapolation method,
namely MaxEnt, would be equivalent. As I demonstrate in Spiegler (2014),
the belief distortions inherent in this representation of subjective beliefs have
interesting implications for individual behavior.
This view of our exercise suggests natural directions for extension. First,
it may be interesting to explore generalizations of the Bayesian-network
representation, such as convex combinations of DAG representations, or
graphs with both directed and non-directed links. Second, the link to the
literature on statistical inference with missing data (see Little and Rubin
(2002) for a textbook treatment) can be strengthened. The class of missingvalue processes that I have assumed is known in this literature as MissingCompletely-at-Random (MCAR). Step 2 in each round of the imputation
procedure is a "non-parametric cousin" of the so-called "stochastic regres23
sion" technique. The missing-data aspect can be developed in various ways:
the processes  and  may be correlated; the DM’s sample may be finite; and
it may consists of passive observational data and active, deliberate "experiments". Finally, alternative methods of extrapolation can be considered.
The link between learning feedback and non-rational expectations was
studied in two recent papers. Esponda and Pouzo (2014b) proposed a general game-theoretic model, in which each player has a "subjective model",
which is a set of stochastic mappings from his action  to a primitive set of
payoff-relevant consequences  he observes during his learning process. This
learning feedback is limited because in the true model, other "latent" variables may affect the action-consequence mapping. Esponda and Pouzo define
an equilibrium concept, in which each player best-replies to a subjective distribution (of  conditional on ), which is the closest in his subjective model
to the true equilibrium distribution. Distance is measured by a weighted
version of Kullback-Leibler divergence. Esponda and Pouzo justify this equilibrium concept as the steady-state of a Bayesian learning process.
Schwartzstein (2014) studied a dynamic model in which a DM tries to
predict a variable  as a function of two variables  and . At every period,
he observes the realizations of  and . In contrast, he pays attention to
the realization of  only if his belief at the beginning of the period is that
 is sufficiently predictive of . When the DM chooses not to observe , he
imputes a constant value. Schwartzstein examines the long-run belief that
emerges from this learning process, and in particular the DM’s failure to
perceive correlations among the three variables.
References
[1] Alon, N. (2014), “Problems and Results in Extermal Combinatorics III,” Tel Aviv University, mimeo.
24
[2] Cowell, R., P. Dawid, S. Lauritzen and D. Spiegelhalter (1999), Probabilistic Networks and Expert Systems, Springer, London.
[3] Dawid, P. and S. Lauritzen (1993), “Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models,” Annals of Statistics,
21, 1272—1317.
[4] Esponda, I. (2008), “Behavioral Equilibrium in Economies with Adverse
Selection,” The American Economic Review, 98, 1269-1291.
[5] Esponda, I. and D. Pouzo (2014a), “An Equilibrium Framework for
Modeling Bounded Rationality,” mimeo.
[6] Esponda, I. and D. Pouzo (2014b), “Conditional Retrospective Voting
in Large Elections,” mimeo.
[7] Evans, G. and S. Honkapohja (2001), Learning and Expectations in
Macroeconomics, Princeton University Press, Princeton.
[8] Eyster, E. and M. Piccione (2013), “An Approach to Asset Pricing Under
Incomplete and Diverse Perceptions,” Econometrica, 81, 1483-1506.
[9] Eyster, E. and M. Rabin (2005), “Cursed Equilibrium,” Econometrica,
73, 1623-1672.
[10] Gilboa, I. (2014), “Rationality and the Bayesian Paradigm,” mimeo.
[11] Hajek, P., T. Havranek and R. Jirousek (1992), Uncertain Information
Processing in Expert Systems, CRC Press.
[12] Jaynes, E. T. (1957), “Information Theory and Statistical Mechanics,”
Physical Review, 106, 620-630.
[13] Jehiel, P. (2005), “Analogy-Based Expectation Equilibrium,” Journal of
Economic Theory, 123, 81-104.
25
[14] Jehiel, P. and F. Koessler (2008), “Revisiting Games of Incomplete Information with Analogy-Based Expectations,” Games and Economic Behavior, 62, 533-557.
[15] Little, R. and D. Rubin (2002), Statistical Analysis with Missing Data,
Wiley, New Jersey.
[16] Osborne, M. and A. Rubinstein (1998), “Games with Procedurally Rational Players,” American Economic Review, 88, 834-849.
[17] Pearl, J. (2000), Causality: Models, Reasoning and Inference, Cambridge University Press, Cambridge.
[18] Piccione, M. and A. Rubinstein (2003), “Modeling the Economic Interaction of Agents with Diverse Abilities to Recognize Equilibrium Patterns,” Journal of the European Economic Association, 1, 212-223.
[19] Sargent, T. (2001), The Conquest of American Inflation, Princeton University Press, Princeton.
[20] Schwartzstein, J. (2014), “Selective Attention and Learning,” Journal
of European Economic Association, 12, 1423-1452.
[21] Spiegler, R. (2014), “Bayesian Networks and Boundedly Rational Expectations,” mimeo.
[22] Verma, T. and J. Pearl (1991), “Equivalence and Synthesis of Causal
Models,” Uncertainty in Artificial Intelligence, 6, 255-268.
[23] Woodford, M. (2013), “Macroeconomic Analysis without the Rational
Expectations Hypothesis,” Annual Review of Economics, forthcoming.
26
Appendix: Proofs
Proposition 2
For convenience, label the variables in  by 1  . Let us write down the
explicit expression for  ( ):
X
 ( ) =
 (1     0+1   0 )
0+1 0
X
=
Y
0+1 0 ∈
( | ()∩  0()− )
(7)
Y
∈

(0 | ()∩  0()− )
(i). Assume  is an ancestral clique in . Then,
Y
∈
( |
()∩  0()− )
= (1 )

Y
=2
( | 1   −1 ) = ( )
Expression (7) can thus be written as
( )
X
0
+1 0
Ã

Y
(0
=+1
|
()∩  0()−
!
= ( )
(ii). Let us distinguish between two cases.
Case 1 :  is not a clique in . Then,  contains two variables, labeled
w.l.o.g 1 and 2, such that 12
/ and 21.
/ Consider an objective distribution
, for which every  ,   2, is distributed independently, whereas 1 and 2
are mutually correlated. Then, expression (7) is simplified into

Y
=1
( )
X
0
+1 0

Y
=+1
27
(0 )
=

Y
=1
( )
whereas the objective distribution can be written as
( ) = (1 )(2 | 1 )

Y
( )
=3
The two expressions are different because 2 and 1 are not independent.
Case 2 :  is a clique which is not ancestral in any DAG in the equivalence
class of . Suppose that for every node  ∈ ,  has no "unmarried parents"
- i.e., if there exist nodes  0 such that  and 0 , then  0 or 0 .
In addition, if there is a directed path from some  ∈
  to , then  has no
unmarried parents either. Transform  into another DAG 0 by inverting
every link along every such path. The DAGs  and 0 share the same skeleton
and -structure. By Proposition 1, they are equivalent. By construction, 
is an ancestral clique in DAG that is equivalent to , a contradiction.
It follows that  has the following structure. First, there exist three
distinct nodes, denoted w.l.o.g 1 2 3, such that 1 2 ∈
 , 13, 23, 12
/ and
21.
/ Second, there is a directed path from 3 to some node  ∈ ,  ≥ 3. For
convenience, denote the path by (3 4  ) - i.e., the immediate predecessor
of any   3 along the path is  − 1. It is possible that  = 3, in which case
the path is degenerate. W.l.o.g, we can assume that  ∈
  for every  6= 
along this path (otherwise, we can take  to be the lowest-numbered node
that belongs to  along the path).
Consider any  which is consistent with a DAG ∗ that has the following
structure: first, 1∗ 2∗ 3 and 1∗ 3; second, for every  ∈ {4  }, ∗ () =
 {2  }. (Note that the latter property
{ −1}; and ∗ () = ∅ for every  ∈
 {2  } is independently distributed. Then,
means that every  ,  ∈
() = ∗ () = (1 )(2 | 1 )(3 | 1  2 ) ·
28

Y
=4
( | −1 ) ·
Y
4
( )
In contrast,
 () = (1 )(2 )(3 | 1  2 ) ·

Y
=4
( | −1 ) ·
Y
( )
4
By definition, every  = 4   − 1 does not belong to . Denote
(0 ) = (01 )(03 | 01  02 )
Ã−1
Y
=4
!
(0 | 0−1 ) ( | 0−1 )
Therefore,
( ) =
Y
( )
0
∈−{}
 ( ) =
Y
X
( )
(02 | 01 )(0 )
X
(02 )(0 )
01 0−1
∈−{}
It is easy to see from these expressions that we can find a distribution  which
is consistent with ∗ such that  ( ) 6= ( ) for some .
Proposition 3
I begin with a lemma due to Noga Alon. A sequence of sets  0   1     is
¯
¯
expansive if for every  ≥ 1, ¯  ∩ (∪   )¯ ≥ |  ∩ (∪   )| for all   .
Lemma (Alon (2014)). Suppose that S satisfies RIP*. Then, every expansive ordering of S satisfies RIP.
Suppose that S satisfies RIP*. Consider the sequence of sets   that
are introduced in Step 1 of each round . By Step 1 of the procedure, the
sequence  0   1    −1 is expansive. By the lemma, it satisfies RIP. My
task is to show that for every  and every round  = 1   − 1, the belief
 ∈ ∆( ) has a perfect-DAG representation, where the DAG  is defined
over   , and its set of maximal cliques is { 0   1     }. The proof is by
induction on .
29
Let  = 1. Recall that  1 =  0 ∪  1 . By assumption, S does not include
sets that contain one another. Therefore,  1 − 0 and  0 − 1 are non-empty.
The auxiliary beliefs 11 and 12 defined over 1 are given by
11 (1 ) = (0 )( 1 −0 |  1 ∩0 )
12 (1 ) = ( 1 )(0 − 1 |  1 ∩0 )
By the basic rules of conditional probability, we have
(0 )( 1 −0 |  1 ∩0 ) = ( 1 )(0 − 1 |  1 ∩0 )
and therefore 1 is consistent with a DAG 1 defined over  1 , where ̃1 
/ 1  for every  ∈  0   ∈  1 . The
if and only if   ∈  0 or   ∈  1 ; and  
DAG 1 has two maximal cliques,  0 and  1 .
Consider the initial condition ( −1  −1 ) of any round   1. The
auxiliary beliefs 1 and 2 over   =  −1 ∪   are given by
1 ( ) = −1 (−1 )(  −−1 |   ∩−1 )
(8)
2 ( ) = (  )−1 (−1 −  |   ∩−1 )
Consider the expression for 1 . The inductive hypothesis is that −1 has a
perfect-DAG representation, where the DAG −1 is defined over  −1 , and
its set of maximal cliques is { 0   1    −1 }. By RIP,   ∩  −1 is weakly
contained in one of the sets  0   1    −1 . Extend −1 to a DAG 
over   , simply by adding a directed link between every   ∈   (without
destroying acyclicity). The DAG  is perfect, and its set of maximal cliques
is { 0   1     }. Thus, 1 is a perfect-DAG representation, where the DAG
is  .
To complete the proof, I will show that 2 coincides with 1 . Note that
30
1 and 2 can be written as
1 ( ) = (  )−1 (−1 ) ·
1
(  ∩−1 )
1
2 ( ) = (  )−1 (−1 ) · −1
 (  ∩−1 )
(9)
Since −1 is a perfect-DAG representation, where the DAG is −1 , and
  ∩  −1 is a clique in −1 , Remark 1 implies that w.l.o.g it is an ancestral clique. Proposition 2 then implies that −1 (  ∩−1 ) ≡ (  ∩−1 ).
Therefore, 2 coincides with 1 .
Proposition 4
Suppose that S violates RIP*. Then, any ordering of its sets will violate
RIP. Note that this means  ≥ 3. Recall that the sequence  0   1 trivially
satisfies RIP. Let   1 be the earliest round for which   ∩  −1 is not
weakly contained in any of the sets  0   1    −1 . By the proof of Proposition 3, −1 ∈ ∆(−1 ) is a perfect-DAG representation, where the perfect
DAG −1 , defined over  −1 , is characterized by the set of maximal cliques
{ 0   1    −1 }. It follows that   ∩ −1 is not a clique in −1 . By Proposition 2, there exist distributions  for which −1 (  ∩−1 ) 6= (  ∩−1 ).
Therefore, by (9), there exists  for which 1 and 2 do not coincide. I now
construct a family of such distributions. Since   ∩  −1 is not a clique
in −1 , it must contain two nodes, denoted w.l.o.g 1 and 2, that are not
linked by −1 . Let  be an arbitrary objective distribution for which  is
independently distributed for every   2, whereas 1 and 2 are mutually
correlated. Then,
Y
( )
−1 (−1 ) =
∈ −1
31
It follows that 1 and 2 can be written as
1 ( ) = (1 )(2 ) ·
Y
( )
∈  −{12}
2 ( ) = (1 )(2 | 1 ) ·
Y
( )
∈  −{12}
such that
⎛
 ( ) = ⎝
Y
∈  −{2}
⎞ ÃP
!


(
)
)
(
P≤−1  (2 ) + P
(2 | 1 )
( )⎠ ·

≤ ( )
≤ ( )
Since all the variables  ∈  − {1 2} are independently distributed under , the continuation of the iterative imputation procedure will eventually
produce a final belief of the form
−1 () =
Ã
Y
6=2
!
( )
· [(2 ) + (1 − )(2 | 1 )]
where  ∈ (0 1). Since (2 | 1 ) 6= (2 ) for some 1  2 , −1 does not
have a DAG representation.
32
Download