Modeling Consumer Decision Making and Discrete Choice Behavior

advertisement
Part 26: Bayesian vs. Classical [1/45]
Econometric Analysis of Panel Data
William Greene
Department of Economics
Stern School of Business
Part 26: Bayesian vs. Classical [2/45]
26.
Modeling Heterogeneity in Classical Discrete Choice:
Contrasts with Bayesian Estimation
William Greene
Department of Economics
Stern School of Business
New York University
Part 26: Bayesian vs. Classical [3/45]
Abstract
This study examines some aspects of mixed (random parameters)
logit modeling. We present some familiar results in specification
and classical estimation of the random parameters model. We then
describe several extensions of the mixed logit model developed in
recent papers. The relationship of the mixed logit model to
Bayesian treatments of the simple multinomial logit model is noted,
and comparisons and contrasts of the two methods are described.
The techniques described here are applied to two data sets, a
stated/revealed choice survey of commuters and one simulated
data set on brand choice.
Part 26: Bayesian vs. Classical [4/45]
Random Parameters Models of Discrete Choice

Econometric Methodology for Discrete Choice
Models



Classical Estimation and Inference
Bayesian Methodology
Model Building Developments




The Mixed Logit Model
Extensions of the Standard Model
Modeling Individual Heterogeneity
‘Estimation’ of Individual Taste Parameters
Part 26: Bayesian vs. Classical [5/45]
Useful References

Classical




Train, K., Discrete Choice Methods with Simulation, Cambridge, 2003.
(Train)
Hensher, D., Rose, J., Greene, W., Applied Choice Analysis,
Cambridge, 2005.
Hensher, D., Greene, misc. papers, 2003-2005,
http://www.stern.nyu.edu/~wgreene
Bayesian



Allenby, G., Lenk, P., “Modeling Household Purchase Behavior with
Logistic Normal Regression,” JASA, 1997.
Allenby, G., Rossi, P., “Marketing Models of Consumer Heterogeneity,”
Journal of Econometrics, 1999. (A&R)
Yang, S., Allenby, G., “A Model for Observation, Structural, and
Household Heterogeneity in Panel Data,” Marketing Letters, 2000.
Part 26: Bayesian vs. Classical [6/45]
A Random Utility Model
Random Utility Model for Discrete Choice
Among J alternatives at time t by person i.
j
′xitj
ijt
Uitj
=
j
= Choice specific constant
+
+
xitj
= Attributes of choice presented to person
(Information processing strategy. Not all
attributes will be evaluated. E.g., lexicographic
utility functions over certain attributes.)

= ‘Taste weights,’ ‘Part worths,’ marginal utilities
ijt
= Unobserved random component of utility


Mean=E[ ijt] = 0; Variance=Var[ ijt] = 2
Part 26: Bayesian vs. Classical [7/45]
The Multinomial Logit Model
Independent type 1 extreme value (Gumbel):




F(itj) = 1 – Exp(-Exp(itj))
Independence across utility functions
Identical variances, 2 = π2/6
Same taste parameters for all individuals
Prob[choice j | i,t] =
exp(α j +β'xitj )

Jt (i)
j=1
exp(α j +β'xitj )
Part 26: Bayesian vs. Classical [8/45]
What’s Wrong with this MNL Model?

I.I.D.  IIA (Independence from irrelevant
alternatives)





Peculiar behavioral assumption
Leads to skewed, implausible empirical results
Functional forms, e.g., nested logit, avoid IIA
IIA will be a nonissue in what follows.
Insufficiently heterogeneous:
“… economists are often more interested in aggregate effects and
regard heterogeneity as a statistical nuisance parameter problem
which must be addressed but not emphasized. Econometricians
frequently employ methods which do not allow for the estimation
of individual level parameters.” (A&R, 1999)
Part 26: Bayesian vs. Classical [9/45]
Accommodating Heterogeneity

Observed?

Unobserved? The purpose of this study.
Enter in the model in familiar
(and unfamiliar) ways.
Part 26: Bayesian vs. Classical [10/45]
Observable (Quantifiable) Heterogeneity
in Utility Levels
Uijt = α j +β'xitj + γj zit + ε ijt
Prob[choice j | i,t] =
exp(α j +β'xitj + γj zit )

Jt (i)
j=1
exp(α j +β'xitj + γzit )
Choice, e.g., among brands of cars
xitj = attributes: price, features
Zit = observable characteristics: age, sex, income
Part 26: Bayesian vs. Classical [11/45]
Observable Heterogeneity in
Preference Weights
Uijt = α j +βi xitj + γjzit + ε ijt
βi = β + Φhi
βi,k = βk + φkhi
Prob[choice j | i,t] =
exp(α j +βixitj + γjzit )

Jt (i)
j=1
exp(α j +βixitj + γzit )
Part 26: Bayesian vs. Classical [12/45]
‘Quantifiable’ Heterogeneity in Scaling
Uijt = α j +β'x itj + γ jz it + ε ijt
Var[εijt ] = σ2j exp(δj wit ), σ12 = π2 /6
wit = observable characteristics: age, sex, income, etc.
Part 26: Bayesian vs. Classical [13/45]
Heterogeneity in Choice Strategy

Consumers avoid ‘complexity’

Lexicographic preferences eliminate certain choices  choice
set may be endogenously determined

Simplification strategies may eliminate certain attributes
 Information processing strategy is a source of
heterogeneity in the model.
Part 26: Bayesian vs. Classical [14/45]
Modeling Attribute Choice

Conventional:
xk,ijt


Uijt = ′xijt. For ignored attributes, set
=0. Eliminates xk,ijt from utility function
Price = 0 is not a reasonable datum. Distorts choice
probabilities
Appropriate: Formally set 



k
=0
Requires a ‘person specific’ model
Accommodate as part of model estimation
(Work in progress) Stochastic determination of attribution
choices
Part 26: Bayesian vs. Classical [15/45]
Choice Strategy Heterogeneity

Methodologically, a rather minor point – construct
appropriate likelihood given known information

logL  m1 iM logLi (θ | data,m)
Not a latent class model. Classes are not latent.
Not the ‘variable selection’ issue (the worst form of

Familiar strategy gives the wrong answer.
M

“stepwise” modeling)
Part 26: Bayesian vs. Classical [16/45]
Application of Information Strategy



Stated/Revealed preference study, Sydney car commuters.
surveyed, about 10 choice situations for each.
500+
Existing route vs. 3 proposed alternatives.
Attribute design


Original: respondents presented with 3, 4, 5, or 6 attributes
Attributes – four level design.







Free flow time
Slowed down time
Stop/start time
Trip time variability
Toll cost
Running cost
Final: respondents use only some attributes and indicate when surveyed which
ones they ignored
Part 26: Bayesian vs. Classical [17/45]
Estimation Results
Part 26: Bayesian vs. Classical [18/45]
“Structural Heterogeneity”


Marketing literature
Latent class structures


Yang/Allenby - latent class random parameters
models
Kamkura et al – latent class nested logit models with
fixed parameters
Part 26: Bayesian vs. Classical [19/45]
Latent Classes and
Random Parameters
Heterogeneity with respect to 'latent' consumer classes
Pr(Choicei ) =  q=1 Pr(choicei | class = q)Pr(class = q)
Q
exp(xi,choiceβclass )
Pr(choicei | class = q) =
Σ j=choice exp(xi,jβclass )
Pr(class = q | i) = Fi,q , e.g., Fi,q =
exp(ziδq )
Σq=classes exp(ziδq )
Simple discrete random parameter variation
exp(xi,choiceβi )
Pr(choicei | βi ) =
Σ j=choice exp(xi,jβi )
Pr(βi = βq ) = Fi,q =
exp(ziδq )
Σq=classes exp(ziδq )
,q = 1,...,Q
Pr(Choicei ) =  q=1 Pr(choice | βi = βq )Pr(βq )
Q
Part 26: Bayesian vs. Classical [20/45]
Latent Class Probabilities

Ambiguous at face value – Classical Bayesian
model?
 Equivalent to random parameters models with
discrete parameter variation



Using nested logits, etc. does not change this
Precisely analogous to continuous ‘random
parameter’ models
Not always equivalent – zero inflation models
Part 26: Bayesian vs. Classical [21/45]
Unobserved Preference Heterogeneity


What is it?
How does it enter the model?
Uijt = α j +β'x itj + γ j z it + ε ijt + w i
Random Parameters?
Random ‘Effects’?
Part 26: Bayesian vs. Classical [22/45]
Random Parameters?
Stochastic Frontier Models with Random
Coefficients. M. Tsionas, Journal of Applied
Econometrics, 17, 2, 2002.
Bayesian analysis of a production model
What do we (does he) mean by “random?”
Part 26: Bayesian vs. Classical [23/45]
What Do We Mean by
Random Parameters?

Classical




Distribution across
individuals
Model of heterogeneity
across individuals
Characterization of the
population
“Superpopulation?” (A&R)

Bayesian



Parameter Uncertainty?
(A&R) Whose?
Distribution defined by a
‘prior?’ Whose prior? Is it
unique? Is one ‘right?’
Definitely NOT
heterogeneity. That is
handled by individual specific
‘random’ parameters in a
hierarchical model.
Part 26: Bayesian vs. Classical [24/45]
Continuous Random Variation in
Preference Weights
Uijt = α j +βi x itj + γ jz it + ε ijt
βi = β + Φhi + w i
βi,k = βk + φkhi + w i,k
Most treatments set Φ = 0
βi = β + w i
Prob[choice j | i, t] =
exp(α j +βi xitj + γj z it )

Jt (i)
j=1
exp(α j +βi xitj + γz it )
Heterogeneity arises from continuous variation
in βi across individuals. (Classical and Bayesian)
Part 26: Bayesian vs. Classical [25/45]
What Do We ‘Estimate?’
Classical
f(i|Ω,Zi)=population
‘Estimate’ Ω, then
E[i|Ω,Zi)=cond’l mean
V[i|Ω,Zi)=cond’l var.
Estimation Paradigm
“Asymptotic (normal)”
“Approximate”
“Imaginary samples”
Bayesian
f(|Ω0)=prior
L(data| )=Likelihood
f(|data, Ω0)=Posterior
E(|data, Ω0)=Posterior Mean
V(|data, Ω0)=Posterior Var.
Estimation Paradigm
“Exact”
“More accurate”
(“Not general beyond this prior and
this sample…”)
Part 26: Bayesian vs. Classical [26/45]
How Do We ‘Estimate It?’

Objective



Mechanics:



Bayesian: Posterior means
Classical: Conditional Means
Simulation based estimator
Bayesian: Random sampling from the posterior distribution.
Estimate the mean of a distribution. Always easy.
Classical: Maximum simulated likelihood. Find the maximum of
a function. Sometimes very difficult.
These will look suspiciously similar.
Part 26: Bayesian vs. Classical [27/45]
A Practical Model Selection Strategy
What self contained device is available to suggest that
the analyst is fitting the wrong model to the data?


Classical: The iterations fail to converge. The optimization
otherwise breaks down. The model doesn’t ‘work.’
Bayesian? E.g., Yang/Allenby Structural/Preference Heterogeneity
has both discrete and continuous variation in the same model. Is
this identified? How would you know? The MCMC approach is too
easy. It always works.
Part 26: Bayesian vs. Classical [28/45]
Bayesian Estimation Platform: The Posterior (to
the data) Density
Prior
: f(β | Ω0 )
Likelihood
: L(β | data) º f(data | β)
Joint density : f(β,data | Ω0 ) = L(β | data)f(β | Ω0 )
Posterior
f(β,data | Ω0 )
: f(β | data,Ω ) =
f(data)
0
=
L(β | data)f(β | Ω0 )

β
L(β | data)f(β | Ω0 )dβ
Posterior density of β given data and prior Ω0
Part 26: Bayesian vs. Classical [29/45]
The Estimator is the Posterior Mean
E[β | data,Ω0 ] =

β
β f(β | data,Ω0 ) dβ


0
L(β
|
data)f(β
|
Ω
)
 dβ
=  β
0
β 
L(β
|
data)f(β
|
Ω
)dβ 

 β

Simulation based (MCMC) estimation: Empirically,
1 R
ˆ
E[β] =  r=1βr | known posterior population
R
This is not ‘exact.’ It is the mean of a random sample.
Part 26: Bayesian vs. Classical [30/45]
Classical Estimation Platform: The Likelihood
Marginal : f(βi | data,Ω)
Population Mean = E[βi | data,Ω]
=  βi f(βi | Ω)dβi
βi
= β = a subvector of Ω
ˆ = Argmax L(β ,i = 1,...,N | data,Ω)
Ω
i
ˆ
Estimator = β
Expected value over all possible realizations of i (according to the
estimated asymptotic distribution). I.e., over all possible samples.
Part 26: Bayesian vs. Classical [31/45]
Maximum Simulated Likelihood
True log likelihood
Ti
Li (βi | datai ) =  t=1 f(datai | βi )
Li (Ω | datai ) = 
βi

Ti
t=1
f(datai | βi )f(βi | Ω)dβi
logL =  i=1 log Li (βi | datai )f(βi | Ω)dβi
N
βi
Simulated log likelihood
logL S =  i=1
N
1 R
log  r=1 Li (βiR | datai ,Ω)
R
ˆ = argmax(logL )
Ω
S
Part 26: Bayesian vs. Classical [32/45]
Individual Parameters
βi ~ N(β, Ω), i.e., βi = β + w i
β ~ N(β0 , Ω0 ), i.e., β = β0 + w 0
Ω ~ Inverse Wishart(G0 , g0 )
ˆ i = Posterior Mean = E[βi | data,β0 , Ω0 , G0 , g0 ]
β
Computed using a Gibbs sample (MCMC)
Part 26: Bayesian vs. Classical [33/45]
Estimating i
“… In contrast, classical approaches to modeling
heterogeneity yield only aggregate summaries of
heterogeneity and do not provide actionable
information about specific groups. The classical
approach is therefore of limited value to
marketers.” (A&R p. 72)
Part 26: Bayesian vs. Classical [34/45]
A Bayesian View: Not All Possible Samples, Just
This Sample
Based on any 'classical' random parameters model,
E[βi | This sample] =  βi f(β i | datai ,Ω)dβ i
βi
ˆ
i
= conditional mean in f(βi | datai ,Ω)


L(β
|
data
)f(
β
|
Ω)
i
i
i
 dβ
=  βi 
βi
 L(βi | datai )f(βi | Ω)dβi  i
 β

= conditional mean conditioned on
the data observed for individual i.
Looks like the posterior mean
Part 26: Bayesian vs. Classical [35/45]
THE Random Parameters Logit Model
Random Utility :
Uijt = αi,j +βi x itj + γ i,j z it + ε ijt
Random parameters :
θi,k = θk + δk w i + σ kui,k
Θi = Θ + Δw i +Σui , Σ a diagonal matrix
Extensions :
Correlation : Σ = a lower triangular matrix
Autocorrelation : ui,k,t = ρui,k,t-1 + v i,k,t
Variance Heterogeneity : σi,k = σk exp(γk fi )
Structural parameters : Ω = [θ, Δ, Σ, ρ,Γ]
Part 26: Bayesian vs. Classical [36/45]
Conditional Estimators
ˆ = argmax N log 1 R
Ω
 i=1 R  r=1
Ti
ˆ
Lˆ =
P (βˆ | Ω,data
)
i

ˆ
E[β
i,k
t=1
ijt
i
t=1
Pijt (βir | Ω,datait )
it
ˆ
(1/R)ΣRr=1βi,k,r ΠTt=1Pijt (βˆ i | Ω,data
1 R
it )
ˆ i,r βi,k,r
| datai ] =
=
w

r=1
R
T
ˆ
ˆ
(1/R)Σ
Π P (β | Ω,data ) R
r=1
ˆ 2
E[β
i,k

Ti
t=1 ijt
i
it
2
ˆ
(1/R)ΣRr=1βi,k,r
ΠTt=1Pijt (βˆ i | Ω,data
1 R
it )
2
ˆ i,r βi,k,r
| datai ] =
=
w

r=1
ˆ
(1/R)ΣR
ΠT P (βˆ | Ω,data
) R
r=1
t=1 ijt

i
it

ˆ 2 | data ] - E[β
ˆ
Var[βi,k | datai ] = E[β
i,k
i
i,k | data i ]
2
ˆ
E[β
i,k | datai ] ± 2 Var[β i,k | data i ] will encompass 95% of any
reasonable distribution
Part 26: Bayesian vs. Classical [37/45]
Simulation Based Estimation

Bayesian: Limited to convenient priors (normal, inverse

Classical: Use any distributions for any parts of the
gamma and Wishart) that produce mathematically
tractable posteriors. Largely simple RPM’s without
heterogeneity.
heterogeneity that can be simulated. Rich layered model
specifications.






Comparable to Bayesian (Normal)
Constrain parameters to be positive. (Triangular, Lognormal)
Limit ranges of parameters. (Uniform, Triangular)
Produce particular shapes of distributions such as small tails. (Beta,
Weibull, Johnson SB)
Heteroscedasticity and scaling heterogeneity
Nesting and multilayered correlation structures
Part 26: Bayesian vs. Classical [38/45]
Computational Difficulty?
“Outside of normal linear models with normal random coefficient
distributions, performing the integral can be computationally
challenging.” (A&R, p. 62)
(No longer even remotely true)
(2)
MSL with dozens of parameters is simple
Multivariate normal (multinomial probit) is no longer the
(3)
Intelligent methods of integration (Halton sequences) speed up
(1)
benchmark alternative. (See McFadden and Train)
integration by factors of as much as 10. (These could be used by
Bayesians.)
Part 26: Bayesian vs. Classical [39/45]
Individual Estimates

Bayesian…
posterior”
“exact…” what do we mean by “the exact

Classical… “asymptotic…”
These will be very similar.

A theorem of Bernstein-von Mises:

Counterpoint is not a
crippled LCM or MNP. Same model, similar values.
Bayesian ------ >
Classical as N  (The likelihood function dominates;
posterior mean  mode of the likelihood; the more so
as we are able to specify flat priors.)
Part 26: Bayesian vs. Classical [40/45]
Extending the RP Model to WTP

Use the model to estimate conditional distributions for
any function of parameters


Willingness to pay = i,time / i,cost
Use same method
R
T
ˆ
(1/
R
)

WTP

P
(

r

1
ir
t

1
ijt
ir
Eˆ [WTPi | datai ] 
(1/ R) R
T P (ˆ
r 1
t 1 ijt
ˆ , data ) 1 R
|
it
  r 1 wˆ i ,rWTPir
ˆ
R
ir | , datait )
Part 26: Bayesian vs. Classical [41/45]
What is the ‘Individual Estimate?’

Point estimate of mean, variance and range of random variable i
| datai.

Value is NOT an estimate of i ; it is an estimate of E[i | datai]
What would be the best estimate of the actual realization

An interval estimate would account for the sampling ‘variation’ in

Bayesian counterpart to the preceding?

i|datai?
the estimator of Ω that enters the computation.
Posterior mean and
variance? Same kind of plot could be done.
Part 26: Bayesian vs. Classical [42/45]
Methodological Differences
Focal point of the discussion in the literature is the simplest
possible MNL with random coefficients,
Prob[choice j | i,t] =
exp(αi,j +βixitj )

Jt (i)
j=1
exp(αi,j +βixitj )
 αi,j   α j   w i,αj 

 =  +

β
w
β
 i    
i 
This is far from adequate to capture the forms of
heterogeneity discussed here. Many of the models
discussed here are inconvenient or impossible with received
Bayesian methods.
Part 26: Bayesian vs. Classical [43/45]
A Preconclusion
“The advantage of hierarchical Bayes models of heterogeneity is
that they yield disaggregate estimates of model parameters. These
estimates are of particular interest to marketers pursuing product
differentiation strategies in which products are designed and
offered to specific groups of individuals with specific needs. In
contrast, classical approaches to modeling heterogeneity yield only
aggregate summaries of heterogeneity and do not provide
actionable information about specific groups. The classical
approach is therefore of limited value to marketers.” (A&R p. 72)
Part 26: Bayesian vs. Classical [44/45]
Disaggregated Parameters

The description of classical methods as only producing aggregate results

As regards “targeting specific groups…” both of these sets of methods

NEITHER METHOD PRODUCES ESTIMATES OF INDIVIDUAL
is obviously untrue.
produce estimates for the specific data in hand. Unless we want to trot
out the specific individuals in this sample to do the analysis and marketing,
any extension is problematic. This should be understood in both
paradigms.
PARAMETERS, CLAIMS TO THE CONTRARY NOTWITHSTANDING. BOTH
PRODUCE ESTIMATES OF THE MEAN OF THE CONDITIONAL
(POSTERIOR) DISTRIBUTION OF POSSIBLE PARAMETER DRAWS
CONDITIONED ON THE PRECISE SPECIFIC DATA FOR INDIVIDUAL I.
Part 26: Bayesian vs. Classical [45/45]
Conclusions

When estimates of the same model are compared, they rarely

Classical methods shown here provide rich model specifications

Just two different algorithms.
differ by enough to matter. See Train, Chapter 12 for a nice
illustration
and do admit ‘individual’ estimates. Have yet to be emulated by
Bayesian methods
The philosophical differences in
interpretation is a red herring. Appears that each has some
advantages and disadvantages
Download