# Estimating Building Simulation Parameters via Bayesian Structure Learning ```overall
approachwhich
is much
scalable
than the
S&amp;S and
sian distribution,
makesmore
it slightly
less general
than
method presented by Fan Li  guarantees
global maxisume ha represents
a vector of zeros. Ba
methods.
In
the
RB
methods
presented
in
this
CB
methods.
This
clear
computational
us
to
the other methods. In addition, other works illustrates that
mum solution
inthe
all ⌦
instances.
presented vector.
in , itThese
is possible
to relaxasthe
J
is
and
h
is
the
potential
methods
paper
have
a
polynomial
worst
case
runtime,
while
all
other
explore
Regression
based
(RB)
structure
learning
there
are
better
feature
selection
methods
than
stepwise
seThis
model
allows
exact
inference
by
using
the
GGM’s
is
zero.
However,
computing
h research
requires c
sume
h
represents
a
vector
of
zeros.
Based
on
the
tribution, which makes it slightly less general
than
method
presented
by
Fan
Li

guarantees
a
global
maximethods
have
an
exponential
worst
case
runtime.
methods.
In
the
RB
methods
presented
in
this
lection
and
backwards
selection
.
distribution’s
information
form
(Eq.
8).
verting
which
is expensive
eventhat
withht
r methods.
In
other
works
illustrates
that
mum
solution
in
all
instances.
sian distribution,
which makes
it slightly less
general than Parameters
method
presentedvia
Fan
guarantees
a global
maxipresented
inby
,
itLiis
possible
to⌦,relax
the
assumption
Estimating
Building
Simulation
Bayesian
Structure
Learning
paper
have
a
polynomial
worst
case
runtime,
while
all
other
While
the
feature
selection
methods
can
be
improved,
the
sian
distribution,
which
makes
it
slightly
less
general
than
method
presented
by
Fan
Li

guarantees
a
global
maxipresented
by
A.
Dobra
et. al .
otherselection
methods.methods
works
illustrates
that
mum
solution
in
all
instances.
e betterthe
feature
thanother
stepwise
seThis
model
allows
exact
inference
by
using
the
GGM’s
Tcomputing
T
is
zero.
However,
h
requires
computing
⌃ by in3.3
Regression
Based
1
2
1
p(x)
/
exp(x
Jx
+
h
x)
(8)
the
other
methods.
other
works
illustrates
mum
solution
in all
instances.
have
an
worst
case
runtime.
overall
approach
is exponential
much
scalable
than
the
S&amp;S
andthat
Richard
E.
Edwards
R.
New
,
and
Lynne
E.
Parker
there
are
better
feature
selection
methods
than
stepwise
se- , Joshua
This
model
allows
exact
inference
by
using
the
GGM’s
andmethods
backwards
selection
.Inmore
distribution’s
information
form
(Eq.
8).
Overall,
A.
Dobra
et. al ’s
meth
verting
⌦,
which
is
expensive
even
with
the
efficient
method
there
are
better
feature
selection
methods
than
stepwise
seThis
model
allows
exact
inference
by
using
the
GGM’s
1University
RB
structure
learning
methods
determine
all
conditional
CB
methods.
This
clear
computational
us
to
lection
and
backwards
selection
.
distribution’s
information
form
(Eq.
8).
of
Tennessee,
Electrical
Engineering
and
Computer
Science
the feature selection methods can be improved, the
and
computationally
feasible
for
a
large
J
is
the
⌦
and
h
is
the
potential
vector.
These
methods
aspresented
by
A.
Dobra
et.
al
.
T
T
lection
and
backwards
selection
.
distribution’s
information
(Eq. 8).
2Oak
3.3
Based
While
the
feature
selection
methods
can
be improved,
the
/ exp(x
+ hform
x) Integration
(8)
dependencies
by
assuming
that
eachp(x)
variable
is
a Jx
functional
explore
Regression
based
(RB)
structure
learning
Ridge
National
Lab,
Whole
Building
&amp; Community
Group
approach
isRegression
much
more
scalable
than
the
S&amp;S
and
T Based
T
variables.
However,
the
resulting
mode
sume
h
represents
a
vector
of
zeros.
on
the
research
Overall,
A.
Dobra
et.
al
’s
method
is
fairly
robust
p(x)
/
exp(x
J
x
+
h
x)
(8)
While
the
feature
selection
methods
can
be
improved,
the
overall
approach
is much
more
scalable
than
the
andrandom variables. This concept isT
methods.
In
the RB
methods
presented
in S&amp;S
this
T
result
from
of all
hods.
This
clear
computational
us
toa subset
RB
structure
learning
methods
determine
all
conditional
p(x)
/ exp(x
Jxfeasible
+the
h x)
(8)
in
,
it is
possible
toThese
relax
assumption
that
h
J is us
theto⌦presented
and h is the
potential
vector.
methods
asdent
upon
the
variable
ordering.
Each
overall
approach
is
much
than the S&amp;S
and
and
computationally
for
a
large
number
of
random
CB
methods.
This
clear
computational
paper have
a polynomial
case runtime,
while all
other
Regression
based worst
(RB)that
structure
learning
J
is
the
⌦
and
h
is
the
potential
vector.
These
methods
asbest
presented
in
Linear
Gaussian
Bayesian
Networks,
where
dependencies
by
assuming
each
variable
is
a
functional
sume
h represents
avariables.
vector computing
of zeros.
Based
on
the
research
is zero. However,
h requires
computing
⌃
by in-isa heavily
CB methods.
ThisRegression
clear Abstract
computational
us to
explore
based (RB)
structure
learning
dering
may
produce
di↵erent
joint
p
However,
the
resulting
model
depenResults
J sume
is distributed,
the h⌦ represents
and h is
theaeach
potential
ass. result
In
the
RB
methods
presented
in this variable
vector
ofvector.
zeros. These
Basedmethods
on the research
methods
have
an
exponential
worst
case
runtime.
each
random
is
Gaussian
but
defrom
a
subset
of
all
random
variables.
This
concept
is
presented
in , it
isdent
possible
to relax
the
assumption
that hEach
Regression
based
(RB) structure
learning
methods.
the RB
methods
presented
in this verting
⌦,
which
is
expensive
even
with
the
efficient
method
tion
because
each
variable
is
only depend
upon
the
variable
ordering.
unique
variable
orsume
h
represents
a
vector
of
zeros.
Based
on
the
research
presented
in
,
it
is
possible
to
relax
the
assumption
that
h
avebest
aMany
polynomial
worst
case
runtime,
while
all
other
key have
building
design
policies
are
using
sophisticated
computer
pendent
random
Gaussian
distribution
iset.
a linear
presented
Linear
Gaussian
Bayesian
Networks,
where
methods.
In
the
RBcase
methods
presented
inzero.
this
paper
ain
polynomial
worst
runtime,
whilevariable’s
all
other
is
However,
computing
h
requires
computing
⌃
by
inpresented
by
A.
Dobra
al
.
that
proceed
it.that
This
ensure
dering
produce
di↵erent
joint
probability
distribuis zero.
However,
harequires
computing
⌃ hbyassumption
inpresented
in
,may
it iscomputing
possible
to relax
the assumption
3.3
Regression
Based
s have
an
exponential
worst
case
runtime.
simulations
such
as
EnergyPlus
(E+),
the
DOE
flagship
whole-building
energy
have
exponential
case runtime.
paper have
a an
polynomial
worstworst
case
runtime,
while
all
other
verting
which
is
expensive
even
with
the
efficient
method
eachmethods
random
variable
is Gaussian
distributed,
but
each⌦,decombination
of its
parents.
Therefore,
one
can
clearly
learn
Overall,
A.
Dobra
et.
al
’s
method
is
fairly
robust
verting
⌦,
which
is
expensive
even
with
the
efficient
method
is
zero.
However,
computing
h
requires
computing
⌃
by
intion
because
each
variable
is
only
dependent
upon
variables
graph
structure
is
a
valid
Bayesian
Netwo
simulation
engine.
E+
and
other
sophisticated
computer
simulations
have
RB
structure
learning
methods
determine
all
conditional
methods
have
an
exponential
worst
case
runtime.
presented
by verting
A.
Dobra
et.
al
.
pendent
random
variable’s
Gaussian
distribution
is
a
linear
the
structure
for
a
Linear
Gaussian
Bayesian
Network
by
and
computationally
feasible
for
a
large
number
of
random
presented
by
A.
Dobra
et.
al
.
Regression
Based
⌦,
which
is
expensive
even
with
the
efficient
method
several
major
problems.
The
two
main
issues
are
1)
gaps
between
the
3.3
Regression
Based
that
proceed
it.
This
assumption
ensures
that
the
resulting
method
to
search
the
set
of
all
possible
o
dependencies
by
assuming
that
each
variable
is
a
functional
Overall,
A.
Dobra
et.
al
’s
method
is
fairly
robust
combination
of
its
parents.
Therefore,
one
can
clearly
learn
Overall,
A.
Dobra
et.
al
’s
method
is fairly
robust
performing
a
series
of
linear
regressions
with
feature
selecvariables.
However,
the
resulting
model
is
heavily
depenpresented
by
A.
Dobra
et.
al
.
simulation
model
and
the
actual
structure,
and
2)
limitations
of
the
modeling
ructure
learning
methods
determine
all
conditional
3.3
Regression
Based
graph
structure
is
a
valid
Bayesian
Network,
but
requires
the
RB
structure
learning
methods
determine
all
conditional
the
best
structure.
A.
Dobra
et.
al

result
from
a
subset
of
all
random
variables.
This
concept
is
and
computationally
feasible
for
a
large
number
of
random
and
computationally
feasible
for
a
large
number
of
random
the
structure
for
a
Linear
Gaussian
Bayesian
Network
by
Overall,
A.
Dobra
et.
al
’s
method
is
fairly
robust
engine’s
capabilities.
Currently,
these
problems
are
by
having
an
dent
upon
the
variable
ordering.
Each
unique
variable
ortion.
ncies
by
assuming
that
each
variable
is
a
functional
dependencies
by
assuming
that
each
variable
is aconditional
functional
method
to
search
the
set
of
all
possible
orders
to
determine
RB
structure
learning
methods
determine
all
by
scoring
the
individual
variables
and
g
best
presented
in
Linear
Gaussian
Bayesian
Networks,
where
variables.
However,
the
resulting
model
is
heavily
depenvariables.
However,
the
resulting
model
is
heavily
depenmanually
calibrate
simulation
parameters
towith
real
world
data
or
using
and may
computationally
feasible
forjoint
a large
number ofdistriburandom
a
series
of
linear
regressions
feature
selecdering
produce
a
di↵erent
probability
The
RB
approach
is
extremely
scalable.
For
example,
omperforming
aengineer
subset
of
all
random
variables.
This
concept
is
result
from
a
subset
of
all
random
variables.
This
concept
is
dependencies
by methods
assuming
that
each
variable
a functional
the
best
structure.
A.
Dobra
et.
al

this
issue
dent
the
variable
ordering.
Each
unique
variable
oreach
random
variable
is Gaussian
distributed,
butis
each
de- upon
variables
in
ascending
order
based
on
th
dent
upon
the
variable
ordering.
Each
unique
variable
oralgorithmic
optimization
to
the
building
parameters.
However,
variables.
However,
the
resulting
model
is
heavily
depentionRB
because
each to
variable
islarge
only dependent upon variables
tion.in
best
presented
in
Linear
Gaussian
Bayesian
Networks,
where
sented
Linear
Gaussian
Bayesian
Networks,
where
M.
Gustafsson
et.
al

used
methods
learn
result
from
a
subset
of
all
random
variables.
This
concept
is
dering
produce
aindividual
di↵erent
joint
probability
distribudering may
di↵erent
joint
probability
distribubyamay
scoring
the
variables
and
greedily
ordering
the
pendent
random
variable’s
Gaussian
is
a
linear
some simulations
engines, like
E+, aredistribution
computationally
expensive,
whichproduce
Fan
Li

builds
on
this
approach
and
dent
upon
the
variable
ordering.
Each
unique
variable
oreach
random
variable
isextremely
Gaussian
distributed,
but
each
de- that
proceed
it.
This
assumption
ensures
that
the
resulting
ndom The
variable
isapproach
Gaussian
distributed,
but
each de(a)
Bayesian
Estimates
(b)
Random
Estimates
RB
is
scalable.
For
example,
undirected
graphical
model
structures
in
Gene
Regulatory
best
presented
in
Linear
Gaussian
Bayesian
Networks,
where
tion
because
each
variable
is
only
dependent
upon
variables
tion
each
variable
is
only
dependent
upon
variables
makes repeatedly
evaluating
the
simulationone
engine
costly.
This
workbecause
exploresdering
variables
in
ascending
order
based
on
their
assigned
score.
combination
of
its
parents.
Therefore,
can
clearly
learn
may
produce
a
di↵erent
joint
probability
distriburequirement
without
a
pendent
random
variable’s
Gaussian
distribution
is
a
linear
random
variable’s
Gaussian
distribution
is
a
linear
graph
structure
is
a
valid
Bayesian
Network,
but
requires
the
M.
Gustafsson
et.
al

used
RB
methods
to
learn
large
each random
variable
is Gaussian
distributed,
but
each
dethat
proceed
it.
This
assumption
ensures
that
the
resulting
Networks.
Microarray
data
generally
contains
thousands
of
this
issue
by
automatically
discovering
the
simulation’s
internal
that
proceed
it.
This
assumption
ensures
that
the
resulting
the
structure
for
a
Linear
Gaussian
Bayesian
Network
by
tion
because
each
variable
is
only
dependent
upon
variables
Figure
1. Compares
Bayesian on
GGM’s
error
againstaaWishart
[0,
1] uniform
distribution’s
Fan
Li

builds
this
approach
and
removes
the
order
combination
of
its
parents.
Therefore,
one
can
clearly
learn

applies
prior
to
the
precisi
ation
of
its
parents.
Therefore,
one
can
clearly
learn
method
to
search
the
set
of
all
possible
orders
to
determine
pendent
random
variable’s
Gaussian
distribution
is
a
linear
graph
structure
is
a
valid
Bayesian
Network,
but
requires
the
undirected
graphical
model
structures
in
Gene
Regulatory
input and
output
dependencies
from
~20
Gigabytes
of
E+
simulation
data,
graph
structure
is
a
valid
Bayesian
Network,
but
requires
the
random
variables
and
very
few
samples.
In
that
particular
error
on
estimating
FG
building
parameters.
it
shows
how estimates
that
proceed
it.
This
assumption
ensures
that
the
resulting
performing
a
series
of
linear
regressions
with
feature
selecthe
structure
for
a
Linear
Gaussian
Bayesian
Network
by
requirement
without
assumptions.
Fan
Li
Lasso
regression
to
perform
feature
selec
cture
for
a
Linear
Gaussian
Bayesian
Network
by
its ~200
parents.
Therefore,
can clearly
learn
method
to of
search
the
set et.
of all
possible
orders
to issue
determine
the
best
structure.
A.
Dobra
al

this
future combination
extensions
willofuse
Terabytes
of E+ one
simulation
data.
The
model
isgraph
align
with
test
building
parameter
values.
Networks.
Microarray
data
generally
contains
thousands
of
method
to
search
the
set
all
possible
orders
to
determine
structure
is astructure
valid
Bayesian
Network,
but
requires
the matrix, and uses
research,
the
algorithm
for
building
the
graph
is
performing
a
series
of
linear
regressions
with
feature
selecis
formulated
as
the
following
under
tion.

applies
a
Wishart
prior
to
the
precision
ing avalidated
series
of
linear
regressions
with
feature
selec- Network
the
best
structure.
A.
Dobra
et.
al

this
issue
prior
to
the
precision
matrix
allows
the
m
the
structure
for
a
Linear
Gaussian
Bayesian
by
by
inferring
building
parameters
for
E+
simulations
with
ground
truth
by
scoring
the
individual
variables
and
greedily
ordering
the
the
best
structure.
A.
Dobra
et.
al

this
issue
random
variables
and
very
few
samples.
In
that
particular
s are The
methodLasso
tocoefficients
search
the
set
of
all
possible
orders
to
determine
tion.
fairly
straightforward.
If
the
regression
are
nonRB
approach
is
extremely
scalable.
For
example,
regression
to
perform
feature
selection.
Applying
the
by scoring
the
individual
variables
andprior
greedily
ordering
minimizef
+ g(z)regressions
(10) accurately
building
parameters.
results
indicate
that the model
represents
performing
a Our
series
of (x)
linear
with
feature
selecBayesian
Parameter
Estimates
Bayesian
Parameter the
Estimates
the
distribution
to
the
regression
variables
in
ascending
order
based
on
their
assigned
score.
ution
by
scoring
the
individual
variables
and
ordering
the
the
best
structure.
A.
Dobra
et.
al

this Variables
issue27 to 52
The
RB
approach
is
extremely
scalable.
For
example,
research,
the
algorithm
for
building
the
graph
structure
is
Variables
1 to 26 greedily
zero,
then
there
exists
a
dependency
between
the
response
M.
Gustafsson
et.
al

used
RB
methods
to
learn
large
in
ascending
order
based
on
their
assigned
score.
parameter
means
withx some
deviation
from
the
means,
but does not support variables
RB
isconstraint
extremely
scalable.
For
example,
tion.
prior
to
the
precision
matrix
allows
the
method
to
propagate
le to approach
with
the
z
=
0,
where
g
is
an
indicator
funcFan
Li

builds
on
this
approach
and
removes
the
order
enables
the
Lasso
regression
method to
variables
in ascending
order
based
on
their
assigned
score.
M.straightforward.
Gustafsson et. alIfthe
used
RB methods
to learn
largenonby
scoring
the
individual
variables
and
greedily
ordering
the
fairly
regression
coefficients
are
Estimate
Estimate
1.5 Li 
1.5
1.5 removes the order
1.5
inferring
parameter
values
that
exist
ongto
the
distribution’s
tail.
Fan
builds
oncoefthis approach
and
undirected
graphical
model
structures
in
Gene
Regulatory
variable
and
the
predictors
with
a
non-zero
regression
tafsson
et.
al

used
RB
methods
learn
large
The
RB
approach
is
extremely
scalable.
For
example,
tion.
Using
an
indicator
function
for
allows
to
the
prior
distribution
to
the
regression
coefficients,
which
Actual
Actual
undirected
graphical
model
structures
in
Gene
Regulatory
requirement
without
assumptions.
Fan
Li
Fan
Li

builds
on
this
approach
and
removes
the
order
variables
in
ascending
order
based
on
their
assigned
score.
feature
selection.
In
applying
zero,
then
there
exists
a
dependency
between
the
response
requirement
without
assumptions.
Fan
Li
represent
the original
convex
optimization
constraints,
and
Microarray
data
generally
contains
thousands
of
M.
Gustafsson
et.
al

used
RB
methods
tothousands
learn
largeof directly extracts
ed Networks.
graphical
model
structures
in
Gene
Regulatory
ficient.
While
therequirement
method
an
undirected
1
1
1method to perform Bayesian
1
enables
the
Lasso
regression
Networks.
Microarray
data
generally
contains
Fan
Li

builds
on
this
approach
and
removes
the
order
without
assumptions.
Fan
Li

applies
a
Wishart
prior
to
the
precision
matrix,
and
uses
allows
Fan
Li
to
prove
all
possible
resul
Problem
Formulation
the
x and
z = 0 constraint
guarantees
that
the xathat
minimizes

applies
a
Wishart
prior
to
the
precision
matrix,
and
uses
variable
the
predictors
with
non-zero
regression
coefundirected
graphical
model
structures
in
Gene
Regulatory
ks. Microarray
data
generally
contains
thousands
ofdependency
random
variables
and
very
few
samples.
In
that
particular
conditional
graph,
it
may
not
be
possible
to
exrandom
variables
and
very
few
samples.
In
that
particular
feature
selection.
Infeature
applying
the
prior
to
⌦
without
Fan
Li
f (x) obeys the original constraints.
 applies
a requirement
Wishart
prior
to
the
precision
matrix,
and
0.5
0.5 assumptions.
0.5 uses
0.5 lets
Lasso
regression
to
perform
feature
selection.
Applying
the
has
Lasso
regression
to
perform
selection.
Applying
the
works
encode
the
same
GGM
joint
pro
Networks.
Microarray
data
generally
contains
thousands
of
ficient.
While
the
method
directly
extracts
an
undirected
variables
and
very
few
samples.
In
that
particular
the
algorithm
forsolution
building
the
graph
structure
isdistribution
research,
the
algorithm
for building
the
graph
structure
is prior
While
 used
this
general
format
to solve
o re-research,
Learning
tract
an many
overall
joint
from
the
resulting
graph.

applies
a Wishart
prior
toprove
the
precision
matrix,
and
uses
Lasso
regression
to
perform
feature
selection.
Applying
the
allows
Fan
Li
to
all
possible
resulting
Bayesian
Netto
the
precision
matrix
allows
the
method
to
propagate
prior
to
the
precision
matrix
allows
the
method
to
propagate
of
variable
order.
This
means
random
variables
and
very
few
samples.
Inisbe
that
particular
0
5
10
15
20
25
26
31
36
41
46
51 that th
di↵erent
convex
optimization
problems,
we
are
onlynot
focus,can
the
algorithm
for
building
the
graph
structure
conditional
dependency
graph,
it
may
possible
to
exfairly
straightforward.
If
the
regression
coefficients
are
nonfairly
straightforward.
If
the
regression
coefficients
are
nonLasso
regression
to perform
feature
selection.
Applying
the
prior
to
the
precision
matrix
allows
the
method
to
propagate
Variables
Variables
ollowing under
A.
Dobra
et.
al

focuses
on
recovering
the
factorized
the
prior
distribution
to
the
regression
coefficients,
which
works
encode
the
same
GGM
joint
probability,
regardless
E+
Building
Specification
the
prior
distribution
to
the
regression
coefficients,
which
ing
on
the
version
used
to
solve
Lasso
regression.
The
disresearch,
the
algorithm
for
building
the
graph
structure
is
SecMAP
estimate,
which
has
very
appeali
raightforward.
If the
regression
are
nonthen
there
exists
a coefficients
dependency
between
the the
response
Estimate
the
tract
an
overall
joint
distribution
from
the
resulting
graph.
zero,zero,
then
there
exists
a
dependency
between
the
response
Parameters
X
prior
to
the
precision
matrix
allows
the
method
to
propagate
prior
distribution
to
the
regression
coefficients,
which
enables
the
Lasso
regression
method
to
perform
Bayesian
tributed
optimization
steps
for solving
large
scale linear
Lasso
graph’s
joint
distribution.
assumes
that
the
overall
joint
(a)
(b)
of
variable
order.
This
means
that
the
computed
⌦
is
a
enables
the
Lasso
regression
method
to
perform
Bayesian
inimizef
(x)
+
g(z)
(10)
joint
probability
fairly
straightforward.
If
the
regression
coefficients
are
nonen there
exists
a
dependency
between
response
variable
and
the
predictors
with
a
non-zero
regression
coef1 the
as
avoiding
overfitting.
In
th
A.regression
Dobra
et.predictors
alare
focuses
on
recovering
the
factorized
and problems
the
with
a non-zero
regression
coefthe
prior
distribution
to
the
regression
coefficients,
which
presented
below
.
feature
selection.
In
applying
the
prior
to
⌦
lets
enables
the
Lasso
regression
method
to
perform
Bayesian
usingvariable
distribution:
distribution
is
sparse
and
represented
by
a
Gaussian
with
feature
selection.
In
applying
the
prior
to
⌦
lets
MAP
estimate,
which
has
very
appealing
properties
such
zero,
then
there
exists
a
dependency
between
the
response
ficient.
While
the
method
directly
extracts
an
undirected
and
the
predictors
with
a
non-zero
regression
coefz
=
0,
where
g
is
an
indicator
funcObserved
Output
Data
/
RealBayesian
Parameter
Estimates
Bayesian Parameter
Estimates
point
way
to
transform
the
Lasso
regression
f
enables
the
Lasso
regression
method
toresulting
perform
Bayesian
ficient.variable
While
the
method
directly
extracts
an
undirected
allows
Fan
Li
to
prove
all
possible
Bayesian
Netgraph’s
joint
distribution.

assumes
that
the
overall
joint
P(X,Y
)
feature
selection.
In
applying
the
prior
to
⌦
lets
1
⇢
2
k
k
2
k+1
Variables
53 topossible
78
Variables
79
to 104 introduces a
world
Sensor
data,
Weather
and
the
predictors
with
a
non-zero
regression
coefallows
Fan
Li
to
prove
all
resulting
Bayesian
Netator
function
for
g
allows
to
conditional
dependency
graph,
it
may
not
be
possible
to
exN
(0,
⌃).
This
type
of
Markov
Network
is
called
a
Gaussian
as
avoiding
overfitting.
In
the

||A
x
b
||
+
||x
z
+
u
||
)
(11)
x
=
argmin
(
i
i
i
i
While
the
an
undirected
tra2
i 2
i method directly extracts
feature
selection.
In
applying
the
prior
to
⌦
lets
works
encode
the
same
GGM
joint
probability,
regardless
Data,
Operation
Schedule
Y
2
2
SVM
regression
problem,
where
the
solut
conditional
dependency
graph,
it
may
not
be
possible
to
exx
allows
Fan
Li
to
prove
all
possible
resulting
Bayesian
Netdistribution
is
sparse
and
represented
by
a
Gaussian
with
convex
optimization
constraints,
and
Estimate probability,
Estimate 1.5
ficient.
While joint
the
method
directly
extracts
an undirected
Im- dependency
tract
an overall
distribution
from
theexresulting
graph. and
works
encode
the
same
GGM
joint
regardless
1.5
1.5
1.5
way
to
transform
the
Lasso
regression
formulation
into
an
Graphical
Model
(GGM),
it
is
assumed
that
the
joint
nal
graph,
it
may
not
be
possible
to
allows
Fan
Li
to
prove
all
possible
resulting
Bayesian
Netk+1
k+1
k
of
variable
order.
This
means
that
the
computed
⌦
is
a
Actual
Actual
nt
guarantees
thatoverall
the
x that
minimizes
works
encode the same GGM joint probability,gression
regardless
tract
an
joint
distribution
from
thenot
resulting
graph.
problem
is
also
the
solution
to
t
z
=
S
(x̄
+al
ūof)
(12)
abilN
(0,
⌃).
This
type
Markov
Network
is
called
a
Gaussian
A.
Dobra
et.
focuses
on
recovering
the
factorized
conditional
dependency
graph,
it
may
be
possible
to
exof
variable
order.
This
means
that
the
computed
⌦
is
a
joint
distribution
from
the
resulting
graph.
SVM
regression
problem,
where
the
solution
to
the
SVM
reprobability
is
represented
by
the
following:
MAP
estimate,
which
has
very
appealing
properties
such
aloverall
constraints.
works
encode
the
same
GGM
joint
probability,
regardless
1
1
1
1
has A.graph’s
ofgraph.
variable
order. This means that the computed
⌦ is This
a
Dobra
et.
al

focuses
on
recovering
the
factorized
k+1
k overall
k+1
k+1
problem.
transformation
allows
th
joint
distribution.

assumes
that
the
overall
joint
Graphical
Model
(GGM),
and
it
is
assumed
that
the
joint
tract
an
joint
distribution
from
the
resulting
Inference
u
=
u
+
x
z
(13)
general
solution
format
to
solve
many
obra
on recovering the factorized
MAP
estimate,
which
has
very
appealing
properties
such
i
i
as
avoiding
overfitting.
In
the

introduces
a
dif- et. ali  focuses
of
variable
order.
This
means
that
the
computed
⌦
is
a
gression
problem
is
also
the
solution
to
the
Lasso
regression
MAP
estimate,
which
has
very
appealing
properties
such
assumes
that
all
observations
are
independent
and
identinot
feasible
to
specify
the
single
best
graph
structure
in
0.5
0.5
0.5 rando
graph’s
joint
distribution.

assumes
that
the
overall
joint
distribution
is
sparse
and
represented
by
a
Gaussian
with
Estimated
Optimize
over
the
joint
A.
Dobra
et.
al

focuses
on
recovering
the
factorized
mization
problems,
we
are
only
focus1
1
nonlinear
dependencies
among
the
probability
is
represented
by
the
following:
T overfitting.
rob- distribution.
New
Observation
joint

assumes
that
the
overall
joint
way
to
transform
the
Lasso
regression
formulation
into
an
as
avoiding
In
introduces
a
MAP
estimate,
which
has
very
appealing
properties
such
cally
distributed.
Eq.
1
may
be
rewritten
as:
vance.
Therefore,
algorithmic
techniques
must
be used
to the 
The
individual
local
subproblems
are
solved
using
ridge
reproblem.
This
transformation
allows
the
method
to detect
Building
probability
distribution
:
exp(
(x
&micro;)
⌦(x
&micro;))
(6)
as
overfitting.
In
the

introduces
a
ed todistribution
solve
regression.
The
NLasso
(0, ⌃).
This
type
of Markov
Network
is
called
a
n Gaussian
1avoiding
Y
, sparse
Y , …,
Y disis
and
represented
by
a
Gaussian
with
graph’s
joint
distribution.

assumes
that
the
overall
joint
tting
find the2 best
graph
structure
matches
true joint
N Gaussian
ever,
exploring
nonlinear
dependencies
SVM
regression
problem,
where
solution
to
the
Parameters
X
gression,
and
the
global
z
values
are
computed
by
evaluating
52
57 that
62
67 In
73
78 the
78 introduces
83 into
88SVM
93 re- 98
103
tion
is
sparse
and
represented
by
a
with
as
avoiding
overfitting.
the

a
2
2
way
to
transform
the
Lasso
regression
formulation
an
⇧
P
(X,
Y
)
(2)
i
(2⇡)
|⌃|
i=1
steps
for
solving
large
scale
linear
Lasso
1
1
nonlinear
dependencies
among
the
random
variables.
Howway
to
transform
the
Lasso
regression
formulation
into
an
ationN (0,
Graphical
Model
(GGM),
and
it
is
assumed
that
the
joint
probability
distribution
without
overfitting
the
observed
trainT
Variables
Variables
distribution
is
sparse
and
represented
by
a
Gaussian
with
⌃).
This
type
of
Markov
Network
is
called
a
Gaussian
1
a soft
thresholding
function
S. This
function
is
defined
as
gression
problem
is Lasso
also
the
solution
to to
the
Lasso
regression
exp(
(x
&micro;)
⌦(x
&micro;))
(6)
.teraThis
type
of
Markov
Network
is
called
a
Gaussian
future
work,
because
naively
implement
are
presented
below
.
way
to
transform
the
regression
formulation
into
an
ing
samples.
There
are
three
categories
of
algorithms
that
which
is
the
joint
probability
distribution.
Conversely,
the
SVM
regression
problem,
where
the
solution
the
SVM
ren
1
probability
is
represented
by
the
following:
SVM
regression
problem,
where
the
solution
to
the
SVM
reever,
exploring
nonlinear
dependencies
via
this
method
is
follows:
(c)
(d)
N
(0,
⌃).
This
type
of
Markov
Network
is
called
a
Gaussian
2
2
2
Graphical
Model
(GGM),
and
it
is
assumed
that
the
joint
(2⇡)
|⌃|
problem.
This
transformation
allows
the
method
to
detect
this
problem.
The
first
method
is
Score
and
Search
E+
approximation
process requires
maximizing
the followwhere
⌃
is
the
covariance
matrix,
&micro;
is
the
mean,
and
⌦
al Model
(GGM),
and
it
is
assumed
that
the
joint
tion.
SVM
regression
problem,
where
the
solution
to
the
SVM
reour
data
will
produce
non-sparse
GGM
⇢
gression
problem
is
also
the
solution
to
the
Lasso
regression
2
k
k 2
Structure
Learning
gression
problem
is
also
the
solution
to
the
Lasso
regression
(Section
3.1),
and
is
generally
used
to
learn
Bayesian
Neting
posterior
probability:
Model
(GGM),
and
it is assumed
that the joint
future
work,
because
naively
implementing
the
method
on
Arelabi ||2 +Graphical
||xi isz represented
+u
i xi probability
1
i ||2 )1 (11)
nonlinear
dependencies
among
the
random
variables.
Howby
the
following:
T
Bayesian
Parameter
Estimates
Bayesian Parameter Estimates
ity is represented
thev following:
gression
problem
is also
the
solution
to the
Lasso
regression
max(0,
) max(0,
v &micro;)
)⌦(x
(14) &micro;))
is
the
inverse
covariance
matrix
or
precision
matrix.
The
2S (v) = by
work
structures.
The
second
approach
is
Constraint
Based
exp(
(x
(6)
While
the
method
presented
by
Fan
L
N
problem.
This
transformation
allows
the
method
to
detect
Variables
105 to 130
Variables 131
to 151
n
1
problem.
This
transformation
allows
the
method
to
detect
where
⌃
is
the
covariance
matrix,
&micro;
is
the
mean,
and
⌦
⇢N
⇢N
probability
is
represented
by
the
following:
⇧i=1
P
(Y
|X)P
(X)
(3)
utii
our
data
will
produce
non-sparse
models.
ever,
exploring
nonlinear
dependencies
via GGM
this
method
is
2 approach learns a Bayesian
Model
(GGM)
(Section 3.2)
methods,
which
are
used
to learn
Bayesian
and
2 |⌃|
2
k Gaussian Graphical
(2⇡)
problem.
This
transformation
allows
the
method
to
detect
Network
that
is
converted
to
a
+
ū )
(12) T 1
1
, the
nonlinear
dependencies
among
the
random
variables.
Howestimate
the
GGM
that
governs
1
1
nonlinear
dependencies
among
the
random
variables.
HowEstimate
Estimate the join
T
Markov
Networks.
Thework,
final
method
uses
strong
distribution
where
X represents
the
inputs
and
observed
operation schedis the
covariance
matrix
or
precision
matrix.
The
future
because
naively
implementing
the
method
on
Theinverse
soft exp(
thresholding
function
applies
the
Lasso
regression
While
the
method
presented
by
Fan
Li

can
efficiently
exp(
(x
&micro;)
⌦(x
&micro;))
(6)
1
1
nonlinear
dependencies
among
the
random
variables.
How(x
&micro;)
⌦(x
&micro;))
(6)
Actual
Actual
T
data
n
1
n
1
assumptions
combined
withand
regression
based
feature
selecule,
and
Yi z,represents
amatrix,
single
output
sample.
Thisthe
apGGM
by
using
regression
coefficients
the
variance
k+1
where
⌃
is
the
covariance
&micro;
is
the
mean,
and
⌦
ever,
exploring
nonlinear
dependencies
viaE+
this
method
is
sparsity
constraints
over
which
are
incorporated
into
the
1.2
ever,
exploring
nonlinear
dependencies
via
this1.2
method
is
exp(
(x
&micro;)
⌦(x
&micro;))
(6)
the
random
variables,
it
is
not
clear
2
2
our
data
will
produce
non-sparse
GGM
models.
z
2 |⌃|
2Bayesian
2 |⌃| 2 learns
n(13) 1
approach
a
Network
that
is
converted
to
a
(2⇡)
(2⇡)
tion
methods
to
determine
dependencies
among
the
random
estimate
the
GGM
that
governs
the
joint
distribution
over
thm
ever,
exploring
nonlinear
dependencies
via
this
method
is
proximation
method
requires
finding multiple
samples that
1
1
2
2
2
(2⇡)
|⌃|
local
subproblem
solutions
on
the
next
optimization
iterais
the
inverse
covariance
matrix
or
precision
matrix.
The
for
each
regression
to
compute
the
precision
matrix
⌦.
The
0.7
0.7
future
work,
because
implementing
the
method
on
future
work,
because
naively
implementing
the
method
on
While
the
method
presented
by
Fan
Li

can
efficiently
0.8 naively
0.8
variables
(Section
3.3).
maximize
the
posterior
probability,
rather
than
a
single
ason
the
E+
data
set,
which
has
many
com0.6
0.6
GGM
by
using
the
regression
coefficients
and
the
variance
future
work,
because
naively
implementing
the
method
on
subproblems
are solved using ridge retion.
the
E+
random
variables,
not
clear
how
it
will
perform
0.6
0.6 it is
0.5
0.5
approach
learns
a Bayesian
Network
that
is
converted
to
a
⌃ggreis
the
covariance
matrix,
&micro;
is
the
mean,
and
⌦
signment.
However,
forward
approximating
E+
using
this
where
is
the
covariance
matrix,
u
is
the
mean,
and
is
the
inverse
where
⌃
is
the
covariance
matrix,
&micro;
is
the
mean,
and
⌦
0.4
0.4
estimate
the
GGM
that
governs
the
joint
distribution
over
our
data
will
produce
GGM
models.
our
data
will
produce
non-sparse
GGM
precision
matrix
is
recovered
by
the
following
computation:
0.4 non-sparse
0.4 models.
where
⌃
is
the
covariance
matrix,
&micro;
is
the
mean,
and
⌦
al zfor
values
are
computed
by
evaluating
0.3
0.3 to wo
Theregression
behind
this particular
Lasso
regression
variables.
The
method
is
intended
our
data
will
produce
non-sparse
GGM
models.
each
to
compute
the
precision
matrix
⌦.
The
method
is
computationally
slower
than
using
statistical
sur0.2
0.2
on
the
E+
data
set,
which
has
many
more
samples
than
GGM
by
using
the
regression
coefficients
and
the
variance
matrix
or
precision
matrix
ation
nverse
covariance
matrix
or
precision
matrix.
The
3.
STRUCTURE
LEARNING
the
E+
random
variables,
it
is
not
clear
how
it
will
perform
iscovariance
the
inverse
covariance
matrix
or
precision
matrix.
The
While
the
method
presented
by
Fan
Li

can
efficiently
unction
S.
This
function
is
defined
as
formulation
isrogate
that the
main,Therefore,
computationally
demanding,
While
the
method
presented
by
Fan
Li

can
efficiently
models.
we areor
only
focusing onmatrix.
estimat- The
is
the
inverse
covariance
matrix
precision
104
109
114
119
124
129
130
135
139 that
144 contain
149
154a large n
While
the
method
presented
by
Fan
Li

can
efficiently
T
1
ple
size
data
sets
precision
matrix
is
recovered
by
the
following
computation:
aster
for
each
regression
to
compute
the
precision
matrix
⌦.
The the
variables.
The
method
is
intended
to
work
with
small
samh learns
aapproach
Bayesian
that
is
converted
to
a
Variables
Variables
on
the
E+
data
set,
which
has
many
more
samples
than
ingaNetwork
building
parameters
andmethod
only
present
this
capability
for
step
is solved
exactly
once.
The
direct
for
comput⌦
=
(1
)
(1
)
(7)
approach
learns
Bayesian
Network
that
is
converted
to
a
estimate
GGM
that
governs
the
joint
distribution
over
learns a Bayesian Network
that
a estimate
the
GGM
that
governs
the
joint
distribution
over
com- Dobra
estimate
the
GGM
that
governs
the
joint
distribution
over
k+1
T
1 is converted to 3.1
Score
and
Search
completeness.
such
as
microarray
data.
While
priors
ar
precision
matrix
is
recovered
by
the
following
computation:
ing
x
requires
computing
the
matrix
(A
A
+
⇢I)
.
The
et.
al
(2004)
and
Li
et.
al
(2006)
variables.
The
method
is
intended
to
work
with
small
samy
using
the
the
variance
i regression coefficients and
T
1
ple
size
data
sets
that
contain
a
large
number
of
features,
(e)
(f)
the E+ Score
random
variables,
it variables,
is not
clear
how
itclear
will
perform
by using
the
regression
coefficients
and
the
variance
y.
GGM
by
using
the
regression
coefficients
and
the
variance
and
Search
(S&amp;S)
algorithms
try
to
search
through
, v InGGM
) resulting
max(0,
v
)
(14)
the
E+
random
variables,
it
is
not
clear
how
it
will
perform
⌦
=
(1
)
(1
)
(7)
the
E+
random
it
is
not
how
it
will
perform
matrix
never
changes
throughout
the
entire
optiTmatrix
1
where
represents
an
upper
triangular
matrix
with
zero
difor
problems
with
more
features
than
o
⇢N
⇢N PROBABILISTIC
ple
size
data
sets
that
contain
a
large
number
of
features,
to
compute
the
precision
⌦.
The
the
space
of
all
possible
models
and
select
the
best
seen
e regression
Al-for
such
as
microarray
data.
While
priors
are
generally
reserved
2.
GRAPHICAL
MODEL
on
the
E+
data
set,
which
has
many
more
samples
than
=
(1
)allows
(1
) matrix
(7) on on
forregression
each
regression
to
compute
thethe
precision
⌦. The
each
to ⌦compute
the
precision
matrix
⌦. The
mization
process.
Storing
this
result
distributed
the
E+
data
set,
which
has
many
more
samples
than
the
E+
data
set,
which
has
many
more
samples
than
Figure
2.
Bayesian
GGM’s
parameter
estimates
compared
against
the
actual
model.
The
best
model
is
selected
by
using
a
global
critesuch
as
microarray
data.
While
priors
are
generally
reserved
n
matrix
is
recovered
by
the
following
computation:
agonals
the
non-zero
elements
represent
the
regression
(e.g.,
microarray
data
sets),
our
E+
data
Learning
the
true
joint
probability
distribution
directly
in and
gover
function
applies
the Lasso
regression
optimization
method
to
perform
aby
very
computationally
in- diagonals
variables.
The
method
is
intended
to
work
with
small
samwhere
represents
an
upper
triangular
matrix
with and
zero
the
where
represents
an
upper
triangular
matrix
with
zero
difor
problems
with
more
features
than
observation
samples
precision
matrix
is
recovered
by
the
following
computation:
precision
matrix
is
recovered
the
following
computation:
values
on
300
randomly
sampled
MO2
simulations.
ria function,
such parameter
as The
theThe
likelihood
function.
However,
the
variables.
method
is
intended
to
work
with
small
samvariables.
method
is
intended
to
work
with
small
samsver
be-z, which
k+1
the
general
case
is
computationally
intractable,
especially
1
where
represents
an
upper
triangular
matrix
with
zero
difor
problems
with
more
features
than
observation
samples
are
incorporated
into
the
tensive
once
and
reduce
all
future
x
computations
most
common
criteria
function
is the a
Bayesian
Information
i coefficients,
and
represents
a
diagonal
matrix
containas
well.
Our
E+
simulations
generate
non-zeroand
elements
represent
theaelements
regression
coefficients,
and
represents
Tnon-zero
1 with
ple
size
data
sets
that
contain
large
number
of
features,
agonals
the
represent
the
regression
(e.g.,
microarray
data
sets),
our
E+
data
set
requires
priors
when
dealing
large
number
of
continuous
random
T
1
eutions
folple
size
data
sets
that
contain
a
large
number
of
features,
k+1
⌦
=
(1
)
(1
)
(7)
T
1
on agonals
the
next
optimization
iteraple
size
data
sets
that
contain
a
large
number
of
features,
andvariables.
the
non-zero
elements
represent
(e.g.,
steps. Caching
the
values
used
to
compute
xthe
to)probability
disk the regression
Criteria (BIC)
: microarray data sets), our E+ data set requires priors
⌦
=
(1
)
(1
(7)
ieach
1
In
general,
it
is
assumed
joint
fac⌦
=
(1
)
(1
)
(7)
a
diagonal
matrix
containing
the
variances
for
performed
regression.
such
as microarray
data.
WhileOur
priors
are
generally
reserved
a reing
the
variances
for
each
performed
regression.
There
are
vectors
per
input
parameter
set,
which
le
1represents
such
as
microarray
data.
While
priors
are
generally
reserved
coefficients,
and
a
diagonal
matrix
containas
well.
E+
simulations
generate
30,540
output
data
allows
a
2.2GHz
Intel
Core
i7
laptop
to
solve
a
univariate
coefficients,torizes
and into several
represents
diagonalcomputational
matrix contain- such as
well. Our E+
simulations
generate
30,540reserved
output data
as microarray
data.
While priors
are generally
smaller moreamanageable
log
M
,behind
the
this3.9GB
particular
Lasso
regression
represents
an
upper
triangular
matrix
with
zero
difor
problems
with
more
features
than
observation
samples
Lasso
regression
problem
in
approximately
17
min⇤
Dim[G]
(4)
BIC(Data;
G)
=
L(Data;
G,
✓)
+
Pros:
Scalable
for
a
large
number
of
random
variables
where
represents
an
upper
triangular
matrix
with
zero
difor
problems
with
more
features
than
observation
samples
problems.
Factorization
is
based
purely
on
conditional
inmethods
for
learning
⌦
from
the
data
directly,
either
by
imbalance
between
output
and
input
ob
ing
the
variances
for
each
performed
regression.
There
are
vectors
per
input
parameter
set,
which
to
a
significant
ing
the
variances
for
each
performed
regression.
There
are
vectors
per
input
parameter
set,
which
to
a
significant
2
where
represents
an
upper
triangular
matrix
with
zero
difor
problems
with
more
features
than
observation
samples
ight.
he
main,
computationally
demanding,
utes.
dependence
and
is
represented
using
a
graphical
model
G,
1
and
themethods
non-zero
elements
represent
the
regression
(e.g.,
microarray
data
sets),
our
E+
data
set
requires
priors
agonals
and
the
non-zero
elements
represent
the
regression
(e.g.,
microarray
data
sets),
our
E+
data
set
requires
priors
for
learning
⌦
from
the
data
directly,
either
by
imbalance
between
output
and
input
observations.
For ex- produce
estimating
⌃
and
computing
⌃
.
The
inverse
operation
ample,
299
E+
simulations
appr
robmethods
for
learning
⌦
from
the
data
directly,
either
by
imbalance
between
output
and
input
observations.
For
exProblem
for
the
E+
Data
where
Dim[G]
represents
the
number
of
parameters
estionce.
The direct
for
computwhere
G is aSet
graph
with
V
nodes
representing
random
variagonals
and
the
non-zero
elements
represent
the
regression
(e.g.,
microarray
data
sets),
our
E+
data
set
requires
priors
1 method
1
11diagonal
nts,
and
represents
matrix
containas
well.
Our
E+
simulations
generate
30,540
output
data
T
1 aE diagonal
3operation
coefficients,
and
represents
a
matrix
containas
well.
Our
E+
simulations
generate
30,540
output data
slave
mated
within
the
graph,
M
is
the
total
number
of
samples,
ables
and
edges.
The
edges
in
E
represent
conditional
estimating
⌃
and
computing
⌃
.
The
inverse
ample,
299
E+
simulations
produce
approximately
3.9GB
of
1
1
mputing
the
matrix
(A
A
+
⇢I)
.
The
4.2
Bayesian
Regression
GGM
Learning
is
a
user
defined
hyperparameter.
Using
the
above
prior
final
estimate
is
computed
using
the
following
equation:
requires
an
O(V
)
runtime,
where
V
is
the
total
number
data.
In
the
simulation
input
p
is
a
user
defined
hyperparameter.
Using
the
above
prior
final
estimate
is
computed
using
the
following
equation:
estimating
⌃
and
computing
⌃
.
The
inverse
operation
ample,
299
E+
simulations
produce
approximately
3.9GB
of
coefficients,
and
represents
a
diagonal
matrix
containas
well.
Our
E+
simulations
generate
30,540
output
data
i
3optiandper
L is input
the
the log-likelihood
function.
dependencies
between
the
random
variables
in
the total
graph
G.number
and some
derivations,
Fan Li There
derived
the folvariances
for
each
performed
regression.
are
vectors
parameter
set,
which
to
a
significant
er
changes
throughout
the
entire
ing
the
variances
for
each
performed
regression.
There
are
vectors
per
input
parameter
set,
which
to
a
significant
E+
Building
Parameters
X
requires
an
O(V
)
runtime,
where
V
is
the
data.
In
the
simulation
input
parameter
space
Fan
Li

derived
the
fol3
X
Unlikederivations,
the direct
methods,
the
Bayesian
approach
assumes
1
lowing
maximum
a
posteriori
(MAP)
distributions
for
all
M
are
used toprobfind
the
best
ˆS&amp;S
The
graph
structure
is intended
to
represent
thethe
true
factorof
random
variables,
which
is
not
scalable
to
larger
resents
a
complex
high
dimensional
stat
requires
an
O(V
)
runtime,
where
V
is
total
number
data.
In
the
simulation
input
parameter
space
repX
= These
P (✓ methods
) per
(19)generally
ing
the
variances
for
each
performed
regression.
There
are
vectors
input
parameter
set,
which
to
a
significant
1
1
oring
this
result
allows
the
distributed
lowing
maximum
a
posteriori
(MAP)
distributions
for
all
1either
and
allthe
: data
M
s for learning
⌦
from
directly,
either
by
imbalance
between
output
and
input
observations.
For
exSample
ID
Pfor
P ⌦
… from
P matrix.
a
global
Wishart
prior
the
precision
Using
this
methods
for
learning
the
data
directly,
by
imbalance
between
output
and
input
observations.
ex- furˆ
of
random
variables,
which
is
not
scalable
to
larger
probresents
a
complex
high
dimensional
state
space,For
which
structure
for
Bayesian
Networks,
because
the
log-likelihood
=
P
(✓
)
(19)
ization
for
the
joint
probability
distribution
by
splitting
it
j
j
1
i
generate
and
all random
:a veryfor
One Observed
Output
Y
Mby
1
to
perform
computationally
in1
lems.
More
importantly,
directly
learning
⌦
via

only
ther
complicates
the
problem.
The
inpu
of
variables,
which
is
not
scalable
to
larger
probresents
a
complex
high
dimensional
state
space,
which
fur1 learning
Min(V
)
avg
avg
methods
⌦
from
the
data
directly,
either
imbalance
between
output
and
input
observations.
For
exglobal
prior
over
the
precision
matrix
allowed
Fan
Li

function
factorizes
into
a
series
of
summations.
This
factorj=1
into
simpler
components.
These
types
of
models
have
been
P
(
|
,
D)
/
ng
⌃
and
computing
⌃
.
The
inverse
operation
ample,
299
E+
simulations
produce
approximately
3.9GB
of
estimating
⌃
and
computing
⌃
.
The
inverse
operation
ample,
299
E+
simulations
produce
approximately
3.9GB
of
lems. xMore
importantly,
directly
learning
⌦where
via
only
the problem. The input parameter space
k+1
M is 
the total
number of samplesther
and ˆ complicates
is a sample
ation
P
P
P
p
d reduce alltofuture
computations
Row
ID
V
V
…
V
Max(V
) avg
avg x1estimates
ization
makes
very between
easy
to modify
an
existing Bayesian and observational skewness has lead us
i2
prove
that
the
Bayesian
approach
ainverse
e↵ectively
applied
to
many
fields,
such
asglobally
topic
3
3
computed
using
Eq.
18.
Afterward
a few1it
iterations
(x
) +guarantees
| | classification
a
global
maximum
solution
if
there
are
sufficient
estimating
⌃
and
computing
⌃
.
The
operation
ample,
299
E+
simulations
produce
approximately
3.9GB
of
lems.
More
importantly,
directly
learning
⌦
via

only
ther
complicates
the
problem.
The
input
parameter
space
k+1
P
(
|
,
D)
/
i guarantees
a
global
maximum
solution
if
there
are
sufficient
and
observational
skewness
has
us
to
avoid
exploring
ˆfinal
an i O(V
)
runtime,
where
V
is
the
total
number
data.
In
the
simulation
input
parameter
space
reprobrequires
an
O(V
)
runtime,
where
V
is
the
total
number
data.
In
the
simulation
input
parameter
space
repexp(
)
where
M
is
the
total
number
of
samples
and
is
a
sample
1
values
used
to
compute
x
to
disk
estimating
s
and
s,
we
use
the
estimates
to
comi
Network
and
compute
the
modified
score
without
recom…
…
,
classification
, and
disease
diagnosis .
i 3document
precision
matrix
possible
variable
orders.
PDoptimal P
Pall
pover
N
N
2
pute
the
GGM’s
precision
(Eq.
7).In score.
withcomputed
using
Eq.
18.
Afterward
a matrix
few
iterations
between
(x
x
)
+
|
|
(16)
puting
the
entire
For
example,
anthe
edge
to an
requires
an
O(V
)
runtime,
where
V
is
the
total
number
data.
the
simulation
input
parameter
space
repfor
the
problems
However,
the
direct
approaches
such
as
,
which
will
most
likely
only
2probnirandom
ij
njavg
iwhich
ijdimensionality.
l Core
i7 observations
laptop
to
solve
a
univariate
The
graphs
used
to
represent
factorized
joint
probabilguarantees
a
global
maximum
solution
if
there
are
sufficient
and
observational
skewness
has
us
to
avoid
exploring
observations
for
the
problems
dimensionality.
However,
direct
approaches
such
as
,
which
w
of
variables,
is
not
scalable
to
larger
probresents
a
complex
high
dimensional
state
space,
which
furom
variables,
which
is
not
scalable
to
larger
resents
a
complex
high
dimensional
state
space,
which
fur299
avg
Max(V
)
n=1
j=i+1
j=i+1
(a)
(b)
That
is
to
say,
under
the
Bayesian
formulation
presented
1
exp(
)
estimating
s and
s,orwe use existing
the final Bayesian
estimates to
com- requires computing the new and
mizanetwork
ity
distributions
are
either
direct
acyclic
(DAG)
… graphs
on
problem
in
approximately
17
mini
5.precision
RESULTS
bylems.
Fan Livariables,
,
allthe
variable
orderings
should
More
importantly,
directly
learning
⌦
via

only
ther
complicates
the
problem.
The
input
parameter
space
of
random
is
not
scalable
to
larger
probresents
a
complex
high
dimensional
state
space,
which
furHow
to directly
sample
Xwhich
to
cover
the
whole
space
oftheoretically
More
importantly,
learning
⌦
via

only
ther
complicates
the
problem.
The
input
parameter
space
observations
for
problems
dimensionality.
However,
the
direct
approaches
such
as
,
which
will
most likely only
pute
the
GGM’s
matrix
(Eq.
7).
previous
conditional
probability
involving
the
child
of
the
P(
|✓ , graphs.
, D) / P (D|The, first
,(16)
✓ )P (form
|
, ✓ )P (35,040
|✓that
)
undirected
assumes
theIn graph
tting
Figure
order to assess how well the GGM estimates
building3. Figure 3(a) highlights that building parameter three has values that
building
parameters
for
all
possible
real
values?
produce
the
same
precision
matrix.
guarantees
a global
maximum
if ⌦
there
are
observational
skewness
usexploring
to
avoid exploring
new edge,
and
to
the previous
BIC
represents
a Bayesian
Network
and sufficient
the latter
form
assumes
+ 1directly
+if
N there
2i + |D|solution
ees
a global
maximum
solution
are
and
observational
skewness
has
ushas
to
avoid
More
importantly,
learning
viaparameters,
 sufficient
only
ther
complicates
the
problem.
The
input
parameter
space
we experimented
withand
three
data
sets –the
Finedi↵erence
opti-lems.
⇠
Gamma(
,
occasionally
differ
greatly from
the parameter’s mean, and Figure 3(b) compares
The Wishartthat
prior
used
by Fan
Li
defined
as W
( , TThese
). Graingraph
(FG), Model score.
Order 1 (MO1),
and Model
Order
2
2 aisMarkov
Updating
the
penalty
term
is
achieved
by
simply
the
graph
represents
Network.
5.
RESULTS
Pdimensionality.
P 1 dimensionality.
P
mmuobservations
for
the
problems
However,
the
direct
approaches
such
as
,
which
will
most
likely
only
egression
GGM
Learning
1for the
1
1
tions
problems
However,
the
direct
approaches
such
as
,
which
will
most
likely
only
log
M
(MO2).
The
first
data
set
contains
building
simulations
guarantees
a
global
maximum
solution
if
there
are
sufficient
and
observational
skewness
has
us
to
avoid
exploring
✓
+
✓
+
(x
x
)
parameters
estimated
via
a
genetic
algorithm
and
optimization.
P ( i |✓i , i ,represents
D) / P (D| aitypes
, i , ✓both
, ✓ithat
)P ( ia joint
|✓i ) probability
user
defined
and In
T order
is a)distribution
hyadding N 2 to the updated BIC score, where N is the
i )P ( assume
i | ihyperparameter
facto
assess
how
well
the
GGM
estimates
building
built from
small
brute
force
increments
across
14 building
2
ethods,
the
Bayesian
approach
assumes
Bayesian
Regression
GGM
Learning
number
ofdata
newly
estimated
torizes
according
to their
structure.
However,
a parameters.
Bayesian
perparameter
diagonal
matrix who’s
entries
are parameters,
governed
by
The MO1
data
set
contains
299
building
simu-parameters.
observations
for
the
problems
dimensionality.
However,
the
direct
approaches
such
as
,
which
will
most
likely
only
+
1
+
N
2i
+
|D|
we
experimented
with
three
sets
–
Fine
(17)
sub-⇠ Gamma(
,
Random Parameter Estimation
0.8
0.6
0.4
0.2
0.6
0.4
0.2
27 28 29 30 66 67 68 69 157 158 159 160 181
Variables
Parameter Values
Parameter Values
Parameter Estimates
Parameter Estimates
Parameter Estimates
Parameter Values
Parameter Values
2
0.8
9
Parameter Values
Parameter Estimates
1
1
0
27 28 29 30 66 67 68 69 157 158 159 160 181
Variables
Parameter Estimates
⇢N
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
9
i
Estimate
Actual
Parameter Values
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
1
Parameter Estimates
Estimate
Actual
Parameter Values
Parameter Estimates
Bayesian Parameter Estimation
N
Parameter Values
Parameter Estimates
⇢N
MO2 Building Parameter 3
FG Building Parameter 2
1
i
M
1
1
1
i
2
D P1
n=1
ni
ij
nj
1.5
1
0.5
j
1
N
j=i+1
One−Act
Zero−Act
Est
Zero−Est
Aactual
GA−Est
1
0.5
0
j=1
P1
i
j
i
156
1
1.5
Parameter Values
Parameter Value
2
2
N
j=i+1
i
1
ij
2
1
i
90
1
0
0
50
100
150
Simulation
200
250
300
−0.5
0
5
10
15
20
25
30
Simulations
35
40
45
50
i
P156
1
i
i
N
j=i+1
1
i
i
1
2
ij i
1
i
i
D
n=1
i
i
ni
1
i
1
i
N
j=i+1
i
j
nj
i
2
The
most
common
method
for performing S&amp;S within
lations and ⇠150 building
parameters.
The
building paramor for the precision
matrix.
Using assumes
this
Network
a much simpler probabilistic
dependency
the following
Grain
(FG),
Model
Order
1
(MO1),
and
Model
Order
2
2 distribution:
Maximizing P
i | i , D) with respect to i is equivalent to
eters were independently
set
to their minimum
or maximum
s precision
.PN
P
P(variables,
the
literature
is
Greedy
Hill Climbing, which explores the
between
the
in
which
a
variable
is
conditionally
D
N
matrix
Fan
Li

1 allowed
1
2
2
(MO2).
The
first
data
set
contains
building
simulations
solving
a
Lasso
regression
problem
with
the
regularization
value,
while
other
parameters
were
set
to
their
average
value.
Hyper-+ ✓i + independent
jx
pnj )
j=i+1 ij ✓i
n=1 (xni
j=i+1
candidate
graph
space
byThe
fined
✓i i.
of
all
other
variables
givenauthors
its small
parents.
On
hyperparameter
set to
The original
This
results
in two
simulations
per
building
parameter.
)
ayesian approach
estimates
a
globally
built
from
brute
force
increments
across
14
building
P (✓these
) =formulations
exp(
) with microarrays,
(15) MO2Xdata
ihand,
parameter θi the
2derived
to work
which
changing
directions.
The best valid graph modification
set contains
pairwise edge
minimum
and maximum
other
a
Markov
network
assumes
variables
are
2
2
parameters.
The
MO1
data
set
contains
299
building
simuatrix over all possible variable
orders.
(17)
typically contain 10,000 to 12,000
variables, but have very
•  Our
FG
and function,
MO2 experimental
results indicated that the GGM performs
combinations
across the
same building
parameter
set.
is
selected
according
to
a
criteria
generally
BIC,
conditionally
independent
from
variables
Y
provided
that
for
T
1
and ⇠150 building
parameters.
Theabuilding
few
samples. This allowed the authors to uselations
conventional
Using the FG
data set, we built
GGM usingparam250 simular (9)
the Bayesian
formulation
presented
This
version
assumes
we
are
only
splitting
the
optimization
and
a
new
candidate
graph
is generated. A
modification
is
they
are
separated
by
variables
Z.
X
and
Y
are
said
to
be
well
at
estimating
building
parameters.
methods
to solve forto i . However,
the
E+independently
Maximizing P ( i | i , D) with optimization
respect to
tions sampled
without
replacement,
evaluated
the model
1 and or
i is equivalent
eters
were
set
to
their
minimum
maximum
is in
aZthis
user
defined
hyperparameter.
Using
theX
above
prior
final
estimate
is
computed
using the
following
equation:
1
ariable orderings
should
theoretically
problem
across
the
training
samples,
and
not
the
features.
i prior
only
valid
if
the
candidate
graph
is
a
valid
Bayesian
Net- equation:
data
set
used
work
contains
several
million
data
vecseparated
by
if
and
only
if
all
paths
between
and
Y
using
150
test
simulations
simulations.
In
we
built
is
a
user
defined
hyperparameter.
Using
the
above
final
estimate
is
computed
using the following
solving
a
Lasso
regression
problem
with
the
regularization
i
value,
while
other
parameters
were
set
to
their
average
value.
and
some
derivations,
Liap derived
the
fol-the entire
con- matrix.
p
tors,
which
mostly
invalidates
conventional
optimization
It
is
possible
to
split
across
both.

presents
the
aNetwork
GGM
using
MO1the
data
set.approach
The resulting
model
ecision
work.
This
guarantees
a locally optimal solution
M
and
some
derivations,
Fan
Li
 derived
folpass
through
the
variables
Z
.
The
Bayesian
2 in two simulations
X
hyperparameter
setprior
to
.
The
original
authors
Wishart
i
This
results
per
building
parameter.
The
M
1
1
proaches.
Rather
than
using
the
fast
grafting
algorithm
lowing
maximum
a
posteriori
(MAP)
distributions
for
all
1
is evaluated using 300 sampled MO2 simulations.
1
X
blem
formulation
for
supporting
this
functionality.
ˆ
1(19)
1
is
a
user
defined
hyperparameter.
Using
the
above
prior
final
estimate
is
computed
using
the
following
equation:
that
will
maximize
the
criteria
function,
but
could
be farbuilt
=
P
(✓
)
lowing
maximum
a
posteriori
(MAP)
distributions
for
all
makes
stronger
assumptions
independence,
but
these
•
the
Bayesian
models
using
the
FG
and
full
MO1
data
sets
1j
used
by Fan
Li

is
defined
as
W
(
,
T
).
j
1 the Lasso
ii Overall,
ˆ
derived
these
formulations
to
work
with
microarrays,
which
used
by
,
we
solve
regression
problems
using
Analyzing
Figure
1
illustrates
that
the
building
paramMO2
data
set
contains
pairwise
minimum
and
maximum
=
P
(✓
)
(19)
and all
:
M
for Ω
j
j
1
i
j=1
Solved
by all
and
:
and
some
derivations,
Fan
Li

derived
the
folaway
from
the
true
model.
M j=1
(Section
4.1).
assumptions
make
it
easier
to
perform
exact
interference.
eter
estimates
using
the
GGM
are
mostly
better
than
the
efined
hyperparameter
and
T
is
a
hytypically contain 10,000 to 12,000 variables, but have very
combinations
across
the
same
building
parameter
set.
M
are
statistically
better
than
the
uniform
distribution,
which
is
expected.
X
LASSO
with
After
obtaining
’s
MAP
estimate,
it
can
be
used
to
i
estimates
generated
by
randomly
guessing
using
a
[0,
1]
ran1
1
There
are
two
algorithms
that
extend
the
basic
algorithm
lowing
maximum
a posteriori
(MAP)
distributions
for all
contrast,
Markov
Networks
use approximate
inference
1
1
ˆ
samples.
This
allowed
the
authors
conventional
PP(by
/
alfew
matrix
who’s
entries
areIn
governed
Using
the
FG
data
set,
we
built
a
GGM
using
250
simulai | 1use
i|✓, iD)
ˆ
=
(✓j )
(19)
maximize
(to
,
,
D)
with
respect
to
.
However,
jis aP
i
1
dom distribution. The
overall
error rate
where
MGGM’s
is the absolute
total
number
of
samples
and allows
sample
1 P ( i | i , D) /i
i
i
i
by
constraining
the
search
space,
which
for
better
soˆ
and
all
:
1
methods,
such
as
Belief
PDLoopy
PN
P
M
where
M is the total
number
of samples
and i is a sample
optimization methods to solve
. However,
the
E+
N
P ( for
upon
the Propagation.
hyperparameter
2
tions
sampled
without
replacement,
and
evaluated
the
model
is
statistically
better
with
95%
confidence
than
the
uniform
i |✓i , i i , D) is dependent
tion:
j=1
computed
using
Eq.
18.
Afterward
a
few
iterations
between
P
P
P
p
(x
x
)
+
|
|
ni
ij
nj
i
ij
N
N
Maximum a posterior
n=1
j=i+1D
j=i+1 2
lutions.
The first algorithm is the Sparse Candidate method
Conclusions
both
have
weaknesses
certain
con- error
andgraphs
Fan
Li exp(

noted
that
there
is with
not a representing
known
computed
using
Eq.
18.
Afterward
a few iterations
between
distribution’s
rate
– 4.05&plusmn;0.98
However,
1built
data set used in this work contains
several
million
data
vec(xni method
)
ij xnj ) + simulations.
i estimating
ij |vss 5.07&plusmn;1.01.
using
150 test
simulations
we
•
However,
our
current
GGMs
have
difficulty estimating parameters that
and
s,
we
use
the
final
estimates
to
comn=1
j=i+1
j=i+1In
1 variables that have1 a
to
analytically
remove
the
hyperparameter
via
integration.
only
the
error
rates
for
variables
1,
2,
3,
8,
10,
11,
12,
13,
exp(
)
.
This
method
assumes
that
random
i
ditional
independence
properties.
However
within
this
work,
P (optimization
/ apestimating
s and
s,ofwe
use theand
finalˆestimates
to comi | i , D)
distribution
for
✓iinvalidates
tors, which(MAP)
mostly
conventional
a
GGM
using
the
entire
MO1
data
set.
The
resulting
model
where
M
is
the
total
number
samples
is
a
sample
pute
the
GGM’s
precision
matrix
(Eq.
7).
i
This means
numerical methods
are required
to approximate
and
14i are
statistically
better
than their
random
guessing
(16)
deviate
significantly
from
the
mean.
This
implies
we
need
to
explore
(✓proaches.
exp(
)
(15)
2 Networks
P
P
P
p
high
measure
of
mutual
information
should
be
located
closer
i) =
we
show
preference
to
Markov
over
Bayesian
Netpute
the
GGM’s
precision
matrix
(Eq.
7).
-1
D
N
N
2
Rather
than
using
the
fast
grafting
algorithm
many
nu- using 300 isampled
isareevaluated
MO2
simulations.
(16)
The other
arecomputed
not statistically
di↵erusing
Eq. 18. Afterward a few iterations between
2 all β2and all ψ the integral over the hyperparameter.
(xni Therej=i+1
|toijvariables
|each
ij xnj ) +counterpart.
n=1
j=i+1
in the
final network
than variables with
low
because
wish to using
avoid
falsely
representing
variables
1
merical
methodswe
for
computing
approximates
toAnalyzing
definite in- Figure
ent.
used by , we solve the works
Lasso
regression
problems
different
hyper-parameter
settings,
which
may
induce
more
estimation
exp(
) RESULTS
1 illustrates 5.
that
theother
building
paramestimating
s
and
s,
we
use
the
final
estimates
to
comtegrals,
such as1 Trapezoidal
method and 1Simpson’s Rule.i 1
mutual
The
mutual information for two dis1
While the
other variables
areinformation.
not statistically di↵erent,
we are only
splitting
optimization
conditionally
independent.
(Section
4.1). theas
5.
RESULTS
P
(
|✓
,
,
D)
/
P
(D|
,
,
✓
)P
(
|
,
✓
)P
(
|✓
)
eter
estimates
using
the
GGM
are
mostly
better
than
the
i
i
i
i
i
i
i
pute
themodel
GGM’s
precision
matrix
(Eq.
7).
i
However, the integral
over ✓i is unbounded
1 i from above, be- i
1there is inot a clear 1indication
1 the
variance,
or
possibly
a
mixture
model
approach,
which
will
allow
more
that
GGM
fails
to
In
order
to
assess
how
well
the
GGM
estimates
building
(16)
crete
random
variables,
X
and
Y
,
is
defined
as
follows:
raining
samples,
and
not
the
features.
Maximize
While
these
graphical
models
are
able
to
repP
(
|✓
,
,
D)
/
P
(D|
,
,
✓
)P
(
|
,
✓
)P
(
|✓
)
i
i
i
i
i
i
i
After obtaining i ’s MAP cause
estimate,
it can
used
to
i
i represent
i
estimates
generated
by randomly
guessing
using
aexperimented
[0,
ran✓i ’s values
exist be
in the
interval
[0,1),
which
means
these variables.
This iindicates
that
the1]
GGM
over-to
In
order
assess
how
wellsets
the–GGM
estimates building
+
1
+
N
2i
+
|D|
parameters,
we
with
three
data
Fine
1
across
both.
the
✓
◆
resent
joint
graphintegral
structures
generally
⇠
Gamma(
,for
variable
means.
17
mustdistributions,
beto
transformed
tothe
a bounded
theNare2i
maximize
P ( i|✓presents
with
respect
However,
all
isThe
successfully
modeling
the absolute
FG buildingerror
parameters.
In we
i , i , D) with
i.
+
1
+
+
|D|
dom
distribution.
overall
GGM’s
rate
X
parameters,
experimented
with
three
Grain
(FG),
Model
Order
1 (MO1), andPModel
Order
2 data sets – Fine
2
(x,
y)
1
⇠
Gamma(
,
numerical
methods
to
be
applicable.
Given,
Eq
17’s
com5.
RESULTS
orting
this
functionality.
fact
analyzing
Figures
1(a)
indicates
that
the
GGM
isolates
Given
total
variables
PDis statistically
PN2 better
I(X,
Y Grain
)uniform
= set
P contains
(x,Model
y)logbuilding
(5)
P ( i |✓i , respect
upon theP
hyperparameter
✓i1 number
1 2 the
1of random
12
1(MO2). than
i , D) is dependent
with
95%
confidence
the
N
1
(FG),
Order
1
(MO1),
and
Model Order 2
to ψ-1predetermined.
The
first
data
simulations
plex nature and
Li’s

recommendation
to
use
sampling
✓
+
✓
+
(x
x
)
P
(
|✓
,
,
D)
/
P
(D|
,
,
✓
)P
(
|
,
✓
)P
(
|✓
)
the
building
parameter’s
means
very
well.
While
the
uniP
(x)P
(y)
ni
j
nj
i
i
i
i
i
i
i
ij
i1
i it is PN i
P
PD rate
i
i N
n=1
an
E+j=i+1
for
our
experiments)
1 j=i+1
21 |✓ i, 1 , D]simulation,
2
In variances
order
to
assess
how data
well
the14
GGM
estimates
building
x,y
and Fan Li  noted thatwithin
there
is not
a(⇠300
known
method
distribution’s
error
–
4.05&plusmn;0.98
vs
5.07&plusmn;1.01.
However,
(MO2).
The
first
set
contains
building
simulations
)
to approximate
,
we
elected
to
use
E[
as
built
from
small
brute
force
increments
across
building
i
i
form
distribution
matches
the
mean
and
(Figure
✓
+
✓
+
(x
x
)
i
i
ni
j
nj
i
j=i+1 2 ij i
n=1
j=i+1
1
+
1
+
N
2i
+
|D|
our estimates for via
,
which
is
the
MAP
estimate
under
a
to analytically remove the hyperparameter
integration.
parameters,
wethe
experimented
with
three
data
sets14–building
Fine
1(b)), for
it appears
the matching
only
possible
because
built
from
small
brute299
force
increments
across
only the error, rates
variables
1, 2, 3,is )8,
10,
11,
12,
13,
i
parameters.
The
MO1
data
set
contains
building
simu⇠
Gamma(
(17)
2
distribution.
Based on Fan Li’s  MAP
distribuactual
building
parameters
have
arandom
large
distribution
This means numerical methodsGamma
are required
to approximate
Grain
(FG), range,
Model
1building
(MO1),
and Model
Ordersimu2
2
parameters.
TheOrder
MO1
set contains
299 building
and
14
are
statistically
better
than
their
guessing
lations
and
⇠150
building
parameters.
Thedata
param(17)
tion, we derived the
following
maximum
likelihood
estimate
which
exPare
PDto i is equivalent
PNmostlytocenters 2at 0.5, the uniform distribution’s
N 1Pmany
1 with 1counterpart.
2 i ,nuMaximizing
(
|
D)
respect
the integral over the hyperparameter.
There
-1
-1
lations
and
building
parameters.
The building
parami
eters
were
independently
set
to⇠150
their
minimum
or maximum
(MO2).
The
first
data
set contains
building
simulations
The
other
variables
are
not
statistically
di↵er1.  (MLE)
Given aequation
fixed θ ,for
we j=i+1
estimated
the
+ ✓i by+computing
value. j xnj )
ija✓iψ sample
i :
This
work
was
funded
by
field
work
proposal
CEBT105 under the Department
n=1 (xni pected
j=i+1
Maximizing
P
(
|
,
D)
with
respect
to
is
equivalent
to
solving
a to
Lasso
regression
with
the
regularization
i
i
i
merical methods for computing
approximates
definite
in- pproblem
eters
were
independently
set
to
their
minimum
or
maximum
value,
while
other
parameters
were
set
to
their
average
value.
)
maximum
likelihood
estimate
(MLE)
ent.
built
small
Similar to the FG data set results,
the from
Bayesian
GGMbrute force increments across 14 building
2
1 +solving
N
hyperparameter
set 2i
to+a|D|
.
The
original
authors
Lasso
regression
problem
with the
regularization
of
Energy
Building
Technology
Activity
Number BT0305000. This research
1
presents
excellent
performance
on the
MO2
set
(Figure
i While
This
results
in
twodata
simulations
per
building
parameter.
The
value,
while
other
parameters
were set
to their
average
value.
tegrals, such as Trapezoidal method
and
Simpson’s
Rule.
the
other
variables
are
not
statistically
di↵erent,
parameters.
The
MO1
data
set
contains
299
building
simup
=
P
P
P
i
N
D
N
(17)
1
2
2).i The
Bayesian
method
isauthors
better set
thancontains
a uniform
distri- in two
✓i 1 + above,
✓formulations
+hyperparameter
xnjmicroarrays,
)2
derived
these
with
which
set
to
.
The original
ij
MO2
data
pairwise
minimum
and
maximum
i
This
results
simulations
per
building
parameter.
The
j=i+1 ijfrom
n=1 (xni to work
j=i+1
However, the integral over ✓i is unbounded
bethere is not a clear
indication
that
the
GGM
model
fails
to
lations
and
⇠150
building
parameters.
The
building
paramused
resources
of
the
Oak
Ridge
Computing
Facility
at
the
Oak
bution
at
estimating
all
building
parameters.
The
Bayesian
(18)
typically
contain
10,000
to
12,000
variables,
but
have
very
derived
these
formulations
to
work
with
microarrays,
which
combinations
across
the
same
parameter
set.minimum and maximum
data building
set contains
pairwise
cause ✓i ’s values exist in the
interval
[0,1),
which
means
2.  Computed
multiple
samples
and
E[ψ-1respect
|θi,these
βi, D] variables.
1 MLE
1 represent
Maximizing
P
(
with
to
is
equivalent
to
This
indicates
the
GGM
overGGMi has
better
overall
error
—that
13.01&plusmn;0.85
vsMO2
41.6936&plusmn;2.13.
i |a estimated
i , D)
eters
were
independently
set
to
their
minimum
or is
maximum
Given few
a fixed
✓
,
we
can
estimate
sample
by
comi
i
Ridge
National
Laboratory,
which
by the Office of Science of the
samples.
This
allowed
the
authors
to
use
conventional
typically
contain
10,000
towith
12,000
variables,
butproduces
have very
Using
the
FG data
set, predicwe builtacross
a GGM
using
250
simula-parameter supported
using
weighted
sampling
combinations
the
same
building
set.
In
the
model
statistically
better
1
Eq.
17
must
be
transformed
to
a
bounded
integral
for
the
puting
the
maximum
likelihood
estimate
(MLE)
(Eq.
18).
all
is
successfully
modeling
the
FG
building
parameters.
In
solving
a
Lasso
regression
problem
the
regularization
r defined hyperparameter. Using the above prior
final i estimate is computed using the p
following equation:
value,
while
other parameters
were set
to
their average value.
optimization
methods
to
solve
for
.
However,
the
E+
i
3.
Sampled
P(θ
)
using
its
CDF
and
a
uniform
distribution
on
the
tions
sampled
without
replacement,
and
evaluated
the
model
few
samples.
This
allowed
the
authors
to
use
conventional
tions
for
all
variables.
Using
the FG data set,of
we Energy
built a GGM
using 250
simula- No. DE-AC05-00OR22725. Our
i multiple
Byfolcomputing
toanalyzing
the ✓i.
’s
derivations,
Fan Lito
be
derived
the
U.S.
Department
under
Contract
numerical
methods
applicable.
Given,
EqMLE
17’ssamples
com-Maccording
fact
Figures
1(a)
indicates
that
the
GGM
isolates
hyperparameter
set
to
The
original
authors
i
This
results
in
two
simulations
per
building
parameter.
The
interval
[0,1]
toset
compute
used
in this
work
several
million
data
vec1
using
150
test
simulations
simulations.
In
we
built
1 X
in Eq.
arecontains
able
to
estimate
optimization
methods
to
solve
for
.
However,
the
E+
ximum a posteriori (MAP) distributions fordistribution
all
115, we
i
tions
sampled
without
replacement,
and
evaluated
the
model
ˆj Pthe
to
use
sampling
=formulations
(✓j )to
(19) microarrays,
building
parameter’s
means
very
well.
While
the
uni1
1plex nature and Li’s  recommendation
i
derived
these
work
with
which
MO2
data
set
contains
minimum
and
maximum
work
has
been
enabled
and
by NSF funded RDAV Center of the
:
E[ i |✓i ,tors,
using1 mostly
weighted
sampling.
Inconventional
order to use optimization
invalidates
apM set
i , D]which
6. DISCUSSION
a GGM
using
the entire
MO1
data
set. pairwise
The
resulting
modelInsupported
1
data
data
vecj=1 used in this work contains several million
using
150
test
simulations
simulations.
we
built
to approximate i , we elected
to sampled,
use
E[ wei sample
|✓
, D]
as10,000
2the
i , contain
i Eq.
form
distribution
matches
mean
and
variances
(Figure
weight
15
using
its
CDF
andgrafting
a variables,
typically
to
12,000
but
have
very
combinations
across
theentire
sameMO1
building
parameter
set.
proaches.
Rather
than
using
the
fast
algorithm
While
the
parameters
in
FG and
MO2
have
well
defined
is
evaluated
using
300
sampled
MO2
simulations.
tors,
which
mostly
invalidates
conventional
optimization
ap1
1
University
of
Tennessee,
Knoxville
(NSF
grant no. ARRA-NSF-OCI-0906324
a
GGM
using
the
data
set.
The
resulting
model
D)our
/ estimates for
ˆ
unifrom
distribution
on
the
interval
[0,1].
This
means
our
,
which
is
the
MAP
estimate
under
a
where
M
is
the
total
number
of
samples
and
is
a
sample
1(b)),
it
appears
the
matching
is
only
possible
because
the
i
2
means,
they
have
instances
where
they
significantly
devii
samples.
This
thethan
authors
to
use
conventional
usedfew
by ,
we solve
theallowed
Lasso
regression
problems
using
PN
PN
p
Using
FG data
set,300
wesampled
built a MO2
GGM
using 250 simulaAnalyzing Figure
1the
illustrates
that
the
building
paramproaches.
Rather
using
the
fast
grafting
algorithm
D
2
is
evaluated
using
simulations.
computed
using
Eq.
18.
Afterward
a
few
iterations
between
(xni
ij xnj ) +
i
ij |
ate
from their have
estimated
means,distribution
which is not
well
implied
Gamma
distribution.
Based
on| Fan
Li’s

MAP distribun=1
j=i+1
j=i+1
and
NSF-OCI-1136246).
actual
building
parameters
a
large
range,
1methods to
(Section
4.1).
optimization
solve
for
.
However,
the
E+
eter
estimates
using
the
GGM
are
mostly
better than
thethe building
)
i
tions
sampled
without
replacement,
and
evaluated
the model
2
estimating
s and used
s, we
use ,
the final
estimates
to comby
we
solve the
Lasso
regression
problems
using
Analyzing
Figure
1
illustrates
that
paramA
based
constrained
set
optimization
method.
by
results
(Figure
1(a)
and
2).
Figure
3(a)
illustrates
that
i
tion, we derived the following maximum
likelihood
estimate
which
mostly
centers
at
0.5,
the
uniform
distribution’s
exAfter
obtaining
’s
MAP
estimate,
it
can
be
used
to
pute
the
GGM’s
precision
matrix
(Eq.
7).
generated
bytest
randomly
guessing
using
a [0,are
1] In
randata set used
ini this(Section
work contains
several million data estimates
vec(16)
using 150
simulations
simulations.
wethan
builtthe
4.1).
eter
estimates
using
the
GGM
mostly
better
1
1
(MLE) equation for i :
maximize
P ( i mostly
|✓i After
, simulations
with
respect
to
However,
pected
value.
i , D)
i . estimate,
distribution.
The
overall
GGM’sMO1
absolute
error
rate resulting
tors,
which
invalidates
conventional
optimization
apobtaining
’s
MAP
it brute
candom
be force
used
to
1.  Fine Grain (FG): contains
building
built
from
small
a
GGM
using
the
entire
data
set.
The
model
i
estimates
generated
by
randomly
guessing
using a [0,
1] ran1
5.
RESULTS
2 the Bayesian
1 the hyperparameter
P ( proaches.
ismaximize
dependent
upon
Similar
to
the
FGrespect
data✓algorithm
GGM
1
1
1
iset results,
is. statistically
better
with
95% confidence
than
the GGM’s
uniform
i |✓i , i , D) Rather
P
(
|✓
,
,
D)
with
to
However,
than
using
the
fast
grafting
i , D) / P (D| i , i , ✓i )P ( i | i , ✓i )P ( i |✓i )
i
i
i
dom
distribution.
The
overall
absolute error rate
is
evaluated
using
300
sampled
MO2
simulations.
i
In
orderLito 
assess
how well
the GGM
estimates
building
increments
across
14
building
parameters.
250
simulations
were
sampled
1
+
N
2i
+
|D|
and
Fan
noted
that
there
is
not
a
known
method
1
distribution’s
error
– 4.05&plusmn;0.98
vs 5.07&plusmn;1.01.
However,
1+ 1 + N 2i + |D|
presents
excellent
performance
onusing
the MO2 data
setisrate
(Figure
P
(
|✓
,
,
D)
is
dependent
upon
the
hyperparameter
✓
used
by
,
we
solve
the
Lasso
regression
problems
i
i
i
parameters,
we
experimented
with
three
data
sets
–
Fine
statistically
better
with
95%
confidence
than the
uniform
=
Analyzing
Figure
1
illustrates
that
the building
parami
P
P
P
i
mma(
,
N
D to analytically
N
1
1
remove
the
hyperparameter
via
integration.
2
2
onlythan
thethe
error
rates
for
variablesC.
1, 2,
3, 8, 10,
11,
12, 13, J. Nevins, G. Yao, and M. West. Sparse
2).
Bayesian
method
is
better
a uniform
distri(FG),
Model
Order
1 (MO1),
and The
Model
Order
2there
2
without
and
150
test
were
used
to
evaluate
✓i replacement,
+ ✓i + n=1 (xGrain
xnjsimulations
)4.1).
A.
Dobra,
Hans,
B.
Jones,
and
Fan
Li

noted
that
is
not
a
known
method
ij
distribution’s
error
rate
–
4.05&plusmn;0.98
vs 5.07&plusmn;1.01.
However,
ij
(Section
j=i+1
j=i+1
P
P
eter
estimates
using
the
GGM
are
mostly
better
than
the
D
N
1
1
2
2
This
means
numerical
methods
are
required
to
approximate
(MO2).
The
first
data
set
contains
building
simulations
and
14
are
statistically
better
than
their
random
guessing
✓
+
✓
+
(x
x
)
ni
j
nj
bution
at
estimating
all
building
parameters.
The
Bayesian
ij
i
i
1
n=1
j=i+1
to(18)
analytically
remove
the hyperparameter
via integration.
onlygraphical
the error by
rates
for variables
1, 2, using
3, 8, 10,
11,
12,
13,
After
obtaining
it
can
be
used
to
)
GGM
model.
built
from
small
brute
force
increments
acrossestimate,
14
building
i ’s MAP
estimates
generated
randomly
guessing
a
[0,
1]
ranmodels
for
exploring
gene
expression
data. Journal of
the
integral
over
the
hyperparameter.
There
are
many
nucounterpart.
The
other
variables
are
not
statistically
di↵er2
1
1
GGM
has
better
overall
error
—
13.01&plusmn;0.85
vs
41.6936&plusmn;2.13.
1
This
means
numerical
methods
are
required
to
approximate
The MO1
data |✓
set ,contains
299with
building
simuGiven a fixed ✓i , we can estimate
a parameters.
sample
by
comand 14 are statistically
better
than absolute
their random
guessing
(17) merical
maximize
P
(
,
D)
respect
to
.
However,
i
i
i
i
dom
distribution.
The
overall
GGM’s
error
rate
methods
for
computing
approximates
to
definite
ini
ent.
2.
Model
Order
1
(MO1):
contains
299
building
simulations
and
~150
building
Multivariate
Analysis,
90(1):196–212,
2004.
lations
and
⇠150
building
parameters.
The
building
paramIn
the
model
produces
statistically
better
predic1
the
integral
over theupon
hyperparameter.
There are ✓many nucounterpart.
The
other
variables
are
not
statistically
di↵erputing
the
maximum
likelihood
estimate
(MLE)
(Eq.
18).
P
(
|✓
,
,
D)
is
dependent
the
hyperparameter
such
and
Simpson’s Rule.
ng P ( i | i , D) with respect to i is equivalent to tegrals,
ias iTrapezoidal
i
statistically
with
95% confidence
eters were
set to their method
minimum or
maximum
While
theis other
variablesbetter
are not
statistically
di↵erent,than the uniform
i independently
tions
for
all
variables.
merical
methods
for
computing
approximates
to
definite
inparameters.
The
building
parameters
were
independently
to their
min or a clearF.
ent.
By regression
computing
multiple
samples
according
to
the
✓iover
’snoted
Lasso
with the MLE
regularization
Li, Y.
Yang,
E.
Xing.
From
lasso
regression to feature vector machine.
value,
other
were
tounbounded
theirthere
average value.
However,
the
integral
✓isetisthat
beandwhile
Fan
Liparameters

isfrom
notabove,
a set
known
method
pproblem
there is notdistribution’s
indication
that
the–and
GGM
model
fails
to
error
rate
4.05&plusmn;0.98
vs
5.07&plusmn;1.01.
However,
meter
set to
original
authors
tegrals,
such
as Trapezoidal
i . The
This able
results in
simulations
per
building
parameter. The method and Simpson’s Rule.
While
theindicates
other variables
are not
statistically di↵erent,
distribution
defined
in
Eq.
15, other
wecause
are
totwoestimate
✓
’s
values
exist
in
the
interval
[0,1),
which
means
i
represent
these
variables.
This
that
the
GGM
overmax
value,
while
parameters
were
set
to
their
average
value.
This
to
analytically
remove
the
hyperparameter
via
integration.
Neural1,that
Information
Systems, 18:779, 2006.
only the
error
rates
forinvariables
2, the
3, 8,
10, model
11,Processing
12,fails
13,to
ese formulations
to work with microarrays, which
MO2 data set contains
pairwise the
minimum
and maximum
1
However,
integral
over
✓
is
unbounded
from
above,
bei
there
is
not
a
clear
indication
GGM
E[
|✓
,
,
D]
using
weighted
sampling.
In
order
to
use
i toi 12,000 variables, but have very
Eq.
17 must
be transformed
to
a bounded
integral for
the
6.
DISCUSSION
containi 10,000
all is successfully
modeling
the FG building
parameters.
Inrandom guessing
combinations
across
the same building
parameter
set. required
This
means
numerical
methods
are
to
approximate
and
14
are
statistically
better
than
their
results
in
two
simulations
per
building
parameter.
The
entire
MO1
data
set
cause
✓
’s
values
exist
in
the
interval
[0,1),
which
means
i applicable.
represent
these variables.
This
indicates
that the GGM overes.
This allowed
the authors
use conventional
Usingits
themethods
FG dataand
set,
weabe
built
a GGM using Given,
250 simulaweight
sampled,
we tosample
Eq. 15 numerical
using
CDF
to
Eq
17’s
comfact
analyzing
Figures
1(a)
indicates
that
the
GGM
isolates
While
the
parameters
in
FG
and
MO2
have
well
defined
the
integral
over
the
hyperparameter.
There
are many
nu- for the
counterpart.
The othermodeling
variablesthe
areFG
notbuilding
statistically
di↵er-In
on methods to solve for i . However, the E+
Eq.
17
must
be
transformed
to
a
bounded
integral
tions
sampled
without
replacement,
and
evaluated
the
model
all
is
successfully
parameters.
plex
nature
Li’sour
 recommendation
to
use instances
sampling where
unifrom distribution
ontothe
interval
[0,1].
This and
means
used
build
GGM
model.
the
building
parameter’s
means very well. While the unised in this workwas
contains
several million
data vecmeans,
they
have
they
significantly
deviusing
150 test methods
simulations
simulations.
In
we
built
merical
for
computing
approximates
to
definite
inent.
1
1
numerical
methods
to E[
be applicable.
Eq
17’s
comfact analyzing
Figures
1(a)
indicates
that the GGM isolates
,MO1
wedata
elected
toresulting
use
|✓i , i , D] Given,
as
form
distribution
matches
the
mean
and
variances
(Figure
h mostly invalidates conventional optimization ap- to aapproximate
GGM
using
the
entire
set.
The
model
i
i
ate
from
their
estimated
means,
which
is not While
well the
implied
3.  Model
Order
2 (MO2):
contains
pairwise
min
and
max
combinations
across
tegrals,
such
as1 , nature
Trapezoidal
method
and
Simpson’s
Rule.
2
the
otherisparameter’s
variables
are
not very
statistically
di↵erent,
plex
and
Li’s

recommendation
to
use
sampling
building
means
well. While
the uniRather
than
using
the
fast
grafting
algorithm
is
evaluated
using
300
sampled
MO2
simulations.
our
estimates
for
which
is
the
MAP
estimate
under
a
1(b)),
it
appears
the
matching
only
possible
because
the
2
i
1
1
based
constrained
set parameter
optimization
method.
bytheresults
(Figure
1(a)
andabove,
2).were
Figure
3(a)
illustrates
that
16], A
solve the
Lasso
regression
problems using
However,
theto
integral
over
✓building
isMO2
from
Analyzing
Figure
1300
illustrates
that
paramapproximate
, unbounded
we
elected
to
use
E[
|✓beD] as
iiLi’s
there
is
not
a clear
indication
that
GGM
model fails
to
i , used
i , building
form
distribution
matches
the the
mean
and variances
(Figure
the
same
building
set.
sampled
simulations
Gamma
distribution.
Based
on
Fan

MAP
distribui actual
parameters
have
a
large
distribution
range,
1 than the
Section 4.1).
eter
estimates
using
the GGM
are mostly
better
cause
✓
’s
values
exist
in
the
interval
whichestimate
means
our
estimates
for
which [0,1),
is the
MAP
under
a centers
i
represent
these
variables.
indicates
that
the GGM
over-the
1(b)),
it appears
theThis
matching
is only
because
tion,
we
derived
the
following
maximum
likelihood
estimate
i a ,[0,
which
mostly
at 0.5,
the uniform
distribution’s
ex-possible
btaining i ’s MAP
estimate,
it
can
be
used
to
estimates
generated
by
randomly
guessing
using
1]
ranto
evaluate
the
GGM
model
built
from
the
entire
MO1
data
set.
1 distribution.
Gamma
on Fanintegral
Li’s for
MAP
Eq.
17 must
beoverall
transformed
to Based
a bounded
thedistribuP ( i 1 |✓i , i , D) with respect to i . However, (MLE)
actual building
parameters
havebuilding
a large distribution
equation
for
all is successfully
modeling
the FG
parameters.range,
In
dom
distribution.
The
pected
value.
i : GGM’s absolute error rate
i , D) is dependent upon the hyperparameter ✓i
is statistically
better
with 95%
than
uniformGiven,
tion,
we confidence
derived
the the
following
maximum
estimate
which
centers
at
0.5, the that
uniform
numerical
methods
to be applicable.
Eq likelihood
17’s comSimilar
to
theanalyzing
FG
datamostly
set
results,
the
Bayesian
GGM
fact
Figures
1(a)
indicates
the distribution’s
GGM isolatesexLi  noted that there is not a known method
1
distribution’s error rate
–
4.05&plusmn;0.98
vs
5.07&plusmn;1.01.
However,
(MLE) equation for
:
Acknowledgments
Data Sets
References
```