overall approachwhich is much scalable than the S&S and sian distribution, makesmore it slightly less general than method presented by Fan Li [16] guarantees global maxisume ha represents a vector of zeros. Ba methods. In addition, the RB methods presented in this CB methods. This clear computational advantage lead us to the other methods. In addition, other works illustrates that mum solution inthe all ⌦ instances. presented vector. in [27], itThese is possible to relaxasthe J is and h is the potential methods paper have a polynomial worst case runtime, while all other explore additional Regression based (RB) structure learning there are better feature selection methods than stepwise seThis model allows exact inference by using the GGM’s is zero. However, computing h research requires c sume h represents a vector of zeros. Based on the tribution, which makes it slightly less general than method presented by Fan Li [16] guarantees a global maximethods have an exponential worst case runtime. methods. In addition, the RB methods presented in this lection and backwards selection [19]. distribution’s information form (Eq. 8). verting which is expensive eventhat withht r methods. In addition, other works illustrates that mum solution in all instances. sian distribution, which makes it slightly less general than Parameters method presentedvia Fan guarantees a global maxipresented inby [27], itLiis[16] possible to⌦,relax the assumption Estimating Building Simulation Bayesian Structure Learning paper have a polynomial worst case runtime, while all other While the feature selection methods can be improved, the sian distribution, which makes it slightly less general than method presented by Fan Li [16] guarantees a global maxipresented by A. Dobra et. al [5]. otherselection methods.methods In addition, works illustrates that mum solution in all instances. e betterthe feature thanother stepwise seThis model allows exact inference by using the GGM’s Tcomputing T is zero. However, h requires computing ⌃ by in3.3 Regression Based 1 2 1 p(x) / exp(x Jx + h x) (8) the other methods. addition, other works illustrates mum solution in all instances. have an worst case runtime. overall approach is exponential much scalable than the S&S andthat Richard E. Edwards R. New , and Lynne E. Parker there are better feature selection methods than stepwise se- , Joshua This model allows exact inference by using the GGM’s andmethods backwards selection [19].Inmore distribution’s information form (Eq. 8). Overall, A. Dobra et. al [5]’s meth verting ⌦, which is expensive even with the efficient method there are better feature selection methods than stepwise seThis model allows exact inference by using the GGM’s 1University RB structure learning methods determine all conditional CB methods. This clear computational advantage lead us to lection and backwards selection [19]. distribution’s information form (Eq. 8). of Tennessee, Electrical Engineering and Computer Science the feature selection methods can be improved, the and computationally feasible for a large J is the ⌦ and h is the potential vector. These methods aspresented by A. Dobra et. al [5]. T T lection and backwards selection [19]. distribution’s information (Eq. 8). 2Oak 3.3 Based While the feature selection methods can be improved, the / exp(x + hform x) Integration (8) dependencies by assuming that eachp(x) variable is a Jx functional explore additional Regression based (RB) structure learning Ridge National Lab, Whole Building & Community Group approach isRegression much more scalable than the S&S and T Based T variables. However, the resulting mode sume h represents a vector of zeros. on the research Overall, A. Dobra et. al [5]’s method is fairly robust p(x) / exp(x J x + h x) (8) While the feature selection methods can be improved, the overall approach is much more scalable than the andrandom variables. This concept isT methods. In addition, the RB methods presented in S&S this T result from of all hods. This clear computational advantage lead us toa subset RB structure learning methods determine all conditional p(x) / exp(x Jxfeasible +the h x) (8) in [27], it is possible toThese relax assumption that h J is us theto⌦presented and h is the potential vector. methods asdent upon the variable ordering. Each overall approach is much more scalableadvantage than the S&S and and computationally for a large number of random CB methods. This clear computational lead paper have a polynomial case runtime, while all other additional Regression based worst (RB)that structure learning J is the ⌦ and h is the potential vector. These methods asbest presented in Linear Gaussian Bayesian Networks, where dependencies by assuming each variable is a functional sume h represents avariables. vector computing of zeros. Based on the research is zero. However, h requires computing ⌃ by in-isa heavily CB methods. ThisRegression clear Abstract computational advantage lead us to explore additional based (RB) structure learning dering may produce di↵erent joint p However, the resulting model depenResults J sume is distributed, the h⌦ represents and h is theaeach potential ass. result In addition, the RB methods presented in this variable vector ofvector. zeros. These Basedmethods on the research methods have an exponential worst case runtime. each random is Gaussian but defrom a subset of all random variables. This concept is presented in [27], it isdent possible to relax the assumption that hEach explore additional Regression based (RB) structure learning methods. In addition, the RB methods presented in this verting ⌦, which is expensive even with the efficient method tion because each variable is only depend upon the variable ordering. unique variable orsume h represents a vector of zeros. Based on the research presented in [27], it is possible to relax the assumption that h avebest aMany polynomial worst case runtime, while all other key have building design policies are made using sophisticated computer pendent random Gaussian distribution iset. a linear presented Linear Gaussian Bayesian Networks, where methods. In addition, the RBcase methods presented inzero. this paper ain polynomial worst runtime, whilevariable’s all other is However, computing h requires computing ⌃ by inpresented by A. Dobra al [5]. that proceed it.that This ensure dering produce di↵erent joint probability distribuis zero. However, harequires computing ⌃ hbyassumption inpresented in [27],may it iscomputing possible to relax the assumption 3.3 Regression Based s have an exponential worst case runtime. simulations such as EnergyPlus (E+), the DOE flagship whole-building energy have exponential case runtime. paper have a an polynomial worstworst case runtime, while all other verting which is expensive even with the efficient method eachmethods random variable is Gaussian distributed, but each⌦,decombination of its parents. Therefore, one can clearly learn Overall, A. Dobra et. al [5]’s method is fairly robust verting ⌦, which is expensive even with the efficient method is zero. However, computing h requires computing ⌃ by intion because each variable is only dependent upon variables graph structure is a valid Bayesian Netwo simulation engine. E+ and other sophisticated computer simulations have RB structure learning methods determine all conditional methods have an exponential worst case runtime. presented by verting A. Dobra et. al [5]. pendent random variable’s Gaussian distribution is a linear the structure for a Linear Gaussian Bayesian Network by and computationally feasible for a large number of random presented by A. Dobra et. al [5]. Regression Based ⌦, which is expensive even with the efficient method several major problems. The two main issues are 1) gaps between the 3.3 Regression Based that proceed it. This assumption ensures that the resulting method to search the set of all possible o dependencies by assuming that each variable is a functional Overall, A. Dobra et. al [5]’s method is fairly robust combination of its parents. Therefore, one can clearly learn Overall, A. Dobra et. al [5]’s method is fairly robust performing a series of linear regressions with feature selecvariables. However, the resulting model is heavily depenpresented by A. Dobra et. al [5]. simulation model and the actual structure, and 2) limitations of the modeling ructure learning methods determine all conditional 3.3 Regression Based graph structure is a valid Bayesian Network, but requires the RB structure learning methods determine all conditional the best structure. A. Dobra et. al [5] result from a subset of all random variables. This concept is and computationally feasible for a large number of random and computationally feasible for a large number of random the structure for a Linear Gaussian Bayesian Network by Overall, A. Dobra et. al [5]’s method is fairly robust engine’s capabilities. Currently, these problems are addressed by having an dent upon the variable ordering. Each unique variable ortion. ncies by assuming that each variable is a functional dependencies by assuming that each variable is aconditional functional method to search the set of all possible orders to determine RB structure learning methods determine all by scoring the individual variables and g best presented in Linear Gaussian Bayesian Networks, where variables. However, the resulting model is heavily depenvariables. However, the resulting model is heavily depenmanually calibrate simulation parameters towith real world data or using and may computationally feasible forjoint a large number ofdistriburandom a series of linear regressions feature selecdering produce a di↵erent probability The RB approach is extremely scalable. For example, omperforming aengineer subset of all random variables. This concept is result from a subset of all random variables. This concept is dependencies by methods assuming that each variable a functional the best structure. A. Dobra et. al [5] addresses this issue dent the variable ordering. Each unique variable oreach random variable is Gaussian distributed, butis each de- upon variables in ascending order based on th dent upon the variable ordering. Each unique variable oralgorithmic optimization to adjust the building parameters. However, variables. However, the resulting model is heavily depentionRB because each to variable islarge only dependent upon variables tion.in best presented in Linear Gaussian Bayesian Networks, where sented Linear Gaussian Bayesian Networks, where M. Gustafsson et. al [9] used methods learn result from a subset of all random variables. This concept is dering produce aindividual di↵erent joint probability distribudering may di↵erent joint probability distribubyamay scoring the variables and greedily ordering the pendent random variable’s Gaussian is a linear some simulations engines, like E+, aredistribution computationally expensive, whichproduce Fan Li [16] builds on this approach and dent upon the variable ordering. Each unique variable oreach random variable isextremely Gaussian distributed, but each de- that proceed it. This assumption ensures that the resulting ndom The variable isapproach Gaussian distributed, but each de(a) Bayesian Estimates (b) Random Estimates RB is scalable. For example, undirected graphical model structures in Gene Regulatory best presented in Linear Gaussian Bayesian Networks, where tion because each variable is only dependent upon variables tion each variable is only dependent upon variables makes repeatedly evaluating the simulationone engine costly. This workbecause exploresdering variables in ascending order based on their assigned score. combination of its parents. Therefore, can clearly learn may produce a di↵erent joint probability distriburequirement without adding additional a pendent random variable’s Gaussian distribution is a linear random variable’s Gaussian distribution is a linear graph structure is a valid Bayesian Network, but requires the M. Gustafsson et. al [9] used RB methods to learn large each random variable is Gaussian distributed, but each dethat proceed it. This assumption ensures that the resulting Networks. Microarray data generally contains thousands of addressing this issue by automatically discovering the simulation’s internal that proceed it. This assumption ensures that the resulting the structure for a Linear Gaussian Bayesian Network by tion because each variable is only dependent upon variables Figure 1. Compares Bayesian on GGM’s error againstaaWishart [0, 1] uniform distribution’s Fan Li [16] builds this approach and removes the order combination of its parents. Therefore, one can clearly learn [16] applies prior to the precisi ation of its parents. Therefore, one can clearly learn method to search the set of all possible orders to determine pendent random variable’s Gaussian distribution is a linear graph structure is a valid Bayesian Network, but requires the undirected graphical model structures in Gene Regulatory input and output dependencies from ~20 Gigabytes of E+ simulation data, graph structure is a valid Bayesian Network, but requires the random variables and very few samples. In that particular error on estimating FG building parameters. Additionally, it shows how estimates that proceed it. This assumption ensures that the resulting performing a series of linear regressions with feature selecthe structure for a Linear Gaussian Bayesian Network by requirement without adding additional assumptions. Fan Li Lasso regression to perform feature selec cture for a Linear Gaussian Bayesian Network by its ~200 parents. Therefore, can clearly learn method to of search the set et. of all possible orders to issue determine the best structure. A. Dobra al [5] addresses this future combination extensions willofuse Terabytes of E+ one simulation data. The model isgraph align with test building parameter values. Networks. Microarray data generally contains thousands of method to search the set all possible orders to determine structure is astructure valid Bayesian Network, but requires the matrix, and uses research, the algorithm for building the graph is performing a series of linear regressions with feature selecis formulated as the following under ADMM: tion. [16] applies a Wishart prior to the precision ing avalidated series of linear regressions with feature selec- Network the best structure. A. Dobra et. al [5] addresses this issue prior to the precision matrix allows the m the structure for a Linear Gaussian Bayesian by by inferring building parameters for E+ simulations with ground truth by scoring the individual variables and greedily ordering the the best structure. A. Dobra et. al [5] addresses this issue random variables and very few samples. In that particular s are The methodLasso tocoefficients search the set of all possible orders to determine tion. fairly straightforward. If the regression are nonRB approach is extremely scalable. For example, regression to perform feature selection. Applying the by scoring the individual variables andprior greedily ordering minimizef + g(z)regressions (10) accurately building parameters. results indicate that the model represents performing a Our series of (x) linear with feature selecBayesian Parameter Estimates Bayesian Parameter the Estimates the distribution to the regression variables in ascending order based on their assigned score. ution by scoring the individual variables and ordering the the best structure. A. Dobra et. al [5] addresses this Variables issue27 to 52 The RB approach is extremely scalable. For example, research, the algorithm for building the graph structure is Variables 1 to 26 greedily zero, then there exists a dependency between the response M. Gustafsson et. al [9] used RB methods to learn large in ascending order based on their assigned score. parameter means withx some deviation from the means, but does not support variables RB isconstraint extremely scalable. For example, tion. prior to the precision matrix allows the method to propagate le to approach with the z = 0, where g is an indicator funcFan Li [16] builds on this approach and removes the order enables the Lasso regression method to variables in ascending order based on their assigned score. M.straightforward. Gustafsson et. alIf[9]the used RB methods to learn largenonby scoring the individual variables and greedily ordering the fairly regression coefficients are Estimate Estimate 1.5 Li [16] 1.5 1.5 removes the order 1.5 inferring parameter values that exist ongto the distribution’s tail. Fan builds oncoefthis approach and undirected graphical model structures in Gene Regulatory variable and the predictors with a non-zero regression tafsson et. al [9] used RB methods learn large The RB approach is extremely scalable. For example, tion. Using an indicator function for allows ADMM to the prior distribution to the regression coefficients, which Actual Actual undirected graphical model structures in Gene Regulatory requirement without adding additional assumptions. Fan Li Fan Li [16] builds on this approach and removes the order variables in ascending order based on their assigned score. feature selection. In addition, applying zero, then there exists a dependency between the response requirement without adding additional assumptions. Fan Li represent the original convex optimization constraints, and Microarray data generally contains thousands of M. Gustafsson et. al [9] used RB methods tothousands learn largeof directly extracts ed Networks. graphical model structures in Gene Regulatory ficient. While therequirement method an undirected 1 1 1method to perform Bayesian 1 enables the Lasso regression Networks. Microarray data generally contains Fan Li [16] builds on this approach and removes the order without adding additional assumptions. Fan Li [16] applies a Wishart prior to the precision matrix, and uses allows Fan Li to prove all possible resul Problem Formulation the x and z = 0 constraint guarantees that the xathat minimizes [16] applies a Wishart prior to the precision matrix, and uses variable the predictors with non-zero regression coefundirected graphical model structures in Gene Regulatory ks. Microarray data generally contains thousands ofdependency random variables and very few samples. In that particular conditional graph, it may not be possible to exrandom variables and very few samples. In that particular feature selection. Infeature addition, applying the prior to ⌦ without adding additional Fan Li f (x) obeys the original constraints. [16] applies a requirement Wishart prior to the precision matrix, and 0.5 0.5 assumptions. 0.5 uses 0.5 lets Lasso regression to perform feature selection. Applying the has Lasso regression to perform selection. Applying the works encode the same GGM joint pro Networks. Microarray data generally contains thousands of ficient. While the method directly extracts an undirected variables and very few samples. In that particular the algorithm forsolution building the graph structure isdistribution research, the algorithm for building the graph structure is prior While [3] used this general format to solve o re-research, Learning tract an many overall joint from the resulting graph. [16] applies a Wishart prior toprove the precision matrix, and uses Lasso regression to perform feature selection. Applying the allows Fan Li to all possible resulting Bayesian Netto the precision matrix allows the method to propagate prior to the precision matrix allows the method to propagate of variable order. This means random variables and very few samples. Inisbe that particular 0 5 10 15 20 25 26 31 36 41 46 51 that th di↵erent convex optimization problems, we are onlynot focus,can the algorithm for building the graph structure conditional dependency graph, it may possible to exfairly straightforward. If the regression coefficients are nonfairly straightforward. If the regression coefficients are nonLasso regression to perform feature selection. Applying the prior to the precision matrix allows the method to propagate Variables Variables ollowing under ADMM: A. Dobra et. al [5] focuses on recovering the factorized the prior distribution to the regression coefficients, which works encode the same GGM joint probability, regardless E+ Building Specification the prior distribution to the regression coefficients, which ing on the version used to solve Lasso regression. The disresearch, the algorithm for building the graph structure is SecMAP estimate, which has very appeali raightforward. If the regression are nonthen there exists a coefficients dependency between the the response Estimate the tract an overall joint distribution from the resulting graph. zero,zero, then there exists a dependency between the response Parameters X prior to the precision matrix allows the method to propagate prior[5] distribution to the regression coefficients, which enables the Lasso regression method to perform Bayesian tributed optimization steps for solving large scale linear Lasso graph’s joint distribution. assumes that the overall joint (a) (b) of variable order. This means that the computed ⌦ is a enables the Lasso regression method to perform Bayesian inimizef (x) + g(z) (10) joint probability fairly straightforward. If the regression coefficients are nonen there exists a dependency between response variable and the predictors with a non-zero regression coef1 the as avoiding overfitting. In addition, th A.regression Dobra et.predictors alare[5] focuses on recovering the factorized and problems the with a non-zero regression coefthe prior distribution to the regression coefficients, which presented below . feature selection. In addition, applying the prior to ⌦ lets enables the Lasso regression method to perform Bayesian usingvariable distribution: distribution is sparse and represented by a Gaussian with feature selection. In addition, applying the prior to ⌦ lets MAP estimate, which has very appealing properties such zero, then there exists a dependency between the response ficient. While the method directly extracts an undirected and the predictors with a non-zero regression coefz = 0, where g is an indicator funcObserved Output Data / RealBayesian Parameter Estimates Bayesian Parameter Estimates point way to transform the Lasso regression f enables the Lasso regression method toresulting perform Bayesian ficient.variable While the method directly extracts an undirected allows Fan Li to prove all possible Bayesian Netgraph’s joint distribution. [5] assumes that the overall joint P(X,Y ) feature selection. In addition, applying the prior to ⌦ lets 1 ⇢ 2 k k 2 k+1 Variables 53 topossible 78 Variables 79 to 104 introduces a world Sensor data, Weather and the predictors with a non-zero regression coefallows Fan Li to prove all resulting Bayesian Netator function for g allows ADMM to conditional dependency graph, it may not be possible to exN (0, ⌃). This type of Markov Network is called a Gaussian as avoiding overfitting. In addition, the [17] ||A x b || + ||x z + u || ) (11) x = argmin ( i i i i While the an undirected tra2 i 2 i method directly extracts feature selection. In addition, applying the prior to ⌦ lets works encode the same GGM joint probability, regardless Data, Operation Schedule Y 2 2 SVM regression problem, where the solut conditional dependency graph, it may not be possible to exx allows Fan Li to prove all possible resulting Bayesian Netdistribution is sparse and represented by a Gaussian with convex optimization constraints, and Estimate probability, Estimate 1.5 ficient. While joint the method directly extracts an undirected Im- dependency tract an overall distribution from theexresulting graph. and works encode the same GGM joint regardless 1.5 1.5 1.5 way to transform the Lasso regression formulation into an Graphical Model (GGM), it is assumed that the joint nal graph, it may not be possible to allows Fan Li to prove all possible resulting Bayesian Netk+1 k+1 k of variable order. This means that the computed ⌦ is a Actual Actual nt guarantees thatoverall the x that minimizes works encode the same GGM joint probability,gression regardless tract an joint distribution from thenot resulting graph. problem is also the solution to t z = S (x̄ +al ūof)[5] (12) abilN (0, ⌃). This type Markov Network is called a Gaussian A. Dobra et. focuses on recovering the factorized conditional dependency graph, it may be possible to exof variable order. This means that the computed ⌦ is a joint distribution from the resulting graph. SVM regression problem, where the solution to the SVM reprobability is represented by the following: MAP estimate, which has very appealing properties such aloverall constraints. works encode the same GGM joint probability, regardless 1 1 1 1 has A.graph’s ofgraph. variable order. This means that the computed ⌦ is This a Dobra et. al [5] focuses on recovering the factorized k+1 k overall k+1 k+1 problem. transformation allows th joint distribution. [5] assumes that the overall joint Graphical Model (GGM), and it is assumed that the joint tract an joint distribution from the resulting Inference u = u + x z (13) general solution format to solve many obra on recovering the factorized MAP estimate, which has very appealing properties such i i as avoiding overfitting. In addition, the [17] introduces a dif- et. ali [5] focuses of variable order. This means that the computed ⌦ is a gression problem is also the solution to the Lasso regression MAP estimate, which has very appealing properties such assumes that all observations are independent and identinot feasible to specify the single best graph structure in ad0.5 0.5 0.5 0.5 rando graph’s joint distribution. [5] assumes that the overall joint distribution is sparse and represented by a Gaussian with Estimated Optimize over the joint A. Dobra et. al [5] focuses on recovering the factorized mization problems, we are only focus1 1 nonlinear dependencies among the probability is represented by the following: T overfitting. rob- distribution. New Observation joint [5] assumes that the overall joint way to transform the Lasso regression formulation into an as avoiding In addition, introduces a MAP estimate, which has very appealing properties such cally distributed. Eq. 1 may be rewritten as: vance. Therefore, algorithmic techniques must be used to the [17] The individual local subproblems are solved using ridge reproblem. This transformation allows the method to detect Building probability distribution : exp( (x µ) ⌦(x µ)) (6) as overfitting. In addition, the [17] introduces a ed todistribution solve regression. The NLasso (0, ⌃). This type of Markov Network is called a n Gaussian 1avoiding Y , sparse Y , …, Y disis and represented by a Gaussian with graph’s joint distribution. [5] assumes that the overall joint tting find the2 best graph structure matches theaddition, true joint N Gaussian ever, exploring nonlinear dependencies SVM regression problem, where solution to the Parameters X gression, and the global z values are computed by evaluating 52 57 that 62 67 In 73 78 the 78 introduces 83 into 88SVM 93 re- 98 103 tion is sparse and represented by a with as avoiding overfitting. the [17] a 2 2 way to transform the Lasso regression formulation an ⇧ P (X, Y ) (2) i (2⇡) |⌃| i=1 steps for solving large scale linear Lasso 1 1 nonlinear dependencies among the random variables. Howway to transform the Lasso regression formulation into an ationN (0, Graphical Model (GGM), and it is assumed that the joint probability distribution without overfitting the observed trainT Variables Variables distribution is sparse and represented by a Gaussian with ⌃). This type of Markov Network is called a Gaussian 1 a soft thresholding function S. This function is defined as gression problem is Lasso also the solution to to the Lasso regression exp( (x µ) ⌦(x µ)) (6) .teraThis type of Markov Network is called a Gaussian future work, because naively implement are presented below . way to transform the regression formulation into an ing samples. There are three categories of algorithms that which is the joint probability distribution. Conversely, the SVM regression problem, where the solution the SVM ren 1 probability is represented by the following: SVM regression problem, where the solution to the SVM reever, exploring nonlinear dependencies via this method is follows: (c) (d) N (0, ⌃). This type of Markov Network is called a Gaussian 2 2 2 Graphical Model (GGM), and it is assumed that the joint (2⇡) |⌃| problem. This transformation allows the method to detect address this problem. The first method is Score and Search E+ approximation process requires maximizing the followwhere ⌃ is the covariance matrix, µ is the mean, and ⌦ al Model (GGM), and it is assumed that the joint tion. SVM regression problem, where the solution to the SVM reour data will produce non-sparse GGM ⇢ gression problem is also the solution to the Lasso regression 2 k k 2 Structure Learning gression problem is also the solution to the Lasso regression (Section 3.1), and is generally used to learn Bayesian Neting posterior probability: Model (GGM), and it is assumed that the joint future work, because naively implementing the method on Arelabi ||2 +Graphical ||xi isz represented +u i xi probability 1 i ||2 )1 (11) nonlinear dependencies among the random variables. Howby the following: T Bayesian Parameter Estimates Bayesian Parameter Estimates ity is represented thev following: gression problem is also the solution to the Lasso regression max(0, ) max(0, v µ) )⌦(x (14) µ)) is the inverse covariance matrix or precision matrix. The 2S (v) = by work structures. The second approach is Constraint Based exp( (x (6) While the method presented by Fan L N problem. This transformation allows the method to detect Variables 105 to 130 Variables 131 to 151 n 1 problem. This transformation allows the method to detect where ⌃ is the covariance matrix, µ is the mean, and ⌦ ⇢N ⇢N probability is represented by the following: ⇧i=1 P (Y |X)P (X) (3) utii our data will produce non-sparse models. ever, exploring nonlinear dependencies via GGM this method is 2 approach learns a Bayesian Model (GGM) (Section 3.2) methods, which are used to learn Bayesian and 2 |⌃| 2 k Gaussian Graphical (2⇡) problem. This transformation allows the method to detect Network that is converted to a + ū ) (12) T 1 1 , the nonlinear dependencies among the random variables. Howestimate the GGM that governs 1 1 nonlinear dependencies among the random variables. HowEstimate Estimate the join T Markov Networks. Thework, final method uses strong distribution where X represents the inputs and observed operation schedis the covariance matrix or precision matrix. The future because naively implementing the method on Theinverse soft exp( thresholding function applies the Lasso regression While the method presented by Fan Li [16] can efficiently exp( (x µ) ⌦(x µ)) (6) 1 1 nonlinear dependencies among the random variables. How(x µ) ⌦(x µ)) (6) Actual Actual T data n 1 n 1 assumptions combined withand regression based feature selecule, and Yi z,represents amatrix, single output sample. Thisthe apGGM by using regression coefficients the variance k+1 where ⌃ is the covariance µ is the mean, and ⌦ ever, exploring nonlinear dependencies viaE+ this method is sparsity constraints over which are incorporated into the 1.2 ever, exploring nonlinear dependencies via this1.2 method is exp( (x µ) ⌦(x µ)) (6) the random variables, it is not clear 2 2 our data will produce non-sparse GGM models. z 2 |⌃| 2Bayesian 2 |⌃| 2 learns n(13) 1 approach a Network that is converted to a (2⇡) (2⇡) tion methods to determine dependencies among the random estimate the GGM that governs the joint distribution over thm ever, exploring nonlinear dependencies via this method is proximation method requires finding multiple samples that 1 1 2 2 2 (2⇡) |⌃| local subproblem solutions on the next optimization iterais the inverse covariance matrix or precision matrix. The for each regression to compute the precision matrix ⌦. The 0.7 0.7 future work, because implementing the method on future work, because naively implementing the method on While the method presented by Fan Li [16] can efficiently 0.8 naively 0.8 variables (Section 3.3). maximize the posterior probability, rather than a single ason the E+ data set, which has many com0.6 0.6 GGM by using the regression coefficients and the variance future work, because naively implementing the method on subproblems are solved using ridge retion. the E+ random variables, not clear how it will perform 0.6 0.6 it is 0.5 0.5 approach learns a Bayesian Network that is converted to a ⌃ggreis the covariance matrix, µ is the mean, and ⌦ signment. However, forward approximating E+ using this where is the covariance matrix, u is the mean, and is the inverse where ⌃ is the covariance matrix, µ is the mean, and ⌦ 0.4 0.4 estimate the GGM that governs the joint distribution over our data will produce GGM models. our data will produce non-sparse GGM precision matrix is recovered by the following computation: 0.4 non-sparse 0.4 models. where ⌃ is the covariance matrix, µ is the mean, and ⌦ al zfor values are computed by evaluating 0.3 0.3 to wo Theregression key advantage behind this particular Lasso regression variables. The method is intended our data will produce non-sparse GGM models. each to compute the precision matrix ⌦. The method is computationally slower than using statistical sur0.2 0.2 on the E+ data set, which has many more samples than GGM by using the regression coefficients and the variance matrix or precision matrix ation nverse covariance matrix or precision matrix. The 3. STRUCTURE LEARNING the E+ random variables, it is not clear how it will perform iscovariance the inverse covariance matrix or precision matrix. The While the method presented by Fan Li [16] can efficiently unction S. This function is defined as formulation isrogate that the main,Therefore, computationally demanding, While the method presented by Fan Li [16] can efficiently models. we areor only focusing onmatrix. estimat- The is the inverse covariance matrix precision 104 109 114 119 124 129 130 135 139 that 144 contain 149 154a large n While the method presented by Fan Li [16] can efficiently T 1 ple size data sets precision matrix is recovered by the following computation: aster for each regression to compute the precision matrix ⌦. The the variables. The method is intended to work with small samh learns aapproach Bayesian that is converted to a Variables Variables on the E+ data set, which has many more samples than ingaNetwork building parameters andmethod only present this capability for step is solved exactly once. The direct for comput⌦ = (1 ) (1 ) (7) approach learns Bayesian Network that is converted to a estimate GGM that governs the joint distribution over learns a Bayesian Network that a estimate the GGM that governs the joint distribution over com- Dobra estimate the GGM that governs the joint distribution over k+1 T 1 is converted to 3.1 Score and Search completeness. such as microarray data. While priors ar precision matrix is recovered by the following computation: ing x requires computing the matrix (A A + ⇢I) . The et. al (2004) and Li et. al (2006) variables. The method is intended to work with small samy using the the variance i regression coefficients and T 1 ple size data sets that contain a large number of features, (e) (f) the E+ Score random variables, it variables, is not clear how itclear will perform by using the regression coefficients and the variance y. GGM by using the regression coefficients and the variance and Search (S&S) algorithms try to search through , v InGGM ) resulting max(0, v ) (14) the E+ random variables, it is not clear how it will perform ⌦ = (1 ) (1 ) (7) the E+ random it is not how it will perform matrix never changes throughout the entire optiTmatrix 1 where represents an upper triangular matrix with zero difor problems with more features than o ⇢N ⇢N PROBABILISTIC ple size data sets that contain a large number of features, to compute the precision ⌦. The the space of all possible models and select the best seen e regression Al-for such as microarray data. While priors are generally reserved 2. GRAPHICAL MODEL on the E+ data set, which has many more samples than = (1 )allows (1 ) matrix (7) on on forregression each regression to compute thethe precision ⌦. The each to ⌦compute the precision matrix ⌦. The mization process. Storing this result distributed the E+ data set, which has many more samples than the E+ data set, which has many more samples than Figure 2. Bayesian GGM’s parameter estimates compared against the actual model. The best model is selected by using a global critesuch as microarray data. While priors are generally reserved n matrix is recovered by the following computation: agonals the non-zero elements represent the regression (e.g., microarray data sets), our E+ data Learning the true joint probability distribution directly in and gover function applies the Lasso regression optimization method to perform aby very computationally in- diagonals variables. The method is intended to work with small samwhere represents an upper triangular matrix with and zero the where represents an upper triangular matrix with zero difor problems with more features than observation samples precision matrix is recovered by the following computation: precision matrix is recovered the following computation: values on 300 randomly sampled MO2 simulations. ria function, such parameter as The theThe likelihood function. However, the variables. method is intended to work with small samvariables. method is intended to work with small samsver be-z, which k+1 the general case is computationally intractable, especially 1 where represents an upper triangular matrix with zero difor problems with more features than observation samples are incorporated into the tensive task once and reduce all future x computations most common criteria function is the a Bayesian Information i coefficients, and represents a diagonal matrix containas well. Our E+ simulations generate non-zeroand elements represent theaelements regression coefficients, and represents Tnon-zero 1 with ple size data sets that contain large number of features, agonals the represent the regression (e.g., microarray data sets), our E+ data set requires priors when dealing large number of continuous random T 1 eutions folple size data sets that contain a large number of features, k+1 ⌦ = (1 ) (1 ) (7) T 1 on agonals the next optimization iteraple size data sets that contain a large number of features, andvariables. the non-zero elements represent (e.g., steps. Caching the values used to compute xthe to)probability disk the regression Criteria (BIC) [22]: microarray data sets), our E+ data set requires priors ⌦ = (1 ) (1 (7) ieach 1 In general, it is assumed joint fac⌦ = (1 ) (1 ) (7) a diagonal matrix containing the variances for performed regression. such as microarray data. WhileOur priors are generally reserved a reing the variances for each performed regression. There are vectors per input parameter set, which le 1represents such as microarray data. While priors are generally reserved coefficients, and a diagonal matrix containas well. E+ simulations generate 30,540 output data allows a 2.2GHz Intel Core i7 laptop to solve a univariate coefficients,torizes and into several represents diagonalcomputational matrix contain- such as well. Our E+ simulations generate 30,540reserved output data as microarray data. While priors are generally smaller moreamanageable log M ,behind the this3.9GB particular Lasso regression represents an upper triangular matrix with zero difor problems with more features than observation samples Lasso regression problem in approximately 17 min⇤ Dim[G] (4) BIC(Data; G) = L(Data; G, ✓) + Pros: Scalable for a large number of random variables where represents an upper triangular matrix with zero difor problems with more features than observation samples problems. Factorization is based purely on conditional inmethods for learning ⌦ from the data directly, either by imbalance between output and input ob ing the variances for each performed regression. There are vectors per input parameter set, which leads to a significant ing the variances for each performed regression. There are vectors per input parameter set, which leads to a significant 2 where represents an upper triangular matrix with zero difor problems with more features than observation samples ight. he main, computationally demanding, utes. dependence and is represented using a graphical model G, 1 and themethods non-zero elements represent the regression (e.g., microarray data sets), our E+ data set requires priors agonals and the non-zero elements represent the regression (e.g., microarray data sets), our E+ data set requires priors for learning ⌦ from the data directly, either by imbalance between output and input observations. For ex- produce estimating ⌃ and computing ⌃ . The inverse operation ample, 299 E+ simulations appr robmethods for learning ⌦ from the data directly, either by imbalance between output and input observations. For exProblem for the E+ Data where Dim[G] represents the number of parameters estionce. The direct for computwhere G is aSet graph with V nodes representing random variagonals and the non-zero elements represent the regression (e.g., microarray data sets), our E+ data set requires priors 1 method 1 11diagonal nts, and represents matrix containas well. Our E+ simulations generate 30,540 output data T 1 aE diagonal 3operation coefficients, and represents a matrix containas well. Our E+ simulations generate 30,540 output data slave mated within the graph, M is the total number of samples, ables and edges. The edges in E represent conditional estimating ⌃ and computing ⌃ . The inverse ample, 299 E+ simulations produce approximately 3.9GB of 1 1 mputing the matrix (A A + ⇢I) . The 4.2 Bayesian Regression GGM Learning is a user defined hyperparameter. Using the above prior final estimate is computed using the following equation: requires an O(V ) runtime, where V is the total number data. In addition, the simulation input p is a user defined hyperparameter. Using the above prior final estimate is computed using the following equation: estimating ⌃ and computing ⌃ . The inverse operation ample, 299 E+ simulations produce approximately 3.9GB of coefficients, and represents a diagonal matrix containas well. Our E+ simulations generate 30,540 output data i 3optiandper L is input the the log-likelihood function. dependencies between the random variables in the total graph G.number and some additional derivations, Fan Li [16]There derived the folvariances for each performed regression. are vectors parameter set, which leads to a significant er changes throughout the entire ing the variances for each performed regression. There are vectors per input parameter set, which leads to a significant E+ Building Parameters X requires an O(V ) runtime, where V is the data. In addition, the simulation input parameter space repand some additional Fan Li [16] derived the fol3 X Unlikederivations, the direct methods, the Bayesian approach assumes 1 lowing maximum a posteriori (MAP) distributions for all M are used toprobfind the best ˆS&S The graph structure is intended to represent thethe true factorof random variables, which is not scalable to larger resents a complex high dimensional stat requires an O(V ) runtime, where V is total number data. In addition, the simulation input parameter space repX = These P (✓ methods ) per (19)generally ing the variances for each performed regression. There are vectors input parameter set, which leads to a significant 1 1 oring this result allows the distributed lowing maximum a posteriori (MAP) distributions for all 1either and allthe : data M s for learning ⌦ from directly, either by imbalance between output and input observations. For exSample ID Pfor P ⌦ … from P matrix. a global Wishart prior the precision Using this methods for learning the data directly, by imbalance between output and input observations. ex- furˆ of random variables, which is not scalable to larger probresents a complex high dimensional state space,For which structure for Bayesian Networks, because the log-likelihood = P (✓ ) (19) ization for the joint probability distribution by splitting it j j 1 i generate and all random :a veryfor One Observed Output Y Mby 1 to perform computationally in1 lems. More importantly, directly learning ⌦ via [7] only ther complicates the problem. The inpu of variables, which is not scalable to larger probresents a complex high dimensional state space, which fur1 learning Min(V ) avg avg methods ⌦ from the data directly, either imbalance between output and input observations. For exglobal prior over the precision matrix allowed Fan Li [16] function factorizes into a series of summations. This factorj=1 into simpler components. These types of models have been P ( | , D) / ng ⌃ and computing ⌃ . The inverse operation ample, 299 E+ simulations produce approximately 3.9GB of estimating ⌃ and computing ⌃ . The inverse operation ample, 299 E+ simulations produce approximately 3.9GB of lems. xMore importantly, directly learning ⌦where via only the problem. The input parameter space k+1 M is [7] the total number of samplesther and ˆ complicates is a sample ation P P P p d reduce alltofuture computations Row ID V V … V Max(V ) avg avg x1estimates ization makes very between easy to modify an existing Bayesian and observational skewness has lead us i2 prove that the Bayesian approach ainverse e↵ectively applied to many fields, such asglobally topic 3 3 computed using Eq. 18. Afterward a few1it iterations (x ) +guarantees | | classification a global maximum solution if there are sufficient estimating ⌃ and computing ⌃ . The operation ample, 299 E+ simulations produce approximately 3.9GB of lems. More importantly, directly learning ⌦ via [7] only ther complicates the problem. The input parameter space k+1 P ( | , D) / i guarantees a global maximum solution if there are sufficient and observational skewness has lead us to avoid exploring ˆfinal an i O(V ) runtime, where V is the total number data. In addition, the simulation input parameter space reprobrequires an O(V ) runtime, where V is the total number data. In addition, the simulation input parameter space repexp( ) where M is the total number of samples and is a sample 1 values used to compute x to disk estimating s and s, we use the estimates to comi Network and compute the modified score without recom… … [2], classification [1], and disease diagnosis [13]. i 3document precision matrix possible variable orders. PDoptimal P Pall pover N N 2 pute the GGM’s precision (Eq. 7).In score. withcomputed using Eq. 18. Afterward a matrix few iterations between (x x ) + | | (16) puting the entire For example, adding anthe edge to an requires an O(V ) runtime, where V is the total number data. addition, the simulation input parameter space repfor the problems However, the direct approaches such as [7], which will most likely only 2probnirandom ij njavg iwhich ijdimensionality. l Core i7 observations laptop to solve a univariate The graphs used to represent factorized joint probabilguarantees a global maximum solution if there are sufficient and observational skewness has lead us to avoid exploring observations for the problems dimensionality. However, direct approaches such as [7], which w of variables, is not scalable to larger probresents a complex high dimensional state space, which furom variables, which is not scalable to larger resents a complex high dimensional state space, which fur299 avg Max(V ) n=1 j=i+1 j=i+1 (a) (b) That is to say, under the Bayesian formulation presented 1 exp( ) estimating s and s,orwe use existing the final Bayesian estimates to com- requires computing the new and mizanetwork ity distributions are either direct acyclic (DAG) … graphs on problem in approximately 17 mini 5.precision RESULTS bylems. Fan Livariables, [16], allthe variable orderings should More importantly, directly learning ⌦ via [7] only ther complicates the problem. The input parameter space of random is not scalable to larger probresents a complex high dimensional state space, which furHow to directly sample Xwhich to cover the whole space oftheoretically More importantly, learning ⌦ via [7] only ther complicates the problem. The input parameter space observations for problems dimensionality. However, the direct approaches such as [7], which will most likely only pute the GGM’s matrix (Eq. 7). previous conditional probability involving the child of the P( |✓ , graphs. , D) / P (D|The, first ,(16) ✓ )P (form | , ✓ )P (35,040 |✓that ) undirected assumes theIn graph tting Figure order to assess how well the GGM estimates building3. Figure 3(a) highlights that building parameter three has values that building parameters for all possible real values? produce the same precision matrix. guarantees a global maximum if ⌦ there are observational skewness lead usexploring to avoid exploring new edge, and adding to the previous BIC represents a Bayesian Network and sufficient the latter form assumes + 1directly +if N there 2i + |D|solution ees a global maximum solution are and observational skewness has lead ushas to avoid More importantly, learning viaparameters, [7] sufficient only ther complicates the problem. The input parameter space we experimented withand three data sets –the Finedi↵erence opti-lems. ⇠ Gamma( , occasionally differ greatly from the parameter’s mean, and Figure 3(b) compares The Wishartthat prior used by Fan Li defined as W ( , TThese ). Graingraph (FG), Model score. Order 1 (MO1), and Model Order 2 2 [16]aisMarkov Updating the penalty term is achieved by simply the graph represents Network. 5. RESULTS Pdimensionality. P 1 dimensionality. P mmuobservations for the problems However, the direct approaches such as [7], which will most likely only egression GGM Learning 1for the 1 1 tions problems However, the direct approaches such as [7], which will most likely only log M (MO2). The first data set contains building simulations guarantees a global maximum solution if there are sufficient and observational skewness has lead us to avoid exploring ✓ + ✓ + (x x ) parameters estimated via a genetic algorithm and gradient optimization. P ( i |✓i , i ,represents D) / P (D| aitypes , i , ✓both , ✓ithat )P ( ia joint |✓i ) probability user defined and In T order is a)distribution hyadding N 2 to the updated BIC score, where N is the i )P ( assume i | ihyperparameter facto assess how well the GGM estimates building built from small brute force increments across 14 building 2 ethods, the Bayesian approach assumes Bayesian Regression GGM Learning number ofdata newly estimated torizes according to their structure. However, a parameters. Bayesian perparameter diagonal matrix who’s entries are parameters, governed by The MO1 data set contains 299 building simu-parameters. observations for the problems dimensionality. However, the direct approaches such as [7], which will most likely only + 1 + N 2i + |D| we experimented with three sets – Fine (17) sub-⇠ Gamma( , Random Parameter Estimation 0.8 0.6 0.4 0.2 0.6 0.4 0.2 27 28 29 30 66 67 68 69 157 158 159 160 181 Variables Parameter Values Parameter Values Parameter Estimates Parameter Estimates Parameter Estimates Parameter Values Parameter Values 2 0.8 9 Parameter Values Parameter Estimates 1 1 0 27 28 29 30 66 67 68 69 157 158 159 160 181 Variables Parameter Estimates ⇢N 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 9 i Estimate Actual Parameter Values 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 Parameter Estimates Estimate Actual Parameter Values Parameter Estimates Bayesian Parameter Estimation N Parameter Values Parameter Estimates ⇢N MO2 Building Parameter 3 FG Building Parameter 2 1 i M 1 1 1 i 2 D P1 n=1 ni ij nj 1.5 1 0.5 j 1 N j=i+1 One−Act Zero−Act Est Zero−Est Aactual Gradient−Est GA−Est 1 0.5 0 j=1 P1 i j i 156 1 1.5 Parameter Values Parameter Value 2 2 N j=i+1 i 1 ij 2 1 i 90 1 0 0 50 100 150 Simulation 200 250 300 −0.5 0 5 10 15 20 25 30 Simulations 35 40 45 50 i P156 1 i i N j=i+1 1 i i 1 2 ij i 1 i i D n=1 i i ni 1 i 1 i N j=i+1 i j nj i 2 The most common method for performing S&S within lations and ⇠150 building parameters. The building paramor for the precision matrix. Using assumes this Network a much simpler probabilistic dependency the following Grain (FG), Model Order 1 (MO1), and Model Order 2 2 distribution: Maximizing P i | i , D) with respect to i is equivalent to eters were independently set to their minimum or maximum s precision [3].PN P P(variables, the literature is Greedy Hill Climbing, which explores the between the in which a variable is conditionally D N matrix Fan Li [16] 1 allowed 1 2 2 (MO2). The first data set contains building simulations solving a Lasso regression problem with the regularization value, while other parameters were set to their average value. Hyper-+ ✓i + independent jx pnj ) j=i+1 ij ✓i n=1 (xni j=i+1 candidate graph space byThe deleting edges, adding edges, and fined ✓i i[16]. of all other variables givenauthors its small parents. On hyperparameter set to The original This results in two simulations per building parameter. ) ayesian approach estimates a globally built from brute force increments across 14 building P (✓these ) =formulations exp( ) with microarrays, (15) MO2Xdata ihand, parameter θi the 2derived to work which changing directions. The best valid graph modification set contains pairwise edge minimum and maximum other a Markov network assumes variables are 2 2 parameters. The MO1 data set contains 299 building simuatrix over all possible variable orders. (17) typically contain 10,000 to 12,000 variables, but have very • Our FG and function, MO2 experimental results indicated that the GGM performs combinations across the same building parameter set. is selected according to a criteria generally BIC, conditionally independent from variables Y provided that for T 1 and ⇠150 building parameters. Theabuilding few samples. This allowed the authors to uselations conventional Using the FG data set, we built GGM usingparam250 simular (9) the Bayesian formulation presented This version assumes we are only splitting the optimization and a new candidate graph is generated. A modification is they are separated by variables Z. X and Y are said to be well at estimating building parameters. methods to solve forto i . However, the E+independently Maximizing P ( i | i , D) with optimization respect to tions sampled without replacement, evaluated the model 1 and or i is equivalent eters were set to their minimum maximum is in aZthis user defined hyperparameter. Using theX above prior final estimate is computed using the following equation: 1 ariable orderings should theoretically problem across the training samples, and not the features. i prior only valid if the candidate graph is a valid Bayesian Net- equation: data set used work contains several million data vecseparated by if and only if all paths between and Y using 150 test simulations simulations. In addition, we built is a user defined hyperparameter. Using the above final estimate is computed using the following solving a Lasso regression problem with the regularization i value, while other parameters were set to their average value. and some additional derivations, FanADMM Liap[16] derived the fol-the entire con- matrix. p tors, which mostly invalidates conventional optimization It is possible to split across both. [3] presents the aNetwork GGM using MO1the data set.approach The resulting model ecision work. This guarantees a locally optimal solution M and some additional derivations, Fan Li [16] derived folpass through the variables Z [14]. The Bayesian 2 in two simulations X hyperparameter setprior to [16]. The original authors Wishart i This results per building parameter. The M 1 1 proaches. Rather than using the fast grafting algorithm lowing maximum a posteriori (MAP) distributions for all 1 is evaluated using 300 sampled MO2 simulations. 1 X blem formulation for supporting this functionality. ˆ 1(19) 1 is a user defined hyperparameter. Using the above prior final estimate is computed using the following equation: that will maximize the criteria function, but could be farbuilt = P (✓ ) lowing maximum a posteriori (MAP) distributions for all makes stronger assumptions about independence, but these • the Bayesian models using the FG and full MO1 data sets 1j used by Fan Li [16] is defined as W ( , T ). j 1 the Lasso ii Overall, ˆ derived these formulations to work with microarrays, which used by [16], we solve regression problems using Analyzing Figure 1 illustrates that the building paramMO2 data set contains pairwise minimum and maximum = P (✓ ) (19) and all : M for Ω j j 1 i j=1 Solved by all and : and some additional derivations, Fan Li [16] derived the folaway from the true model. M j=1 ADMM (Section 4.1). assumptions make it easier to perform exact interference. eter estimates using the GGM are mostly better than the efined hyperparameter and T is a hytypically contain 10,000 to 12,000 variables, but have very combinations across the same building parameter set. M are statistically better than the uniform distribution, which is expected. X LASSO with After obtaining ’s MAP estimate, it can be used to i estimates generated by randomly guessing using a [0, 1] ran1 1 There are two algorithms that extend the basic algorithm lowing maximum a posteriori (MAP) distributions for all contrast, Markov Networks use approximate inference 1 1 ˆ samples. This allowed the authors conventional PP(by / alfew matrix who’s entries areIn governed Using the FG data set, we built a GGM using 250 simulai | 1use i|✓, iD) ˆ = (✓j ) (19) ADMM maximize (to , , D) with respect to . However, jis aP i 1 dom distribution. The overall error rate where MGGM’s is the absolute total number of samples and allows sample 1 P ( i | i , D) /i i i i by constraining the search space, which for better soˆ and all : 1 methods, such as Belief PDLoopy PN P p✓i Additionally, M where M is the total number of samples and i is a sample optimization methods to solve . However, the E+ N P ( for upon the Propagation. hyperparameter 2 tions sampled without replacement, and evaluated the model is statistically better with 95% confidence than the uniform i |✓i , i i , D) is dependent tion: j=1 computed using Eq. 18. Afterward a few iterations between P P P p (x x ) + | | ni ij nj i ij N N Maximum a posterior n=1 j=i+1D j=i+1 2 lutions. The first algorithm is the Sparse Candidate method Conclusions both have weaknesses certain con- error andgraphs Fan Li exp( [16] noted that there is with not a representing known computed using Eq. 18. Afterward a few iterations between distribution’s rate – 4.05±0.98 However, 1built data set used in this work contains several million data vec(xni method | addition, ) ij xnj ) + simulations. i estimating ij |vss 5.07±1.01. using 150 test simulations we • However, our current GGMs have difficulty estimating parameters that and s, we use the final estimates to comn=1 j=i+1 j=i+1In 1 variables that have1 a to analytically remove the hyperparameter via integration. only the error rates for variables 1, 2, 3, 8, 10, 11, 12, 13, exp( ) [8]. This method assumes that random i ditional independence properties. However within this work, P (optimization / apestimating s and s,ofwe use theand finalˆestimates to comi | i , D) distribution for ✓iinvalidates tors, which(MAP) mostly conventional a GGM using the entire MO1 data set. The resulting model where M is the total number samples is a sample pute the GGM’s precision matrix (Eq. 7). i This means numerical methods are required to approximate and 14i are statistically better than their random guessing (16) deviate significantly from the mean. This implies we need to explore (✓proaches. exp( ) (15) 2 Networks P P P p high measure of mutual information should be located closer i) = we show preference to Markov over Bayesian Netpute the GGM’s precision matrix (Eq. 7). -1 D N N 2 Rather than using the fast grafting algorithm many nu- using 300 isampled isareevaluated MO2 simulations. (16) The other arecomputed not statistically di↵erusing Eq. 18. Afterward a few iterations between 2 all β2and all ψ the integral over the hyperparameter. (xni Therej=i+1 |toijvariables |each ij xnj ) +counterpart. n=1 j=i+1 in the final network than variables with low because wish to using avoid falsely representing variables 1 merical methodswe for computing approximates toAnalyzing definite in- Figure ent. used by [16], we solve the works Lasso regression problems different hyper-parameter settings, which may induce more estimation exp( ) RESULTS 1 illustrates 5. that theother building paramestimating s and s, we use the final estimates to comtegrals, such as1 Trapezoidal method and 1Simpson’s Rule.i 1 mutual The mutual information for two dis1 While the other variables areinformation. not statistically di↵erent, s ADMM we are only splitting optimization conditionally independent. (Section 4.1). theas 5. RESULTS P ( |✓ , , D) / P (D| , , ✓ )P ( | , ✓ )P ( |✓ ) eter estimates using the GGM are mostly better than the i i i i i i i pute themodel GGM’s precision matrix (Eq. 7). i However, the integral over ✓i is unbounded 1 i from above, be- i 1there is inot a clear 1indication 1 the variance, or possibly a mixture model approach, which will allow more that GGM fails to In order to assess how well the GGM estimates building (16) crete random variables, X and Y , is defined as follows: raining samples, and not the features. Maximize While these graphical models are able to adequately repP ( |✓ , , D) / P (D| , , ✓ )P ( | , ✓ )P ( |✓ ) i i i i i i i After obtaining i ’s MAP cause estimate, it can used to i i represent i estimates generated by randomly guessing using aexperimented [0, ran✓i ’s values exist be in the interval [0,1), which means these variables. This iindicates that the1] GGM over-to In order assess how wellsets the–GGM estimates building + 1 + N 2i + |D| parameters, we with three data Fine 1 across both. the ✓ ◆ resent joint graphintegral structures generally ⇠ Gamma( ,for variable means. Eq. ADMM 17 mustdistributions, beto transformed tothe a bounded theNare2i maximize P ( i[3]|✓presents with respect However, all isThe successfully modeling the absolute FG buildingerror parameters. In we i , i , D) with i. + 1 + + |D| dom distribution. overall GGM’s rate X parameters, experimented with three Grain (FG), Model Order 1 (MO1), andPModel Order 2 data sets – Fine 2 (x, y) 1 ⇠ Gamma( , numerical methods to be applicable. Given, Eq 17’s com5. RESULTS orting this functionality. fact analyzing Figures 1(a) indicates that the GGM isolates Given total variables PDis statistically PN2 better I(X, Y Grain )uniform = set P contains (x,Model y)logbuilding (5) P ( i |✓i , respect upon theP hyperparameter ✓i1 number 1 2 the 1of random 12 1(MO2). than i , D) is dependent with 95% confidence the N 1 (FG), Order 1 (MO1), and Model Order 2 to ψ-1predetermined. The first data simulations plex nature and Li’s [16] recommendation to use sampling ✓ + ✓ + (x x ) P ( |✓ , , D) / P (D| , , ✓ )P ( | , ✓ )P ( |✓ ) the building parameter’s means very well. While the uniP (x)P (y) ni j nj i i i i i i i ij i1 i it is PN i P PD rate i i N n=1 an E+j=i+1 for our experiments) 1 j=i+1 21 |✓ i, 1 , D]simulation, 2 In variances order to assess how data well the14 GGM estimates building x,y and Fan Li [16] noted thatwithin there is not a(⇠300 known method distribution’s error – 4.05±0.98 vs 5.07±1.01. However, (MO2). The first set contains building simulations ) to approximate , we elected to use E[ as built from small brute force increments across building i i form distribution matches the mean and (Figure ✓ + ✓ + (x x ) i i ni j nj i j=i+1 2 ij i n=1 j=i+1 1 + 1 + N 2i + |D| our estimates for via , which is the MAP estimate under a to analytically remove the hyperparameter integration. parameters, wethe experimented with three data sets14–building Fine 1(b)), for it appears the matching only possible because built from small brute299 force increments across only the error, rates variables 1, 2, 3,is )8, 10, 11, 12, 13, i parameters. The MO1 data set contains building simu⇠ Gamma( (17) 2 distribution. Based on Fan Li’s [16] MAP distribuactual building parameters have arandom large distribution This means numerical methodsGamma are required to approximate Grain (FG), range, Model 1building (MO1), and Model Ordersimu2 2 parameters. TheOrder MO1 set contains 299 building and 14 are statistically better than their guessing lations and ⇠150 building parameters. Thedata param(17) tion, we derived the following maximum likelihood estimate which exPare PDto i is equivalent PNmostlytocenters 2at 0.5, the uniform distribution’s N 1Pmany 1 with 1counterpart. 2 i ,nuMaximizing ( | D) respect the integral over the hyperparameter. There -1 -1 lations and building parameters. The building parami eters were independently set to⇠150 their minimum or maximum (MO2). The first data set contains building simulations The other variables are not statistically di↵er1. (MLE) Given aequation fixed θ ,for we j=i+1 estimated the + ✓i by+computing value. j xnj ) ija✓iψ sample i : This work was funded by field work proposal CEBT105 under the Department n=1 (xni pected j=i+1 Maximizing P ( | , D) with respect to is equivalent to solving a to Lasso regression with the regularization i i i merical methods for computing approximates definite in- pproblem eters were independently set to their minimum or maximum value, while other parameters were set to their average value. ) maximum likelihood estimate (MLE) ent. built small Similar to the FG data set results, the from Bayesian GGMbrute force increments across 14 building 2 1 +solving N hyperparameter set 2i to+a|D| [16]. The original authors Lasso regression problem with the regularization of Energy Building Technology Activity Number BT0305000. This research 1 presents excellent performance on the MO2 set (Figure i While This results in twodata simulations per building parameter. The value, while other parameters were set to their average value. tegrals, such as Trapezoidal method and Simpson’s Rule. the other variables are not statistically di↵erent, parameters. The MO1 data set contains 299 building simup = P P P i N D N (17) 1 2 2).i The Bayesian method isauthors better set thancontains a uniform distri- in two ✓i 1 + above, ✓formulations +hyperparameter xnjmicroarrays, )2 derived these with which set to [16]. The original ij MO2 data pairwise minimum and maximum i This results simulations per building parameter. The j=i+1 ijfrom n=1 (xni to work j=i+1 However, the integral over ✓i is unbounded bethere is not a clear indication that the GGM model fails to lations and ⇠150 building parameters. The building paramused resources of the Oak Ridge Leadership Computing Facility at the Oak bution at estimating all building parameters. The Bayesian (18) typically contain 10,000 to 12,000 variables, but have very derived these formulations to work with microarrays, which combinations across the same parameter set.minimum and maximum data building set contains pairwise cause ✓i ’s values exist in the interval [0,1), which means 2. Computed multiple samples and E[ψ-1respect |θi,these βi, D] variables. 1 MLE 1 represent Maximizing P ( with to is equivalent to This indicates the GGM overGGMi has better overall error —that 13.01±0.85 vsMO2 41.6936±2.13. i |a estimated i , D) eters were independently set to their minimum or is maximum Given few a fixed ✓ , we can estimate sample by comi i Ridge National Laboratory, which by the Office of Science of the samples. This allowed the authors to use conventional typically contain 10,000 towith 12,000 variables, butproduces have very Using the FG data set, predicwe builtacross a GGM using 250 simula-parameter supported using weighted sampling combinations the same building set. In addition, the model statistically better 1 Eq. 17 must be transformed to a bounded integral for the puting the maximum likelihood estimate (MLE) (Eq. 18). all is successfully modeling the FG building parameters. In solving a Lasso regression problem the regularization r defined hyperparameter. Using the above prior final i estimate is computed using the p following equation: value, while other parameters were set to their average value. optimization methods to solve for . However, the E+ i 3. Sampled P(θ ) using its CDF and a uniform distribution on the tions sampled without replacement, and evaluated the model few samples. This allowed the authors to use conventional tions for all variables. Using the FG data set,of we Energy built a GGM using 250 simula- No. DE-AC05-00OR22725. Our i multiple Byfolcomputing toanalyzing the ✓i[16]. ’s additional derivations, Fan Lito [16]be derived the U.S. Department under Contract numerical methods applicable. Given, EqMLE 17’ssamples com-Maccording fact Figures 1(a) indicates that the GGM isolates hyperparameter set to The original authors i This results in two simulations per building parameter. The interval [0,1] toset compute datadefined used in this work several million data vec1 using 150 test simulations simulations. In addition, we built 1 X in Eq. arecontains able to estimate optimization methods to solve for . However, the E+ ximum a posteriori (MAP) distributions fordistribution all 115, we i tions sampled without replacement, and evaluated the model ˆj Pthe to use sampling =formulations (✓j )to (19) microarrays, building parameter’s means very well. While the uni1 1plex nature and Li’s [16] recommendation i derived these work with which MO2 data set contains minimum and maximum work has been enabled and by NSF funded RDAV Center of the : E[ i |✓i ,tors, using1 mostly weighted sampling. Inconventional order to use optimization invalidates apM set i , D]which 6. DISCUSSION a GGM using the entire MO1 data set. pairwise The resulting modelInsupported 1 data data vecj=1 used in this work contains several million using 150 test simulations simulations. addition, we built to approximate i , we elected to sampled, use E[ wei sample |✓ , D] as10,000 2the i , contain i Eq. form distribution matches mean and variances (Figure weight 15 using its CDF andgrafting a variables, typically to 12,000 but have very combinations across theentire sameMO1 building parameter set. proaches. Rather than using the fast algorithm While the parameters in FG and MO2 have well defined is evaluated using 300 sampled MO2 simulations. tors, which mostly invalidates conventional optimization ap1 1 University of Tennessee, Knoxville (NSF grant no. ARRA-NSF-OCI-0906324 a GGM using the data set. The resulting model D)our / estimates for ˆ unifrom distribution on the interval [0,1]. This means our , which is the MAP estimate under a where M is the total number of samples and is a sample 1(b)), it appears the matching is only possible because the i 2 means, they have instances where they significantly devii samples. This thethan authors to use conventional usedfew by [16], we solve theallowed Lasso regression problems using PN PN p Using FG data set,300 wesampled built a MO2 GGM using 250 simulaAnalyzing Figure 1the illustrates that the building paramproaches. Rather using the fast grafting algorithm D 2 is evaluated using simulations. computed using Eq. 18. Afterward a few iterations between (xni ij xnj ) + i ij | ate from their have estimated means,distribution which is not well implied Gamma distribution. Based on| Fan Li’s [16] MAP distribun=1 j=i+1 j=i+1 and NSF-OCI-1136246). actual building parameters a large range, 1methods to ADMM (Section 4.1). optimization solve for . However, the E+ eter estimates using the GGM are mostly better than thethe building ) i tions sampled without replacement, and evaluated the model 2 estimating s and used s, we use [16], the final estimates to comby we solve the Lasso regression problems using Analyzing Figure 1 illustrates that paramA gradient based constrained set optimization method. by results (Figure 1(a) and 2). Figure 3(a) illustrates that i tion, we derived the following maximum likelihood estimate which mostly centers at 0.5, the uniform distribution’s exAfter obtaining ’s MAP estimate, it can be used to pute the GGM’s precision matrix (Eq. 7). generated bytest randomly guessing using a [0,are 1] In randata set used ini this(Section work contains several million data estimates vec(16) using 150 simulations simulations. addition, wethan builtthe ADMM 4.1). eter estimates using the GGM mostly better 1 1 (MLE) equation for i : maximize P ( i mostly |✓i After , simulations with respect to However, pected value. i , D) i . estimate, distribution. The overall GGM’sMO1 absolute error rate resulting tors, which invalidates conventional optimization apobtaining ’s MAP it brute candom be force used to 1. Fine Grain (FG): contains building built from small a GGM using the entire data set. The model i estimates generated by randomly guessing using a [0, 1] ran1 5. RESULTS 2 the Bayesian 1 the hyperparameter P ( proaches. ismaximize dependent upon Similar to the FGrespect data✓algorithm GGM 1 1 1 iset results, is. statistically better with 95% confidence than the GGM’s uniform i |✓i , i , D) Rather P ( |✓ , , D) with to However, than using the fast grafting i , D) / P (D| i , i , ✓i )P ( i | i , ✓i )P ( i |✓i ) i i i dom distribution. The overall absolute error rate is evaluated using 300 sampled MO2 simulations. i In orderLito [16] assess how well the GGM estimates building increments across 14 building parameters. 250 simulations were sampled 1 + N 2i + |D| and Fan noted that there is not a known method 1 distribution’s error – 4.05±0.98 vs 5.07±1.01. However, 1+ 1 + N 2i + |D| presents excellent performance onusing the MO2 data setisrate (Figure P ( |✓ , , D) is dependent upon the hyperparameter ✓ used by [16], we solve the Lasso regression problems i i i parameters, we experimented with three data sets – Fine statistically better with 95% confidence than the uniform = Analyzing Figure 1 illustrates that the building parami P P P i mma( , N D to analytically N 1 1 remove the hyperparameter via integration. 2 2 onlythan thethe error rates for variablesC. 1, 2, 3, 8, 10, 11, 12, 13, J. Nevins, G. Yao, and M. West. Sparse 2). Bayesian method is better a uniform distri(FG), Model Order 1 (MO1), and The Model Order 2there 2 without and 150 test were used to evaluate ✓i replacement, + ✓i + n=1 (xGrain xnjsimulations )4.1). A. Dobra, Hans, B. Jones, and Fan Li [16] noted that is not a known method niADMM ij distribution’s error rate – 4.05±0.98 vs 5.07±1.01. However, ij (Section j=i+1 j=i+1 P P eter estimates using the GGM are mostly better than the D N 1 1 2 2 This means numerical methods are required to approximate (MO2). The first data set contains building simulations and 14 are statistically better than their random guessing ✓ + ✓ + (x x ) ni j nj bution at estimating all building parameters. The Bayesian ij i i 1 n=1 j=i+1 to(18) analytically remove the hyperparameter via integration. onlygraphical the error by rates for variables 1, 2, using 3, 8, 10, 11, 12, 13, After obtaining it can be used to ) GGM model. built from small brute force increments acrossestimate, 14 building i ’s MAP estimates generated randomly guessing a [0, 1] ranmodels for exploring gene expression data. Journal of the integral over the hyperparameter. There are many nucounterpart. The other variables are not statistically di↵er2 1 1 GGM has better overall error — 13.01±0.85 vs 41.6936±2.13. 1 This means numerical methods are required to approximate The MO1 data |✓ set ,contains 299with building simuGiven a fixed ✓i , we can estimate a parameters. sample by comand 14 are statistically better than absolute their random guessing (17) merical maximize P ( , D) respect to . However, i i i i dom distribution. The overall GGM’s error rate methods for computing approximates to definite ini ent. 2. Model Order 1 (MO1): contains 299 building simulations and ~150 building Multivariate Analysis, 90(1):196–212, 2004. lations and ⇠150 building parameters. The building paramIn addition, the model produces statistically better predic1 the integral over theupon hyperparameter. There are ✓many nucounterpart. The other variables are not statistically di↵erputing the maximum likelihood estimate (MLE) (Eq. 18). P ( |✓ , , D) is dependent the hyperparameter such and Simpson’s Rule. ng P ( i | i , D) with respect to i is equivalent to tegrals, ias iTrapezoidal i statistically with 95% confidence eters were set to their method minimum or maximum While theis other variablesbetter are not statistically di↵erent,than the uniform i independently tions for all variables. merical methods for computing approximates to definite inparameters. The building parameters were independently to their min or a clearF. ent. By regression computing multiple samples according to the ✓iover ’snoted Lasso with the MLE regularization Li, Y. Yang, E. Xing. From lasso regression to feature vector machine. value, other were tounbounded theirthere average value. However, the integral ✓isetisthat beandwhile Fan Liparameters [16] isfrom notabove, a set known method pproblem there is notdistribution’s indication that the–and GGM model fails to error rate 4.05±0.98 vs 5.07±1.01. However, meter set to original authors tegrals, such as Trapezoidal i [16]. The This able results in simulations per building parameter. The method and Simpson’s Rule. While theindicates other variables are not statistically di↵erent, distribution defined in Eq. 15, other wecause are totwoestimate ✓ ’s values exist in the interval [0,1), which means i represent these variables. This that the GGM overmax value, while parameters were set to their average value. This to analytically remove the hyperparameter via integration. Advances Neural1,that Information Systems, 18:779, 2006. only the error rates forinvariables 2, the 3, 8, 10, model 11,Processing 12,fails 13,to ese formulations to work with microarrays, which MO2 data set contains pairwise the minimum and maximum 1 However, integral over ✓ is unbounded from above, bei there is not a clear indication GGM E[ |✓ , , D] using weighted sampling. In order to use i toi 12,000 variables, but have very Eq. 17 must be transformed to a bounded integral for the 6. DISCUSSION containi 10,000 all is successfully modeling the FG building parameters. Inrandom guessing combinations across the same building parameter set. required This means numerical methods are to approximate and 14 are statistically better than their results in two simulations per building parameter. The entire MO1 data set cause ✓ ’s values exist in the interval [0,1), which means i applicable. represent these variables. This indicates that the GGM overes. This allowed the authors use conventional Usingits themethods FG dataand set, weabe built a GGM using Given, 250 simulaweight sampled, we tosample Eq. 15 numerical using CDF to Eq 17’s comfact analyzing Figures 1(a) indicates that the GGM isolates While the parameters in FG and MO2 have well defined the integral over the hyperparameter. There are many nu- for the counterpart. The othermodeling variablesthe areFG notbuilding statistically di↵er-In on methods to solve for i . However, the E+ Eq. 17 must be transformed to a bounded integral tions sampled without replacement, and evaluated the model all is successfully parameters. plex nature Li’sour [16] recommendation to use instances sampling where unifrom distribution ontothe interval [0,1]. This and means used build GGM model. the building parameter’s means very well. While the unised in this workwas contains several million data vecmeans, they have they significantly deviusing 150 test methods simulations simulations. In addition, we built merical for computing approximates to definite inent. 1 1 numerical methods to E[ be applicable. Eq 17’s comfact analyzing Figures 1(a) indicates that the GGM isolates ,MO1 wedata elected toresulting use |✓i , i , D] Given, as form distribution matches the mean and variances (Figure h mostly invalidates conventional optimization ap- to aapproximate GGM using the entire set. The model i i ate from their estimated means, which is not While well the implied 3. Model Order 2 (MO2): contains pairwise min and max combinations across tegrals, such as1 , nature Trapezoidal method and Simpson’s Rule. 2 the otherisparameter’s variables are not very statistically di↵erent, plex and Li’s [16] recommendation to use sampling building means well. While the uniRather than using the fast grafting algorithm is evaluated using 300 sampled MO2 simulations. our estimates for which is the MAP estimate under a 1(b)), it appears the matching only possible because the 2 i 1 1 based constrained set parameter optimization method. bytheresults (Figure 1(a) andabove, 2).were Figure 3(a) illustrates that 16], A we gradient solve the Lasso regression problems using However, theto integral over ✓building isMO2 from Analyzing Figure 1300 illustrates that paramapproximate , unbounded we elected to use E[ |✓beD] as iiLi’s there is not a clear indication that GGM model fails to i , used i , building form distribution matches the the mean and variances (Figure the same building set. sampled simulations Gamma distribution. Based on Fan [16] MAP distribui actual parameters have a large distribution range, 1 than the Section 4.1). eter estimates using the GGM are mostly better cause ✓ ’s values exist in the interval whichestimate means our estimates for which [0,1), is the MAP under a centers i represent these variables. indicates that the GGM over-the 1(b)), it appears theThis matching is only because tion, we derived the following maximum likelihood estimate i a ,[0, which mostly at 0.5, the uniform distribution’s ex-possible btaining i ’s MAP estimate, it can be used to estimates generated by randomly guessing using 1] ranto evaluate the GGM model built from the entire MO1 data set. 1 distribution. Gamma on Fanintegral Li’s [16]for MAP Eq. 17 must beoverall transformed to Based a bounded thedistribuP ( i 1 |✓i , i , D) with respect to i . However, (MLE) actual building parameters havebuilding a large distribution equation for all is successfully modeling the FG parameters.range, In dom distribution. The pected value. i : GGM’s absolute error rate i , D) is dependent upon the hyperparameter ✓i is statistically better with 95% than uniformGiven, tion, we confidence derived the the following maximum estimate which centers at 0.5, the that uniform numerical methods to be applicable. Eq likelihood 17’s comSimilar to theanalyzing FG datamostly set results, the Bayesian GGM fact Figures 1(a) indicates the distribution’s GGM isolatesexLi [16] noted that there is not a known method 1 distribution’s error rate – 4.05±0.98 vs 5.07±1.01. However, (MLE) equation for : Acknowledgments Data Sets References