Energy Market Prediction: Papers from the 2014 AAAI Fall Symposium Individual Household Modeling of Photovoltaic Adoption Joshua Letchford and Kiran Lakkaraju Yevgeniy Vorobeychik Sandia National Laboratories {jletchf,klakkar}@sandia.gov Vanderbilt University yevgeniy.vorobeychik@vanderbilt.edu Abstract We consider the question of predicting solar adoption using demographic, economic, peer effect and predicted system characteristic features. We use data from San Diego county to evaluate both discrete and continuous models. Additionally, we consider three types of sensitivity analysis to identify which features seem to have the greatest effect on prediction accuracy. In this work, we focus on developing a better understanding of the features of residential households that can help in predicting PV adoption. We incorporate economic factors as well as peer effect factors in our model. Apart from decreasing customer acquisition costs, the individual household level developed here can be integrated into agent-based simulation approaches; allowing us to test the effectiveness of policy changes. Introduction Related Work The SunShot Initiative (Sunshot 2011) has the goal of reducing the total costs for photovoltaic (PV) solar energy systems to be ”cost-competitive” with other forms of energy. At that cost, PV could be widely adopted and thus allow the United States (US) increase it’s use of clean energy – a goal of the Department of Energy (U.S. Department of Energy 2011). Residential PV adoption will be a significant portion of this increase in use. Studies indicate that 9.2% of the cost of residential PV comes from customer acquisition (Friedman et al. 2013). Better identification of potential customers will have the direct impact of reducing customer acquisition costs. Purchasing PV (or in equivalent terms “adopting PV”) is a complex, long-term, decision that incorporates multiple factors. Like other major purchases (such as automobiles), there are several economic factors that play a role, such as national/state incentives, and current energy costs to the customer. What makes adoption of PV (and other renewable energy/energy efficiency products) unique is the impact of noneconomic aspects. For instance, (Gromet, Kunreuther, and Larrick 2013) demonstrates the impact of political ideology on energy efficient lightbulb purchases. In (Rai and Robinson 2013; Rai and McAndrews 2012) the authors show that in addition to financial considerations, demographic aspects of customers (such as education level), and peer effects (the influence of friends/family/co-workers/neighbors) can also impact probability of adoption. It is this combination of economic and non-economic considerations that makes predicting who will purchase PV a complex problem. An important contribution of our work is to quantitatively assess the impact of peer effects on PV adoption in relationship to other economic and non-economic variables. It has long been noted that peer effects play a significant role in the adoption of new technology. For instance, (Rogers 2003) highlights the importance of “opinion leaders” and interpersonal communication in the diffusion of innovations. Following these more general case studies, several studies have explored peer effects specifically for the adoption of residential PV. In (Rai and Robinson 2013), the authors looked at “information channels” for PV adopters. Using a household level dataset of PV adopters in Texas (USA), they found that passive (merely seeing PV systems in the neighborhood) and active (communication with neighbor) peer effects occur. (Bollinger and Gillingham 2012) develop a simple method to estimate the impact of PV adoption at the zip code level using data from the state of California. They found that additional PV installations can increase the probability of adoption in a zip code area by 0.78 percentage points. Our work involves peer effect variables at a higher granularity than zip code level (circles of radii 1/4 mile, 1/2 mile, 1 mile, etc) which are not limited by the arbitrary divisions of zip code. In addition, we propose a new, machine learning based method for identifying features and estimating impact. Data We developed a list of candidate features drawn from domestic, peer effect, and both micro and macro economic characteristics. This data came from a variety of sources, with the majority of it coming from data compiled by the California Center For Sustainable Energy (CCSE) on both adopters and non-adopters in the San Diego region. After an c 2014, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 19 will (L(Θ|x) = P (x|Θ) = Qn let L(Θ|x)yibe the likelihood(1−y i) )). To measure the i=1 (P (Y |X) )(1 − P (Y |X)) accuracy of this model we will use the likelihood ratio = (−2(ln(L(Θ1 |x)) − ln(L(Θ2 |x))), where Θ1 is our fitted model and Θ2 is the null (zero feature) model. To avoid overfitting the data we used a standard technique known as cross-validation. In cross-validation, one splits a data set into partitions often known as a training dataset that one uses to train the model and a testing dataset against which the model is tested. To compute this likelihood measure, we used fivefold cross-validation of the above 28 million entry dataset, where we trained an instance of the models on each fold and evaluated it on each of the other 4 folds. These set of logistic regression models on the using all 34 features described at the end of this document gave an average likelihood ratio of 1339.38. We also considered a survival model, which models the time it takes for an event (adoption) to occur. Specifically, here we choose the Cox proportional-hazards (CoxPH) regression model (Cox and Oakes 1984), which ties the features multiplicatively to the hazard rate. The hazard rate is defined as event rate at time t, conditional on the fact that this has survived to time t: h(t) = ≥t] lim∆t→0 P r[(t≤T ≤t+∆t)|T . In the Cox-PH model, this ∆t hazard rate is made up of two parts that need to be learned by the model, a baseline hazard level and effect parameters that determine the rate at which each feature modifies the hazard : hi (t) = h0 (t)exp(βXi ), where h0 (t) is the baseline hazard. Even though this baseline hazard can take a number of different forms, the covariates enter the model linearly. To learn the β parameters we use the method of maximum partial likelihood estimation, developed in the same paper as this model was introduced. At a high level, we look at the number of events at each timestep and use this to calculate the probability of the event occuring to the ith individual. Maximizing the log-likelihood of this allows us to estimate β. Since the value of some of our features change each month, we have what is known as time-dependent covariates, which is handled as in the regression model by creating a new data point each time the features changes (in our case, every month), then treating each of these data points as right censored and left truncated. Once we have learned the parameters of the model, we still need to generate a measurement we can compare directly to the logistic regression model. To generate a comparable prediction we predict the expected number of events for a set follow-up time (1 month or 30 days, where each timestep in our model is one day) for each of the entries in the test set. Again using 5-fold crossvalidation, the set of Cox-PH models using only 25 features1 gave an average likelihood ratio of 1466.51, outperforming the logistic regression model. Finally, to compare with the stochastic social influence model of adoption behavior, we compared these two models to both logistic regression and Cox-PH models that used only two of the features that appeared in the full model, initial cleaning step where we removed incomplete entries, we had 9000 and 390,000 total households (including both adopters and non-adopters). These adoptions occurred over a period of 72 months, between June 2007 and May 2013. Additionally, we also used publically available data from the US census, national unemployment and national GDP data. These features can be broadly broken down into the following categories: • Demographics variables, such as property size, value, number of bedrooms or bathrooms, if the property has a pool and is the property lived in by the owner. • Economic variables, tract level info such as average household income, and national level info such as GDP and unemployment rate. • Peer effect variables, measuring number of other solar adaptors nearby, for both distance and zipcode based definitions of near. • System characteristics, such as predicted household energy consumption, system size and incentive. A full listing of the features used appears at the end of this document. To model dynamic features that change in value from month to month (such as peer effects, potential incentive, etc.), we created a data point for each of the 390,000 households in each month, giving us around 28 total million data points. Modeling Solar Adoption First, we used a simple linear model to predict the number of watts that we would expect each household to install if they were to adopt (again for each month) and how much they would expect to spend. We predicted these two quantities (and several derived quantities) using models that took into account the features described above, demographic information, climate zones, and census economic data, plus a few additional features specific to these models such as what incentive step would have been active in that month and the climate zone that the household is located in. While our predictions here were not extremely accurate (our prediction of system size (number of kilowatts) had an R2 of 0.333 and our prediction of system cost had an R2 of 0.302), we will see in the remainder of this work that the features we derived from these predictions are still predictive of solar adoption. Our main efforts focused on predicting the probability that a given data point who has not yet adopted solar either adopts in that month or fails to do so. Thus, we can frame this as a binary logistic regression model, where our two classes (Y ) are ”adopted during the month in question” and ”did not adopt during this month”. We will be using a set of features (X) pulled from the above data. Logistic regression is one technique that allows us to predict the probablity of a particular datapoint being in either of these classes, as opposed to classification where we would instead merely predict which class is more likely. As we are interested in only making this prediction on those who have not yet adopted, once a household adopts, we do not add a data point for the household for any months after adoption. We 1 We found that we had to remove seven of the features from this model or the model failed to converge. 20 Figure 1: Top ten features from ANOVA. Figure 2: Top ten features from individual feature selection. current features = full model features; while current features 6= ∅ do min feature = arg max : the number of adopters in the same zipcode as the potential adopter and an estimate of net present value. Our NPV estimate took into account both the cost of installing solar and the expected reduction in electricity costs, using our simple linear model of system size. This limited regression model outperformed the null model by 267.19 while the limited Cox PH model outperformed the null model by 617.11. This suggests to us that while these two features are important, that there is significant value to the other features in the model. f eature LogisticRegression(current features \ min feature); current features = current features \ min feature ; end Algorithm 1: Greedy procedure Immediately salient in these results is the fact that several features appear in more than one of these lists: Feature Analysis • Demographics: Owner occupation, Number of bathrooms, Pool, Total liveable sq ft and acreage. To further explore the importance of each feature, we performed three types of sensitivity analysis on this data. First, we used a standard ANOVA (analysis of variance) technique. Here we started with an empty model, adding the features one at a time while calculating the optimal model parameters for each successive model. The top ten features identified with this technique appear in Figure 1. However, this technique has several drawbacks. First, it is not being evaluated on a held out test set, so it risks overfitting. Second, since the order that variables are added is arbitrary, whenever two features overlap in their predictive power, whichever appears first receives a disproportionate amount of the ”credit”. To deal with these drawbacks, we ran two further experiments. Both of these experiments we evaluated using the same five-fold cross-validation as above, training a model with each set and evaluating each of these models against the other four sets. First, we looked at each feature individually, by training a model that used only that feature compared to the null model. The top ten features identified by this procedure appear in Figure 2. Second, we used a procedure that iteratively removed the least predictive feature in each round, starting with the full model. This was a greedy procedure, as we determined the least predictive feature by comparing the set of models that resulted by individually removing each of the current model features separately (if we were considering a model with 20 features, we would compare 20 different models of 19 features each). Psuedocode for this procedure appears in Algorithm 1. The top 10 features identified by this procedure appear in Figure 3. • Economic and consumption: National unemployment rate and average family income. • Peer effects: Number of installations in zip code. • Predicted system characteristics: Cost per watt. Conclusion While our results agree with previous work that has shown that peer effects are important in predicting the adoption of PV, our experiments highlight that they are only one of a number of factors that appear to be predictive of PV adoption. In addition, we found it interesting that even though our metric measured accuracy on a discrete prediction, the continious (Cox-PH) model outperformed the discrete logistic Figure 3: Top ten features from greedy feature selection. 21 • Does the property include a pool? (B) regression model. • Does the property span at least .25 acres of land (Acreage)? (B) Acknowledgements This work was funded by the DOE Sunshot Initiative under the Solar Energy Evolution and Diffusion Studies (SEEDS) program (Agreement 26153). Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energys National Nuclear Security Administration under contract DEAC04-94AL85000. • Total livable square feet of the property. (N) • Is the property designated as having a particularly pleasant (valuable) view? (B) Peer effects • The number of completed PV installations by the current month in the zip code the house belongs to. (N) • The fraction of houses that have completed PV installations in the zip code the house belongs to. (N) References • The number of installations within a one mile circle centered on the house at the start of the month. (N) Bollinger, B., and Gillingham, K. 2012. Peer effects in the diffusion of solar photovoltaic panels. Marketing Science 31(6):900–912. Cox, D., and Oakes, D. 1984. Analysis of Survival Data. Chapman & Hall/CRC Monographs on Statistics & Applied Probability. Taylor & Francis. Friedman, B.; Ardani, K.; Feldman, D.; Citron, R.; and Margolis, R. 2013. Benchmarking non-hardware balance-ofsystem (soft) costs for u.s. photovoltaic systems, using a bottom-up approach and installer survey – second edition. Nrel/tp-6a20-60412, National Renewable Energy Laboratory. Gromet, D. M.; Kunreuther, H.; and Larrick, R. P. 2013. Political ideology affects energy-efficiency attitudes and choices. Proceedings of the National Academy of Sciences 201218453. PMID: 23630266. Rai, V., and McAndrews, K. 2012. Decision-making and behavior change in residential adopters of solar PV. In Proceedings of the Behavior, Energy and Climate Change Conference. Rai, V., and Robinson, S. A. 2013. Effective information channels for reducing costs of environmentally- friendly technologies: evidence from residential PV markets. Environmental Research Letters 8(1):014044. Rogers, E. M. 2003. Diffusion of Innovations. Free Press, fifth edition. Sunshot initiative. http://energy.gov/eere/sunshot/sunshotinitiative. U.S. Department of Energy. 2011. Strategic plan. Technical Report DOE/CF-0067, U.S. Department of Energy. • The number of installations within a one mile circle centered on the house at the start of the previous month. (N) • The fraction of installations within a one mile circle centered on the house at the start of the previous month. (N) • The number of installations within a fourth mile circle centered on the house at the start of the month. (N) • Were there at least three completed installations within a quarter mile of the property by the start of this month? (B) • Are there at least three households that within a quarter mile of the property who have at least started an application with CCSE by the start of the month. This also includes those who have advanced beyond this step, provided they have not canceled. (At least 3 applications in .25 miles) (B) • Are there at least three households that within a quarter mile of the property who have at least started an application with CCSE by the start of the previous month. This also includes those who have advanced beyond this step, provided they have not canceled. (B) Economic and Consumption • The baseline KWH for that zip code for that month determined by climate zone by San Diego Power and Gas. (Numeric) • Average KWH consumption for that zip code for that month. (N) • If the zip code’s average KWH is higher than 1.3 times the baseline charging rate, then the amount that would fall in tiers 3 and 4 for that month, otherwise 0. (N) Full Listing of Features • Average Family income for the tract this household falls in. (N) Below is a listing of all features considered in the full models, as well as information if the feature is binary (B) or numeric (N). Demographic • Median household income for the tract this household falls in. (N) • Is the property occupied by the owner? (B) • Average household income for the tract this household falls in. (N) • Total property value including both land and building values. (N) • Indicator variable on Income, ranges from 1 to 5 for each tract. (D) • Number of bedrooms in the property. (N) • Fraction of households in the tract this falls in that have incomes over the poverty level. (N) • Number of bathrooms in the property. (N) 22 • Change in US GDP for this month. (N) • Published national unemployment rate for this month. (N) System Characteristics (Predicted) • Predicted cost per KW if this household was to install PV. (N) • What incentive step this installation would be covered by (assuming the application was made in this month). (N) • Amount of incentive per KW (determined by incentive step). (N) • Predicted monthly electricity cost for household. (N) • Predicted monthly electricty cost after installing solar for household. (N) • Predicted fraction of power costs that would be covered by solar installation. (N) • Net Present Value, taking into account the estimated savings, cost and incentive of the installed system. (N) 23