Individual Household Modeling of Photovoltaic Adoption Joshua Letchford and Kiran Lakkaraju

advertisement
Energy Market Prediction: Papers from the 2014 AAAI Fall Symposium
Individual Household Modeling of Photovoltaic Adoption
Joshua Letchford and Kiran Lakkaraju
Yevgeniy Vorobeychik
Sandia National Laboratories
{jletchf,klakkar}@sandia.gov
Vanderbilt University
yevgeniy.vorobeychik@vanderbilt.edu
Abstract
We consider the question of predicting solar adoption using demographic, economic, peer effect and predicted system
characteristic features. We use data from San Diego county
to evaluate both discrete and continuous models. Additionally, we consider three types of sensitivity analysis to identify
which features seem to have the greatest effect on prediction
accuracy.
In this work, we focus on developing a better understanding of the features of residential households that can help in
predicting PV adoption. We incorporate economic factors as
well as peer effect factors in our model.
Apart from decreasing customer acquisition costs, the individual household level developed here can be integrated
into agent-based simulation approaches; allowing us to test
the effectiveness of policy changes.
Introduction
Related Work
The SunShot Initiative (Sunshot 2011) has the goal of reducing the total costs for photovoltaic (PV) solar energy systems
to be ”cost-competitive” with other forms of energy. At that
cost, PV could be widely adopted and thus allow the United
States (US) increase it’s use of clean energy – a goal of the
Department of Energy (U.S. Department of Energy 2011).
Residential PV adoption will be a significant portion of this
increase in use.
Studies indicate that 9.2% of the cost of residential PV
comes from customer acquisition (Friedman et al. 2013).
Better identification of potential customers will have the direct impact of reducing customer acquisition costs.
Purchasing PV (or in equivalent terms “adopting PV”) is
a complex, long-term, decision that incorporates multiple
factors. Like other major purchases (such as automobiles),
there are several economic factors that play a role, such as
national/state incentives, and current energy costs to the customer.
What makes adoption of PV (and other renewable energy/energy efficiency products) unique is the impact of noneconomic aspects. For instance, (Gromet, Kunreuther, and
Larrick 2013) demonstrates the impact of political ideology
on energy efficient lightbulb purchases. In (Rai and Robinson 2013; Rai and McAndrews 2012) the authors show that
in addition to financial considerations, demographic aspects
of customers (such as education level), and peer effects (the
influence of friends/family/co-workers/neighbors) can also
impact probability of adoption.
It is this combination of economic and non-economic considerations that makes predicting who will purchase PV a
complex problem.
An important contribution of our work is to quantitatively
assess the impact of peer effects on PV adoption in relationship to other economic and non-economic variables. It has
long been noted that peer effects play a significant role in
the adoption of new technology. For instance, (Rogers 2003)
highlights the importance of “opinion leaders” and interpersonal communication in the diffusion of innovations.
Following these more general case studies, several studies have explored peer effects specifically for the adoption
of residential PV. In (Rai and Robinson 2013), the authors
looked at “information channels” for PV adopters. Using
a household level dataset of PV adopters in Texas (USA),
they found that passive (merely seeing PV systems in the
neighborhood) and active (communication with neighbor)
peer effects occur. (Bollinger and Gillingham 2012) develop
a simple method to estimate the impact of PV adoption at the
zip code level using data from the state of California. They
found that additional PV installations can increase the probability of adoption in a zip code area by 0.78 percentage
points. Our work involves peer effect variables at a higher
granularity than zip code level (circles of radii 1/4 mile, 1/2
mile, 1 mile, etc) which are not limited by the arbitrary divisions of zip code. In addition, we propose a new, machine
learning based method for identifying features and estimating impact.
Data
We developed a list of candidate features drawn from domestic, peer effect, and both micro and macro economic
characteristics. This data came from a variety of sources,
with the majority of it coming from data compiled by the
California Center For Sustainable Energy (CCSE) on both
adopters and non-adopters in the San Diego region. After an
c 2014, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
19
will
(L(Θ|x) = P (x|Θ) =
Qn let L(Θ|x)yibe the likelihood(1−y
i)
)). To measure the
i=1 (P (Y |X) )(1 − P (Y |X))
accuracy of this model we will use the likelihood ratio =
(−2(ln(L(Θ1 |x)) − ln(L(Θ2 |x))), where Θ1 is our fitted
model and Θ2 is the null (zero feature) model. To avoid
overfitting the data we used a standard technique known as
cross-validation. In cross-validation, one splits a data set into
partitions often known as a training dataset that one uses to
train the model and a testing dataset against which the model
is tested. To compute this likelihood measure, we used fivefold cross-validation of the above 28 million entry dataset,
where we trained an instance of the models on each fold and
evaluated it on each of the other 4 folds. These set of logistic
regression models on the using all 34 features described at
the end of this document gave an average likelihood ratio of
1339.38.
We also considered a survival model, which models the
time it takes for an event (adoption) to occur. Specifically, here we choose the Cox proportional-hazards (CoxPH) regression model (Cox and Oakes 1984), which ties
the features multiplicatively to the hazard rate. The hazard rate is defined as event rate at time t, conditional
on the fact that this has survived to time t: h(t) =
≥t]
lim∆t→0 P r[(t≤T ≤t+∆t)|T
. In the Cox-PH model, this
∆t
hazard rate is made up of two parts that need to be learned by
the model, a baseline hazard level and effect parameters that
determine the rate at which each feature modifies the hazard : hi (t) = h0 (t)exp(βXi ), where h0 (t) is the baseline
hazard. Even though this baseline hazard can take a number
of different forms, the covariates enter the model linearly.
To learn the β parameters we use the method of maximum
partial likelihood estimation, developed in the same paper
as this model was introduced. At a high level, we look at the
number of events at each timestep and use this to calculate
the probability of the event occuring to the ith individual.
Maximizing the log-likelihood of this allows us to estimate
β.
Since the value of some of our features change each
month, we have what is known as time-dependent covariates, which is handled as in the regression model by creating a new data point each time the features changes (in our
case, every month), then treating each of these data points
as right censored and left truncated. Once we have learned
the parameters of the model, we still need to generate a measurement we can compare directly to the logistic regression
model. To generate a comparable prediction we predict the
expected number of events for a set follow-up time (1 month
or 30 days, where each timestep in our model is one day) for
each of the entries in the test set. Again using 5-fold crossvalidation, the set of Cox-PH models using only 25 features1
gave an average likelihood ratio of 1466.51, outperforming
the logistic regression model.
Finally, to compare with the stochastic social influence
model of adoption behavior, we compared these two models to both logistic regression and Cox-PH models that used
only two of the features that appeared in the full model,
initial cleaning step where we removed incomplete entries,
we had 9000 and 390,000 total households (including both
adopters and non-adopters). These adoptions occurred over
a period of 72 months, between June 2007 and May 2013.
Additionally, we also used publically available data from the
US census, national unemployment and national GDP data.
These features can be broadly broken down into the following categories:
• Demographics variables, such as property size, value,
number of bedrooms or bathrooms, if the property has a
pool and is the property lived in by the owner.
• Economic variables, tract level info such as average
household income, and national level info such as GDP
and unemployment rate.
• Peer effect variables, measuring number of other solar
adaptors nearby, for both distance and zipcode based definitions of near.
• System characteristics, such as predicted household energy consumption, system size and incentive.
A full listing of the features used appears at the end of this
document.
To model dynamic features that change in value from
month to month (such as peer effects, potential incentive,
etc.), we created a data point for each of the 390,000 households in each month, giving us around 28 total million data
points.
Modeling Solar Adoption
First, we used a simple linear model to predict the number
of watts that we would expect each household to install if
they were to adopt (again for each month) and how much
they would expect to spend. We predicted these two quantities (and several derived quantities) using models that took
into account the features described above, demographic information, climate zones, and census economic data, plus a
few additional features specific to these models such as what
incentive step would have been active in that month and the
climate zone that the household is located in. While our predictions here were not extremely accurate (our prediction of
system size (number of kilowatts) had an R2 of 0.333 and
our prediction of system cost had an R2 of 0.302), we will
see in the remainder of this work that the features we derived
from these predictions are still predictive of solar adoption.
Our main efforts focused on predicting the probability
that a given data point who has not yet adopted solar either adopts in that month or fails to do so. Thus, we can
frame this as a binary logistic regression model, where our
two classes (Y ) are ”adopted during the month in question”
and ”did not adopt during this month”. We will be using
a set of features (X) pulled from the above data. Logistic regression is one technique that allows us to predict the
probablity of a particular datapoint being in either of these
classes, as opposed to classification where we would instead
merely predict which class is more likely. As we are interested in only making this prediction on those who have not
yet adopted, once a household adopts, we do not add a data
point for the household for any months after adoption. We
1
We found that we had to remove seven of the features from
this model or the model failed to converge.
20
Figure 1: Top ten features from ANOVA.
Figure 2: Top ten features from individual feature selection.
current features = full model features;
while current features 6= ∅ do
min feature = arg max :
the number of adopters in the same zipcode as the potential adopter and an estimate of net present value. Our NPV
estimate took into account both the cost of installing solar
and the expected reduction in electricity costs, using our
simple linear model of system size. This limited regression
model outperformed the null model by 267.19 while the limited Cox PH model outperformed the null model by 617.11.
This suggests to us that while these two features are important, that there is significant value to the other features in the
model.
f eature
LogisticRegression(current features \ min feature);
current features = current features \ min feature ;
end
Algorithm 1: Greedy procedure
Immediately salient in these results is the fact that several
features appear in more than one of these lists:
Feature Analysis
• Demographics: Owner occupation, Number of bathrooms, Pool, Total liveable sq ft and acreage.
To further explore the importance of each feature, we performed three types of sensitivity analysis on this data. First,
we used a standard ANOVA (analysis of variance) technique. Here we started with an empty model, adding the
features one at a time while calculating the optimal model
parameters for each successive model. The top ten features
identified with this technique appear in Figure 1. However,
this technique has several drawbacks. First, it is not being
evaluated on a held out test set, so it risks overfitting. Second, since the order that variables are added is arbitrary,
whenever two features overlap in their predictive power,
whichever appears first receives a disproportionate amount
of the ”credit”.
To deal with these drawbacks, we ran two further experiments. Both of these experiments we evaluated using the
same five-fold cross-validation as above, training a model
with each set and evaluating each of these models against
the other four sets. First, we looked at each feature individually, by training a model that used only that feature compared to the null model. The top ten features identified by
this procedure appear in Figure 2.
Second, we used a procedure that iteratively removed the
least predictive feature in each round, starting with the full
model. This was a greedy procedure, as we determined the
least predictive feature by comparing the set of models that
resulted by individually removing each of the current model
features separately (if we were considering a model with 20
features, we would compare 20 different models of 19 features each). Psuedocode for this procedure appears in Algorithm 1. The top 10 features identified by this procedure
appear in Figure 3.
• Economic and consumption: National unemployment
rate and average family income.
• Peer effects: Number of installations in zip code.
• Predicted system characteristics: Cost per watt.
Conclusion
While our results agree with previous work that has shown
that peer effects are important in predicting the adoption of
PV, our experiments highlight that they are only one of a
number of factors that appear to be predictive of PV adoption. In addition, we found it interesting that even though our
metric measured accuracy on a discrete prediction, the continious (Cox-PH) model outperformed the discrete logistic
Figure 3: Top ten features from greedy feature selection.
21
• Does the property include a pool? (B)
regression model.
• Does the property span at least .25 acres of land
(Acreage)? (B)
Acknowledgements
This work was funded by the DOE Sunshot Initiative under
the Solar Energy Evolution and Diffusion Studies (SEEDS)
program (Agreement 26153). Sandia National Laboratories
is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed
Martin Corporation, for the U.S. Department of Energys National Nuclear Security Administration under contract DEAC04-94AL85000.
• Total livable square feet of the property. (N)
• Is the property designated as having a particularly pleasant (valuable) view? (B)
Peer effects
• The number of completed PV installations by the current
month in the zip code the house belongs to. (N)
• The fraction of houses that have completed PV installations in the zip code the house belongs to. (N)
References
• The number of installations within a one mile circle centered on the house at the start of the month. (N)
Bollinger, B., and Gillingham, K. 2012. Peer effects in the
diffusion of solar photovoltaic panels. Marketing Science
31(6):900–912.
Cox, D., and Oakes, D. 1984. Analysis of Survival Data.
Chapman & Hall/CRC Monographs on Statistics & Applied
Probability. Taylor & Francis.
Friedman, B.; Ardani, K.; Feldman, D.; Citron, R.; and Margolis, R. 2013. Benchmarking non-hardware balance-ofsystem (soft) costs for u.s. photovoltaic systems, using a
bottom-up approach and installer survey – second edition.
Nrel/tp-6a20-60412, National Renewable Energy Laboratory.
Gromet, D. M.; Kunreuther, H.; and Larrick, R. P. 2013.
Political ideology affects energy-efficiency attitudes and
choices. Proceedings of the National Academy of Sciences
201218453. PMID: 23630266.
Rai, V., and McAndrews, K. 2012. Decision-making and
behavior change in residential adopters of solar PV. In Proceedings of the Behavior, Energy and Climate Change Conference.
Rai, V., and Robinson, S. A. 2013. Effective information channels for reducing costs of environmentally- friendly
technologies: evidence from residential PV markets. Environmental Research Letters 8(1):014044.
Rogers, E. M. 2003. Diffusion of Innovations. Free Press,
fifth edition.
Sunshot initiative. http://energy.gov/eere/sunshot/sunshotinitiative.
U.S. Department of Energy. 2011. Strategic plan. Technical
Report DOE/CF-0067, U.S. Department of Energy.
• The number of installations within a one mile circle centered on the house at the start of the previous month. (N)
• The fraction of installations within a one mile circle centered on the house at the start of the previous month. (N)
• The number of installations within a fourth mile circle
centered on the house at the start of the month. (N)
• Were there at least three completed installations within a
quarter mile of the property by the start of this month? (B)
• Are there at least three households that within a quarter
mile of the property who have at least started an application with CCSE by the start of the month. This also
includes those who have advanced beyond this step, provided they have not canceled. (At least 3 applications in
.25 miles) (B)
• Are there at least three households that within a quarter
mile of the property who have at least started an application with CCSE by the start of the previous month. This
also includes those who have advanced beyond this step,
provided they have not canceled. (B)
Economic and Consumption
• The baseline KWH for that zip code for that month determined by climate zone by San Diego Power and Gas.
(Numeric)
• Average KWH consumption for that zip code for that
month. (N)
• If the zip code’s average KWH is higher than 1.3 times the
baseline charging rate, then the amount that would fall in
tiers 3 and 4 for that month, otherwise 0. (N)
Full Listing of Features
• Average Family income for the tract this household falls
in. (N)
Below is a listing of all features considered in the full models, as well as information if the feature is binary (B) or numeric (N).
Demographic
• Median household income for the tract this household
falls in. (N)
• Is the property occupied by the owner? (B)
• Average household income for the tract this household
falls in. (N)
• Total property value including both land and building values. (N)
• Indicator variable on Income, ranges from 1 to 5 for each
tract. (D)
• Number of bedrooms in the property. (N)
• Fraction of households in the tract this falls in that have
incomes over the poverty level. (N)
• Number of bathrooms in the property. (N)
22
• Change in US GDP for this month. (N)
• Published national unemployment rate for this month. (N)
System Characteristics (Predicted)
• Predicted cost per KW if this household was to install PV.
(N)
• What incentive step this installation would be covered by
(assuming the application was made in this month). (N)
• Amount of incentive per KW (determined by incentive
step). (N)
• Predicted monthly electricity cost for household. (N)
• Predicted monthly electricty cost after installing solar for
household. (N)
• Predicted fraction of power costs that would be covered
by solar installation. (N)
• Net Present Value, taking into account the estimated savings, cost and incentive of the installed system. (N)
23
Download