CHILDREN AND FAMILIES EDUCATION AND THE ARTS The RAND Corporation is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. ENERGY AND ENVIRONMENT HEALTH AND HEALTH CARE INFRASTRUCTURE AND TRANSPORTATION This electronic document was made available from www.rand.org as a public service of the RAND Corporation. INTERNATIONAL AFFAIRS LAW AND BUSINESS Skip all front matter: Jump to Page 16 NATIONAL SECURITY POPULATION AND AGING PUBLIC SAFETY SCIENCE AND TECHNOLOGY TERRORISM AND HOMELAND SECURITY Support RAND Browse Reports & Bookstore Make a charitable contribution For More Information Visit RAND at www.rand.org Explore the Pardee RAND Graduate School View document details Limited Electronic Distribution Rights This document and trademark(s) contained herein are protected by law as indicated in a notice appearing later in this work. This electronic representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of RAND electronic documents to a non-RAND website is prohibited. RAND electronic documents are protected under copyright law. Permission is required from RAND to reproduce, or reuse in another form, any of our research documents for commercial use. For information on reprint and linking permissions, please see RAND Permissions. This product is part of the Pardee RAND Graduate School (PRGS) dissertation series. PRGS dissertations are produced by graduate fellows of the Pardee RAND Graduate School, the world’s leading producer of Ph.D.’s in policy analysis. The dissertation has been supervised, reviewed, and approved by the graduate fellow’s faculty committee. Policy Impacts on Wind and Solar Innovation New Results Based on Article Counts Eileen Hlavka PARDEE RAND GRADUATE SCHOOL Policy Impacts on Wind and Solar Innovation New Results Based on Article Counts This document was submitted as a dissertation in April 2013 in partial fulfillment of the requirements of the doctoral degree in public policy analysis at the Pardee RAND Graduate School. The faculty committee that supervised and approved the dissertation consisted of Siddhartha Dalal (Chair), Nicholas Burger, and Robert Lempert. PARDEE RAND GRADUATE SCHOOL The Pardee RAND Graduate School dissertation series reproduces dissertations that have been approved by the student’s dissertation committee. The RAND Corporation is a nonprofit institution that helps improve policy and decisionmaking through research and analysis. RAND’s publications do not necessarily reflect the opinions of its research clients and sponsors. R® is a registered trademark. Permission is given to duplicate this document for personal use only, as long as it is unaltered and complete. Copies may not be duplicated for commercial purposes. Unauthorized posting of RAND documents to a non-RAND website is prohibited. RAND documents are protected under copyright law. For information on reprint and linking permissions, please visit the RAND permissions page (http://www.rand.org/publications/permissions.html). Published 2013 by the RAND Corporation 1776 Main Street, P.O. Box 2138, Santa Monica, CA 90407-2138 1200 South Hayes Street, Arlington, VA 22202-5050 4570 Fifth Avenue, Suite 600, Pittsburgh, PA 15213-2665 RAND URL: http://www.rand.org To order RAND documents or to obtain additional information, contact Distribution Services: Telephone: (310) 451-7002; Fax: (310) 451-6915; Email: order@rand.org Abstract Predicting the effects of climate policies on energy use and the economy requires understanding how they will affect innovation. Yet, little empirical research exists in this area. This study helps fill the gap, using the number of relevant academic journal articles published per month as a proxy for innovation in wind and solar energy. Tens of thousands of articles are counted using Bayesian logistic classification methods. The first of three essays finds that solar and wind innovation increase with U.S. research and renewable energy production subsidies. Production subsidies are represented by the Production Tax Credit and Investment Tax Credit for renewable energy, taken together, whose effect on innovation has not been measured before. Wind and solar patents give similar results with respect to research subsidies but are too coarse a measure to identify tax credit effects, as described in the second essay. The third and final essay identifies articles on monocrystalline silicon and thin film solar panels, the two main types of solar energy research. Together, these essays provide new methods for producing article count time series; new data describing solar and wind innovation; parameters enabling future climate policy models to incorporate effects on innovation; and results suggesting direct and indirect U.S. policies have encouraged solar and wind energy research. iii Table of Contents Abstract ........................................................................................................................................... iii Acknowledgments .......................................................................................................................... vii Summary ......................................................................................................................................... ix Essay 1. Federal Policy Effects on Solar and Wind Research Output: Using a New Measure to Inform Renewable Energy Policy ......................................................... 1 Essay 2. Using Articles Rather than Patents to Quantify Research Over Time: An Example Identifying Policy Effects on Wind and Solar Energy Research.............................. 72 Essay 3. A Thousand Thin Film Flowers Perennially Blooming: Article Counts as a Method of Comparing Monocrystalline Silicon and Thin Film Solar Research and Examining the Influence of Public Policy .......................................................................... 119 v Acknowledgements Many thanks are due to my dissertation chair Siddhartha Dalal, dissertation committee members Nicholas Burger and Robert Lempert, and outside reader Daniel Kammen. Siddhartha Dalal and Nicholas Burger were particularly involved even before this research concept became a dissertation proposal, providing guidance from statistical and economic perspectives. Robert Lempert and Daniel Kammen contributed additional advice often relating to the energy and environment policy context. Aviva Litovitz and Evan Bloom helped code the journal article abstracts. Aimee Curtright, Constantine Samaras, Adam Gailey, Christopher Sharon, Amber Jaycocks, Ethan Scherer, Stephanie Chan, Anita Szafran and many others provided helpful suggestions along the way. Any remaining errors in fact or judgment are my own. This research was conducted with the generous support of the Cazier Sustainability Dissertation Fellowship and developed under STAR Fellowship Assistance Agreement no. FP1711801‐0 awarded by the U.S. Environmental Protection Agency (EPA). It has not been formally reviewed by EPA. The views expressed in this paper are solely those of the author, and EPA does not endorse any products or commercial services mentioned in this paper. vii Summary The extent to which technological change will help mitigate climate change is a subject of some debate among climate policy researchers and stakeholders. In particular, models predicting the costs and effectiveness of climate policies rarely include the effects of technological change because it is difficult to predict how much change will occur. This study takes an empirical approach to filling this gap in the case of solar and wind energy research. In order to do so, a new set of data is collected to serve as a proxy for technological change, focusing on research production: the number of technical journal articles on solar or wind energy published each month. A combination of hand‐sorting and Bayesian logistic modeling is used to identify these articles, which number in the tens of thousands. The resulting monthly article counts are used to assess how solar and wind research production may have responded to major U.S. renewable energy policies. The first essay focuses on these empirical findings, the second on methodological comparisons, and the third on subcategories of solar energy and their potential relationships with policy. Both direct and indirect subsidies are found to be associated with increases in research production, as described in the first essay. Indirect subsidies are represented by the largest U.S. federal support for renewable energy, the Production Tax Credit, considered in combination with the renewable energy Investment Tax Credit. Most of the value of these credits is used for wind, although they are also available for solar. These credits are widely considered to drive installations of wind turbines, but their effects on research were previously unmeasured. Increasing the tax credits by $20 million per year (1% of their 2008 value) is found to be associated with a 1% increase in solar article counts and positive but statistically insignificant effects on wind article counts. To the author’s knowledge, this is the first study to successfully quantify these tax credits’ relationship with innovation. Also investigated is the impact of federal solar or wind research funding, i.e. direct subsidies for research. For every $1 million dollars spent on solar or wind research, solar or wind article counts increase by 1‐2%. Although the impact of research funding on research is intuitive, this result contradicts the hypothesis that public funds might simply displace private research funding with no net ix increase in research. These findings can be used as inputs to climate policy models in order to calibrate the effects of renewable energy production subsidies and research subsidies on solar and wind energy research. The second essay focuses on methods of measuring innovation and how they affect the results. Three measures of innovation—article counts from the first essay, article counts based on keyword selection, and patents—are constructed and subjected to the same analysis as in the first essay. Patents, selected using either patent classes or by searching for keywords, are the leading measure of innovation for economic and policy analyses. Similarly, keyword selection is the usual method of finding and counting articles, which has been done in many other contexts including examination of topical trends within wind and solar energy research. Testing against a hand‐sorted sample of articles indicates that Bayesian regression and keyword selection are roughly similarly successful at identifying relevant articles, although assessment methods used favor keyword selection. Bayesian article selection identifies more articles and may be more reliable, but keyword selection is faster and in this case, both types of article counts give substantively analogous results when regressed with tax credits and subsidies. These findings suggest that it is at least sometimes possible to choose keywords which work approximately as well as the Bayesian regression models used. Thus, reasons for selecting one method over the other in the future may be case‐specific. For both wind and solar energy, patents’ associations with research funding are very similar to the associations for articles, while their associations with the tax credits are not statistically significantly different from zero or from article count results. Patents’ failure to find an association with tax credits when article counts do find such a result may be caused in part by patents being far fewer in number than articles. These findings can be interpreted to suggest that article counts are as valid a measure of innovation impacts as patent counts, and are more effective at measuring small effects. Further suggestions for future research methods are discussed. Again using Bayesian regression and keyword approaches, the third essay investigates trends over time in the two main subcategories of solar energy research: monocrystalline silicon and thin film. While demonstrating the methods with more challenging article topics, x this study also offers preliminary findings on whether public policies have affected the distribution of research effort between monocrystalline silicon and thin film. Monocrystalline silicon, known also as the “first generation” of solar cells, is the technology used in most commercially available silicon cells to date. Thin films are a competing technology which, since they use thinner materials, are often less efficient but potentially cheaper. They are known as the “second generation” of solar technologies. Predictions that thin films imminently will dominate the market date to at least 1985. This study finds that thin film articles collected by either method have far outnumbered monocrystalline articles consistently throughout 1985‐ 2010, the entire time period considered. U.S. research subsidies appear to have potentially favored monocrystalline silicon over thin film research in the more applied database, demonstrating that research subsidies can be used to steer net solar research. The ratio of monocrystalline silicon to thin film research appears to have been otherwise unaffected by the U.S. tax credits and research subsidies, consistent with the philosophy that policies should avoid being technology‐specific. Taken together, these essays provide both substantive and methodological results. Firstly, they demonstrate the substantial impacts that public policies can have on the volume of solar and wind innovation, even when innovation is not the policy’s direct target. They provide detailed results on the relative sizes of these impacts, which can be used as inputs to subsequent analyses. Simultaneously, they show that hand‐sorting followed by Bayesian regression is an effective way to count large numbers of articles for the purposes of social science research, although carefully chosen keywords may sometimes perform similarly well. The solar and wind article counts thus created are also suitable for many more studies than could be conducted here. This dissertation provides new solar and wind article count data, the methods used to create it, and information on how direct and indirect subsidies can drive solar and wind research. xi Essay 1. Federal Policy Effects on Solar and Wind Research Output: Using a New Measure to Inform Renewable Energy Policy Abstract In the U.S., the largest federal renewable energy subsidy throughout recent decades has been a pair of tax credits for renewable energy producers known separately as the Production Tax Credit and Investment Tax Credit, or together as the New Technology Credits. The U.S. government, through the Department of Energy and other agencies, also provides subsidies for renewable energy research. This essay compares the effectiveness of these two policy types in encouraging research on wind and solar technologies, using a new measure of renewable energy research quantity: the number of technical articles on solar or wind energy published each month in each of two journal article databases. A Bayesian logistic model is applied to over 200,000 abstracts from two databases to construct four time series of solar and wind article counts. Using these as a proxy for research output suggests New Technology Credit outlays have significant positive effects on applied solar research. Research subsidies have stronger significant effects on both wind and solar applied research, and possibly on less‐ applied wind research. The New Technology Credits’ substantial effect on solar research represents a seldom measured effect of the tax credits and suggests the potential of renewable energy price supports to foster innovation. Keywords renewable energy, innovation, Production Tax Credit, Investment Tax Credit Research Highlights The New Technology Credits appear to stimulate applied solar energy research. The New Technology Credits’ relationship with wind research is more ambiguous. Applied research on wind or solar energy increases in response to research funding. 1 A new measure of wind and solar research is built using 200,000+ article abstracts. Estimates provided can be used to calibrate models of renewable energy policies. 1. Introduction In the decade ending in 2009, the U.S. government funded renewable energy research at an average of $122 million annually for solar and $40 million annually for wind, culminating in $254 million and $46 million respectively in 2009 (OECD/IEA, 2010). These dollars support research on wind power systems, solar heating and cooling, solar thermal electricity, and primarily photovoltaic cells. Federal funds also support renewable energy in ways which may incentivize research indirectly, by far the largest of which is the Production Tax Credit. First passed in 1992 (Karen Palmer et al., 2011), the Production Tax Credit is a per‐kilowatt‐hour tax credit for renewable electricity production, totaling an average expenditure of $332 million in 2000‐2009, reaching $430 million in 2009 and more than doubling in 2010 (Office of Management and Budget, For each of Fiscal Years 1996‐2011). Of the Production Tax Credit dollars, over 90% go to wind energy (2007 data, Energy Information Administration, 2008). The Production Tax Credit has historically been reported together with the Investment Tax Credit, a tax credit for investment in solar, biomass and certain other renewable energy facilities first offered in 1978 and thus among the older renewable energy subsidies available today (Energy Information Administration, 2008). Together, they are referred to as the New Technology Credits. Although they do not fund research directly, these tax credits may provide an incentive for research by signaling that there is a large potential return if the research is successful. How much do these federal investments affect research output? The environmental economics consensus suggests both that government support for renewable energy research is desirable and that it will be effective (Gregory F. Nemet, 2009). Economic theory suggests that society would benefit from more renewable energy research than the market will provide, because the researchers will not be able to capture the full environmental and economic benefits of their discoveries (Adam Jaffe, Richard Newell and Robert Stavins, 2005, Vicki Norberg‐Bohm, 2000). Thus, government funding may stimulate research above and beyond what would be produced by the private market. On the other hand, private research funding 2 theoretically could be crowded out as its providers allow the government to take their place, in which case federal subsidies would not produce a net increase in research. Similarly, economic reasoning suggests that tax credits for renewable energy may encourage relevant innovation by increasing the value of successful innovation. Because tax credits increase the overall demand for renewable energy, their effect on innovation is not subject to crowding out the way research funding is. However, insofar as the intellectual property stimulated by tax credits cannot be fully protected by its originators, researchers may still produce less renewable energy research than would be desirable for society. Taken together, these theories imply that the relative size of renewable energy research funding’s and New Technology Credits’ impacts on research cannot be definitively predicted by theory. The most straightforward hypothesis is that renewable energy research will increase in response to research funding and, to a lesser extent, in response to the New Technology Credits, but this is not the only possibility. Empirical work has just begun to assess the relationship between renewable energy subsidies and research, mostly focusing on the effects of research subsidies. For energy broadly, including fossil fuels, U.S. patents have risen and fallen with total public and private energy research spending (Robert Margolis and Daniel Kammen, 1999). Wind and solar energy patents in particular have followed the trends in public wind and solar research expenditures (Gregory F. Nemet and Daniel M. Kammen, 2007), reflecting the relative importance of public funding in this area. Detailed qualitative examination of wind and photovoltaic technologies similarly emphasizes the importance of federal research support, identifying many specific technological developments that relied upon federal research funding (Vicki Norberg‐Bohm, 2000). In Europe, the size of this impact has been calculated using patents as the measure of innovation and comparing across 25 countries over time, finding that a government’s research funding has significant effects on both wind and solar patents as well as geothermal (Nick Johnstone et al., 2009). The effect on wind is over four times as strong as the effect on solar(Nick Johnstone, Ivan Haščič and David Popp, 2009). In the U.S. context, effect sizes previously calculated for renewable energy in general are small—David Popp finds $55 or $100 3 million is required to induce a single renewable energy U.S. patent (2002)—but this may be caused by including technology types that are less affected by public support, as the author implies in later work (David Popp, 2003). The results reported here suggest stronger effects than this for wind and solar technology.i Little or no research has directly examined the New Technology Credits’ effects on innovation. The Production Tax Credit’s direct effect is on renewable electricity installations, and appears to be substantial at least for wind (Lori Bird et al., 2005, Joanna Lewis and Ryan Wiser, 2005). The volatility of this tax credit, due to its repeatedly being allowed to expire and the uncertainty of whether it will then be renewed, also reduces its effectiveness in encouraging renewable energy production (Merrill Jones Barradale, 2010). The effect on research, if any, is indirect and has received less attention. Related work finds that higher energy prices lead to more renewable energy patents (David Popp, 2002), implying that a tax credit would also increase renewable energy innovation. However, patents for wind energy did not appear to increase in response to California wind installation incentives in the 1980’s, perhaps because the wind industry simultaneously reduced the types of technologies it was researching for commercialization (Gregory F. Nemet, 2009). When examining simply the presence or absence of a renewable energy tax credit, the European patent research cited earlier also finds no statistically significant effect on wind research, solar research, or research in other renewable energy types considered, nor on the total across types (Nick Johnstone, Ivan Haščič and David Popp, 2009). In contrast, the present work is able to quantify the pair of tax credits it examines and proceeds to find that the New Technology Credits have significant and sizeable effects on solar innovation. In the absence of estimates of the size of subsidy effects on research, models of climate policy have proceeded using various assumptions. In particular, several papers have modeled how innovation may respond to a greenhouse gas cap‐and‐trade system (Carolyn Fischer and Richard Newell, 2008, Lawrence H. Goulder and Stephen H. Schneider, 1999, Richard G. Newell et al., 1999, William D. Nordhaus, 2002, Karen Palmer, Anthony Paul, Matt Woerman and i For example, the present findings would predict that a $100 million increase in solar research funding would result in a whopping 85% increase in solar research articles in the more applied of the two databases considered. While this result is extrapolating beyond the plausible limits of the analysis, it shows that the current research finds relatively strong effects of federal research funding. 4 Daniel C. Steinberg, 2011). Technological development is represented as resulting from R&D expenditures, accumulated experience (depending on an “experience curve”), or a combination of both, with models typically concluding that technological change can substantially reduce the costs of climate policies. While the most detailed of these models use simulations to characterize the relative magnitude of their findings, their specific conclusions often depend on strong simplifying assumptions and assumed parameter values. Recognizing this problem, an alternative, empirical approach compares a variety of market characteristics as potential causes of cost decreases in solar photovoltaic cells without parametrizing innovation itself (Gregory F. Nemet, 2006). Such empiricism could be combined with theoretical models and used to predict how different subsidy combinations would impact multiple renewable energy policy outcomes, including research, if only the necessary empirically‐based parameters were available. This research seeks to provide some of the necessary parameters, namely, the effects of research subsidies and the New Technology Credits on solar and wind research. Of all renewable electricity sources, solar and wind are the subject of this research for two reasons. First, they are often considered the greenest of energy sources. Sunshine and wind are perhaps the most widely available of renewable resources, and are subject to less environmental criticism than nuclear power or large hydroelectric dams, although the siting of large solar power plants and wind turbines can also be controversial. Moreover, solar and wind provide a significant contrast in terms of their technological development and resulting costs. Wind is a relatively mature technology, having been used for hundreds of years and with little current expectation of a breakthrough that will revolutionize the industry by dramatically increasing efficiency or reducing costs. Solar researchers, in contrast, are searching for such breakthroughs, especially in the area of thin film photovoltaic materials. If these efforts eventually succeed in sufficiently reducing the cost of solar power, the amount of solar power produced may then increase to rival the currently much more widespread, much cheaper wind installations. Examining both wind and solar makes it possible to see whether and how mature and emerging renewable energy research areas differ in their responses to subsidies. 5 2. Methodological Background This essay uses the number of research articles published each month on wind and solar technology as a measure of the research output in these areas and thus a proxy for innovation. Prior investigation of renewable innovation and public policies has focused on patent counts, to which article counts provide an informative counterpoint. While patents are readily available from patent offices and well categorized far back in time, articles are available in academic research databases but inconsistently categorized, if at all. The ambiguity of keywords such as “solar,” which could relate to photovoltaics or to astronomical activity, means that classifying articles effectively requires substantial effort. This study overcomes this difficulty by using a combination of hand‐classification and Bayesian logistic modeling to classify hundreds of thousands of abstracts. Both patent counts and article counts are meaningful signposts along the idea‐to‐ product research timeline (for example, Zvi Griliches, 1990). For a classic review of the strengths and weaknesses of patents as measures of research output, see Pavitt (1985). Typically only successful patent applications are included in patent count analysis in order to exclude research which did not meet the standards of the patenting office. Patents are presumed to meet such standards and be at least as valuable as the cost of their acquisition in the eyes of their authors and funders. A similar presumption of value holds for articles, based instead on what authors wish to publicize and journal editors believe will or should interest their audiences. However, neither patents nor articles is a complete record of valuable innovation because researchers, especially in the private sector, sometimes opt to keep proprietary information secret by neither patenting nor publishing it. In addition, variations in quality among patents or articles means that total counts may not be proportional to total value. Article counts may be closer to proportional to research value or interest in a topic than patent counts because multiple, similar articles on the same topic are likely while multiple overlapping patents are not. Patents and article counts are also closely connected, with some articles citing patents and vice versa (R. Tussen et al., 2000). Historical data for the private sector show chemistry article counts are highly correlated with chemistry patent counts, and physics article counts 6 with information and electronics patents (Benoît Godin, 1996). In addition to renewable energy and engineering journals, chemistry and physic journals are among the journals which published the most articles counted in the present study. Although the relationship between patents and articles may have varied over time with changes in private and public incentives for each, correlations between the two are likely to remain strong. Article counts are an increasingly common measure of solar and other energy research, but so far they have rarely been used to assess the policy and economic drivers of broad research areas. Instead, they are typically used to compare among countries to identify which are conducting the most research (Éric Archambault et al., 2009 for energy in general, K.C. Garg and Praveen Sharma, 1991, Bikramjit Sinha, 2011) or to find patterns within a set of articles (Katarina Larsen, 2008, Thomas D. IV Perry et al., 2011, Guo Ying et al., 2009). Keyword searches form the basis of these methods with little or no subsequent processing to add or remove articles; perhaps the most sophisticated of them uses an iterative process of selecting relevant articles and using them to generate potential keywords (Éric Archambault, Julie Caruso, Grégoire Côté and Vincent Larivière, 2009). The keyword approach is useful, but prone on the one hand to including irrelevant articles which happen to include the keywords in metaphorical use, as minor content or using their irrelevant homonymous meanings; and on the other hand, to excluding relevant articles in which the authors failed to use the necessary keywords in favor of synonyms or assuming that the reader would use context to identify the topic of the article. The approach used here minimizes this problem by selecting and weighting keywords based on their effectiveness at identifying article relevance, using a semi‐automated method. Keywords chosen by the analyst are used only as a first step, with a few broad terms identifying a very large set of potentially relevant article abstracts. A random subset of these abstracts is then sorted by hand for relevance or irrelevance. Hand categorization of pieces of text has a long history in anthropology and more recent use in medical and other fields, where it is known as “content analysis” or simply “qualitative analysis” of text. Sample texts, goals of the project and background knowledge are used to decide upon several mutually exclusive categories into which the texts will be sorted. The categories are clearly defined in a written “codebook,” 7 which “readers” then use to sort all the texts into the categories (G.W. Ryan and H.R. Bernard, 2000). Hand‐coding can describe articles in greater detail and with subtler categories than purely automated methods, as in “systematic reviews” of the evidence for a medical question, or in Stephens, Wilson et al.’s recent analysis of wind energy news coverage (2009), which was one of the inspirations for this work. The words which appear in the hand‐sorted abstracts are then used as the input data to build a Bayesian logistic model. The large number of words would strain the use of a traditional OLS function and likely lead to overfitting and thus inaccurate predictions for abstracts which were not hand‐sorted. Instead, Bayesian logistic and latent semantic analysis methods are more feasible and can minimize the overfitting problem. An active and growing literature outside of social science discusses the merits of document analysis using these two methods, motivated in large part by web search applications (for example, Scott Deerwester et al., 1990, Alexander Genkin et al., 2007). The basic approach for either conceptualizes each article as a “bag of words” (e.g. David Lewis, 1998). In this model, to write an article of a given type is to randomly select a set of words which will make up that article, using probabilities of selection which depend on the desired article type. A model or classifier is constructed which uses articles of known types to identify what these probabilities are. This model typically employs a functional form that reduces the dimensionality of the problem, such as LASSO in the case of Bayesian approaches or singular value decomposition in the case of latent semantic analysis. The resulting model is then applied to new texts in order to predict their types. Similar methods have been used recently in medical and terrorism‐related research, employing and comparing gradient boosting, support vector machines and LASSO models similar to the model used here as the functional forms (Siddhartha R. Dalal et al., 2012, R. Mason et al., 2012). Debate continues as to which model specification is optimal, but some recent research suggests that a Bayesian logistic model can be as efficient and accurate as other “state‐of‐the‐art” classifiers (Alexander Genkin, David D. Lewis and David Madigan, 2007). While the Bayesian logistic model assumes the independent selection of each word, and the independence of article “type” from article length, its effectiveness typically does not require these assumptions to be true (Alexander Genkin, David D. Lewis and David Madigan, 2007). The logistic form with 8 a Laplace prior distribution minimizes the number of predictors used, which is necessary to avoid overfitting the model when there are either more predictors than observations or a large number of predictors compared to observations, as in text data. By minimizing the number of predictors, such methods both predict well and are efficient to create and use, which is particularly important for large datasets such as the word lists used here. Bayesian classification approaches are still uncommon in the social science literature, despite having accumulated some research history in other fields (the title of David Lewis, 1998 refers to “Naïve (Bayes) at 40” years of age). Thus, they provide an optimal opportunity for research cross‐pollination. In a rare social science example, a similar text classification idea is used in a recent methodological paper in the American Journal of Political Science (Daniel J. Hopkins and Gary King, 2010). Hopkins and King’s model reduces the problem’s dimensionality by a different method—randomly selecting words and then applying OLS regression—but similarly uses hand‐coding and a model to classify articles of interest. The social sciences may benefit greatly from increased application of such models to document analysis and research on precisely which models are most effective for which applications. After the article counts are produced, the analysis here proceeds to compare the counts with subsidies and other covariates using a Poisson regression model with a logarithmic link. This is a natural form for counts of events, such as articles being published. Overdispersion affects the standard errors of model coefficients but not their mean values, so standard errors are calculated using bootstrapping. A lag time is allowed between the time period of the regressors and the article counts, representing the amount of time it takes to produce a research article. While some previous research using patent counts has omitted this lag, clearly both articles and patents take time to create, and thus such a lag is necessary. 3. Methods How do federal research subsidies and the New Technology Credits affect the rate of research on solar and wind energy technology? I start by defining a measure of research output: the number of relevant technical journal papers published within a month. Article abstracts are collected using keyword searches of two leading article databases, ISI Web of 9 Science and Compendex Engineering Village, and the resulting abstracts’ probabilities of relevance are assessed using human‐guided Bayesian logistic classification. This produces monthly article counts, which are then regressed with solar‐ or wind‐targeted research subsidies, the total value of the New Technology Credits for renewable electricity producers, and other variables reflecting relevant economic conditions. The analysis is completed separately for wind and solar, and for each of the article databases, yielding four sets of regression results. 3.1 Counting Research Articles The number of research articles on wind and solar technology published each month during 1995 through 2008 is estimated by completing several steps. First, article abstracts are drawn from a general academic article database and a database focused on engineering and applied sciences. Next, a sample of collected abstracts is sorted by hand into two categories, relevant and irrelevant. The words in these abstracts are used to build a Bayesian logistic model predicting the probability of an abstract’s relevance. Finally, this model is used to find the probability of relevance for each of the remaining abstracts. These probabilities are then summed by each article’s month of publication. The steps to produce the models are conducted separately for solar and wind, and the models are applied separately to abstracts from each of the two databases for wind and for solar, yielding four time series. Articles were identified using two databases: ISI Web of Science and Compendex Engineering Village II. Both databases consist primarily of peer‐reviewed article abstracts but each also includes some non‐peer‐reviewed trade journals. Many journals appear in both databases but each includes some journals that are not in the other. Web of Science has broad coverage of the sciences and humanities (over 10,000 journals), emphasizing the hard sciences. Its primary intended audience is academic (Thomson Reuters, 2010). Engineering Village, in contrast, includes roughly half as many journals (over 5,600) and focuses on more applied topics, especially chemical, electrical, mechanical, mining and civil engineering (Elsevier, 2010). These topical differences may help explain the different results found depending on which database is used. Both databases include only English‐language abstracts. 10 All abstracts matching the keywords “solar” or “wind” and published within 1995‐2008 were extracted from each database. After removing conference proceedingsii and abstracts that do not list a month of publication—a more common problem for older abstracts—this yields 81,718 Web of Science and 35,385 Engineering Village abstracts matching the keyword “solar” and 58,935 Web of Science and 36,198 Engineering Village abstracts matching the keyword “wind.” That is, the initial collection is over 200,000 abstracts, far more than could feasibly be sorted by hand. The majority of these abstracts discuss sun spots, plant growth, winding wires, bridges moving in the wind, and a variety of other topics irrelevant to solar and wind electricity production, so the relevance of keyword matches cannot be assumed. Table I. Numbers of Articles Extracted Using Keywords Initial numbers of articles extracted from the databases Engineering Village and Web of Science using the keywords “solar” and “wind.” The relevance of each of these articles to solar energy or wind energy is assessed using the process described below. Many of these turn out to be irrelevant, covering topics such as sun spots. Solar Wind Total Engineering Village 35,385 36,198 71,583 Web of Science Total 81,718 117,103 58,935 95,133 140,653 212,236 Instead, a random sample of abstracts was sorted by hand to create the “training data” that is used to build the model of article relevance. A total of 3,000 abstracts were hand‐coded; the large number was chosen in order to include a sufficient number of relevant abstracts despite the predominance of irrelevant abstracts. For “solar,” a random set of 750 abstracts was sampled from the abstracts collected from each of the two databases. For “wind,” a random sample of 250 from each database was used, plus 500 each oversampled from selected journals likely to be relevant. These journals, listed in Appendix A, were chosen by the author for their apparent potential relevance from the journals represented in a sample of 10,000 ii This is accomplished by removing all abstracts with "Conference,” “Proceedings,” “Proceddings,” “Annual,” or “Symposium” appearing in the title. Conference proceedings are excluded because they create large spikes during certain months and are likely to repeat content found in other articles. 11 abstracts from the given database. Oversampling was necessary because the noun “wind” is a homonym for the verb “wind” and thus produces a particularly low percentage of relevant abstracts among keyword matches. Each of the training data abstracts was read by two trained graduate student readers who identified whether the abstract focused on research relevant to solar electricity production or was irrelevant, and similarly for wind abstracts.iii The readers followed a “codebook” which was created to codify what types of content are considered irrelevant or irrelevant. The definition of relevance focused on research aimed at informing the production of electricity using solar or wind, respectively, as the power source. Thus, for example, articles on the availability of sunlight or wind for energy were defined to be relevant, while passive solar architecture and windmills used to pump water were defined as irrelevant because they are not aimed at generating electricity. The codebook is described in further detail in Appendix B. After reading the abstracts independently, the two readers agreed on the relevance of 91% of solar abstracts and 98% of wind abstracts. Abstracts which the readers disagreed upon typically had something to do with solar or wind power, but whether they were sufficiently relevant was less clear—for instance, an abstract on solar‐powered air conditioners (finally as classified irrelevant) or on day‐ahead predictions of wind speed, published in a journal on energy conversion (finally classified as relevant). All disagreements between the two readers were subsequently resolved by discussion. The agreement between readers is typically measured by the statistic, which ranges from zero if agreement between readers is purely random to one if there are no disagreements. is designed to measure the difference between actual and random agreement between readers, and thus is preferable to using percentage agreement, which will be high if most of the documents are in one category. However, is also plagued by the tendency towards lower values if the texts are not evenly divided among categories (Barbara Di Eugenio and Michael Glass, 2004), and is thus an imperfect measure. I present it here since it is preferable to using percentage agreement, for lack of an obvious better measure and for ease of comparison with previous literature. iii In my experience this hand-classification takes, very roughly, an average of one or two minutes per article. 12 is defined as the difference between actual and random agreement between the readers, divided by one minus the expected agreement between readers. Random agreement is the expected percent agreement if each reader randomly decided whether each abstract was relevant, with a probability of relevance reflecting the proportion of documents which that reader labeled relevant. For instance, the first reader’s probability of relevance, 1 in Equation 3, is simply the number of abstracts which the first reader classified as relevant divided by 1,500. (1) 1 1 2 2 1 (2) 1 1 ∗ ∗ 2 2 (3) For solar, is 0.79, and for wind, 0.93, showing a high degree of agreement between readers. This occurs despite the texts being weighted somewhat towards one category and minimal training of the second reader. Of the relevant solar abstracts, most discuss developments in the chemistry or construction of photovoltaic cells. Others address control systems or arrangements of such cells, aspects of concentrating solar power (where focused sunlight heats water or other fluids), or optimal locations for installing solar power. For wind, relevant abstracts discuss wind turbine engines, connections to the electrical grid, blade shape and aerodynamics, and optimal locations; and less commonly, windmill blade materials, turbine control software, public opinion and other relevant topics. Salt ponds were considered irrelevant to solar because they are more often used for heat than electricity generation. As should be expected, irrelevant topics range widely, including many articles on sun spots and winding engines. 13 31% percent of solar and 23% percent of wind training data abstracts were found relevant. For solar, the proportion of relevant abstracts differed substantially between the two databases: 40% of the Engineering Village training data were relevant compared with 22% of the Web of Science training data. For wind training data, the percentages of relevant abstracts were almost identical between the two databases: 25% for Engineering Village and 22% for Web of Science. Because the wind training data consists mostly of abstracts from selected journals, as described above, the percentages of relevant abstracts in the full wind data sets will be much lower. Figure 1. Hand‐Coded Relevance of Training Abstracts A sample of 750 abstracts from each of the two databases was classified by hand for relevance to solar electricity, and similarly for wind. These hand‐coded abstracts form the basis of the model that was used to automatically classify the collected abstracts. This training data was used to build Bayesian logistic classifiers for wind and solar relevance, which in turn were used to assign a probability of being relevant to each of the collected abstracts. These classifiers use the words which appear in the abstracts as regressors to predict each abstract’s probability of relevance. Obviously, only words which appear in the training data can be included. The functional form used to predict article relevance is Bayesian 14 maximum a posterior estimation of logistic regression models with Laplace priors—equivalent to a type of LASSO regression. The former interpretation, rather than LASSO, will be described here since it is the interpretation used by the authors of the software used for implementing the algorithm (Alexander Genkin, David D. Lewis and David Madigan, 2007). In statistical terminology, this is a Bayesian logistic classifier. As of this writing, the software is freely available at <www.bayesianregression.org> (Alexander Genkin et al., 2005). The model begins with a logistic regression, with the probability of an abstract being relevant represented as and the word counts serving as regressors where identifies the word: 1| (5) 1 (4) The values of beta are further constrained by imposing a Laplace prior distribution with a mean of zero: 1 2 (6) This distribution favors prediction of betas precisely equaling zero, thus excluding words with lower predictive power from the solved model and thereby reducing overfitting. This approach also reduces the computational power needed to apply the model. Imposing a prior distribution is what makes the model Bayesian. This constraint has the same effect as the hyperparameter constraint used in a LASSO regression. The so‐called hyperparameter must be chosen by the analyst. There is no theoretical reason that the value of cannot vary with j, but in practice models with constant perform acceptably well (Alexander Genkin, David D. Lewis and David Madigan, 2007). Therefore, it is set constant, i.e. for all . Note that as approaches zero, the model converges to a standard logistic regression solved via maximum likelihood estimation, and as grows large, an increasing number of coefficients in the model are set to zero. Thus, one would like some moderate that will cause some but not all of the parameter coefficients to be found to be 15 zero. I choose as follows, following one of several standard choices in the literature (Alexander Genkin, David D. Lewis and David Madigan, 2007): 2⁄ (7) where is the number of unique words which appear in the training data. iv Choices must also be made regarding what words to include. Although it would be possible to use phrases, combine synonyms, etc. rather than simply using discrete words as the units for the regression, discrete words are both the simplest choice and quite effective. Using only discrete words also means that this method can be more easily replicated in other contexts. Before summing the word appearances, words were reduced to their “stems,” or roots—e.g. the stem “calcul” could represent “calculate,” “calculation” or other forms of the word. Words consisting of only a single letter, and words appearing on a list of very commonly used words, i.e. “stopwords,” were excluded because they are unlikely to carry meaningful information about abstract relevance. Finally, words to be used in the model are further culled by discarding all words that appear fewer than a minimum number of times in the entire training sample. This number was set to 50 for solar and 100 for wind, using cross‐validation supplemented by spot‐checking of out‐of‐sample performance (precision) of the selected model. An additional adjustment is made to the data before the model is run: cosine normalization. Given the dependence of the method on the number of times a word appears, one might worry that longer texts would be more likely to be classified as relevant. This may be a particular concern with data that covers many years, like this data, since older abstracts are often much shorter than more recent abstracts and it would be undesirable for this difference per se to be reflected in the classification results. Cosine normalization is a common method for dealing with this problem (Amit Singhal et al., 1996) and is applied here. That is, for every iv The models are solved a cyclic coordinate descent method adapted to deal with the peak in the Laplace distribution as triplets , , including only and with the many zero-valued . Memory space is conserved by storing the non-zero values. For further details on the software, called “BBRBMR,” see Genkin, Alexander; David. D. Lewis and David Madigan. 2005. "BBR: Bayesian Logistic Regression Software." www.stat.rutgers.edu/~madigan/bbr/ (accessed May 10, 2010). 16 abstract , each is scaled down until the vector has a length of 1 in ‐dimensional Euclidean space: ∗ 1 where is the necessary scalar to make the equation hold. The resulting ∗ place of the raw (8) then take the in the model. Thus, the regressors in the model are simply counts of how many times a given word appears in a given abstract , scaled down to account for the length of the abstract. The full models for solar and for wind relevance each are built in this way using all of the training data from both databases combined. Combining both databases ensures that an article which appears in both will be classified the same way both times it appears and increases the amount of training data used to build each model. Of the 666 stemmed words used as inputs to the solar model, 294 have nonzero coefficients in the final model; for wind, 293 words are reduced to 192 with nonzero coefficients. Lists of the fifty most positive and most negative coefficients in the final solar and wind relevance models are given in Appendix C. As may be expected, the two models perform roughly as well as or better than the hand‐coded data used to build them, accurately predicting the relevance of 99% of solar and 97% of wind keyword‐collected abstracts. To assess their performance outside of the data used to build them, ten‐fold cross‐validation is used. Model performance drops off somewhat outside of the sample, but all four measures of successfully retrieving relevant abstracts stay above 80%. Since performance of 70‐80% is considered good (Kevin W. Boyack et al., 2008), these models perform unusually well. 17 Tables IIA and IIB. Solar and Wind Relevance Models’ Performance Several measures show the performance of the models used to identify relevant wind and solar abstracts. For the purposes of these calculations, an abstract is considered relevant if the model predicts its probability of relevance to be at least 0.5. When this probability is varied from zero to one, and the resulting percent of relevant abstracts identified is plotted versus percent of irrelevant abstracts identified as relevant, the area under this curve is the AUC. Thus, AUC represents the model’s ability to achieve both high precision and high recall, if that probability is optimized. Solar Article Relevance Model Performance Reader 1 Reader 2 Model Performance Model Cross‐Validation Percent Correct 98% 93% 99% 89% (5%) Precision 96% 88% 99% 84% (19%) Recall 97% 88% 99% 81% (21%) AUC ‐ ‐ 100% 95% (1%) Wind Article Relevance Model Performance Reader 1 Reader 2 Model Performance Model Cross‐Validation Percent Correct 99% 99% 97% 90% (2%) Precision 97% 99% 94% 81% (9%) Recall 97% 96% 94% 77% (5%) AUC ‐ ‐ 99% 94% (2%) In the tables above, an abstract is considered relevant if the model predicts its probability of relevance to be greater than 0.5. In theory, this cutoff is another parameter which can be varied, although if model prediction probabilities are accurate, 0.5 will be the optimal value for . The tradeoff between precision and recall can be seen by plotting them both versus the choice of , as shown in Figure 2 below. The probability where the two graphs intersect is often considered the optimal value, which here is approximately 0.5, the desired result. Note that when the model is used to construct article counts, no value of is needed since instead the raw probabilities will be summed. 18 Figure 2. Models’ Precision and Recall vs. Minimum Relevant Probability Precision and recall are shown as functions of the minimum probability required for an abstract to be considered relevant. These are calculated using the final relevance models which are used to construct article counts. Finally, the models are applied to all of the four sets of abstracts collected by keyword. A random sample of five relevant and five irrelevant abstracts from each set is shown in Appendix D as examples. As in the training data, most relevant solar abstracts discuss the chemistry of photovoltaics, while the subjects of relevant wind abstracts are more diverse. For wind, a few irrelevant abstracts appear among those classified as relevant (two in the sample of ten), which is consistent with the 81% precision predicted by cross‐validation. To examine journals which contribute to this data, journal titles were cleaned. When “&” appeared it was converted to “and,” punctuation was removed, the term “Journal” was removed when it was the final word, and journal titles which appeared synonymous were identified by hand and defined to be the same. For the latter process, apparent subsections of journals were left separate. The removal of the final term “Journal” affected a negligible number of journals. All text was capitalized in order to remove the effects of case when identifying journals. Abstracts are drawn from a wide variety of journals: even the smallest of the four sets of abstracts has over 74 journals which each contribute at least 10 abstracts to it. Highest‐ 19 contributing journals for solar include leading physics and chemistry journals as well as journals focusing on renewable energy applications. For wind, a similar combination of general and renewable‐specific technical journals is joined by journals focusing on relevant policy and management. The preponderance of journals not specific to solar or wind supports the hypothesis that fluctuations in article counts reflect variation in the amounts of meaningful solar and wind research conducted, rather than simply variations in the fates of renewable energy publications. Tables IIIA and IIIB. Journals with the Most Articles Journals which publish the highest numbers of articles relevant to solar or wind energy technology, as identified using the methods described above. Solar and wind journals are joined by somewhat more general journals in the fields of renewable energy, chemistry, physics, engineering and social sciences. Totals for some journals differ between the databases because some journals are included incompletely in one or the other database. 1 2 3 4 5 6 7 8 9 10 Journals with the Most Solar Articles Engineering Village Web of Science Journal Abstracts Journal Solar Energy Materials and Solar Solar Energy Materials and Solar 2,568 Cells Cells Thin Solid Films 1,331 Thin Solid Films Renewable Energy 587 Applied Physics Letters Solar Energy 581 Journal of Applied Physics Progress in Photovoltaics Research 432 Renewable Energy and Applications Solar Cells 381 Journal of Physical Chemistry C Abstracts 2,639 1,668 1,225 1,092 511 490 Japanese Journal Of Applied Physics Part 1: Regular Papers and Short 335 Notes and Review Papers Progress in Photovoltaics 453 Applied Physics Letters 329 Journal of Non Crystalline Solids 323 Journal of Crystal Growth 305 Journal of Physical Chemistry B 442 Journal of Non Crystalline Solids 294 Journal of Solar Energy Engineering 286 20 Journals with the Most Wind Articles Engineering Village Web of Science Journal Abstracts Journal Abstracts Journal of Wind Engineering and IEEE Transactions on Energy 252 261 Industrial Aerodynamics Conversion 1 249 Wind Energy 231 3 IEEE Transactions on Energy Conversion Wind Energy 195 163 4 Journal of Solar Energy Engineering 119 5 Energy Policy 6 Wind Engineering 7 IEEE Transactions on Power Systems 8 Energy Conversion and Management Renewable and Sustainable Energy 9 Reviews 102 93 79 75 Journal of Solar Energy Engineering Journal of Wind Engineering and Industrial Aerodynamics Energy Policy IEEE Transactions on Power Systems New Scientist Power Engineering Energy Conversion and Management Professional Engineering 71 2 10 IEEE Transactions on Magnetics 64 59 159 114 113 106 95 86 Summing the probabilities of relevance by month yields four time series, which are the article counts used in the remainder of this essay: an Engineering Village and a Web of Science series for each of solar energy research and wind energy research. Monthly article counts average 73 from Engineering Village and 106 from Web of Science for solar, while for wind the means are much lower: 20 and 23, respectively. No month has zero articles, so there is no need for a zero‐inflated model. Further summary statistics are given in Table IV. Table IV. Article Count Summary Statistics Summary statistics for monthly counts of articles on solar and wind energy, by database, as calculated using the methods described above. Tech Type Database Engineering Village Solar Web of Science Engineering Wind Village Wind Web of Science Solar Summary Statistics for Article Counts Standard Total Mean Median Minimum Maximum Months Deviation Abstracts 73 62 19 196 41 168 12,232 106 73 32 322 69 168 17,750 20 17 4 56 13 168 3,425 23 18 5 77 15 168 3,784 21 The Web of Science‐ and Engineering Village‐based article counts on solar energy have a correlation of 0.89 and those on wind have a correlation of 0.72. Overlap between the two databases can be further examined by identifying journals they have in common and summing the abstracts which come from those journals, using probabilities of relevance. In all four data sets, most abstracts are from journals which appear in both databases. The proportion of abstracts from common journals is higher for solar, with 90% of Engineering Village and 83% of Web of Science abstracts coming from shared journals. For wind, 75% of Engineering Village and 69% of Web of Science articles come from shared journals. Despite this high overlap, the article counts from different databases yield different results in the regressions, suggesting differences in the types of research they contain. Regression results for Engineering Village and Web of Science are more similar for solar than for wind, reflecting the higher overlap in solar journals. Table V. Article Counts by Database and Journal Overlap Article counts from journals which appear in both Engineering and Web of Science databases, and from journals which appear only in one or the other database. Since some journals are indexed incompletely in at least one of the databases, article counts from shared journals are an overestimate of the number of individual abstracts which appear in both databases. Article counts are calculated by summing abstracts’ probabilities of relevance. Only abstracts from journals which contribute at least 0.5 articles are included here. Tech Type Database From Database‐ Specific Journals From Shared Journals Total Articles Solar Engineering Village 1,798 15,680 17,478 Solar Web of Science 4,216 20,226 24,442 Solar Total Articles 6,014 35,906 41,921 Wind Engineering Village 1,120 3,325 4,445 Wind Web of Science 1,564 3,425 4,990 Wind Total Articles 2,684 6,750 9,434 22 Shared journals are responsible for more relevant abstracts per journal, on average, than journals which only appear in Engineering Village or only in Web of Science. This is consistent with the hypotheses that relatively major journals are simultaneously more likely to be covered in both databases and to publish content relevant to solar or wind energy. Journals with narrower audiences, in contrast, are more likely to be database‐specific. The relatively high proportion of database‐specific journals, compared to database‐specific abstracts, may suggest that content related to solar and wind energy is appearing in a wide array of journals. However, it must be noted that a journal is included here if it contributes at least one half article’s worth of probability to the article counts, and this means that some journals counted in Table VI may not have produced a single article predicted to cover solar or wind energy with a probability greater than 0.5. Table VI. Journals by Database Nunbers of journals which appear in both Engineering and Web of Science databases, only in Engineering Village, or only in Web of Science, identified separately for solar and for wind. Shared journals are by definition constant across databases. Only those journals which contributed at least 0.5 articles are included here. Tech Type Database Database‐ Specific Journals Shared Journals Total Journals Solar Engineering Village 448 618 1,066 Solar Web of Science 863 618 1,481 Solar Total Journals 1,311 618 1,929 Wind Engineering Village 467 329 796 Wind Web of Science 712 329 1,041 Wind Total Journals 1,179 329 1,508 The analysis of journal overlap includes only those journals which contributed at least 0.5 articles, for methodological ease. This includes over 99% of the solar total article count for each database and over 96% of the wind article count for each database. 23 3.2 Regressing with Research and Production Subsidies Article counts are then regressed with data representing technological and market effects on innovation and subsidies which influence technological and market effects. Technological effects are termed “technology push”—that is, the enabling effect that existing knowledge has on subsequent research (Gregory F. Nemet, 2009). Subsidies for research strengthen this effect. On the market side, “demand pull” describes how demand for the innovation’s intended product encourages research (Gregory F. Nemet, 2009). Demand in this case is considered to be the researchers’ or their funders’ expectations of future demand for solar or wind energy at the time their product may be released, based on observing demand conditions at the time research is begun. Product subsidies stimulate this demand and thus strengthen the demand pull effect. These influences are shown in Figure I. By including variables representing each of these categories, the regression model used here compares the effects of subsidizing technology push and demand pull. This comparison is designed to inform policy decisions about how to allot funds between them (see Georgeta Vidican et al., March 15, 2009 regarding the need for such analysis). Figure 3. Drivers of Renewable Energy Innovation Researchers produce knowledge via a process influenced by demand, existing technology and learning from the production process. Governments influence this system by providing subsidies such as the New Technology Credits, which contribute to the demand pull effect, and research funding which goes directly to research. The model uses measures of technology push, demand pull and subsidies affecting each to predict article counts (see Table VIII). Existing technological knowledge is represented by a 24 lagged, smoothed version of article counts (conceptually following David Popp, 2002). Technology‐focused subsidies are represented by total federal subsidies for solar or wind energy research. On the demand pull side, demand conditions are represented by renewable energy consumption, electricity prices, fossil fuel prices paid by power plants, and gross domestic product. The demand‐side subsidy considered is the federal New Technology Credits, which include the Production Tax Credit, the largest federal expenditure for renewable energy (Energy Information Administration, 2008) and thus the largest stimulus to renewable energy demand. A time lag between the dependent and independent variables is used to reflect the amount of time needed to produce the research after deciding to do so. The regressions are run separately for solar and wind for each database, producing four sets of results. The regressor data are from several federal sources. The U.S. government reports research expenditures on wind, solar, and a variety of other energy technologies to an international database maintained by the Organisation for Economic Co‐operation and Development and International Energy Association; this is the source for wind and solar energy research subsidies (OECD/IEA, 2010). These data are reported annually and are aligned with calendar years. New Technology Credit costs are represented by the amount of tax receipts not collected due to the credits, as reported in annual budget documents (Office of Management and Budget, For each of Fiscal Years 1996‐2011). Unfortunately, while the total expenditures on the New Technology Credits are available, the OMB did not report tax expenditures for the Production Tax Credit and Investment Tax Credit separately for years before 2008 (Office of Management and Budget, For each of Fiscal Years 1996‐2011). Since the division of tax credit expenditures among wind, solar and other sources also is not readily available, the total is used and may therefore be considered a signal of the total value which wind or solar innovators could hope to access. New Technology Credit expenditures are reported for each fiscal year, i.e. October of the preceding year through September of the fiscal year, so they are aligned by fiscal year. Data on other aspects of the market for renewable energy—i.e. renewable energy consumption, average fossil fuel prices paid by power plants and average electricity prices—are 25 from the U.S. Energy Information Administration (Energy Information Administration, 2010). Finally, GDP data are drawn from Bureau of Economic Analysis tables (Bureau of Economic Analysis, August 2010). All dollar amounts are inflation‐adjusted to 2009 dollars. For research subsidies, the inflation adjustment has already been conducted by the data provider. Annual New Technology Credits expenditures were adjusted using annual Consumer Price Index factors (Bureau of Labor Statistics, 2011). For consistency, fossil fuel prices and electricity prices were adjusted using the same annual factors. The New Technology Credit expenditures for each fiscal year were adjusted using an inflation factor for that fiscal year. This fiscal year factor is constructed by taking the geometric mean of the annual inflation factors for each of the months within that fiscal year, i.e. three months of the preceding year’s annual inflation factor and nine months of the calendar year’s annual inflation factor. All of the variables used vary substantially across the time period considered, as shown in Table VII. Both research subsidies and the tax credit expenditures have risen and fallen several times during the years considered. The two subsidy types are available only as annual data, so they are repeated across each month for the purposes of the regressions. All other data were collected monthly, except for GDP for which quarterly data are interpolated monthly using geometric means. The time period considered is limited by the availability of monthly electricity price data. By definition, the only variables which differ between solar and wind are article counts and research subsidies, which both are much larger for solar than wind. 26 Table VII. Summary Statistics for Article Counts, Subsidies and Control Variables Summary data for article counts and other data, all for January 1995 through December 2008. Summary Statistics Engineering Village Article Counts Web of Science Article Counts Research Subsidies Solar Wind Units Frequency Source 73 (41) 19 (12) articles monthly new 106 (69) 114 (33) 23 (15) 41 (11) monthly annual New Technology Credits 250 (244) 250 (244) Renewable Consumption Electricity Prices Fossil Fuel Prices 546 (58) 9.0 (0.6) 2.6 (0.8) 546 (58) 9.0 (0.6) 2.6 (0.8) articles million $a million $ expendedb million Btu $/million Btub $/million Btub monthly monthly monthly new OECD/IEA OMB datac EIA EIA EIA quarterly BEA GDP 12,600,000 12,600,000 seasonally (1,300,000) (1,300,000) adjusted million $b annual a. inflation‐adjusted by source using 2009 CPI b. inflation‐adjusted by author using average 2009 CPI c. compiled from annual Analytical Perspectives budget documents A preliminary visual examination of article counts and tax credit levels over time suggests they could be related. In Figure 4 below, each set of article counts is plotted over time with the number of articles shown on the left axis and the New Technology Credits expenditures overlaid and labeled on the right axis. Federal research subsidies are overlaid as well, magnified to five times their actual levels for ease of observation. Overall, article counts and tax credit levels rise concurrently in all cases, with drops in the tax credit in 2005 and 2007 followed by drops in solar articles in Engineering Village that are not mirrored in the other series of article counts. Research subsidies for both solar and wind are level or falling during most of this time period. To further investigate the relationship between the tax credit and research, we turn now to the results of the regression models. Next page: Article counts have grown concurrently with the New Technology Credits, while federal research spending on solar and wind has stayed level during most of this time period. Article counts shown are twelve‐month‐ smoothed to remove seasonal variation and all dollar amounts are shown in 2009 dollars, with research subsidies increased by a factor of five in order to make them easily observable in the same graphs. 27 Figure 4. Article Counts, Research Subsidies and New Technology Credits Over Time with Solar and Wind Articles from Engineering Village and Web of Science 28 The regression models place a lag of months between regressors and article counts, representing the length of time to produce an article. But what is the appropriate value for ? If the primary decision is made by journal editors who decide whether an article will be published, then represents the time between submission and publication. Taking the mean of journal publication times reported in Luwel and Moed (1998), using only journals also included in my data and weighting them by appearance in my data set, this length of time is estimated to be seven months. For details of this calculation, see Appendix E. However, a more realistic model may define to include the time authors take to complete their research, considering authors to be the producers of research. Since most research articles are not specifically solicited by journals, and since authors may submit their work to multiple journals, focusing on authors seems more appropriate. However, there is little data on how long it takes authors to produce research articles. For these reasons, several values of are considered, selected to span possible values: seven months, one year, two years, three years and five years. Significant regression results may reveal what length of time before publication is when subsidies have their impacts. Article counts are count data, so they are modeled using a Poisson regression with a log link. Thus, the expected number of article counts in a given month is (9) where ~ (10) and the independent variables include a lagged rolling average of article counts, where the average is across the twelve months up to and including : (11) In short, the expected value of article counts at time function of a constant; the average ( is modeled as a Poisson for “rolling average”) of article counts from the twelve 29 months ending at month ; solar or wind technology subsidies during that year; the New Technology Credit expenditures during that year; and other market conditions in month . Like many count data sets, article counts are overdispersed relative to the Poisson assumption that their variance equals their mean. As a result, the analytically calculated standard errors of Poisson model coefficients may be inaccurate. An alternative method of finding standard errors must be used, such as bootstrapping or quasi‐maximum likelihood estimation. Standard errors here are calculated by bootstrapping article counts, with bootstrap samples selected from the full sets of unclassified abstracts. For full details on the bootstrapping approach, see Appendix F. Table VIII. Regression Nomenclature Nomenclature for the regression equation. Article counts are modeled as a function of subsidies and other market conditions. Variable Name Means Represents count of research articles technology push published in month t research subsidies subsidy to technology push New Technology Credit subsidy to demand expenditures pull renewable electricity demand pull consumption electricity prices demand pull fossil fuel prices paid by demand pull power plants gross domestic product demand pull months before articles N/A are published 4. Results Both research subsidies and the New Technology Credits are associated with significant increases in some solar or wind research output, as measured by article counts. The New Technology Credits have their most consistent effects on solar research in the more applied 30 database, while also showing significant effects after seven months on solar research and wind research in the more general database. Thus, they appear effective at driving applied solar research and potentially other types of research. Research subsidies have clear effects on both solar and wind applied research, with effects in the less applied database appearing significant for wind only after five years. Dollar for dollar, research subsidies have much larger effects than New Technology Credits expenditures. Given that they impact research indirectly, the New Technology Credits’ effects may also be considered substantial. These results are consistent with federal policies playing a significant role in driving renewable energy innovation. Results of a hypothetical increase of $10 million in the New Technology Credit expenditures, solar research funding or wind research funding are shown in the tables below. This format, rather than raw coefficients from the models, is chosen for increased interpretability. The amount of $10 million is equal to 1.1% of the actual New Technology Credits expenditures, 6.1% of solar research funding or 37% of wind research funding levels in December 2008, the last month included in the analysis. Thus, it represents a plausible policy in terms of size, a reasonable extrapolation from the model except perhaps for wind, and a convenient level for comparing the effects of potential uses of public funding for either production or research. Analogous tables showing the effects of all coefficients are given in Appendix G. It should be recalled that research subsidy impacts occur only for wind or for solar research, depending on which type is being subsidized, while New Technology Credits affect wind and solar research simultaneously. Each column gives results for the models using the specified value for lag time , representing the time it takes to produce the research, with lower and upper bounds calculated using that model’s bootstrapped standard errors. As noted earlier, bootstrapping methods are further detailed in Appendix F. Each pair of rows shows results for one of the four sets of article counts. Statistical significance of each predicted percentage increase is calculated across both policy variables, both databases and all lag times for either wind or solar, i.e. significant results are significant individually at the . 05/20 .0025 level. This conservative approach to statistical significance guarantees that chance appearances of significance are no more likely 31 than if only one lag time and one variable of interest were considered, while conducting the analysis for solar or for wind. White cells and double asterisks identify these results in the table below. Effect sizes given are for a single month, since the model uses monthly units. Since article counts rise dramatically with the passage of time, an effect size of a given percent will describe different numbers of articles depending upon when it occurs. For this reason, percentages rather than absolute article counts are emphasized in interpreting model results. Starting from December 2008, the last year in the data, a 1% increase in research consists of approximately two solar articles in Engineering Village; three solar articles in Web of Science; 0.4 wind articles in Engineering Village; or 0.6 articles in Web of Science. The percent of deviance explained by these models ranges from 44% for solar Engineering Village articles using a five‐year lag to 85% for solar Web of Science articles using a seven month lag. The New Technology Credits and research funding, when significant, are each responsible for 0.5% up to 3.6% of the variance; that is, they typically account for a few percent of the variance in article counts. 32 Table IX. Predicted Research Subsidy and New Technology Credit Impacts on Solar and Wind Energy Research Results from Poisson regressions of article counts with subsidy and other variables. Percent changes resulting from a hypothetical increase in solar research subsidies, wind research subsidies or New Technology Credits are shown, rather than raw coefficients, for increased interpretability. Highlighting and double asterisks (**) mark significance at the five‐percent level across models and subsidies for wind or for solar, using bootstrapped standard errors. Five percent significance for a single variable is identified by a single asterisk (*). Although most of the variation is due to other sources, both subsidies have statistically significant effects on wind and solar research after certain lag times, as shown by the highlighted cells. Dependent Variable Solar Articles from Engineering Village Solar Articles from Web of Science Wind Articles from Engineering Village Wind Articles from Web of Science Predicted Article Count Changes After a $10 Million Increase in Research Subsidies or the New Technology Credits Production 7 months 1 year 2 years 3 years Time ‐0.35% 1.4% 2.0% 8.5% Solar (‐0.61% ~ ‐ (1.2% ~ (0.79% ~ (6.9% ~ Research 0.093%) 1.7%)** 3.3%) 10%)** Subsidies 5 years ‐3.9% (‐5.6% ~ ‐ 2.2%) New Technology Credits 0.60% (0.54% ~ 0.67%)** 0.43% (0.36% ~ 0.51%)** 0.55% (0.43% ~ 0.68%)** ‐0.12% (‐0.27% ~ 0.033%) ‐0.62% (‐0.90% ~ ‐0.34%) Solar Research Subsidies ‐1.1% (‐1.4% ~ ‐ 0.74%) ‐0.85% (‐1.3% ~ ‐ 0.43%) 2.4% (1.1% ~ 3.8%) ‐0.18% (‐1.8% ~ 1.5%) 3.6% (1.5% ~ 5.8%) New Technology Credits 0.35% (0.28% ~ 0.42%)** 0.077% (0.0055% ~ 0.15%) 0.24% (0.12% ~ 0.35%) 0.28% (0.099% ~ 0.46%) ‐0.54% (‐0.90% ~ ‐0.18%) Wind Research Subsidies 4.6% (1.4% ~ 8.0%) 14% (10% ~ 18%)** 19% (16% ~ 22%)** 26% (20% ~ 33%)** 5.1% (‐1.1% ~ 12%) 0.66% (0.42% ~ 0.89%)* ‐0.085% (‐0.34% ~ 0.17%) 0.026% (‐0.56% ~ 0.62%) New Technology Credits 0.14% ‐0.19% (‐0.012% ~ (‐0.34% ~ ‐ 0.29%) 0.033%) Wind Research Subsidies ‐13% (‐15% ~ ‐ 11%) ‐6.0% (‐8.1% ~ ‐ 3.9%) ‐3.3% (‐6.0% ~ ‐ 0.44%) 16% (9.2% ~ 23%) 33% (27% ~ 39%)** New Technology Credits 0.46% (0.33% ~ 0.59%)** 0.13% (‐0.0024% ~ 0.26%) 0.21% (0.020% ~ 0.41%) 0.075% (‐0.17% ~ 0.32%) 0.35% (‐0.070% ~ 0.76%) 33 4.2 Research Subsidy Effects Research subsidies for wind or solar energy have strong significant effects on article counts from Engineering Village, the more applied of the two article databases. For solar, an increase of $10 million in funding would net an increase of 1.2‐10% in solar research after one or three years, if the effect is within one standard error of model predictions. The expected effect is an increase of 1.4%, or 8.5% after one or three years, respectively. If the increases occurred in December 2008, the last time period in the data set, these changes would represent 3 or 16 articles in the first month of the effect, which is modeled as lasting for twelve months; and the $10 million would be in addition to the actual $163 million in solar research funding. Research subsidy effects are less apparent in Web of Science, the less applied of the two databases. For solar, there is no statistically significant effect at all. As noted earlier, Web of Science includes more basic conceptual research and social science research, while Engineering Village focuses more on technical research that is close to application. Thus, it is likely that research subsidies have no significant effects on Web of Science’s solar energy research as a result of subsidies being deliberately targeted towards more applied research. If the same $10 million were applied to wind research instead, wind research articles in Engineering Village would increase by 10‐33%, or an expected 14%, 19% or 26% after one, two or three years (5, 7 or 10 articles in the first of twelve months). Given that wind research subsidies were $27 million in 2008, a more reasonable calculation may be the 1% or 2% increase expected after a $1 million increase in subsidies. Wind research percentage increases are roughly three times those for solar, but they represent a similar number of articles and explain a similar percentage of deviance, which is consistent with wind having a lower baseline level of research. If the effects are causal, these results suggest research funding can be used to encourage significant increases in applied wind or solar research, with a larger effect on wind relative to the current level of research. For Web of Science, research subsidies appear to take a long time to affect wind research. Their effect is significant only after five years, predicting an increase of 27‐39% in wind articles after a $10 million subsidy increase (a mean of 33%, or 20 articles). As noted 34 above, because wind subsidies are initially only $27 million, it is more reasonable to consider the effect of a smaller subsidy increase: $1 million would be followed by a 3% increase in research. Given the wide margin of error for the insignificant impact on wind after three years, together these results suggest that wind research subsidies may impact wind research in Web of Science, but only after many years. 4.3 New Technology Credit Effects The New Technology Credits have consistent impacts on solar research in Engineering Village and potentially in Web of Science. Their effect sizes are much smaller than the effects seen for research subsidies, as is expected because their dollars do not go directly to research. In Engineering Village, statistically significant effects occur after seven months to two years. If $10 million is used to increase the tax credit, solar research in Engineering Village is predicted to increase by 0.36‐0.68% in this time period, with expected increases of 0.60%, 0.43% or 0.55% in seven months, one year or two years, respectively (about one article in all cases). The fact that the tax credit has such a strong effect on solar research is particularly indicative of the signaling effect of the tax credit, since most of its funds currently go to wind and other non‐ solar types of renewable energy production. Since $10 million represents just 1.1% of the New Technology Credits’ value in December 2008, these results can be interpreted to suggest that every one percent increase in the tax credits beyond their 2008 value would lead to a roughly half percent increase in solar research, which would occur within seven months to two years. In Web of Science, New Technology Credit impacts on solar appear very early. That is, solar article counts increase by a statistically significant 0.28%‐0.42% after seven months (with a mean of 0.35%, one article in first month) and show positive but not statistically significant responses after one, two or three years. The short time frame suggests that this response to tax credits may not reflect an increase in total research but instead in decisions to publish. If so, this impact of the New Technology Credits on solar research timing is analogous to Production Tax Credit impacts that have been found on wind investment timing (Merrill Jones Barradale, 2010). 35 New Technology Credit impacts on wind research are particularly striking in comparison with solar. For Engineering Village, no tax credit effects are statistically significant for wind, nor is there even a pattern of positive effect sizes. This stands in contrast to strong effects for solar within the first seven months to two years. These two findings are consistent with wind being a more mature technology than solar, such that wind power is already cost‐competitive with natural gas under some conditions (Lori Bird, Mark Bolinger, Troy Gagliano, Ryan Wiser, Matthew Brown and Brian Parsons, 2005). If tax credit users are ready to buy wind turbines currently available, they may therefore be relatively uninterested in further applied research. Given that the margins of error include effect sizes the same order of magnitude as the significant effects for solar, it is also possible that this lack of significance for wind is an artifact of noise in the data. In Web of Science, New Technology Credits effects on wind research are very similar to their effects on solar research. A statistically significant 0.33%‐0.59% increase after seven months (with a mean of 0.46%, 0.3 articles in first month) is followed by positive but non‐ significant increases across later time periods. Recall that for solar, a similar pattern occurs, with 0.28‐0.42% as the significant effect after seven months. As for solar, the early increase in article counts may reflect decisions to publish wind research in response to the New Technology Credits in order to capitalize on the credits while they last, as opposed to a genuine increase in total research production. Insofar as there may be a real increase in net Web of Science wind research, perhaps it consists of basic research which may eventually lead to large technological changes and geographic or social science research that could prove helpful in planning wind turbine installations, since these kinds of research are more prevalent in Web of Science than Engineering Village. 4.4 Contextualizing the Results Throughout this discussion, the relationships between article counts and subsidy types have been interpreted as causal. It is also possible that article counts and renewable energy incentives are both functions of a third variable, such as political interest in renewable energy, or even that articles cause subsidies and follow a cyclical pattern which causes the results 36 observed. Alternatively, the pattern of subsidy timing affecting article timing but not net article counts may be more widespread than hypothesized above. Further exploration of these possibilities is beyond the scope of this dissertation. Ceteris paribus, the importance of funding in making research possible and worthwhile makes a causal effect of tax credits and research subsidies on article counts the most plausible explanation of the relationships identified in this essay. It is clear from these results that federal research subsidies and the New Technology Credit are not the primary drivers of wind and solar research. They account for at most four percent of the variation in published research levels during the fourteen years covered by this study. This percentage is consistent with the fact that myriad other factors—including subsidy policies in the rest of the world—affect this research area. It is equally clear that U.S. research funding and the New Technology Tax Credits are associated with noteworthy increases in published solar and wind research. Research subsidies have strong effects on applied solar and wind research, and even a one percent increase in the tax credit could spur a nearly one percent increase in published solar research. 5. Conclusion Wind and solar research subsidies and the New Technology Credits are associated with substantial increases in research. These associations are clearest between research subsidies and applied solar or wind research and between New Technology Credits and applied solar research, with additional impacts appearing in narrower time frames. The particular sizes and timing of these effects are identified and therefore serve as potential inputs to later models and policy discussions. This analysis was made possible by employing Bayesian text analysis to construct the article counts used as the dependent variable, which now provide a new dataset for future study. This investigation suggests that both research subsidies and the New Technology Credits lead to increases in published solar or wind energy research, and indicates to what degree in what time period for what types of research. Research funding yields substantial increases across several years in applied research on both wind and solar electricity. Wind research 37 funding is also followed by an increase in less‐applied wind research after five years. Thus, research funding is effectively driving at least applied research output in both wind and solar research. In addition, New Technology Credits spur applied solar research across several years, demonstrating a seldom measured benefit of the tax credit. The New Technology Credits are also followed by significant increases in less‐applied solar and wind research after seven months, with no statistically significant results after longer lag times. In short, the quantity of applied solar research is clearly affected by the New Technology Credits, while the timing and quantity of other types of research may also be affected. Although none of these subsidies are the largest drivers of wind or solar research, their effect sizes are large enough to suggest that such policies can be used deliberately to increase renewable energy research. These results confirm and extend previous research on the relationships between incentives and renewable energy research. Research subsidy effects on article counts broadly confirm previous work demonstrating connections between research subsidies and renewable energy patent counts, including the finding that effects on wind may be over four times the size of effects on solar, as concluded using European subsidies and patent counts (Nick Johnstone, Ivan Haščič and David Popp, 2009). The same previous work found no connection between patent counts and the presence or absence of renewable energy tax credits. Perhaps due to use of a continuous measure of tax credit value, the present research identifies strong connections between New Technology Credits and research quantities, which are to the author’s knowledge an entirely new empirical finding. The identification of strong effects for research subsidies and lesser but significant effects for tax credits confirms the predictions of theoretical work on direct and indirect incentives for innovation. The present work adds layers of detail to these predictions by identifying absolute effect sizes and their variation across time periods, databases and technology types, as could not be predicted by broad theoretical models. The effects reported here are designed to serve as inputs for policy decisions and models in which renewable energy research output is a consideration. For example, it may be reported that this research concludes that increasing the New Technology Credit outlays by two percent is likely to result in roughly a one percent increase in applied solar research. If the 38 models predict correctly, this increase will occur within seven months to two years. As a new and strong finding, this particular example is one of the more interesting conclusions of this study and should be considered when evaluating the value of the New Technology Credits as a public policy. Similar calculations can be made using other statistically significant results. Similarly, these findings may be used to calibrate more detailed models of policy outcomes. Economic models are often used to predict the impacts of climate‐related or innovation‐related policies on renewable energy use, industry profits, jobs or other outcomes of interest. The effects calculated here may be included in such models as parameters describing the effects of solar research subsidies, wind research subsidies or production‐side subsidies on wind and solar research. For climate policy models in particular, innovation is an outcome that has often been left out of modeling exercises despite its obvious importance because of the difficulty of parametrizing it. The present results can help fill this gap in policy analysis. At the same time, it must be noted that the quantities calculated here should be refined by subsequent related research. Here, a fairly simple form was used to model effect sizes, in order to focus on the overall research approach and the construction and use of article counts. Later work can and should use varying models to unpack causality, including autoregressive models, investigation of the role of stability of tax credits, and consideration of the impact of renewable energy incentives in other countries. Similarly, U.S. state‐level renewable energy policies are becoming widespread enough to enable assessment of their effects on innovative research. Separating the two tax credits considered here also may become feasible in the upcoming years, since they have been recorded separately since 2008. More complex functional forms may require stronger assumptions but reveal more about the interplay among renewable energy policies, markets and innovation. The current article counts can be used for the above research directions. To further test the robustness of the current results, similar article counts can be produced by similar methods, with differing particular decisions such as use of articles in other languages, and the current article counts can be used as an input into more sophisticated models of research production. In addition, comparing these article counts more directly with patent counts could help 39 elucidate the research process and the timing of incentive effects. However, patent counts need not be expected to produce similar significant results to those for article counts, since articles are more numerous and more quickly published, and therefore subsidy effects on article counts may be easier to identify than subsidy effects on patent counts. Earlier research on articles has typically used hand‐classification or keyword selection alone to identify relevant articles. Here, a combination of both with Bayesian logistic modeling enabled the condensation of over 200,000 raw, potentially relevant abstracts into four monthly time series of article counts describing wind and solar energy research. These article counts now constitute data available for future study.v While this investigation has focused on the impact of U.S. renewable energy policies, the article counts could be used to investigate many other aspects of innovation. Given the process used to construct them, the existence of many articles not covered in either database used, and other factors, these article counts may best be considered not an absolute measure of article quantities but a proxy for published research output, or more broadly for innovation. In this way, the present essay adds new effect size results—and a new response variable—to the study of incentives for renewable energy innovation. The study of renewable energy innovation is still an emerging field. Theory suggests that implementing a combination of research and product pricing policies may be needed to overcome the joint market failures of the social benefits of innovation and environmental improvement. Models can be used to find policy combinations most consistent with a given set of innovation and implementation climate policy goals, but only if those models are calibrated with the results of empirical work. This essay provides a piece of that puzzle. v Please contact the author if you are interested in using these article counts. They may also be made available in later work. 40 Appendix A. Wind Oversampling Journals Abstracts matching the keyword “wind” were found to be mostly irrelevant. Therefore, the hand‐coded abstracts for wind included 250 randomly selected abstracts and 500 abstracts from selected journals likely to have more relevant articles, for each of the two databases. In order to choose the journals to use for this, a random sample of 10,000 abstracts was drawn from each of the two databases. Of the journals contributing to these, all journals which plausibly may publish wind energy research were identified. This selection erred on the inclusive side in order to minimize biasing the results. The selected journals are listed below, formatted as they are in their respective databases. Engineering Village Wind Oversampled Journals Web of Science American Society of Mechanical Engineers, Fluids Engineering Division (Publication) FED Cambridge Univ Press IECON Proceedings (Industrial Electronics Conference) IEE Proceedings: Electric Power Applications IEEE Transactions on Energy Conversion ADVANCED MATERIALS & PROCESSES ADVANCES IN ENGINEERING SOFTWARE APPLICATION AND THEORY OF PERIODIC STRUCTURES APPLIED MATHEMATICS AND MECHANICS‐ ENGLISH EDITION CANADIAN CONFERENCE ON ELECTRICAL AND COMPUTER ENGINEERING IEEE Transactions on Industrial Electronics CHALLENGES OF POWER ENGINEERING AND ENVIRONMENT IEEE Transactions on Industry CHINESE SCIENCE BULLETIN Applications IEEE transactions on power apparatus and DESIGN AND MANUFACTURE FOR systems SUSTAINABLE DEVELOPMENT IEEE Transactions on Power Delivery ENERGY Mechanics of Composite Materials ENVIRONMENT AND PLANNING C‐ GOVERNMENT AND POLICY Pergamon Press ENVIRONMENTAL CONSERVATION PESC Record ‐ IEEE Annual Power ENVIRONMENTAL TECHNOLOGY Electronics Specialists Conference 41 Proceedings of the ASME Turbo Expo Proceedings of the IEEE Power Engineering Society Transmission and Distribution Conference Proceedings of the Institution of Mechanical Engineers, Part A: Journal of Power and Energy Proceedings of the Universities Power Engineering Conference Renewable Energy SAE Special Publications Wind Engineering HIGH PERFORMANCE COMPUTING IN SCIENCE AND ENGINEERING IEE PROCEEDINGS‐CONTROL THEORY AND APPLICATIONS IEEE BUCHAREST POWERTECH IEEE CONTROL SYSTEMS MAGAZINE IEEE INSTRUMENTATION AND MEASUREMENT TECHNOLOGY CONFERENCE IEEE LATIN AMERICA TRANSACTIONS IEEE LAUSANNE POWERTECH IEEE POWER ENGINEERING SOCIETY GENERAL MEETING IEEE SPECTRUM IEEE TRANSACTIONS ON POWER DELIVERY IEEE TRANSACTIONS ON POWER SYSTEMS INTERNATIONAL CONFERENCE ON ELECTRICAL MACHINES AND SYSTEMS INTERNATIONAL CONFERENCE ON APPLIED SUPERCONDUCTIVITY AND ELECTROMAGNETIC DEVICES INTERNATIONAL CONFERENCE ON CONTROL AUTOMATION ROBOTICS & VISION: ICARV INTERNATIONAL CONFERENCE ON ELECTRICAL POWER QUALITY AND UTILISATION INTERNATIONAL CONFERENCE ON POWER ELECTRONICS AND DRIVE SYSTEMS INTERNATIONAL CONFERENCE ON POWER ELECTRONICS SYSTEMS AND APPLICATIONS: ELECTRIC VEHICLE AND GREEN ENERGY INTERNATIONAL JOURNAL OF MODERN PHYSICS B INTERNATIONAL JOURNAL OF NON‐LINEAR MECHANICS 42 JOURNAL OF APPLIED PHYSICS JOURNAL OF ENERGY JOURNAL OF FLUIDS ENGINEERING‐ TRANSACTIONS OF THE ASME JOURNAL OF GUIDANCE CONTROL AND DYNAMICS JOURNAL OF PROPULSION AND POWER JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH JOURNAL OF SOLAR ENERGY ENGINEERING‐ TRANSACTIONS OF THE ASME JOURNAL OF THE STRUCTURAL DIVISION‐ ASCE JOURNAL OF TURBOMACHINERY‐ TRANSACTIONS OF THE ASME JOURNAL OF TURBULENCE JOURNAL OF WIND ENGINEERING AND INDUSTRIAL AERODYNAMICS MANAGEMENT OF NATURAL RESOURCES, SUSTAINABLE DEVELOPMENT AND ECOLOGICAL HAZARDS II MATHEMATICS AND COMPUTERS IN SIMULATION MECHANICS OF COMPOSITE MATERIALS NATURE NEW JOURNAL OF PHYSICS NEW RESULTS IN NUMERICAL AND EXPERIMENTAL FLUID MECHANICS VI OPTICAL ENGINEERING POWER PROCEEDINGS OF THE ASME PRESSURE VESSELS AND PIPING CONFERENCE VOL PROCEEDINGS OF THE ASME TURBO EXPO PROCEEDINGS OF THE POWER CONVERSION CONFERENCE RENEWABLE & SUSTAINABLE ENERGY REVIEWS RENEWABLE ENERGY Science Of Making Torque From Wind SIMULATION 43 SOLAR ENERGY SOLAR WORLD CONGRESS WIND AND STRUCTURES Wind Energy WIND ENERGY WIND ENGINEERING INTO THE 21ST CENTURY WINDPOWER 44 Appendix B. Codebook A codebook was created which defines what makes an article relevant or irrelevant to solar energy for the purposes of this analysis, and similarly for wind energy. The solar categories were used to hand‐classify the solar training data, and the wind categories were used to hand‐classify the wind training data. The full codebook also describes a variety of subcategories, which may be used for subsequent research. A simplified version of the codebook, describing the categories used here, is shown below. All categories besides “Relevant” were ultimately defined as irrelevant for the purposes of this analysis. While abstracts describing solar review articles or solar in space were hand‐classified as irrelevant, their content overlap with relevant articles is so high that it is expected that most such abstracts were classified as relevant when the Bayesian regression models were applied. Solar Codebook: Category: Relevant Code: Y Description: Describes a new development in solar technology. Any effort towards improving solar electricity technology, including site assessment, is to be included (unless that electricity is only to be used in space). Inclusion Criteria: Main purpose of article is to describe development of technology which will aid in producing solar electricity. Relevant aspects such as site selection are to be included, as are descriptions of solar collector installations. Descriptions of the amount of sunlight reaching a given location on the earth's surface, presuming they appear relevant to solar site selection, are included. Exclusion Criteria: Does not focus on solar technology or focuses on describing existing technology (review article) rather than on new research. Descriptions of the amount of light emitted by the sun, without reference to where it hits the earth, e.g. discussions of the sun's corona or flares, are excluded. Expected Frequency: Very common (potentially half the articles) Category: Marginally Relevant Code: ? Description: 45 Topical focus is relevant to solar technological development, but is not focused on electricity or does not describe new technology. Includes all use of solar for non-electricity purposes (e.g. for driving pumps or water hearting, even if it is electricity from PV which drives those processes). Also includes use of solar in space. Inclusion Criteria: Main purpose of article is not to describe development of solar electricity technology, but does relate to either solar-driven processes (but not electricity), or to solar electricity (but no new research presented). Exclusion Criteria: Describes solar electricity technology or is entirely irrelevant to the use of the sun. Expected Frequency: Marginally common Category: Irrelevant Code: N Description: Does not describe a new development in solar technology. Mutually exclusive with relevant and marginally relevant categories. Inclusion Criteria: Main purpose of article is not to describe development of technology which will aid in producing solar electricity. Articles which mention solar technology but focus on other topics fall into this category. Exclusion Criteria: Main focus of article is relevant to solar technology. Expected Frequency: Very common (potentially half the articles) 46 Wind Codebook: Category: Relevant Code: Y Description: Describes a new development in or is explicitly relevant to wind electricity technology. Inclusion Criteria: Main purpose of article is to describe technology for producing electricity using wind. The article may focus on any aspect of the wind energy production process, from the availability of wind clearly for this purpose through the connection of a windmill to an electric grid and the resulting effects on the grid. Articles on the potential for wind energy to be productive, costeffective, reliable, etc. are also included here. Exclusion Criteria: Focus is not on electricity production or not on wind as the source of the energy for that electricity. For further details, see other categories’ descriptions below. Expected Frequency: Common Category: Hybrid Code: H Description: Article discusses wind energy but also includes at least as great an emphasis on solar, geothermal, diesel or other energy sources. Inclusion Criteria: Describes the use of wind and at least one other energy source. The other energy source(s) may be renewable, such as solar or geothermal, or non-renewable such as diesel. Both wind and the other source must receive substantial attention in the article, not just be mentioned in passing. The article may or may not explicitly discuss connection of the energy sources into one power system. For instance, the article may compare the potential of different energy sources. As for the ‘relevant’ category, articles in this category may focus on any stage in the electricity production process or aspects of the desirability of such power production. Note that hydrogen is a way of storing output, not an energy source. Exclusion Criteria: Does not include substantial treatment of both wind and one or more other energy sources. If wind is mentioned only for the purposes of comparison to or description of non-wind topics, the article is not included in this category. For instance, articles focusing on ocean thermal energy using turbine designs based on wind turbines are not included here. Expected Frequency: Marginally common 47 Category: Non-Electric Code: NE Description: Article discusses the use of wind energy for specific end purposes other than electricity production. Example purposes include desalination, heat and pumping water. Inclusion Criteria: Similar to ‘relevant,’ except that end product is specifically identified and is not electricity itself. Electricity may be an intermediary product. If multiple energy sources are described, as in the ‘hybrid’ category, and they are used for a non-electricity end product, the article is classified as ‘non-electric.’ Note that for the purposes of this codebook, batteries and hydrogen are considered storage, not considered non-electric outputs. Exclusion Criteria: Fails to focus on wind as an energy source and on an output other than electricity. Expected Frequency: Marginally common Category: Irrelevant Code: N Description: Does not describe a new development in wind technology. Common subjects of irrelevant articles include meteorology, bridge design, engines and aircraft. Inclusion Criteria: Major purpose of article is not to describe development of technology which will aid in producing energy using the wind. Exclusion Criteria: Main focus of article is relevant to wind technology. Expected Frequency: Most common 48 Appendix C. Coefficients from Bayesian Logistic Models of Relevance Bayesian logistic regression was used to construct a model of an abstract’s probability of describing solar energy research, based on the words in the abstract. A similar model was constructed for relevance to wind energy research. The words used were reduced to their “stems,” e.g. “polym” for “polymer” and related words, before the models were constructed. Below the words with the highest and lowest coefficients in each model are listed, with the horizontal axis showing their coefficients in the regression models. 49 50 51 52 53 Appendix D. Sample Abstracts A random sample of relevant and irrelevant abstracts selected from each of the two databases using the keyword “solar” or “wind.” Misclassified abstracts are shown in grey. For solar, the few abstracts among the sample of ten predicted to be relevant which are not obviously relevant from the titles do turn out to be relevant upon investigation; that is, all ten in the sample are correctly classified. Similarly, all ten predicted irrelevant to solar are clearly irrelevant. For wind, the picture is more complicated. While all sampled predicted‐irrelevant wind abstracts appear irrelevant, the predicted‐relevant sample includes two irrelevant abstracts, one on helicopter rotors whose terms are probably reminiscent of the rotors of a wind turbine, and one on magnet fabrication which similarly may share terminology with wind turbine engines. The abstract titled “Electric Energy Generator” describes an unconventional wind generator. Random Sample of Relevant and Irrelevant Abstracts for Solar Predicted Prob‐ Title Relevance ability Engineering 1 0.856 Novel two dimensional Village transmission‐line collection systems for photovoltaic power Engineering 1 1.000 Photocurrent of 1 eV Village GaInNAs lattice‐matched to GaAs Database Engineering Village Engineering Village Engineering Village Web of Science 1 1.000 Cd1‐xHgxS thin film electrodes: An electrochemical solar cell approach 1 0.870 Computer simulation of the optical properties of high‐ temperature cermet solar selective coatings 1 1.000 Autodiffusion: A novel method for emitter formation in crystalline silicon thin‐film solar cells 1 0.958 Fundamentals of exergy analysis, entropy generation minimization, and the generation of flow architecture 54 Journal Renewable Energy Authors Month Kuo, M.Y.; Kuo, Ch.Ch.1; Kuo, M.Sh. Oct 1995 Journal of Geisz, J.F.1; Crystal Growth Friedman, D.J.1; Olson, J.M.; Kurtz, S.R.1; Keyes, B.M. International Garadkar, K.M.; Journal of Hankare, P.P. Electronics Dec 1998 Nov 1999 Solar Energy Nejati, M. Reza; Fathollahi, V.; Asadi, M. Khalaji Feb 2005 Progress in Photovoltaics: Research and Applications Wolf, A.; Terheiden, B.; Brendel, R. May 2007 International Journal of Energy Research Bejan, A Jun 2002 Web of Science Web of Science 1 0.993 Effect of Ga content on defect states in CuIn1‐ xGaxSe2 photovoltaic devices 1 0.970 Growth and transport properties of CuInSe2/ZnO heterostructure solar cell Applied Physics Heath, JT Cohen, JD Jun Letters Shafarman, WN 2002 Liao, DX Rockett, AA Materials Science and Engineering B‐ Solid State Materials for Advanced Technology 1 0.992 Three‐channel transmission Journal of line impedance model for Physical mesoscopic oxide Chemistry B electrodes functionalized with a conductive coating Dhananjay Nagaraju, J Krupanidhi, SB Feb 2006 Bisquert, J Gratzel, M Wang, Q Fabregat‐Santiago, F Jun 2006 1 0.998 In situ deposition of Thin Solid Films cadmium chloride films using MOCVD for CdTe solar cells 0 0.000 Evidence for nonmigrating Geophysical thermal tides in the Mars Research upper atmosphere from the Letters Mars Global Surveyor Accelerometer Experiment Barrioz, V Irvine, SJC May Jones, EW 2007 Rowlands, RL Lamb, DA Wilson, R. John Apr 2002 Engineering Village 0 0.000 Anomalous F region Radio Science response to moderate solar flares Engineering Village 0 0.000 Scaling of electric field fluctuations associated with the aurora during northward IMF 0 0.000 Near‐infrared mapping and physical properties of the dwarf‐planet Ceres Smithtro, C.G.; Sojka, Oct J.J.; Berkey, T.; 2006 Thompson, D.; Schunk, R.W. Kozelov, B.V.; Oct Golovchanskaya, I.V. 2006 Web of Science Web of Science Engineering Village Engineering Village Engineering Village Web of Science Geophysical Research Letters Astronomy and Carry, B.; Dumas, C.; Jan Astrophysics Fulchignoni, M.; 2008 Merline, W.J.; Berthier, J.; Hestroffer, D.; Fusco, T.; Tamblyn, P. 0 0.000 Some aspects of the Journal of Breus, T.K.; Feb biological effects of space Atmospheric Ozheredov, V.A.; 2008 weather and Solar‐ Syutkina, E.V.; Terrestrial Rogoza, A.N. Physics 0 0.002 New measurement of B‐8 Nuclear Physics Motobayashi, T Apr Coulomb dissociation and E2 A 1997 component 55 Web of Science 0 0.007 Selective dolomitization of Cambrian microbial carbonate deposits: A key to mechanisms and environments of origin Web of Science 0 0.000 Study of the magnetic turbulence in a corotating interaction region in the interplanetary medium Web of Science Web of Science Palaios Annales Geophysicae‐ Atmospheres Hydrospheres and Space Sciences 0 0.000 UV index climatology over Journal of the United States and Geophysical Canada from ground‐based Research‐ and satellite estimates Atmospheres 0 0.002 Salegentibacter salinarum sp International nov., isolated from a marine Journal of solar saltern Systematic and Evolutionary Microbiology 56 Glumac, B Walker, KR Apr 1997 Valdes‐Galicia, JF Caballero, RA Nov 1999 Fioletov, VE Kimlin, MG Krotkov, N McArthur, LJB Kerr, JB Wardle, DI Herman, JR Meltzer, R Mathews, TW Kaurola, J Yoon, JH Lee, MH Kang, SJ Oh, TK Nov 2004 Feb 2008 Random Sample of Relevant and Irrelevant Abstracts for Wind Predicted Prob‐ Title Relevance ability Engineering 1 0.968 Prediction of tail rotor Village thrust and yaw control effectiveness Database Journal Journal of the American Helicopter Society Oct 1995 IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control Chen, Chih‐Ta; Islam, Rashed Adnan; Priya, Shashank Mar 2006 Energy Policy Gamboa, Gonzalo; Munda, Giuseppe Mar 2007 International Journal of Energy Research Power Engineering Journal TCE Ulgen, K Hepbasli, A May 2002 Kirby, NM Xu, L Luckett, M Siepmann, W Cleugh, A Jun 2002 1 0.664 Fabrication of pulsed Physica B: magnets with a linear‐type Condensed coil‐winding machine Matter Engineering Village 1 0.999 Gain scheduling control of variable‐speed wind energy conversion systems using quasi‐LPV models 1 0.537 Electric energy generator Engineering Village Web of Science Web of Science 1 0.979 The problem of windfarm location: A social multi‐ criteria evaluation framework 1 0.752 Determination of Weibull parameters for wind energy analysis of Izmir, Turkey 1 0.921 HVDC transmission for large offshore wind farms Month Srinivas, Venkataraman; Chopra, Inderjit; Haas, David; McCool, Kelly Suzuki, O.; Sakamoto, K.; Imanaka, Y.; Kido, G. Bianchi, F.D.; Mantz, R.J.; Christiansen, C.F. Engineering Village Engineering Village Authors Control Engineering Practice Jan 2001 Feb 2005 Web of Science Web of Science Web of Science 1 0.627 Wind worries 1 0.984 Turbines for wind power Mechanical Schuerger, R Engineering American [Anon] Ceramic Society Bulletin Engineering Village 0 0.012 Destabilization and localization of traveling waves by an advected field 0 0.000 PE‐modified A‐10C Thunderbolt makes successful maiden flight Physica D: Roxin, A.; Riecke, H. Aug Nonlinear 2001 Phenomena Jane's [Anon] Feb International 2005 Defense Review Engineering Village 1 0.889 Fission vs. wind 57 Apr 2006 Aug 2006 May 2007 Engineering Village 0 0.000 Snow in Lebanon: A Hydrological preliminary study of snow Sciences cover over Mount Lebanon Journal and a simple snowmelt model Engineering Village 0 0.000 Optical fibre sensors for remote monitoring of tunnel Displacements ‐ Prototype tests in the laboratory Engineering Village 0 0.004 Weather driven influences on phytoplankton succession in a shallow lake during contrasting years: Application of PROTBAS 0 0.000 The relationship between great lakes water levels, wave energies, and shoreline damage Web of Science Web of Science Web of Science Web of Science Web of Science Aouad‐Rizk, Ang; Jun Job, Jean‐Olivier; 2005 Khalil, Selim; Touma, Tarek; Bitar, Chadi; Bocquillon, Claude; Najem, Wajdi Tunnelling and Metje, Nicole; Jul Underground Chapman, David N.; 2006 Space Rogers, Christopher Technology D. F.; Henderson, Philip; Beth, Martin Ecological Modelling Markensten, Hampus; Pierson, Donald C. Bulletin of the American Meteorological Society Meadows, GA Apr Meadows, LA 1997 Wood, WL Hubertz, JM Perlin, M Dyer, JM Baird, PR Apr 1997 Oct 2007 0 0.004 Wind disturbance in remnant forest stands along the prairie‐forest ecotone, Minnesota, USA 0 0.010 Wind‐induced dynamic response and resultant load estimation of a circular flat roof Plant Ecology Journal of Wind Engineering and Industrial Aerodynamics Uematsu, Y Watanabe, K Sasaki, A Yamada, M Hongo, T Nov 1999 0 0.000 Solar radar astronomy with the low‐frequency array 0 0.003 Diversity and biogeography of testate amoebae Planetary and Space Science Rodriguez, P Dec 2004 Biodiversity and Conservation Smith, HG Bobrov, Feb A Lara, E 2008 58 Appendix E. Identifying Time from Submission to Publication Identifying average time from submission to publication for relevant articles is somewhat problematic. Many commentators have discussed the importance of publication time, and articles published in recent years typically state when they were submitted and accepted to the journal as well as their publication date. However, few have collected publication times for more than a handful of years or journals. Some of the most useful publication rates were published in Scientometrics. In particular, Luwel and Moed collected average publication times for a variety of technical journals using more than 250 articles from each journal, collected from volumes as recent as possible to Luwel and Moed’s publication in 1998. Because they cover many journals which are well represented in my data set and were collected at a time in the middle of my date range, these appear to be the most useful already‐ collected publication times. Luwel and Moed selected journals which researchers described to them as important and having long publication times, so their average publication time may be biased upwards. To find a publication time for the present purposes, journals used by Luwel and Moed were reviewed. Two journals were discarded for not containing more than one solar energy article published in the time period considered and two for having titles in other languages. Luwel and Moed’s reported publication times for these journals were then averaged, each weighted by the sum of their numbers of solar articles in Web of Science and Engineering Village in the present study. This gives us a publication time of 6.7 months. For use with my data, most of which occurs in monthly time steps, this is rounded to seven months. The variation across journals in average publication times—from two to nearly seventeen months— suggests that this number should be considered very approximate. 59 Table 1. Journal Publication Times I use Luwel and Moed’s (1998) publication time results for selected journals to calculate an “average” publication time to use as p. Publication time is weighted by the number of solar articles from these journals appearing in the Web of Science and Engineering Village samples and then averaged. Rounding up to the nearest integer, I get seven months as the value for p. Journal Publication Times Journal Articles on Solar in Articles on Solar in Publication Time Web of Science Engineering Village (months) Electronics Letters Applied Physics Letters Journal of the American Chemical Society Journal of Chemical Physics Applied Surface Science Journal of Applied Physics Optical Engineering Astrophysical Journal Applied Optics IEEE Transactions on Electron Devices Chemical Engineering Science Solid‐State Electronics IEEE Transactions on Automatic Control Weighted Mean 21 1317 238 67 269 1254 8 7 138 270 8 198 1 31 889 126 30 192 665 12 0 88 228 6 150 0 2 4.9 5.1 6 6.2 7 8.5 9.5 10 10.8 11 12 16.6 6.7 Besides the work by Luwel and Moed, the most relevant reported publication times found were for articles published in early 1990 in the Journal of the American Chemical Society and Physical Review B (Abt 1992). Abt selected these two journals for their generality and high impact factors, using them to represent the fields of chemistry and physics, respectively. As reported by Abt, the Journal of the American Chemical Society takes 7.7 months to publish an article, compared to Luwel and Moed’s 5.1 months. Physical Review B was found to publish articles in 6.4 months and was not included in Luwel and Moed’s analysis. Abt’s publication times are not statistically different from Luwel and Moed’s 6.7 months, which has a standard deviation across journals of 3.8 months. Thus, Abt’s results are consistent with the estimate of seven months using Luwel and Moed’s work. 60 Unlike Luwel and Moed, Abt reports standard deviations within journals, which are 3.1 for the Journal of the American Chemical Society and 2.3 months for Physical Review B. These deviations suggest that publication time variation within journals, not just between journals, is substantial. Average publication time may also vary across years. My model simplifies reality by assuming away these sources of variation. To avoid these assumptions would require an even more complex model form, introducing more alternative assumptions. 61 Appendix F. Bootstrapping Methods Article counts were bootstrapped and submitted to the regression analysis to produce standard errors. In part because this analysis involves multiple steps and time series, there are several plausible ways in which the results could be bootstrapped. The selected method focuses on randomizing the initial selection of articles. The analysis was performed separately for solar and for wind abstracts from Engineering Village and from Web of Science. Bootstrapping proceeded as follows: 1. Let be the total number of unclassified abstracts. 2. Let the probability of each abstract being relevant be the same value which has already been predicted for it. 3. Repeat 1,000 times: a. Select abstracts, with replacement, from the set of unclassified abstracts. b. Sum the predicted probabilities of these abstracts by month, thereby producing a set of monthly article counts. c. Run the regression models using this set of monthly article counts as the dependent variable. As part of this process, use the new set of monthly article counts to calculate the lag of article counts used as a regressor variable. 4. For each variable in each regression model, calculate the standard error of the 1,000 bootstrapped values for it. 5. Use these bootstrapped standard errors to calculate the ranges and statistical significance reported in the results tables. 62 Appendix G. Regression Results The results of Poisson regressions of article counts with subsidy and other variables are shown in tables below. Highlighted cells and double asterisks identify statistically significant results at the 0.25% level, and single asterisks at the 5% level, for each variable. The 0.25% level is chosen in order to guarantee 5% significance across all ten combinations of policy variables and time lags for solar, or for wind. The regressions were run separately for four sets of article counts—solar articles from the Engineering Village database, solar from Web of Science, wind from Engineering Village and wind from Web of Science—following the procedures described above. Values shown in the tables are not raw coefficients but instead, percentage increases resulting from a ten‐unit change in the predictor variable. For research subsidies, the New Technology Credits and GDP, ten units is $10 million. For the lag of article counts, ten units is ten articles. For renewable energy consumption, ten units is ten million Btu and for both electricity prices and fossil fuel prices it is ten dollars per million Btu. While some of these units are difficult to compare with each other, they produce readily interpretable results for individual variables, including the two policy variables of interest, research subsidies and New Technology Credits. Rows for these variables are identical to the results reported in Table VI. Additional results here may be of interest, such as the prediction that ten new research articles on solar energy in Engineering Village will result in an 18% increase in similar research after five years. Percent changes are simply the results of increasing the relevant variable by ten units. That is, the baseline prediction for some set of regressor values at time is (12) And after a ten‐unit increase in some regressor, say , the prediction is , 10 63 (13) The percentage increase, then, is , 1 10 (14) Thus, the percent increase is simply the raw coefficient multiplied by the contemplated ten units and then exponentiated. Ranges based on standard errors are calculated similarly, using the coefficient plus or minus the standard error. The fact that these percentages do not depend on the starting number of articles makes percentage increase a logical statistic to report for the Poisson and related functional forms. 64 Predicted Percent Changes in Solar Article Counts from Engineering Village After a 10‐Unit Increase in an Independent Variable Production Time 7 months 1 year 2 years 3 years 5 years Constant < 10^‐5 ** < 10^‐5 ** < 10^‐5 < 10^‐5 < 10^‐5 ** Lag of Article Counts 1.8% (‐0.30% ~ 3.9%) 0.44% (‐1.1% ~ 2.0%) 2.2% (‐0.094% ~ 4.6%) 7.7% (4.8% ~ 11%)* 18% (13% ~ 23%)** Solar Research Subsidies ‐0.35% (‐0.61% ~ ‐ 0.093%) 1.4% (1.2% ~ 1.7%)** 2.0% (0.79% ~ 3.3%) 8.5% (6.9% ~ 10%)** ‐3.9% (‐5.6% ~ ‐ 2.2%) New Technology Credits 0.60% (0.54% ~ 0.67%)** 0.43% (0.36% ~ 0.51%)** 0.55% (0.43% ~ 0.68%)** ‐0.12% (‐0.27% ~ 0.033%) ‐0.62% (‐0.90% ~ ‐ 0.34%) Renewable Energy Consumption ‐1.0% (‐1.2% ~ ‐ 0.79%) 0.45% (0.23% ~ 0.66%) 0.47% (0.23% ~ 0.71%) 0.33% (0.11% ~ 0.55%) 1.6% (1.3% ~ 1.8%)** Electricity Prices 79% (46% ~ 120%)* ‐89% (‐91% ~ ‐ 86%) ‐62% (‐69% ~ ‐ 52%) ‐82% (‐86% ~ ‐ 76%) ‐75% (‐82% ~ ‐ 66%) Fossil Fuel Prices 60% (11% ~ 130%) 850% (580% ~ 1200%)** ‐35% (‐54% ~ ‐ 8.4%) 150% (64% ~ 270%) ‐65% (‐80% ~ ‐ 39%) GDP < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** Percent of Deviance Explained 62% 61% 53% 52% 44% Deviance Explained by Research Subsidies 0.1% 0.9% 0.1% 1.0% 0.3% Deviance Explained by New Technology Credits 2.4% 1.0% 0.8% 0.0% 0.6% 65 Predicted Percent Changes in Solar Article Counts from Web of Science After a 10‐Unit Increase in an Independent Variable Production Time 7 months 1 year 2 years 3 years 5 years Constant < 10^‐5 < 10^‐5 ** < 10^‐5 * < 10^‐5 ** < 10^‐5 ** Lag of Article Counts 4.0% (3.3% ~ 4.7%)** 6.2% (5.3% ~ 7.1%)** 8.8% (7.2% ~ 10%)** 15% (11% ~ 18%)** 48% (40% ~ 56%)** Solar Research Subsidies ‐1.1% (‐1.4% ~ ‐ 0.74%) ‐0.85% (‐1.3% ~ ‐ 0.43%) 2.4% (1.1% ~ 3.8%) ‐0.18% (‐1.8% ~ 1.5%) 3.6% (1.5% ~ 5.8%) New Technology Credits 0.35% (0.28% ~ 0.42%)** 0.077% (0.0055% ~ 0.15%) 0.24% (0.12% ~ 0.35%) 0.28% (0.099% ~ 0.46%) ‐0.54% (‐0.90% ~ ‐ 0.18%) Renewable Energy Consumption ‐0.40% (‐0.58% ~ ‐ 0.22%) 1.1% (0.92% ~ 1.3%)** 1.4% (1.2% ~ 1.6%)** 0.97% (0.74% ~ 1.2%)** 0.28% (‐0.069% ~ 0.64%) Electricity Prices 260% (200% ~ 330%)** ‐80% (‐83% ~ ‐ 75%) ‐74% (‐79% ~ ‐ 68%) ‐74% (‐80% ~ ‐ 66%) ‐72% (‐80% ~ ‐ 61%) Fossil Fuel Prices 16% (‐8.7% ~ 48%) 880% (670% ~ 1100%)** 140% (83% ~ 210%)** 43% (1.0% ~ 100%) 200% (47% ~ 490%) GDP < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** < 10^‐5 < 10^‐5 Percent of Deviance Explained 85% 84% 81% 80% 75% Deviance Explained by Research Subsidies 0.2% 0.1% 0.1% 0.0% 0.2% Deviance Explained by New Technology Credits 0.5% 0.0% 0.1% 0.2% 0.3% 66 Predicted Percent Changes in Wind Article Counts from Engineering Village After a 10‐Unit Increase in an Independent Variable Production Time 7 months 1 year 2 years 3 years 5 years Constant < 10^‐5 ** < 10^‐5 < 10^‐5 < 10^‐5 < 10^‐5 Lag of Article Counts 49% (36% ~ 62%)** 42% (30% ~ 55%)** 14% (6.9% ~ 22%) ‐4.9% (‐12% ~ 3.4%) ‐25% (‐38% ~ ‐ 9.4%) 4.6% (1.4% ~ 8%) 14% (10% ~ 18%)** 19% (16% ~ 22%)** 26% (20% ~ 33%)** 5.1% (‐1.1% ~ 12%) New Technology Credits 0.14% (‐0.012% ~ 0.29%) ‐0.19% (‐0.34% ~ ‐ 0.033%) 0.66% (0.42% ~ 0.89%)* ‐0.085% (‐0.34% ~ 0.17%) 0.026% (‐0.56% ~ 0.62%) Renewable Energy Consumption ‐1.7% (‐2.1% ~ ‐ 1.3%) ‐0.42% (‐0.82% ~ ‐ 0.018%) ‐0.19% (‐0.60% ~ 0.22%) ‐0.25% (‐0.64% ~ 0.15%) ‐0.45% (‐0.88% ~ ‐ 0.019%) Electricity Prices ‐65% (‐76% ~ ‐ 49%) ‐78% (‐85% ~ ‐ 67%) ‐14% (‐42% ~ 30%) ‐74% (‐83% ~ ‐ 61%) ‐30% (‐55% ~ 8.2%) Fossil Fuel Prices 120% (18% ~ 320%) 73% (‐18% ~ 270%) ‐88% (‐93% ~ ‐ 79%) ‐0.75% (‐48% ~ 88%) 120% (‐8.4% ~ 420%) GDP < 10^‐5 < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** Percent of Deviance Explained 67% 68% 68% 64% 60% Deviance Explained by Research Subsidies 0.2% 2.1% 3.6% 1.5% 0.1% Deviance Explained by Tax Credit 0.1% 0.1% 0.6% 0.0098% 0.0004% Wind Research Subsidies 67 Predicted Percent Changes in Wind Article Counts from Web of Science After a 10‐Unit Increase in an Independent Variable Production Time 7 months 1 year 2 years 3 years 5 years Constant < 10^‐5 ** < 10^‐5 < 10^‐5 < 10^‐5 < 10^‐5 Lag of Article Counts 17% (14% ~ 20%)** 17% (13% ~ 22%)** 21% (13% ~ 30%)* 76% (58% ~ 95%)** 27% (3.0% ~ 55%) Wind Research Subsidies ‐13% (‐15% ~ ‐ 11%) ‐6.0% (‐8.1% ~ ‐ 3.9%) ‐3.3% (‐6.0% ~ ‐ 0.44%) 16% (9.2% ~ 23%) 33% (27% ~ 39%)** New Technology Credits 0.46% (0.33% ~ 0.59%)** 0.13% (‐0.0024% ~ 0.26%) 0.21% (0.020% ~ 0.41%) 0.075% (‐0.17% ~ 0.32%) 0.35% (‐0.070% ~ 0.76%) Renewable Energy Consumption ‐1.4% (‐1.7% ~ ‐ 1.0%) 1.2% (0.84% ~ 1.6%)** 0.93% (0.53% ~ 1.3%) 0.52% (0.054% ~ 0.98%) ‐0.31% (‐0.71% ~ 0.10%) Electricity Prices 150% (79% ~ 260%)* ‐60% (‐73% ~ ‐ 40%) ‐77% (‐84% ~ ‐ 66%) ‐84% (‐90% ~ ‐ 75%) ‐60% (‐75% ~ ‐ 37%) Fossil Fuel Prices 250% (120% ~ 450%)* 360% (170% ~ 670%)* 180% (68% ~ 370%) 170% (33% ~ 450%) 140% (7.0% ~ 430%) GDP < 10^‐5 < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** Percent of Deviance Explained 72% 72% 72% 72% 66% Deviance Explained by Research Subsidies 2.3% 0.5% 0.1% 0.4% 2.2% Deviance Explained by Tax Credit 0.8% 0.1% 0.1% 0.01% 0.1% 68 References Archambault, Éric; Julie Caruso; Grégoire Côté and Vincent Larivière. 2009. "Bibliometric Analysis of Leading Countries in Energy Research," B. Larsen and J. Leta, Proceedings of the 12th International Conference of the International Society for Scientometrics and Informetrics (ISSI) – Peer‐Reviewed Conference. Rio de Janeiro: BIREME/PAHO/WHO and Federal University of Rio de Janeiro, 80‐91. Barradale, Merrill Jones. 2010. "Impact of Public Policy Uncertainty on Renewable Energy Investment: Wind Power and the Production Tax Credit." Energy Policy, 38(12), 7698‐709. Bird, Lori; Mark Bolinger; Troy Gagliano; Ryan Wiser; Matthew Brown and Brian Parsons. 2005. "Policies and Market Factors Driving Wind Power Development in the United States." Energy Policy, 33(11), 1397‐407. Boyack, Kevin W.; Jeffrey Y. Tsao; Ann Miksovic and Mark Huey. 2008. "International Trends in Solid‐ State Lighting: Analyses of the Article and Patent Literature." Sandia Laboratory, SAND2008‐4564. Bureau of Economic Analysis. August 2010. "Section 1 ‐ Domestic Product and Income, Table 1.1.5. Gross Domestic Product." National Economic Accounts All NIPA Tables, http://www.bea.gov/national/nipaweb/SelectTable.asp (accessed February 1, 2011). Bureau of Labor Statistics. 2011. "Consumer Price Index ‐ All Urban Consumers." http://data.bls.gov/pdq/SurveyOutputServlet (accessed February 1, 2011). Dalal, Siddhartha R.; Paul G. Shekelle; Susanne Hempel; Sydne J. Newberry; Aneesa Motala and Kanaka D. Shetty. 2012. "A Pilot Study Using Machine Learning and Domain Knowledge to Facilitate Comparative Effectiveness Review Updating." Medical Decision Making, 33(3), 343‐55. Deerwester, Scott; Susan T. Dumais; George W. Furnas; Thomas K. Landauer and Richard Harshman. 1990. "Indexing by Latent Semantic Analysis." Journal of the American Society for Information Science, 41(6), 391‐407. Elsevier. 2010. "About Engineering Village." http://www.engineeringvillage2.org/ (accessed May 15, 2010). Energy Information Administration. 2008. "Federal Financial Interventions and Subsidies in Energy Markets 2007." SR/CNEAF/2008‐01. Energy Information Administration. 2010. "Table 9.9 Average Retail Prices of Electricity and Table 10.1 Renewable Energy Production and Consumption by Source." Monthly Energy Review, http://www.eia.doe.gov (released March 31, 2010). Eugenio, Barbara Di and Michael Glass. 2004. "The Kappa Statistic: A Second Look." Computational Linguistics, 30(1), 95‐101. Fischer, Carolyn and Richard Newell. 2008. "Environmental and Technology Policies for Climate Mitigation." Journal of Environmental Economics and Management, 55, 142‐62. Garg, K.C. and Praveen Sharma. 1991. "Solar Power Research: A Scientometric Study of World Literature." Scientometrics, 21(2), 147‐57. Genkin, Alexander; David D. Lewis and David Madigan. 2007. "Large‐Scale Bayesian Logistic Regression for Text Categorization." Technometrics, 49(3), 291‐304. Genkin, Alexander; David. D. Lewis and David Madigan. 2005. "BBR: Bayesian Logistic Regression Software." www.stat.rutgers.edu/~madigan/bbr/ (accessed May 10, 2010). Godin, Benoît. 1996. "Research and the Practice of Publication in Industries." Research Policy, 25(4), 587‐606. Goulder, Lawrence H. and Stephen H. Schneider. 1999. "Induced Technological Change and the Attractiveness of CO2 Abatement Policy." Resource and Energy Economics, 21(3‐4), 211‐53. Griliches, Zvi. 1990. "Patent Statistics as Economic Indicators: A Survey." Journal of Economic Literature, 28(4), 1661‐707. 69 Hopkins, Daniel J. and Gary King. 2010. "A Method of Automated Nonparametric Content Analysis for Social Science." American Journal of Political Science, 54(1), 229‐47. Jaffe, Adam, Richard Newell and Robert Stavins. 2005. "A Tale of Two Market Failures: Technology and Environmental Policy." Ecological Economics, 54, 165‐74. Johnstone, Nick; Ivan Haščič and David Popp. 2009. "Renewable Energy Policies and Technological Innovation: Evidence Based on Patent Counts." Environmental and Resource Economics, 45(1), 133‐55. Larsen, Katarina. 2008. "Knowledge Network Hubs and Measures of Research Impact, Science Structure, and Publication Output in Nanostructured Solar Cell Research." Scientometrics, 74(1), 123‐42. Lewis, David. 1998. "Naïve (Bayes) at Forty: The Independence Assumption in Information Retrieval." Machine Learning: ECML‐98, 4‐15. Lewis, Joanna and Ryan Wiser. 2005. "Fostering a Renewable Energy Technology Industry: An International Comparison of Wind Industry Policy Support Mechanisms." LBNL‐59116, http://www.escholarship.org/uc/item/6cf1r3z5 (accessed July 6, 2013). Luwel, M. and H. Moed. 1998. "Publication Delays in the Science Field and Their Relationship to the Ageing of Scientific Literature." Scientometrics, 41(1), 29‐40. Margolis, Robert and Daniel Kammen. 1999. "Evidence of Under‐Investment in Energy R&D in the United States and the Impact of Federal Policy." Energy Policy, 27, 575‐84. Mason, R.; B. McInnis and S. Dalal. 2012. "Machine Learning for the Automatic Identification of Terrorist Incidents in Worldwide News Media," 2012 IEEE International Conference on Intelligence and Security Informatics (ISI). 84‐89. Nemet, Gregory F. 2006. "Beyond the Learning Curve: Factors Influencing Cost Reductions in Photovoltaics." Energy Policy, 34(17), 3218‐32. Nemet, Gregory F. 2009. "Demand‐Pull, Technology‐Push, and Government‐Led Incentives for Non‐ Incremental Technical Change." Research Policy, 38(5), 700‐09. Nemet, Gregory F. and Daniel M. Kammen. 2007. "U.S. Energy Research and Development: Declining Investment, Increasing Need, and the Feasibility of Expansion." Energy Policy, 35(1), 746‐55. Newell, Richard G.; Adam B. Jaffe and Robert N. Stavins. 1999. "The Induced Innovation Hypothesis and Energy‐Saving Technological Change." The Quarterly Journal of Economics, 114(3), 941‐75. Norberg‐Bohm, Vicki. 2000. "Creating Incentives for Environmentally Enhancing Technological Change: Lessons from 30 Years of U.S. Energy Technology Policy." Technological Forecasting and Social Change, 65(2), 125‐48. Nordhaus, William D. 2002. "Modeling Induced Innovation in Climate Change Policy," N. N. A. Grubler and W. D. Nordhaus, Modeling Induced Innovation in Climate Change Policy. Resources for the Future Press, 259‐90. OECD/IEA. 2010. "RD&D Budgets, Group III: Renewable Energy Sources, III.1 Total Solar Energy." Energy Technology RD&D 2010 Edition. Office of Management and Budget. For each of Fiscal Years 1996‐2011. "Analytical Perspectives: Budget of the U.S. Government." Palmer, Karen; Anthony Paul; Matt Woerman and Daniel C. Steinberg. 2011. "Federal Policies for Renewable Electricity: Impacts and Interactions." Energy Policy, 39(7), 3975‐91. Pavitt, K. 1985. "Patent Statistics as Indicators of Innovative Activities: Possibilities and Problems." Scientometrics, 7(1), 77‐99. Perry, Thomas D. IV; Mackay Miller; Lee Fleming; Kenneth Younge and James Newcomb. 2011. "Clean Energy Innovation: Sources of Technical and Commercial Breakthroughs." NREL/TP‐6A20‐50624. Popp, David. 2002. "Induced Innovation and Energy Prices." American Economic Review, 92(1), 160‐80. Popp, David. 2003. "Pollution Control Innovations and the Clean Air Act of 1990." Journal of Policy Analysis and Management, 22(4), 641‐60. 70 Ryan, G.W. and H.R. Bernard. 2000. "Data Management and Analysis Methods," N. Densin and Y. Lincoln, Handbook of Qualitative Research, 2nd Ed. Thousand Oaks, CA: Sage Publications, 769‐802. Singhal, Amit; Gerard Salton; Mandar Mitra and Chris Buckley. 1996. "Document Length Normalization." Information Processing & Management, 32(5), 619‐33. Sinha, Bikramjit. 2011. "Trends in Global Solar Photovoltaic Research: Silicon Versus Non‐Silicon Materials." CURRENT Science, 100(5). Stephens, Jennie C.; Gabriel M. Rand and Leah L. Melnick. 2009. "Wind Energy in US Media: A Comparative State‐Level Analysis of a Critical Climate Change Mitigation Technology." Environmental Communication: A Journal of Nature and Culture, 3(2), 168‐90. Thomson Reuters. 2010. "Web of Science (R) ‐with Conference Proceedings." ISI Web of Knowledge [v.4.10], http://apps.isiknowledge.com/ (accessed November 10, 2010). Tussen, R.; R. Buter and Th. van Leeuwen. 2000. "Technological Relevance of Science: An Assessment of Citation Linkages between Patents and Research Papers." Scientometrics, 47(2), 389‐412. Vidican, Georgeta; Wei Lee Woon and Stuart Madnick. March 15, 2009. "Measuring Innovation Using Bibliometric Techniques: The Case of Solar Photovoltaic Industry." Working Paper CISL# 2009‐05, http://hdl.handle.net/1721.1/65941 (accessed July 6, 2013). Ying, Guo; Huang Lu and A. L. Porter. 2009. "Profiling Research Patterns for a New and Emerging Science and Technology: Dye‐Sensitized Solar Cells," 2009 Atlanta Conference on Science and Innovation Policy. 1‐7. 71 Essay 2. Using Articles Rather than Patents to Quantify Research Over Time: An Example Identifying Policy Effects on Wind and Solar Energy Research Abstract In any research field, the volume of research produced varies over time. Increased research funding, demand for the target of the research, or successful studies are often predicted to encourage subsequent innovation. Patents are sometimes used as a proxy for innovation in order to investigate such hypotheses. In this essay, counting technical journal articles published over time is proposed as a potentially fruitful alternative to counting patents. An example examining both patents and articles in solar and wind energy research shows that article counts produced using either of two methods have significant associations with both direct and indirect subsidies, while patents are only significantly associated with direct subsidies. Since articles are far more numerous than patents, it appears that only articles are a fine enough measure to identify the small impacts of indirect subsidies. I conclude that article counts are a valuable measure of innovation which could be used more often to assess trends in field‐specific rates of research and the drivers of those trends. Further comments are given on designing effective methods of identifying articles. Keywords renewable energy, innovation, articles, patents I. Introduction Several measures have been used to quantify “innovation” over time. Among economists, one of the most popular is the number of patents produced in a given place or period of time. Patents have been used to study the effectiveness of firms’ R&D spending (Zvi Griliches, 1990, Jerry Hausman et al., 1984, Adam B. Jaffe, 1986), relative innovation speeds in different countries (Dora Marinova, 2008), technological areas (World Intellectual Property 72 Organization, 2009) and economic sectors (Thomas D. IV Perry et al., 2011) and more recently to investigate renewable energy policies’ impacts on solar, wind and other renewable energy innovation (F. Braun et al., 2010, Nick Johnstone et al., 2009 discussing public research funding). Patents on the topic of interest are either simply counted or first culled to include only those most often cited by other patents and thereby presumed to be most valuable (Gregory F. Nemet and Daniel M. Kammen, 2007). Patents are assumed to represent innovation because by definition they describe content that is not common knowledge and has not been patented before (K. Pavitt, 1985). Further, they are costly to acquire and thus should only be acquired if they are worth the cost. Empirical findings corroborate that the average patent is commercially valuable: survey results suggest that perhaps half of all patents have seen commercial use and their average economic value is substantially positive, although varying widely (Rossman and Sanders 1957 as cited and described as one of the most detailed such surveys by Zvi Griliches, 1990). Patents also have a strong relationship with R&D expenditures when both are aggregated across firms or industries, as many studies have shown (Schmookler 1966, Bound et al 1984 and others as described by Zvi Griliches, 1990). For specific technologies, cumulative patents over time can sometimes be seen to follow an S‐shaped curve of accelerating patenting as interest and knowledge in an area grow and then decelerating patenting as further new developments become more difficult and costly (Birgitte Andersen, 1999, Peter Mock and Stephan A. Schmid, 2009). Although aggregating into subject areas as large as wind or solar technology may obscure any existing S curves, these results suggest that patent counts are a meaningful measure of innovation. Journal article counts have typically been used to answer different kinds of questions about innovation. Article citation patterns have been studied since at least the 1920’s (P. L. K. Gross and E. M. Gross, 1927), while more recently, high‐powered statistical methods have been used to assess collections of articles to discover what topics receive the most attention (Georgeta Vidican et al., March 15, 2009) or what types of organizations collaborate most often (Phech Colatat et al., 2009). Modern article counts are often used to assess and compare the performance of individual researchers (Rebecca Long et al., 2009), universities (Oguz Baskurt, 2011), and countries (Ryan Zelnio, 2012). When not concerned with performance assessment, 73 researchers have counted articles in order to assess spatial or temporal patterns of research (Éric Archambault et al., 2009 for energy in general, K.C. Garg and Praveen Sharma, 1991, Bikramjit Sinha, 2011, Ali Uzun, 2002) and try to predict future research directions or success (Kevin W. Boyack et al., 2008, G. Dawelbait et al., 2010, Yuya Kajikawa et al., 2008). Article counts are expected to represent innovation for reasons similar to those for patents. Although the requirements for originality are weaker, many and perhaps most peer‐ reviewed journals require that any article they publish be previously unpublished material. Entailing months if not years to publish, articles like patents represent investment of effort. There is also evidence of connections between journal articles and industrial production. Mansfield’s industry survey indicates that research conducted at universities underlies many industrial products and processes, although the frequency of this varies dramatically by industry, from 27% of products and 29% of processes in pharmaceuticals depending on academic research for their timely development to only 4% of products and 2% of processes in the chemicals industry (Edwin Mansfield, 1995). Like patent counts, article counts in specific subject areas sometimes show S‐curves reflecting the growth and stagnation of that subject area (Murat Bengisu and Ramzi Nekhili, 2006). Further, articles are authored not only by academics, as might be assumed, but also by government researchers and by employees of industry. For large firms, the highest rates of publication are found in chemistry, electronics and computing—the categories which solar energy research is most likely to fall into. In contrast, for firms in sectors most likely to include wind energy‐related research, mechanics— the most likely wind research category—had among the lowest rates of publication (Benoît Godin, 1996). For firms in the airline industry, which also may be related to wind energy, results were mixed. Thus, publishing might play a larger role in the solar than the wind industry. Besides patent or article counts, several other measures are sometimes used to describe innovation which, while informative in other ways, do not reflect innovation per se. Research funding is sometimes considered a proxy for research output, although it is clearly an input measure (James J. Dooley, 2000). Cost of the product being innovated upon, e.g. dollars per kilowatt hour of solar energy (Gregory F Nemet, 2006), is a much more informative measure— 74 indeed, captures a key aspect of what one would want innovation to accomplish—but leaves the innovation process itself as a black box. Patent counts and article counts are more direct measures of the quantity of innovation occurring. Patent and article count studies often use the terms “innovation” or even “invention” loosely to describe what the counts they use represent (K. Pavitt, 1985). I will follow this trend for the purpose of consistency, although the more careful studies limit themselves to narrowly‐ defined “invention” (Zvi Griliches, 1990) and the term “research output” may be most precise. Ideally, for economic and policy questions such as the ones considered here, one would hope patent and article counts reflect the quantity of knowledge being generated that may prove commercially useful, as is loosely meant by “innovation.” For the reasons cited above, both patents and articles appear to bear close relationships with “innovation.” At the same time, neither purely captures the quantity of new useful ideas as one might wish it to do. Patent counts are influenced by external trends such as patent costs and patent enforcement (for a more thorough review of patent pros and cons, see K. Pavitt, 1985), and article counts by variation in the proclivity to publish and in journal audience interests. Relatively small data sizes, at least for patents, may have limited the research questions which analysts have attempted to use them to answer. As patents are particular to the patenting country or consortium, so article analysis is also limited by the languages used by the indexing databases and the journals they choose to include. Thus, patent and article quantities are influenced by many factors beyond the quantity of innovation occurring. Despite these vagaries, both patents and articles reflect innovation that their authors believed was worth the cost of patenting or publishing, and thus serve as meaningful proxies for the rate of innovation. The relationship between patents and articles is also complex. Patents often cite articles, (R. Tussen et al., 2000), while articles rarely cite patents (S. Bhattacharya and M. Meyer, 2003), but this imbalance may result as much from patent office citation requirements (R. Tussen, R. Buter and Th. van Leeuwen, 2000) as from the direction of knowledge flow. Even when the same or closely related research findings are published in both patent and article form, which is first may vary from case to case: for example, publicly funded research may be 75 published in a journal article which others then build upon and patent, or researchers may avoid publishing an article on their work until it is protected by a patent. In theory, patents are likely to lag behind articles, because articles can contain theoretical content and incremental applied content not suitable for patents, reflecting earlier steps in the development of a concept. In practice, patents are counted by date of submission, since it is availablevi, while articles are counted by date of publicationvii, biasing in favor of patents appearing earlier. Whether patent counts will lag behind article counts or vice versa therefore remains an open question, and the two may yield complimentary but distinct information about innovation because they describe differing aspects of the idea‐to‐product process. Unlike patent counts, article counts are still seldom used to answer econometric questions about how innovation relates to economic conditions or public policies. Patent counts are not only better known but easier to collect by topic using the patent classification system. However, modern electronic citation databases make collecting articles a relatively easy process and it is even cheap for those many researchers who already have access to such databases through their institutions. Articles are also more plentiful than patents, enabling more detailed analysis. This study compares patent and article counts as measures of innovation, focusing on the example of U.S. policy effects on innovation in wind and solar energy. Article counts here are collected by an old and a new method and compared with each other and patent counts. When articles have been counted in the past, this has nearly always been done simply by finding articles whose abstract or bibliographic information includes keywords selected by the analyst (as examples on solar and wind energy, see Elias Sanz‐Casado et al., 2012, Bikramjit Sinha, 2011). Occasional studies have used more sophisticated iterative methods of selecting keywords (G. Dawelbait, T. Mezher, Woon Wei Lee and A. Henschel, vi Successful patents take 26 months or longer from submission to issuance, according to average times for mechanical and chemical categories, Popp, David; Ted Juhl and Daniel K. N. Johnson. 2004. "Time in Purgatory: Examining the Grant Lag for U.S. Patent Applications." Topics in Economic Analysis & Policy, 4(1). vii Articles take 7 months from submission to publication, on average, for journals assessed by Luwel and Moed (1998) and that include solar energy articles as calculated in Hlavka (2011), Hlavka, Eileen. 2011. "U.S. Subsidy Effects on Production of Solar Technology Research: Using a New Measure to Assess Policy Impacts," 2011 IEEE First Conference on Clean Energy and Technology (CET). 259-64, Luwel, M. and H. Moed. 1998. "Publication Delays in the Science Field and Their Relationship to the Ageing of Scientific Literature." Scientometrics, 41(1), 2940. 76 2010). Still, this approach unquestionably results in the inclusion of undesired articles which happen to include those keywords, as well as the exclusion of others which use synonymous language without including the necessary keywords. Whether this problem is major or minor has rarely been assessed by the studies which employ this approach. Here, using a random sample of potentially relevant articles, keyword‐based article selection is compared against the results of sorting articles by hand in order to answer this question. It is also possible to use modeling to identify relevant articles, just as one can use it to analyze results. Rather than selecting keywords a priori, such models use the words in the abstracts as the independent variables to predict an abstract’s relevance to the topic at hand, essentially selecting and weighting keywords automatically. In order to create the model, a subset of abstracts is sorted by hand and used to define relevant and irrelevant abstracts; the model is then applied to the remaining potentially relevant abstracts. Because they are built using hand‐sorted data, unlike simple keyword matching without hand‐sorting, these models’ effectiveness is subject to testing using that same data. What functional form is most effective at classifying texts or other data has received extensive coverage in the machine learning literature, owing to many applications in web searching and other non‐academic settings (Thorsten Joachims, 1998, Fabrizio Sebastiani, 2002), as well as genetic analysis (Xiaojing Yu et al., 2006). Logistic Bayesian models are a leading approach to such text classification (Alexander Genkin et al., 2007) and are the approach is used and assessed here. The logistic Bayesian approach and resulting article counts are the same as in Essay 1 (also see Eileen Hlavka, 2011). Although such an approach could also be used to identify patents, it is not attempted here because in most cases patents are already well classified, and because using more traditional patent selection methods makes the current results more useful in interpreting past patent research. Wind energy is among the topics for which patents are not well classified, so future application of the logistic Bayesian approach to wind patents may be useful. In this essay I compare the use of patents, keyword‐selected articles and logistic Bayesian model‐selected articles as measures of innovation in wind and solar energy. While patent and article counts are highly correlated in the years considered, they follow somewhat 77 different trajectories, suggesting they have different stories to tell about innovation. Both article count methods perform well when compared with hand‐sorted data and provide strikingly similar results, suggesting that at least for this research design, selecting keywords a priori or by modeling are roughly as effective. Keyword selection appears to favor recall and modeling appears to favor precision, within the hand‐sorted data, with modeling yielding somewhat larger article counts than keywords when applied to the full data set. These methodological comparisons, however, may themselves be biased in favor of keyword results by only considering articles initially selected with the single keywords “solar” and “wind.” Keyword‐based results may have little generalizability, suggesting that modeling is a more reliable method for identifying articles. Finally, as an example application of policy interest, I examine how two federal policy types have affected patent and article counts. Both article and patent counts are significantly associated with research subsidies, while the impact of production subsidies is only statistically significant for article counts. The connection between articles and production subsidies constitutes a new finding suggesting that price supports for renewable energy incentivize wind and solar innovation. This example demonstrates the usefulness of article counts to address economic questions of policy interest. II. Comparing Three Approaches This section discusses how solar and wind energy patents and articles were collected. Different methods were used for wind patents and solar patents, which were designed primarily to form a reasonable basis for comparison with articles. Articles were collected using a carefully selected set of keywords and then separately using a logistic Bayesian model to select from a larger set of potentially relevant articles. These two methods yield similar but distinct results, particularly in terms of precision and recall measures, with modeling yielding more articles and appearing potentially more accurate overall. In all cases, article counts are more numerous than patents. Patents and articles are somewhat correlated but neither is sufficient to explain the other, suggesting that overlapping but different factors drive the creation of patents and articles. 78 a. Collecting Patents Patents addressing solar and wind electricity production were collected from the U.S. Patent and Trademark Office database and sorted by month of application to form monthly patent counts. While patents filed from 1980 through 2009 were collected, only those from 1986‐2007 are used in the quantitative analyses. This time period was chosen because it is similar to that used for articles, excludes almost all months with no patents, and ends before the number of patents per month drops because some patents are still in processing. The U.S. Patent Office categorizes all patents using an extensive classification system. For solar, I counted all patents within one of 96 patent classifications describing photovoltaic and concentrating solar energy. These classifications are described in Table I. Classes covering solar heat collectors were not included, because they are typically designed to produce heat as the end product rather than electricity. Patent classifications for light‐responsive sensors and for devices auxiliary to solar panels, such as batteries, also were not used; although advances in these areas may contribute to the success of solar electricity, it is not their primary purpose and therefore it was judged that including them would overly dilute the sample of what are here defined to be solar energy patents. Solar patents were collected from a DVD containing all patents granted by the U.S. between 1975 and 2011 inclusive (U.S. Patent and Trademark Office Patent Technology Monitoring Team, 2012). The patent office plans to soon change its patent classification system to align with the European patent classification system, so later research may require different patent classes. 79 Table I. Solar Patent Classes Patent classifications used to identify patents relevant to solar energy. “PV” refers to photovoltaic, “CSP” to concentrating solar power, and “G” to general categories where the class does not specify whether PV or CSP is used. Solar Patent Classes Technology Subtype Description Classification PV Solar cells 136 243‐265 PV Apparatus incorporating or using photoelectric cells, including in space or with electrical circuits specified 136 291‐293 PV Process for manufacturing a radiation responsive photovoltaic device of the semiconductor barrier layer type 438 57‐98 CSP Solar power plants including mechanical step (e.g. concentrating) 60 641.8‐641.9, 641.11‐641.15 CSP solar concentrating reflectors and lenses 126 683‐700 CSP Solar thermoelectric generator 136 206 G Metal working for making solar energy device 29 890.033 G Solar systems 323 906 Wind energy patents were collected by searching for specific keywords rather than by patent classification. This is because wind energy classifications present two problems: windmill blade patents, unlike solar and many other patent types, have not been reclassified for consistency over the years (United States Patent and Trademark Office, 2007); and several patent classifications (290 43 and 290 54) which contain many wind energy patents also contain many irrelevant patents (Margaret Taylor et al., 2006), namely blades which are designed for other purposes. The two patent classes which appear to house most wind energy patents, 290 44 and 290 55, have been found to include only 54% of all relevant patents which Taylor and Thornton et al identified by searching for keywords and removing irrelevant patents by hand (2006). Therefore, I identify wind patents not by classification but by searching patent abstracts for keywords using the same search string which the relevant patent examiner has recommended to previous researchers: “("wind power" OR (wind AND turbine) OR windmill) 80 OR (wind AND (rotor OR blade* OR generat*) AND (electric*))”, where “*” is a wildcard representing any number of characters (N. Ponomarenko as cited in Gregory F. Nemet, 2009, Margaret Taylor, Dorothy Thornton, Gregory Nemet and Michael Colvin, 2006). Although this search string appears likely to yield some patents which discuss engine winding or other irrelevancies, hopefully these are few, and it is used for ease of comparisons with previous research. This search string was used to collect patents from the patent office website, <www.uspto.gov> (United States Patent and Trademark Office, 2012). Each patent lists the date it was submitted to the patent office and the date it was issued. The solar and wind patents thus collected were counted by month of submission to form two time series, one for solar energy and one for wind energy. Submission date was used rather than issue date because the length of time for patents to be issued varies substantially and is not of interest here. Because patents are only included in the CD database once issued, recent years are likely to have had fewer successful patents submitted during those years. For this reason, only patents submitted during or before 2007 are used. b. Collecting Articles by Keyword The first method used to collect articles is keyword searching. Articles were collected from two journal article databases by searching their article abstracts and bibliographic information for at least one appearance of a specified set of key terms, following the same process as a typical user of an article database or search engine. A set of solar articles and a set of wind articles were thereby constructed from each of the two databases, which were counted by month to yield four time series of monthly article counts. The databases used were ISI Web of Science and Compendex Engineering Village II, which both consist primarily of peer‐reviewed academic journals. Web of Science is a larger database with broad coverage of the sciences and humanities, emphasizing the hard sciences (over 10,000 journals). Its primary intended audience is academic (Thomson Reuters, 2010). Engineering Village, in contrast, includes roughly half as many journals (over 5,600) and focuses on more applied topics, especially chemical, electrical, mechanical, mining and civil engineering (Elsevier, 2010). Both databases include only English‐language abstracts. While many journals appear in both databases, each also includes many journals which are not in the other. In light 81 of the differences between the databases, results from each database were left separate in order to identify and demonstrate the effect that database selection (between these two) has on results. For solar articles, the search string was “solar AND ("solar power*" OR electric* OR photovolt* OR PV* OR "solar cell*")”. These terms were chosen as appearing likely to yield many relevant and few irrelevant articles, after reviewing search strings used in several other previous studies, shown in Tables IIA and IIB. For wind, similarly, previous work was examined and the search string selected was “wind AND ("wind energy*" OR "wind power*" OR turbine* OR windmill* OR blade*)”. These search strings were also refined to their present character by spot‐checking results. 82 Tables IIA and IIB. Keywords Previously Used for Identifying Articles and Patents Previous researchers have used the following keyword searches to identify patents and articles discussing wind and solar energy, sometimes as parts of broader searches. Rows with source marked “Current” describe the keywords used in the present study. Keywords Previously Used for Solar Energy Source Current Sanz‐ Casado et al 2012 Used Keywords (combined with 'OR' unless Found Date Database Comments for? otherwise noted) within Range articles "solar" AND at least one of ("solar power*", abstract, Web of 1986‐2008 "electric*", "photovolt*", "PV*", "solar citation Science, cell*") info Engineering Village articles (‘‘solar energy*’’ OR ‘‘solar radiation’’ OR Topic Web of 2005‐2009 ‘‘solar cell*’’ OR ‘‘solar photovoltaic*’’ OR Science ‘‘solar power’’ OR ‘‘solar heat*’’ OR ‘‘solar plant*’’ OR ‘‘solar concentrate*’’ OR ‘‘solar thermal’’ OR ‘‘solar collect*’’ OR ‘‘solar technolog*’’) Vidican et al 2009 articles “photovolt*”, "solar cell", "solar PV", "solar keyword Web of energy", (i.e. Science "solar generation" and "solar power"” abstract, citation info) Muruges‐ articles "solar energy" Topic Web of apandian Science 2011 Sinha articles ‘solar power’ or ‘solar energy’ or ‘solar cell’ title, Scopus 2011 or ‘solar photovolt’ abstract, or ‘solar PV’ or ‘solar cell material’ keyword 1975‐2008 1999‐2011 1981‐1988, 2001‐2008 Margolis patents (oil or natural gas or coal or photovoltaic or Title and hydroelectric or hydropower or nuclear or Kammen geothermal or solar or wind) and (electric* 1999 or energy or power or generat* or turbine) USPTO WIPO patent not offices of specified U.S., Europe, China, Japan and Korea and World Intellectual Property Office patents (solarcell solar‐cell photovoltaic* ((solar not photo* PV sun) adj (light cell specified batter* panel module*))) [for solar materials, cells and modules], (solarcell solar‐cell photovoltaic* solar photo* PV sun) and (control* invert* convert* conversion system mount* instal*) [for solar power systems] CSP concentrat* collect* trough dish tower sterling stirling [for solar thermal collectors] ((solar* sun*) and (heat* thermal accumulate* power generat* warm* boiler* building system house hot boiling )) [for solar thermal heating] 83 1976‐1997 also tried by Taylor et al searched within matches to international patent classes for each of the 4 types given in [brackets] in the search string description Keywords Previously Used for Wind Energy Source Used for? Keywords (combined with Found 'OR' unless otherwise noted) within Current Articles "wind" AND at least one of ("wind energy", "wind power", "turbine", "windmill", "blade") Current Patents (“wind power” OR (wind AND turbine) OR windmill) OR (wind AND (rotor OR blade* OR generat*) AND (electric*)) Sanz‐ Articles (’’wind power’’ OR ‘‘wind Casado turbine*’’ OR ‘‘wind energy*’’ et al OR ‘‘wind farm*’’ OR ‘‘wind 2012 generation’’ OR ‘‘wind systems’’) Stephens news (wind energy, wind power, 2009 articles wind turbine, wind and renewables, windfarm, OR windmill) NOT (rain OR storm) Nemet Patents (“wind power” OR (wind AND 2009 turbine) OR windmill) OR (wind AND (rotor OR blade* OR generat*) AND (electric*)) Taylor et Patents (“wind power” OR (wind AND al 2006 turbine) OR “windmill”) OR (wind AND (rotor OR blade* OR generat*) AND (electric*)) Database Date Range Comments abstract incl Web of bibliographic Science, info Engineering Village Abstract 1986‐2008 Topic Web of Science 2005‐2009 found 830 articles heading, lead paragraph Lexis‐Nexis January 1, 1990 to December 31, 2007 Abstract USPTO Abstract USPTO 1975‐2006 terms recommended by patent examiner irrelevant articles (14.6%) removed manually; terms recommended by patent examiner 1976‐1997 also tried by Taylor et al 1986‐2007 copied from Nemet Margolis Patents (oil or natural gas or coal or Title USPTO and photovoltaic or hydroelectric Kammen or hydropower or nuclear or 1999 geothermal or solar or wind) and (electric* or energy or power or generat* or turbine) WIPO Patents wind* turbin* not specified patent unlimited offices of U.S., Europe, China, Japan and Korea and World Intellectual Property Office 84 irrelevant articles removed manually searched within matches to international patent classes The accuracy of keyword selection methods was assessed by comparing against hand‐ sorting results. First, a random sample of 750 articles from each database matching each of “solar” and “wind” as keywords was sorted by hand into relevant and irrelevant categories. These hand‐sorted samples were also used for the model‐based classification described below, where they are described in more detail. These samples were searched for keywords to see if keyword searching would successfully reproduce the results of hand sorting. Their degree of success was quantified by calculating the percent correct, precision and recall, reported in Tables IIIA and IIIB below. Precision measures what percent of articles found using keywords are actually relevant, whereas recall measures what percent of actually relevant articles were identified using keywords. As seen in Tables IIIA and IIIB, percent correct and recall are high for both wind and solar, demonstrating that keyword selection was successful at finding relevant articles. Levels at or above 70‐80% may be considered good (Boyack, Tsao et al. 2008), suggesting that percent correct and recall are good, but precision is unfortunately low. This implies that articles collected using keyword matching will be somewhat diluted by the presence of irrelevant articles. These performance measures will be compared with results for articles selected using modeling. All keyword‐collected abstracts which listed at least one month in the publication date field were counted by month, by database and by technology type (solar or wind) to form four time series of article counts. Abstracts which list two months, e.g. “February‐March,” were assigned to the earlier month. These time series will be discussed below. Some abstracts, particularly older ones, do not list a month of publication, and so were not included in monthly article counts. c. Collecting Articles by Modeling A second method of identifying articles was used to construct additional sets of article counts. In this process, an inclusive set of abstracts was collected from the same databases. After sorting random samples from this set, a Bayesian logistic model was built and used to predict each abstract’s probability of relevance to solar or wind energy. Just as for the keyword 85 searches, the process was performed separately for solar and wind energy for each of the two article databases. First, abstracts were collected from Web of Science and Engineering Village by searching for the terms “solar” and “wind,” yielding over 200,000 abstracts. From the matches to “solar”, 750 abstracts were randomly chosen from each database. For wind, for each database, 250 abstracts were randomly chosen and 500 were chosen from a set of journals likely to be relevant. This was necessary because only a small percentage of the matches to “wind” are relevant, and thus a random sample would be unlikely to include enough relevant abstracts to produce an effective model. Two readers read each of these 3,000 abstracts and decided if they were relevant, i.e. if they described research related to producing solar or wind electricity, respectively, or were irrelevant. Most abstracts were irrelevant, covering topics such as sun spots and engine windings. Any disagreements between the two readers were resolved by discussion (for more on systematic hand‐classification, see G.W. Ryan and H.R. Bernard, 2000). On average, sorting took readers about one minute per abstract, not including training or discussions. These hand‐sorted data were then used to build models which predict whether an abstract is likely relevant to solar or wind electricity. Referred to as Bayesian logistic classifiers in the statistics literature, these models are LASSO regressions with Laplace priors to minimize overfitting (for example applications of these and similar methods, see Siddhartha R. Dalal et al., 2012, for more on Bayesian logistic modeling of text, see Alexander Genkin, David D. Lewis and David Madigan, 2007, and R. Mason et al., 2012). Counts of words which appear in the abstracts are used as the regressors to predict the dependent variable: the abstract’s probability of relevance. Before entering words into the model, common words (e.g. “this”) and single‐letter words were removed, words were reduced to roots called “stems,” and word counts were scaled down using cosine normalization to counteract the effect of variation in abstract length.viii Further methodological details are given in Essay 1. The hand‐sorted data viii The models were constructed and applied using software from Madigan, David; David Lewis; Alex Genkin; Shenzhi Li; Bing Bai; Dmitriy Fradkin; Michael Hollander and Vladimir Menkov. 2009. "Bayesian Logistic Regression (BBR, BMR, BXR)." www.bayesianregression.org (accessed August 2009). 86 from both databases were combined, thus producing two models: one for solar energy and one for wind energy. The models perform well when tested with cross‐validation, which assesses how well a model will predict the relevance of abstracts which were not used to build it. N‐1 cross‐ validation shows that on average, the solar model successfully identifies 79% of the abstracts which the readers identified as relevant (recall), and 79% of the abstracts it classifies as relevant are truly relevant according to the readers (precision). Cross‐validation for the wind model suggests it will find 75% of the relevant abstracts and of those abstracts it predicts are relevant, 76% will actually be relevant. Since performance of 70‐80% is considered good (Boyack, Tsao et al. 2008), these models perform reasonably well. Ten‐fold cross‐validation provides another set of somewhat less precise results as well as a sense of how much their performance may vary. There is no measure for the human readers’ results which corresponds to cross‐validation, so it may be noted that the full models perform roughly as well as each reader. Hand‐sorting results for each reader, model results and both forms of cross‐validation results are reported in Tables IIIA and IIIB. The solar and wind models are then applied to each of the “solar” or “wind” abstracts collected from each database, predicting its probability of relevance. Abstracts which have no month listed and conference abstracts are removed.ix The probabilities of relevance for the remaining abstracts are summed by month to produce four time series of monthly article counts. d. Comparing Keyword and Modeled Article Results The two methods of identifying articles yield very similar results. Analysis suggests that keywords favor recall and modeling favors precision, although the methods of analysis are biased in favor of this conclusion. Modeling consistently yields more articles than keyword selection. Thus, modeling may be preferable when time allows, while keyword searches may be good enough in many situations if keywords are chosen effectively. ix This is accomplished by removing all abstracts with "Conference,” “Proceedings,” “Proceddings,” “Annual,” or “Symposium” appearing in the title. Conference proceedings are excluded because they create large spikes during certain months and are likely to repeat content found in other articles. 87 The two article count methods can be compared based on their effectiveness at correctly classifying articles. To do this, as before consider the hand‐sorted articles, with their relevance agreed upon by both readers, as the definition of correct. For solar, the model appears to have substantially better precision and somewhat better percent correct than the keyword method and somewhat worse recall. For wind, the model has slightly better precision, but percent correct is similar for both methods and the model has substantially worse recall. Thus, it appears likely that if precision is the top priority, modeling is the preferred method, while if recall is the top priority, keyword selection may be preferable. However, these conclusions may be biased in favor of keywords for reasons explained below, and the performance differences between modeling and keyword selection may be highly sensitive to the data and keywords used. Tables IIIA and IIIB. Performance of Article Identification Methods Several measures show the performance of the models used to identify relevant wind and solar abstracts. For the purposes of these calculations, an abstract is considered relevant by the model if the model predicts its probability of relevance to be at least 0.5. When this probability is varied from zero to one, and the resulting percent of relevant abstracts identified is plotted versus percent of irrelevant abstracts identified as relevant, this curve is labeled the ROC. The area under the curve, reported below, represents the model’s ability to achieve both high precision and high recall if that probability is optimized. In the case of solar, the model appears to have better precision and percent correct than the keyword selection, and similar recall. In the case of wind, the model also appears to have better precision, but has a similar percent correct and a poorer recall than the keyword selection. Performance of Solar Article Identification Methods Percent Correct Precision Recall Reader 1 98% 96% 97% Reader 2 93% 88% 88% Keywords 81% 65% 85% 99% 99% 99% Model Performance 87% 79% 79% Model N‐1 CV 89% (5%) 84% (19%) 81% (21%) Model 10‐Fold CV 88 ROC ‐ ‐ ‐ 100% ‐ 95% (1%) Performance of Wind Article Identification Methods Percent Correct Precision Recall Reader 1 99% 97% 97% Reader 2 99% 99% 96% Keywords 90% 73% 94% 97% 94% 94% Model Performance 89% 76% 75% Model N‐1 CV 90% (2%) 81% (9%) 77% (5%) Model 10‐Fold CV ROC ‐ ‐ ‐ 99% ‐ 94% (2%) A more direct comparison of the two methods more dramatically reveals their relative emphases on recall and precision, respectively. In Tables IVA and IVB, articles are counted based on whether they were identified as relevant by the keyword approach, the model, or both; and within each of these cells, whether that classification was correct, based on hand‐ sorting. Since a similar table assessing a single model is sometimes called a confusion table, this may be called a “two‐model confusion table.” Note that for the purposes of comparing the models, the upper left and lower right quadrants are not useful, since they describe articles for which the keywords and model both gave the same result. The cells of greatest interest are highlighted. To quantify the comparison between modeling and keyword searching, let the “Relative Recall Ratio” of the two methods be the number of relevant articles identified by the model but not by the keywords, divided by the number of relevant articles correctly identified by the keywords but not the model. For solar, this is about 3:4 as shown by the table, or more precisely, 0.62. For wind, the relative recall ratio is about 1:5, or more precisely, 0.13. This measure shows that the second approach, using keywords, results in better recall. Defining the “Relative Precision Ratio” analogously for irrelevant articles gives a ratio of 11:3 or more precisely, 3.4, for solar and 6:3, precisely 2.0, for wind. Thus, the model is much better at precision than the keyword approach. 89 Tables IVA and IVB. Keyword and Model Article Selection Two‐Model Confusion Tables Using the hand‐sorted data, cross‐tabs of keyword and model results are coded within each cell to show relevant article counts in bold and irrelevant article counts in plain typeface, all as percentages of the entire hand‐coded sample. Cells in the highlighted upper right and lower left quadrants reflect the differences between keyword and model results, suggesting that for these data, keyword selection is much better at recall and modeling is much better at precision. Solar Two‐Model Confusion Table BBR Relevant BBR Irrelevant Total Keyword Relevant 22 3 4 11 26 14 40 Keyword Irrelevant 3 3 2 52 5 55 60 Total 24 6 6 63 31 69 100 31 69 100 100 Grand Total Grand Total Wind Two‐Model Confusion Table BBR Relevant BBR Irrelevant Total Keyword Relevant 17 3 5 6 22 8 Keyword Irrelevant 1 3 1 65 1 68 70 Total 18 5 6 71 23 77 100 23 77 100 100 Grand Total Grand Total 30 The analysis presented so far suggests that keyword selection is better for recall and model selection is better for precision. However, since even the irrelevant articles considered here match the keywords “solar” and “wind” because that is how they were collected, the percent correct, precision and recall measures may be biased compared to their true values if all existing articles were used. In particular, if appearances of “solar” and “wind” are more correlated with the keywords chosen than with relevance, all the analysis above would be 90 biased in favor of keyword selection over modeling. The tables above suggest that this is true: 40% of the solar sample abstracts match the keywords used, while only 31% of the solar sample abstracts are relevant. Similarly, for wind 30% of the sample abstracts match keywords while 23% are relevant. Thus, all of the analysis above likely overstates the effectiveness of keyword models. When applied beyond the training data, both keyword and model selection methods also yield similar results. This is most clear in Figure I, where both types of article counts can be seen to track closely. The similarity between the two article count types can be quantified by finding the cross‐correlation between them, shown in Table V. Using a lag of zero, which is optimal, the cross‐correlation ranges from 0.94 to 1.00 depending on article database and technology type. Keyword article counts are able to explain 73% to 89% of the variance in modeled article counts, using quasiPoisson regression. These results suggest that at least in this case, keyword‐based and modeled article counts can be used interchangeably. Table V. Similarity Between Keyword and Modeled Article Count Time Series Time series of monthly article counts produced using keywords and modeling are compared. For every combination of technology type and database, the two article count methods produce closely related results. Similarity Between Keyword Article Counts and Modeled Article Counts Cross‐ Percent Optimal Lag Technology Type Database Correlation at Deviance (months) Optimal Lag Explained Engineering Solar 0 1.00 89% Village Solar Web of Science 0 0.99 82% Wind Engineering Village 0 0.97 81% Wind Web of Science 0 0.94 73% Model selection identifies more articles on average, as noted earlier. For solar, article counts from Engineering Village average 16% higher when based on the model than when based on keywords, and article counts from Web of Science average 45% higher. For wind, the difference is even greater: Engineering Village modeled counts are 65% higher than keyword 91 counts, and Web of Science modeled counts are 114% higher than keyword counts. All else equal, this difference makes modeling a preferable method of identifying articles. Because keyword and model‐based article counts are so highly correlated and model‐based counts are higher, for simplicity the rest of this section will focus on model‐based article counts. e. Comparing Article and Patent Results Both patent and article counts are extremely noisy, as is typical for counts of random events. When plotted over time or regressed with each other, they appear correlated, but the source of this relationship is unclear and may be based on their independently increasing over time. What is clear is that articles are more numerous and that patents and articles each vary at least partially independently of one another, suggesting that articles will have greater statistical power and each could contain information about innovation that is not contained in the other. Articles are more numerous than patents in every case considered here. An average month has about two to three times as many articles published that month as patents applied for in that month (including only patents applications which were ultimately successful). As count data, this makes articles somewhat ‘smoother’ than patents. This alone makes them a potentially interesting candidate for further analysis. 92 Table VI. Article Count Summary Statistics Considering each set of article counts as a monthly time series from January 1986‐December 2007, a variety of summary statistics are shown. Summary Statistics for All Patent and Article Count Time Series Tech Type Count Standard Median Min. Max. Equaling Obs. Deviation Zero Database Data Type Mean USPTO Patents 15 8 14 3 44 0 264 Engineering Keyword Articles Village Solar Model Articles 44 33 31 4 164 0 264 51 36 38 7 180 0 264 Web of Science Keyword Articles 45 43 33 0 261 0 264 Model Articles 66 54 50 2 322 0 264 USPTO Patents 5 4 3 0 21 25 264 Engineering Keyword Articles Village Wind Model Articles 8 9 4 0 42 0 264 14 12 10 1 56 0 264 Keyword Articles Model Articles 7 14 8 12 4 11 0 1 52 77 6 0 264 264 Web of Science In Figure I, all patent and article counts can be seen to rise substantially over time. In some cases, individual peaks or valleys in patent counts appear to align with or be followed by peaks or valleys in article counts, such as the 1997 valley in wind patent counts and the 1999 valley in wind Web of Science article counts, both of which are low points before their respective counts rise almost monotonically for many years. However, in most cases it is unclear whether such peaks and valleys in patents and articles are related or entirely independent. Such ambiguity is common in graphs describing related but complex phenomena, whose relationships can sometimes be revealed by using quantitative analysis. While longer time series are shown below, only patent data from 1986‐2007 and article data from 1986‐2008 will be discussed further. Earlier years reflect the beginning of meaningful data availability and later article and patent counts are likely to be low because of the time required to review patents and enter articles into the databases. 93 Figure I. Patent and Article Counts for Solar and Wind Energy The number of patents filed each month and two measures of the number of articles published each month are smoothed by calculating twelve‐month rolling averages and plotted over time. Article counts produced using Bayesian classification are slightly higher than those produced by searching for key words, but follow a very similar path. Patent counts follow their own, somewhat related path, appearing to lead article counts by up to several years as seen in the wind data below. 94 Article counts and patent counts are likely to take different amounts of time to produce, thus potentially resulting in a lag between them. Particularly for wind, the plots in Figure I suggest that article counts may lag after patent counts. Using the cross‐correlation to find the optimal lag time between patent counts and article counts suggests that in most cases, articles follow after patent counts. This is consistent with the fact that the dates used for patents are when they were submitted, whereas article dates are dates of publication and therefore after the time spent in the article acceptance process. Identifying the length of this lag would be useful for future research, since it could be used to adjust article or patent count results before comparing between them. However, lag times tend to vary widely, potentially because even among patents or among articles, there is great variation in the time spent producing them. Optimal lag times and related information are shown in Table VII. While in most cases articles lag after patents, for solar articles from Engineering Village, the optimal lag time has articles ahead of patents by six months. The next most optimal lag is nine months with articles lagging after patent counts, similar to the results for other database and technology type combinations. In short, exact lag times are not very stable, although they appear to center around articles lagging 9‐16 months behind patents. Table VII. Similarity Between Patent Count and Modeled Article Count Time Series Time series of patent and article counts are compared. For solar using Engineering Village, the next most optimal lag time is nine months, which is somewhat consistent with the optimal lags for other technology type and database combinations. While cross‐correlation and percent deviance explained are high in all cases, they are far less than 100%, suggesting that patents and articles do not suffice to explain each other. Similarity Between Patent Counts and Modeled Article Counts Cross‐ Percent Optimal Lag Technology Type Database Correlation at Deviance (months) Optimal Lag Explained Engineering Solar ‐6 0.61 37% Village Solar Web of Science 11 0.64 41% Wind Engineering Village 9 0.70 44% Wind Web of Science 16 0.61 33% 95 Since patents and articles are both being used as proxies for innovation, it may be of interest how closely related they are to each other. Using the optimal lags, the cross‐ correlation between patent and article counts is 0.61‐0.70. Using Poisson regression, article counts explain 33%‐44% of the deviance in patent counts. Thus, article counts and patent counts are related, although the analysis below suggests that this association may be due to the upward trends exhibited by both patent and article counts. Article counts and patent counts both are noisy and rise dramatically with the passage of time, so detrending (using a quasiPoisson model with a log link and time as the only regressor) and twelve‐month‐smoothing before calculating cross‐correlations and optimal lags may be a more appropriate approach. This approach gives optimal lags between ‐2 and 38 months, an unsatisfyingly large range, and more surprisingly, half of the cross‐correlations used to find these lags are negative. Thus, it is not clear that patents and articles are meaningfully correlated after their upward trends are removed. The differences between results for keywords and modeled articles are as large as the differences between wind and solar results and between databases used, suggesting that perhaps the results of this particular exercise reflect only arbitrary noise. Article and patent counts appear logically related to innovation, and the regression example below will suggest they react very similarly to research subsidies. However, a preliminary analysis attempting to quantify their relationships with each other is inconclusive. Given this, whether the apparent associations between patents and articles reflect meaningful relationships or independent time trends is in the eye of the beholder. What is clear is that patents and articles may each tell us something which cannot be found by examining the other. III. Example Application: Identifying Subsidy Impacts To see what patents and article counts can tell us about the drivers of innovation, this section considers an example focusing on renewable energy and public policy. Renewable energy is often considered subject to a double market failure: innovators, as in all fields, are unlikely to capture the full benefits of their work because others will copy it; and further, they 96 will be unable to capture the environmental and energy independence benefits because of their public goods character. Therefore, many governments including the U.S. provide direct supports for renewable energy innovation as well as supports for renewable electricity production which may encourage innovation indirectly. It would be useful to know how much these two types of price supports affect innovation, in order to more wisely allocate resources between them. Here, I compare the impacts of direct and indirect U.S. subsidies on patents and the two types of article counts, interpreted as proxies for innovation. In this example, direct support for research is represented by federal research subsidies allotted to wind or solar research, while indirect support is represented by tax credits given for wind, solar and other renewable electricity production and investment. The former is likely to have stronger effects on research than the latter, since research dollars go only to research while the tax credit dollars go to electricity producers and thus only indirectly may some of them be used to pay for research. Compounding this effect, research subsidies as used here are specific to solar or to wind energy, while the tax credits are for many types of renewable energy. On the other hand, economic theory also suggests that research subsidies may not result in any increase in research production, since the public subsidies may simply crowd out private research dollars. Because the tax credits increase the potential value of successful research, they are not subject to crowding out. a. Regression Methods The expected value of the number of patents or articles ( function of a lag of patents or articles ( ) is modeled as a ), subsidies for research on that technology type ( ), tax credits for renewable energy production ( ), and other conditions related to energy demand: renewable electricity consumption ( for “use”), electricity prices ( ), fossil fuel prices paid by power plants ( ), and GDP ( ). Since patents and articles are count data, a Poisson regression is appropriate, and the canonical log link is used since counts appear to increase exponentially over time. Bootstrapping the count data is used to produce appropriate standard errors given that the data are overdispersed relative to a Poisson distribution. 97 Regressor variable names are defined in the data section below. The model is run separately for each time series of patent counts or article counts. ln where ~ A lag of months is allowed between the regressor conditions and the patent or article counts. In the case of patents, this represents the time between the decision to produce the patent and when it is filed; in the case of articles, it represents the time between producing an article and submitting it to a journal, plus the time between submission and publication. Five different values for are considered in order to examine when subsidies have their effects, if any. These range from seven months, which is the average publication time for some relevant journals based on previous results,x to five years. Note that significant results should not necessarily be expected for all values of ; instead, the value of for which a given regressor is significant identifies the length of time that that regressor takes to affect patents or articles. The lag of article counts is defined as the twelve‐month average ending in month . For article counts, 1995‐2008 data are used. For patents, only 1995‐2007 data are used because recent patent counts may be low due to the time taken to review patents. Restricting the patent data further to 1995‐2005, since 2005 is the last year before wind patent counts begin to fall, was also considered and yields similar results. b. Additional Data Sources The two subsidies considered are total research subsidies for solar energy or for wind energy, and the taxes not collected due to the New Technology Credits for renewable energy production. Research subsidies are the most direct federal funding source for renewable energy research, and are reported annually to the OECD and IEA by the federal government x This average publication time was calculated by weighting the journal publication times in Luwel and Moed’s 1998 research by those journals frequency of appearance in the classified article count data. See Luwel, M. and H. Moed. 1998. "Publication Delays in the Science Field and Their Relationship to the Ageing of Scientific Literature." Scientometrics, 41(1), 29-40. 98 (OECD/IEA, 2010). They are the only regression variable which differs between solar and wind energy. The New Technology Credit expenditures are the sum of “tax expenditures,” i.e. dollars not collected in taxes, due to the Production Tax Credit (PTC) and Investment Tax Credit (ITC) for renewable energy producers. The PTC is given on a per‐kilowatt‐hour basis for renewable electricity production and is by far the largest source of federal support for renewable energy energy (Energy Information Administration, 2008). Most of the PTC dollars go to supporting wind energy, although other sources including solar are eligible and receive the same support per kilowatt‐hour. For every fiscal year, the tax expenditure value of the PTC is reported in federal budget documents summed with the tax expenditure value of the Investment Tax Credit (Office of Management and Budget, For each of Fiscal Years 1996‐2011), which is given for investments in renewable energy equipment and is one of the oldest federal supports for renewable energy (Energy Information Administration, 2008). Research subsidies are a direct price support for renewable energy research while the New Technology Credits are an indirect price support in this context, since their direct target is renewable energy production rather than research. 99 Table VIII. Summary Statistics of Counts and Regressors Summary statistics are shown for the count data and all regressors during 1995‐ 2008. Summary Statistics for 1995‐2008 Solar 19 (7) Wind 6 (5) Units Articles Frequency Monthly Source new Keyword Article Counts from Engineering Village 64 (39) 13 (11) Articles Monthly new Modeled Article Counts from Engineering Village 73 (41) 20 (13) Articles Monthly new Keyword Article Counts from Web of Science 76 (53) 12 (11) Articles Monthly new Modeled Article Counts from Web of Science 106 (69) 23 (15) Articles Monthly new Research Subsidies 114 (33) 41 (11) million $a Annual OECD/IEA 250 (244) 250 (244) million $ expendedb Annual OMB datac Renewable Consumption 546 (58) 546 (58) million Btu Monthly EIA Electricity Prices 9.0 (0.6) 9.0 (0.6) $/million Btub Monthly EIA Monthly EIA Quarterly BEA Patent Counts New Technology Credits Fossil Fuel Prices GDP 2.6 (0.8) 2.6 (0.8) b $/million Btu 12,600,000 12,600,000 seasonally adjusted (1,300,000) (1,300,000) million $b a. inflation-adjusted by source using 2009 CPI b. inflation-adjusted by author using average 2009 CPI c. compiled from annual Analytical Perspectives budget documents Data on other aspects of the market for renewable energy—i.e. renewable energy consumption, average fossil fuel prices paid by power plants and average electricity prices—are from the U.S. Energy Information Administration (Energy Information Administration, 2010). Finally, GDP data are drawn from Bureau of Economic Analysis tables (Bureau of Economic Analysis, August 2010). All dollar amounts are inflation‐adjusted to 2009 dollars. For research subsidies, the inflation adjustment has already been conducted by the data provider. Annual New Technology Credits expenditures were adjusted using annual Consumer Price Index factors (Bureau of Labor Statistics, 2011). For consistency, fossil fuel prices and electricity prices were adjusted using the same annual factors. The New Technology Credit expenditures for each fiscal year were adjusted using an inflation factor for that fiscal year. This fiscal year factor is 100 constructed by taking the geometric mean of the annual inflation factors for each of the months within that fiscal year, i.e. three months of the preceding year’s annual inflation factor and nine months of the calendar year’s annual inflation factor. c. Regression Results Using Patent Counts and Article Counts Regression results using article counts and patent counts show similarities which suggest they capture related phenomena and differences which suggest that article counts are better able to identify small effect sizes. That is, patents and articles both exhibit statistically significant responses to research subsidies, while only article counts show significant responses to the tax credits. Specifically, solar and wind patent counts increase significantly in response to research subsidies within several years; solar and wind Engineering Village article counts show effects of similar timing and orders of magnitude. Patent counts show no significant responses to the tax credits, while all sets of article counts show significant effects. Keyword‐ based and model‐based article counts produce almost identical results. Thus, using article counts of either type reveals the impacts of tax credits which are not apparent when using patent counts. Modeled and keyword‐based article counts in most cases produce the same results, to the significant digits described in the text below, and in all cases the one‐standard‐deviation ranges of their results are substantially overlapping. Given this degree of similarity, for clarity of discussion only modeled article counts will be discussed further. Results for research subsidies and tax credits are summarized in Table IX. To facilitate interpretation, results are shown as the percentage increase in the outcome variable which would result from a hypothetical $10 million increase in the policy variable. Ranges indicate +/‐ one standard deviation. Only results which have a 95% probability of being significant given that five time periods were tested (i.e. independently significant at the one percent level) are shown. If a variable has no statistically significant effect with any of the time lags considered, the table lists “none” as the results for that variable. Similarly, deviance explained is only shown for those variables which the regressions find to be significant. Except where otherwise noted, only statistically significant results will be discussed. Full results tables for patent counts 101 are given in the appendix and for article counts are given in Appendix G of Essay 1.xi Note that increasing research subsidies would only impact either solar or wind research, while increasing the New Technology Credit would impact both solar and wind research simultaneously. xi Essay 1 reports results using the same statistical significance threshold used here, but uses different statistical significance thresholds for the purposes of interpreting results because it focuses on different research questions. 102 Table IX. Predicted Impacts of a $10 Million Increase in Solar Research Subsidies, Wind Research Subsidies or the New Technology Credits Percentage increases in patent or article counts predicted by a hypothetical $10 million increase in research subsidies or the New Technology Credits, based on the regression models. Only statistically significant effects are reported, with ranges given showing +/‐ one standard deviation. Percent deviance explained by these regressors is also is also listed for those coefficients which are statistically significant. Patent Predicted Increases Patent Percent Deviance Explained Solar Wind Solar Wind Research Subsidies 4‐9% in 1 year 6‐12% in 2 years 12‐30% in 7 months 19‐38% in 1 year Research Subsidies 2.5% in 1 year 3.6% in 2 years 2.3% in 7 months 3.5% in 1 year New Technology Credits none none New Technology Credits Engineering Village Article Percent Deviance Explained Solar Wind Engineering Village Article Predicted Increases Solar 1‐2% after 1 year Research 7‐10% after 3 years Subsidies 0.5‐0.7% after 7 months New Technology 0.4‐0.5% after 1 year Credits 0.4%‐0.7% after 2 years Wind 10‐18% after 1 year 16‐22% after 2 years 20‐33% after 3 years 0.4‐0.9% after 2 years 2.1% after 1 year 3.6% after 2 years 1.5% after 3 years 0.5% in 7 months Web of Science Article Percent Deviance Explained Solar Wind Web of Science Article Predicted Increases Solar Wind Research Subsidies none 27‐39% after 5 years New Technology Credits 0.9% in 1 year 1% in 3 years 2.4% in 7 months New Technology 1% in 1 year Credits 0.8% in 2 years Research Subsidies 0.3‐0.4% in 7 months 0.3‐0.6% after 7 months 2.2% in 5 years New Technology 0.5% in 7 months 0.8% in 7 months Credits Research Subsidies Research subsidies show similar impacts on patent counts and article counts. Increasing solar research subsidies by $10 million for one year is followed by a 4‐12% increase in solar 103 patents after one to two years. After the same $10 million, solar article counts increase by 1‐ 10% after one or three years, if articles are drawn from the Engineering Village database (with a statistically insignificant 0.8‐3% increase after two years). Solar article counts from Web of Science, with no statistically significant responses, are the exception to these results and will be discussed further. If wind research subsidies are increased by $10 million, patent counts and article counts from both databases all increase significantly and to a greater degree than for solar. Wind patents show a 12%‐38% increase appearing after seven months to one year. Wind Engineering Village article counts increase by 10‐33% in one to three years, and wind Web of Science article counts by 27‐39% after five years. Since the absolute quantities of wind research are less than for solar research, it is perhaps unsurprising that their percentage increases in response to research subsidies are higher. Overall, patent and article counts both suggest that research subsidies have substantial positive impacts on the rates of solar and wind energy research. For context, note that previous research has found $55 or $100 million in U.S. research funding is required to induce a single renewable energy patent (David Popp, 2002). However, this number may be high because the study included technology types that are less affected by public support, as the author implies in later work (David Popp, 2003). In any case, the effects identified here are much larger than previously measured. Web of Science article counts show weaker or at least slower responses to research subsidies than do article counts from Engineering Village. For wind, the response is only after five years; for solar, no response is significant, suggesting that effects if they exist may take more than five years to appear. Web of Science includes more basic science and more social science research than Engineering Village, which may explain its lower responsiveness to research subsidies if those subsidies are targeted towards applied research. While patent and article count results yield strikingly similar conclusions about research subsidies, they give noticeably different results for tax credits. That is, article counts increase significantly with tax credits while patents do not. A $10 million tax credit increase is followed by a 0.4%‐0.7% increase in solar article counts from Engineering Village after seven months to two years. For solar counts from Web of Science, the response is 0.3‐0.4% after seven months. 104 Wind article counts increase by a similar amount in time periods differing by database: 0.4‐0.9% after two years for Engineering Village and 0.3‐0.6% after seven months for Web of Science. Considering only article counts, this analysis strongly suggests that increases in tax credits will be followed by increases in both wind and solar energy research. Considering only patent counts suggests no such effect. To examine why, it is useful to consider the size of the effects and whether the results using article counts fall within the error bounds of results using patent counts. The tax credit effects on articles are one or more orders of magnitude smaller than the research subsidy effects. Thus, they may only be identifiable using more precise data. The article‐based results have narrow ranges—for example, 0.4‐0.7% for solar articles from Engineering Village, when significant—which fall well within the ranges of the patent count results—for example, ‐1.2%‐0.7% for solar patents during the same set of time periods. This overlap shows that although the patent counts cannot reject the null hypothesis that tax credits have a 0% effect on patent counts, nor can they reject the hypothesis that tax credits have an effect on patents that is exactly the same size as their effect on articles. Thus, it is possible that patents are as responsive to the New Technology Credits as articles are, but are simply too noisy for their responses to appear statistically significant. This interpretation is consistent with the fact that articles are far more numerous than patents, and together they suggest that articles can be used to measure effects that patents cannot pick up. Web of Science article counts for solar and wind respond significantly only after seven months, a particularly short time lag. This suggests the possibility that for these article counts, the timing of article publication could be affected more than total quantity of articles produced. Like Web of Science wind articles’ long time lags before responding to research subsidies, these results may indicate that Web of Science articles is generally a weaker and less reliable measure of applied research than Engineering Village, consistent with the differing types of articles covered by the two databases. The research funding and tax credit effect sizes reported in the tables above can serve as inputs for policy decisions and models in which renewable energy research output is a consideration. For example, it may be reported that this research concludes that increasing solar research funding by $10 million is likely to be followed by a 6‐12% increase in patents and 105 articles after two years, or increasing tax credit outlays by the same amount would be followed by a roughly half percent increase in solar research articles and in wind research articles. For wind, a hypothetical increase of $10 million in research funding is fairly large since it represents 37% of the true amount in 2008, the last year in the data; therefore smaller increases may be more reasonable extrapolations from the data. Dollar amounts can be reported as percentages of the actual policy dollar amounts at a given point in time; for example, since $10 million is 1.1% of actual tax credit levels in 2008, one can say that increasing tax credits by two percent of their 2008 levels is predicted to stimulate an approximately one percent increase in solar or wind energy research. These levels, always as percent increases per policy dollar, also can be used as inputs to models which predict the effects of research funding and tax credits or similar production price supports on renewable energy innovation or research output. Using such models, innovation can be included as one among various outcomes of potential suites of energy policies. IV. Reflections on Article Selection Methods The experience of article selection here, together with existing bibliometric knowledge, suggests a variety of methodological conclusions which may be of use to social scientists constructing article counts. Which article database is used appears particularly important, while keywords may be as effective as modeling depending on particular circumstances. Initial article selection, choice of modeling functional form, and software accessibility are areas where improvements in methods would be particularly helpful. While the discussion below focuses on articles, it should be noted that these methods can be applied to patents or other types of documents as well. Firstly, the choice of database appears to influence results more than the choice of article identification method, at least for the two methods considered here. If applied research and innovation are the focus, Engineering Village appears to be a preferable database. More generally, the database should be chosen to match the type of research articles desired, and should be considered when results are interpreted. If articles from a given database are to be counted, the examples here shed light on which methods are best for which purposes. Choosing keywords a priori is typically faster and 106 appears good enough for situations when good keywords can be found and lower absolute article counts are acceptable. These conditions may often hold, especially for econometric analysis where article topics are broad and the focus is on relative rather than absolute article counts. If these conditions do not hold, modeling articles is preferable. More specifically, if a priori keyword selection requires several iterations to be effective, it may be more efficient to use a model, which essentially identifies and uses keywords automatically. By letting the model choose and weight the keywords, using a model relieves the analyst of the need for technical knowledge sufficient to guess which words will best identify relevant articles. Modeling methods are likely to be the more effective means of collecting articles (maximizing both precision and recall), and are therefore preferable if resources allow. For these reasons and particularly because of their high recall, model‐selected articles are suitable for a wider range of applications than keyword‐selected articles, from econometrics (as here) to theme analysis within the articles. To assess the effectiveness of keyword‐based or modeling methods, it is always desirable to sort a sample of documents by hand and calculate the precision, recall and success rate of the approach used. This sample can be selected randomly from a large set of potentially relevant articles identified using as inclusive search terms as feasible. Using a set of known‐to‐ be‐relevant articles to generate an inclusive search string may be more effective than a priori selection of fewer search terms, although the latter was done here for simplicity. If only a small percent of the resulting articles are relevant, an oversampling approach can be employed, as done for wind articles here. The oversample may be selected by drawing from relevant journals, but should include both relevant and irrelevant articles and avoid using keywords in order to avoid biasing the resulting model as much as possible. Each abstract in the resulting sample and oversample, if used, should be hand‐sorted by at least two readers in order to maintain consistency. The hand‐sorting results can then be compared with the results of the chosen keyword‐based or model‐based article selection method. For modeling methods, this hand‐sorted data can be used to build the model, which will then be applied to the large set from which the hand‐sorted data was drawn. 107 Among model‐based article selection methods, other models merit further comparison with logistic Bayesian article selection. Latent semantic analysis has received substantial attention elsewhere and could be considered for the purposes of producing time series of article counts. Multiple methods, including others in use for social sciences such as classical regression with randomly selected subsets of keywords (Daniel J. Hopkins and Gary King, 2010), support vector machines, and gradient boosting (Siddhartha R. Dalal, Paul G. Shekelle, Susanne Hempel, Sydne J. Newberry, Aneesa Motala and Kanaka D. Shetty, 2012, R. Mason, B. McInnis and S. Dalal, 2012), could be applied to the same dataset of potentially relevant articles for the purposes of comparison. Examination of their results should focus on the desired end product, e.g. article counts, and on the abstracts for which the different methods give different results, using measures such as the Relative Recall Ratio and Relative Precision Ratio defined above. Finally, if article counts based on modeling are to be more widely used in the social sciences, the software for creating them will need to be easier to access and use. This research uses a combination of R, with text‐related packages “tm” (for “text modeling”) and “lsa” (for “latent semantic analysis”), and software from <www.bayesianregression.org> (David Madigan, David Lewis, Alex Genkin, Shenzhi Li, Bing Bai, Dmitriy Fradkin, Michael Hollander and Vladimir Menkov, 2009). While there are a variety of packages available for analyzing text by hand, for doing fully automatic text analysis and for running Bayesian and classical statistical analysis, to this author’s knowledge they are not streamlined into one platform where uploading data, hand‐sorting, modeling and analysis of the resulting time series are all convenient. Such a tool may foster greater popularity of modeling article relevance and using the results to answer questions throughout social science. V. Conclusions Article counts collected with either of two methods can identify an effect on innovation that is undetectable with patent counts. The regressions relating patent and article counts with policy variables provide substantial evidence of the impact of research subsidies on solar and wind innovation. These effects are comparable whether articles are identified using keywords or modeling. Further, article counts show an effect of tax credits which is too small to identify 108 using patent counts. These comparisons among results using patents, keyword‐based article counts and modeled article counts yields both substantive and methodological conclusions relevant to future work. a. Solar and Wind Policy Findings and their Applications Research subsidies and tax credits both show strong relationships with solar and wind research. Research subsidies appear to be followed by substantial increases in the production of solar and wind patents and articles, particularly patents and articles from the more applied article database. Tax credits have no discernible impact on patents, perhaps due to the smaller size of patent counts, but are followed by increases in all four types of article counts. Thus, research subsidies and tax credits appear to be effective policies for increasing wind and solar innovation. This result suggests that policy impacts are not simply displacing (“crowding out”) private sector funds. For tax credits, this represents a previously unmeasured effect. The percent increases predicted per tax credit or research subsidy dollar can be interpreted directly and used as inputs to climate policy models. b. Future Research Directions in Renewable Energy Innovation and Policy A combination of varied article and patent count measures and methods of modeling their relationships with economic drivers should ultimately yield a stronger understanding of how economic conditions affect innovation in renewable energy. An important goal for such models would be to examine causality between policy decisions and innovation in further detail than has been possible here. Renewable energy policies in other countries may have substantial impacts on innovation, as may public opinion about environmental needs, energy security or the economy. The instability of policy variables and their interactions with each other are also often hypothesized to affect innovation; such hypotheses can be tested by explicitly considering interaction terms and the slopes as well as absolute values of policy variables. Numerous other functional forms are possible, with some of the more promising including autoregression, more comprehensive constructs representing past innovation or past independent conditions, modeling policies as influencing prices which influence innovation, and 109 consideration of bidirectional causality. As such research proceeds, we will accumulate a better understanding of what drives renewable energy innovation. c. Choosing and Using Proxies for Innovation These results suggest that patent and article counts can both be used to investigate the drivers of innovation, with article counts sometimes revealing more than do patent counts. The paths of patents and articles over time are sufficiently different to merit separate investigation and interpretation. Articles are substantially more numerous than patents for both wind and solar energy, which may account for the narrower error bounds of their responses to policy variables. Both patents and articles respond very similarly to research subsidies, implying not only the clear impact of research subsidies themselves, but also that patents and articles are both reflective of some common underlying phenomenon—presumably innovation, or at least successful research—and it is probably appropriate to use them both as proxies for it. At the same time, only articles show significant responses to tax credits, demonstrating that article counts can reveal econometric relationships either nonexistent for patents or too subtle to be identified using patent counts. d. Methods of Producing Article Counts This study can serve as an example and source of guidance for constructing future article counts. In collecting articles, which database to draw them from is a particularly crucial choice. The choice between keyword‐based article selection and modeling may have less influence on the resulting article counts if the chosen method is executed effectively, but still involves balancing effort and accuracy. Modeling methods may be more dependable than keyword‐based methods, although they are also likely to be more time‐consuming. Further suggestions for such methods are listed above and could be improved upon with continued studies focused on article counting for social science applications. 110 Appendix. Patent Count Responses to a 10‐Unit Increase in Subsidy or Other Regressor Variables The results of the Poisson regressions of patent counts with subsidy and other variables are shown in tables below. These follow the same format as the results for article counts in Appendix G of Essay 1, with the exception of showing significance at the 0.25% and 1% levels here instead of the 0.25% and 5% levels as in the previous essay. Highlighted cells and double asterisks identify statistically significant results at the 0.25% level, and single asterisks at the 1% level, for each variable. The 0.25% level is chosen in order to guarantee 5% significance across all ten combinations of policy variables and time lags for solar, or for wind. 1% guarantees significance across the five lag times considered. The regressions were run separately for solar and wind patent counts as described above. Values shown in the tables are not raw coefficients but instead, percentage increases resulting from a ten‐unit change in the predictor variable. For research subsidies, the New Technology Credits and GDP, ten units is $10 million. For the smoothed lag of article counts, ten units is ten articles. For renewable energy consumption, ten units is ten million Btu and for both electricity prices and fossil fuel prices it is ten dollars per million Btu. While some of these units are difficult to compare with each other, they produce readily interpretable results for individual variables, including the two policy variables of interest, research subsidies and New Technology Credits. Rows for these variables are the source of the patent count results reported in Table IX. The results shown in the tables are calculated as follows: The baseline prediction for some set of regressor values at time is (15) and after a ten‐unit increase in some regressor, say , the prediction is , 10 111 (16) The percentage increase, then, is , 1 10 (17) That is, the percent increase is simply the raw coefficient multiplied by the contemplated ten units and then exponentiated. Ranges based on standard errors are calculated similarly, using the coefficient plus or minus the standard error. The fact that these results do not depend on the starting values for either patents or coefficients makes percentage increase a logical statistic to report for the Poisson and related functional forms. 112 Predicted Percent Changes in Solar Patent Counts After a 10‐Unit Increase in an Independent Variable Production Time 7 months 1 year 2 years 3 years 5 years Constant < 10^‐5 * < 10^‐5 < 10^‐5 < 10^‐5 < 10^‐5 Lag of Patent Counts 26% (16% ~ 36%)* 8.9% (0.035% ~ 18.5%) ‐19% (‐25% ~ ‐ 11%) ‐0.43% (‐10% ~ 10%) ‐8.8% (‐20% ~ 4.3%) Solar Research Subsidies 0.55% (‐0.34% ~ 1.4%) 6.4% (3.9% ~ 8.9%)* 9.3% (6.4% ~ 12%)** 0.26% (‐2.9% ~ 3.5%) ‐6.5% (‐9.9% ~ ‐ 3.1%) New Technology Credits ‐0.58% (‐0.84% ~ ‐ 0.33%) ‐0.96% (‐1.2% ~ ‐ 0.70%) 0.037% (‐0.21% ~ 0.29%) 0.38% (0.061% ~ 0.69%) 0.69% (0.17% ~ 1.2%) Renewable Energy Consumption 0.33% (‐0.11% ~ 0.77%) 0.44% (‐0.080% ~ 0.95%) 0.26% (‐0.23% ~ 0.76%) 1.5% (0.9% ~ 2.0%) ‐1.1% (‐1.7% ~ ‐ 0.60%) Electricity Prices ‐80% (‐87% ~ ‐ 70%) 1.7% (‐35% ~ 59%) 34% (‐22% ~ 130%) 220% (91% ~ 440%) 500% (230% ~ 990%)* Fossil Fuel Prices 76% (‐9.5% ~ 240%) 23% (‐36% ~ 140%) ‐50% (‐78% ~ 10.8%) ‐99% (‐100% ~ ‐ 95%) ‐7.7% (‐78% ~ 300%) GDP < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** < 10^‐5 ** < 10^‐5 Percent of Deviance Explained 35% 43% 28% 20% 14% Deviance Explained by Research Subsidies 0.1% 2.5% 3.6% 0.0% 2.6% Deviance Explained by New Technology Credits 1.8% 5.0% 0.01% 0.7% 1.4% 113 Predicted Percent Changes in Wind Patent Counts After a 10‐Unit Increase in an Independent Variable Production Time 7 months 1 year 2 years 3 years 5 years Constant < 10^‐5 < 10^‐5 < 10^‐5 < 10^‐5 < 10^‐5 Lag of Patent Counts 170% (100% ~ 260%)** 130% (82% ~ 200%)** 92% (49% ~ 150%) 2.9% (‐39% ~ 73%) ‐86% (‐92% ~ ‐ 76%) Wind Research Subsidies 21% (12% ~ 30%)* 28% (19% ~ 38%)** 27% (15% ~ 41%) 21% (9.4% ~ 34%) ‐17% (‐27% ~ ‐ 5.4%) New Technology Credits ‐0.96% (‐1.6% ~ ‐ 0.29%) ‐1.5% (‐2.1% ~ ‐ 0.96%) ‐1.9% (‐2.5% ~ ‐ 1.2%) ‐1.5% (‐2.3% ~ ‐ 0.71%) 0.39% (‐1.1% ~ 1.9%) Renewable Energy Consumption ‐3.1% (‐3.8% ~ ‐ 2.3%) ‐1.2% (‐2.0% ~ ‐ 0.43%) ‐1.5% (‐2.3% ~ ‐ 0.72%) ‐0.66% (‐1.5% ~ 0.16%) 0.40% (‐0.64% ~ 1.5%) Electricity Prices 160% (4.8% ~ 540%) ‐74% (‐89% ~ ‐ 39%) ‐93% (‐97% ~ ‐ 83%) ‐95% (‐98% ~ ‐ 86%) ‐49% (‐82% ~ 44%) Fossil Fuel Prices ‐20% (‐78% ~ 170%) ‐0.0036% (‐71% ~ 240%) ‐88% (‐97% ~ ‐ 57%) ‐93% (‐99% ~ ‐ 52%) ‐18% (‐93% ~ 830%) GDP < 10^‐5 < 10^‐5 * < 10^‐5 < 10^‐5 < 10^‐5 ** Percent of Deviance Explained 58% 56% 39% 29% 32% Deviance Explained by Research Subsidies 2.3% 3.5% 1.3% 1.0% 1.5% Deviance Explained by Tax Credit 1.0% 2.6% 3.4% 1.8% 0.05% 114 Bibliography Andersen, Birgitte. 1999. "The Hunt for S‐Shaped Growth Paths in Technological Innovation: A Patent Study." Journal of Evolutionary Economics, 9(4), 487‐526. Archambault, Éric; Julie Caruso; Grégoire Côté and Vincent Larivière. 2009. "Bibliometric Analysis of Leading Countries in Energy Research," B. Larsen and J. Leta, Proceedings of the 12th International Conference of the International Society for Scientometrics and Informetrics (ISSI) – Peer‐Reviewed Conference. Rio de Janeiro: BIREME/PAHO/WHO and Federal University of Rio de Janeiro, 80‐91. Baskurt, Oguz. 2011. "Time Series Analysis of Publication Counts of a University: What Are the Implications?" Scientometrics, 86(3), 645‐56. Bengisu, Murat and Ramzi Nekhili. 2006. "Forecasting Emerging Technologies with the Aid of Science and Technology Databases." Technological Forecasting and Social Change, 73(7), 835‐44. Bhattacharya, S. and M. Meyer. 2003. "Large Firms and the Science‐Technology Interface: Patents, Patent Citations, and Scientific Output of Multinational Corporations in Thin Films." Scientometrics, 58(2), 265‐79. Boyack, Kevin W.; Jeffrey Y. Tsao; Ann Miksovic and Mark Huey. 2008. "International Trends in Solid‐ State Lighting: Analyses of the Article and Patent Literature." Sandia Laboratory, SAND2008‐4564. Braun, F.; J. Schmidt‐Ehmcke and P. Zloczysti. 2010. "Innovative Activity in Wind and Solar Technology: Empirical Evidence on Knowledge Spillovers Using Patent Data." Bureau of Economic Analysis. August 2010. "Section 1 ‐ Domestic Product and Income, Table 1.1.5. Gross Domestic Product." National Economic Accounts All NIPA Tables, http://www.bea.gov/national/nipaweb/SelectTable.asp (accessed February 1, 2011). Bureau of Labor Statistics. 2011. "Consumer Price Index ‐ All Urban Consumers." http://data.bls.gov/pdq/SurveyOutputServlet (accessed February 1, 2011). Colatat, Phech; Georgeta Vidican and Richard K. Lester. 2009. "Innovation Systems in the Solar Photovoltaic Industry: The Role of Public Research Institutions." Industrial Performance Center Working Paper Series, Massachusetts Institute of Technology, MIT‐IPC‐09‐007. Dalal, Siddhartha R.; Paul G. Shekelle; Susanne Hempel; Sydne J. Newberry; Aneesa Motala and Kanaka D. Shetty. 2012. "A Pilot Study Using Machine Learning and Domain Knowledge to Facilitate Comparative Effectiveness Review Updating." Medical Decision Making, 33(3), 343‐55. Dawelbait, G.; T. Mezher; Woon Wei Lee and A. Henschel. 2010. "Taxonomy Based Trend Discovery of Renewable Energy Technologies in Desalination and Power Generation," Technology Management for Global Economic Growth (PICMET), 2010 Proceedings of PICMET '10. 1‐8. Dooley, James J. 2000. "A Short Primer on Collecting and Analyzing Energy R&D Statistics." PNNL‐13158, http://www.osti.gov/energycitations/servlets/purl/896784‐BBDxQs/ (accessed July 6, 2013). Elsevier. 2010. "About Engineering Village." http://www.engineeringvillage2.org/ (accessed May 15, 2010). Energy Information Administration. 2008. "Federal Financial Interventions and Subsidies in Energy Markets 2007." SR/CNEAF/2008‐01. Energy Information Administration. 2010. "Table 9.9 Average Retail Prices of Electricity and Table 10.1 Renewable Energy Production and Consumption by Source." Monthly Energy Review, http://www.eia.doe.gov (released March 31, 2010). Garg, K.C. and Praveen Sharma. 1991. "Solar Power Research: A Scientometric Study of World Literature." Scientometrics, 21(2), 147‐57. Genkin, Alexander; David D. Lewis and David Madigan. 2007. "Large‐Scale Bayesian Logistic Regression for Text Categorization." Technometrics, 49(3), 291‐304. 115 Godin, Benoît. 1996. "Research and the Practice of Publication in Industries." Research Policy, 25(4), 587‐606. Griliches, Zvi. 1990. "Patent Statistics as Economic Indicators: A Survey." Journal of Economic Literature, 28(4), 1661‐707. Gross, P. L. K. and E. M. Gross. 1927. "College Libraries and Chemical Education." Science, 66(1713), 385‐89. Hausman, Jerry; Bronwyn H. Hall and Zvi Griliches. 1984. "Econometric Models for Count Data with an Application to the Patents‐R & D Relationship." Econometrica, 52(4), 909‐38. Hlavka, Eileen. 2011. "U.S. Subsidy Effects on Production of Solar Technology Research: Using a New Measure to Assess Policy Impacts," 2011 IEEE First Conference on Clean Energy and Technology (CET). 259‐64. Hopkins, Daniel J. and Gary King. 2010. "A Method of Automated Nonparametric Content Analysis for Social Science." American Journal of Political Science, 54(1), 229‐47. Jaffe, Adam B. 1986. "Technological Opportunity and Spillovers of R & D: Evidence from Firms' Patents, Profits, and Market Value." The American Economic Review, 76(5), 984‐1001. Joachims, Thorsten. 1998. "Text Categorization with Support Vector Machines: Learning with Many Relevant Features," C. Nédellec and C. Rouveirol, Machine Learning: ECML‐98. Berlin/Heidelberg: Springer, 137‐42. Johnstone, Nick; Ivan Haščič and David Popp. 2009. "Renewable Energy Policies and Technological Innovation: Evidence Based on Patent Counts." Environmental and Resource Economics, 45(1), 133‐55. Kajikawa, Yuya; Junta Yoshikawa; Yoshiyuki Takeda and Katsumori Matsushima. 2008. "Tracking Emerging Technologies in Energy Research: Toward a Roadmap for Sustainable Energy." Technological Forecasting and Social Change, 75(6), 771‐82. Long, Rebecca; Aleta Crawford; Michael White and Kimberly Davis. 2009. "Determinants of Faculty Research Productivity in Information Systems: An Empirical Analysis of the Impact of Academic Origin and Academic Affiliation." Scientometrics, 78(2), 231‐60. Luwel, M. and H. Moed. 1998. "Publication Delays in the Science Field and Their Relationship to the Ageing of Scientific Literature." Scientometrics, 41(1), 29‐40. Madigan, David; David Lewis; Alex Genkin; Shenzhi Li; Bing Bai; Dmitriy Fradkin; Michael Hollander and Vladimir Menkov. 2009. "Bayesian Logistic Regression (BBR, BMR, BXR)." www.bayesianregression.org (accessed August 2009). Mansfield, Edwin. 1995. "Academic Research Underlying Industrial Innovations: Sources, Characteristics, and Financing." The Review of Economics and Statistics, 77(1), 55‐65. Marinova, Dora. 2008. "Renewable Energy Technologies in Asia: Analysis of US Patent Data," H. Cabalu and D. Marinova, Second International Association for Energy Economics IAEE Asian Conference. Perth, WA: Curtin University of Technology: 193‐204. Mason, R.; B. McInnis and S. Dalal. 2012. "Machine Learning for the Automatic Identification of Terrorist Incidents in Worldwide News Media," 2012 IEEE International Conference on Intelligence and Security Informatics (ISI). 84‐89. Mock, Peter and Stephan A. Schmid. 2009. "Fuel Cells for Automotive Powertrains—a Techno‐Economic Assessment." Journal of Power Sources, 190(1), 133‐40. Nemet, Gregory F. 2006. "Beyond the Learning Curve: Factors Influencing Cost Reductions in Photovoltaics." Energy Policy, 34(17), 3218‐32. Nemet, Gregory F. 2009. "Demand‐Pull, Technology‐Push, and Government‐Led Incentives for Non‐ Incremental Technical Change." Research Policy, 38(5), 700‐09. Nemet, Gregory F. and Daniel M. Kammen. 2007. "U.S. Energy Research and Development: Declining Investment, Increasing Need, and the Feasibility of Expansion." Energy Policy, 35(1), 746‐55. 116 OECD/IEA. 2010. "RD&D Budgets, Group III: Renewable Energy Sources, III.1 Total Solar Energy." Energy Technology RD&D 2010 Edition. Office of Management and Budget. For each of Fiscal Years 1996‐2011. "Analytical Perspectives: Budget of the U.S. Government." Pavitt, K. 1985. "Patent Statistics as Indicators of Innovative Activities: Possibilities and Problems." Scientometrics, 7(1), 77‐99. Perry, Thomas D. IV; Mackay Miller; Lee Fleming; Kenneth Younge and James Newcomb. 2011. "Clean Energy Innovation: Sources of Technical and Commercial Breakthroughs." NREL/TP‐6A20‐50624. Popp, David. 2002. "Induced Innovation and Energy Prices." American Economic Review, 92(1), 160‐80. Popp, David. 2003. "Pollution Control Innovations and the Clean Air Act of 1990." Journal of Policy Analysis and Management, 22(4), 641‐60. Popp, David; Ted Juhl and Daniel K. N. Johnson. 2004. "Time in Purgatory: Examining the Grant Lag for U.S. Patent Applications." Topics in Economic Analysis & Policy, 4(1). Ryan, G.W. and H.R. Bernard. 2000. "Data Management and Analysis Methods," N. Densin and Y. Lincoln, Handbook of Qualitative Research, 2nd Ed. Thousand Oaks, CA: Sage Publications, 769‐802. Sanz‐Casado, Elias; J. Garcia‐Zorita; Antonio Serrano‐López; Birger Larsen and Peter Ingwersen. 2012. "Renewable Energy Research 1995–2009: A Case Study of Wind Power Research in EU, Spain, Germany and Denmark." Scientometrics, 1‐28. Sebastiani, Fabrizio. 2002. "Machine Learning in Automated Text Categorization." ACM Computing Surveys, 34(1), 1‐47. Sinha, Bikramjit. 2011. "Trends in Global Solar Photovoltaic Research: Silicon Versus Non‐Silicon Materials." Current Science, 100(5). Taylor, Margaret; Dorothy Thornton; Gregory Nemet and Michael Colvin. 2006. "Government Actions and Innovation in Environmental Technology for Power Production: The Cases of Selective Catalytic Reduction and Wind Power in California." California Climate Change Center Report Series, CEC‐500‐ 2006‐053. Thomson Reuters. 2010. "Web of Science (R) ‐with Conference Proceedings." ISI Web of Knowledge [v.4.10], http://apps.isiknowledge.com/ (accessed November 10, 2010). Tussen, R.; R. Buter and Th. van Leeuwen. 2000. "Technological Relevance of Science: An Assessment of Citation Linkages between Patents and Research Papers." Scientometrics, 47(2), 389‐412. U.S. Patent and Trademark Office Patent Technology Monitoring Team. 2012. "U.S. Patent Custom Data Extracts DVD." Data covers January 1, 1975 to December 31, 2011. United States Patent and Trademark Office. 2007. "Patent Class 416. Fluid Reaction Surfaces (i.e., Impellers)." United States Patent and Trademark Office. 2012. "USPTO Patent Full‐Text and Image Database." http://patft.uspto.gov/netahtml/PTO/search‐adv.htm (accessed August 27, 2012). Uzun, Ali. 2002. "National Patterns of Research Output and Priorities in Renewable Energy." Energy Policy, 30(2), 131‐36. Vidican, Georgeta; Wei Lee Woon and Stuart Madnick. March 15, 2009. "Measuring Innovation Using Bibliometric Techniques: The Case of Solar Photovoltaic Industry." Working Paper CISL# 2009‐05, http://hdl.handle.net/1721.1/65941 (accessed July 6, 2013). World Intellectual Property Organization. 2009. "Patent‐Based Technology Analysis Report: Alternative Energy Technology." http://www.wipo.int/export/sites/www/patentscope/en/technology_focus/pdf/landscape_alternative_ energy.pdf (accessed July 6, 2013). 117 Yu, Xiaojing; Jianping Cao; Yudong Cai; Tieliu Shi and Yixue Li. 2006. "Predicting rRNA‐, RNA‐, and DNA‐ Binding Proteins from Primary Structure with Support Vector Machines." Journal of Theoretical Biology, 240(2), 175‐84. Zelnio, Ryan. 2012. "Identifying the Global Core‐Periphery Structure of Science." Scientometrics, 91(2), 601‐15. 118 Essay 3. A Thousand Thin Film Flowers Perennially Blooming: Article Counts as a Method of Comparing Monocrystalline Silicon and Thin Film Solar Research and Examining the Influence of Public Policy Abstract For several decades, researchers have sought to make solar cells competitive with the average electricity source, and to use thin film solar to do so. This study quantifies the extent to which solar research has focused on thin film rather than monocrystalline silicon, the current dominant solar technology. Several related methods of quantifying published research in these areas are compared. All methods considered show that thin film has received the bulk of research throughout recent decades. In addition, I contextualize these results and offer a preliminary examination of whether public policies have affected the relative dominance of thin film. While U.S. research subsidies may have favored monocrystalline silicon, for the most part it appears that policies to encourage solar energy affect both research types equally. Thus, they are broadly consistent with the philosophy that public policies should “let a thousand flowers bloom.” These results may be of interest to solar researchers and policymakers alike. For at least the past 15 years, it has seemed that the photovoltaic industry has been on the verge of making a transition to a ‘second generation’ of thin‐film solar cell technology. ‐Martin Green, 2001 Keywords solar, generation, thin film, innovation, policy 119 1. Introduction The impending dominance of thin film has been predicted for some time. Martin Green, a leading photovoltaics researcher since the early days of solar cells, has made such predictions as well as implicitly acknowledging that the transition has not been as fast as predicted (2001). Similar predictions have been made by others in the field (Björn A. Andersson and Staffan Jacobsson, 2000, A. Shah et al., 1999). While most of the solar market is currently held by the first generation technology, i.e. monocrystalline silicon, researchers hope and predict that thin film will not only become more cost‐effective than monocrystalline silicon but will become cost‐ competitive with non‐solar energy sources. For example, Bagnall and Boreland predict it will be a “truly cost‐competitive energy supply” by 2020 (2008). This belief—that thin film is “on the verge” of success—may help explain why the volume of thin film research has been high for so many years. Perhaps for this reason, thin film research articles have far outnumbered articles on monocrystalline silicon throughout recent decades. This study identifies the extent to which this has occurred by categorizing and counting tens of thousands of journal articles. Thus, a rich data source on thin film and monocrystalline silicon research quantity trends over time is created. This data may be useful for a variety of additional types of research. The multiple methods used to create it, including their strengths and weaknesses, are compared in some detail. These article counts over time are then used to examine to what extent public policies may have affected how much research has focused on monocrystalline silicon vs. on thin film. In other contexts, public policies to encourage the use of a technology sometimes have been found to influence not the quantity of research on that technology, but the direction of that research (David Popp, 2003, regarding Clean Air Act effects on patents for air‐cleaning scrubbers). Here, it appears that of the two U.S. policies considered quantitatively, research subsidies for solar may have favored monocrystalline research. This relationship is observed only within the more applied of the two article databases considered, and suggests that research funding can be used to steer solar research. Otherwise, the remaining regressions and qualitative examination of article count trends do not suggest that the balance between 120 monocrystalline silicon and thin film research was affected by the policies of the most major solar‐supporting countries. Previous research using similar article counts shows that U.S. solar research and renewable energy production subsidies have encouraged solar research as a whole (see preceding essays). Thus, this preliminary analysis concludes that with the exception of research funding, solar energy policies have tended not to favor monocrystalline silicon or thin film, but rather to encourage all solar research and allow non‐policy factors to determine the direction of research within solar energy. This essay is organized as follows. The remainder of the background section gives more detailed definitions of the two technology types (1.1), and describes previous efforts to quantify research on them (1.2). This background is necessary to help the reader understand why the comparison between thin film and monocrystalline silicon is of interest, what the article counts do and do not mean and how they compare to existing data of a similar nature. The second section provides greater historical context. It covers how the performance of different solar cell types has been improved over time (2.1) and the institution of the most major solar energy policies worldwide (2.2). These events, occurring during the same time period as the research focused on here, may have encouraged monocrystalline silicon or thin film research or both. The third section shifts the focus to article counts. Four approaches are used to identify articles: three based on logistic Bayesian classification of hand‐sorted article abstracts, and a fourth using hand‐selected keywords. Subsections describe the Bayesian classifier method (3.1), the sources of all article abstracts considered (3.2), the three Bayesian classifiers (3.3‐3.5), the keyword‐based method (3.6), how the steps of the methods come together to create article counts (3.7), and finally, assessment of how successful those methods are at correctly identifying monocrystalline silicon and thin film articles (3.8). Such assessment has rarely been performed in past research and provides a measure of confidence in the resulting article counts. The fourth section discusses the resulting article counts. How the totals compare with each other is noted (4.1), followed by more detailed examination of how that comparison appears relatively constant over time, and how it may or may not relate to public policies (4.2). A brief regression analysis with two U.S. policies is provided (4.3), suggesting that tax credits for 121 renewable energy have not affected the balance between monocrystalline silicon and thin film research, although research subsidies for solar energy may have done so. Section five concludes, reiterating the main findings of this essay with regard to thin film and monocrystalline silicon research trajectories (5.1) and their relationship with policy (5.2), including suggestions for future research. 1.1 Definitions and Histories of Monocrystalline Silicon and Thin Film “Monocrystalline silicon” and “thin film” are the terms which will be used for the two broad categories which describe the chemical technologies of solar cells. Monocrystalline silicon solar, also known as “flat panel,” “wafer” or “first generation” solar, is defined as using rigid panels of monocrystalline or multicrystalline silicon material in the realm of 0.5 mm (500 microns) thick (M. A. Green, 2000). This is the oldest solar cell material (Geoffrey A. Landis et al., 1996) and by far the most widely sold to date. It is worth noting that this material, typically made by growing large silicon crystals and then slicing them into “wafers,” is similar to silicon materials used for microelectronics and can be built using some of the same raw materials and equipment. This concurrence has contributed to minimizing the costs of manufacture (Darren M. Bagnall and Matt Boreland, 2008, M. A. Green, 2000). “Thin film” photovoltaic cells, typically thinner than one micron (M. A. Green, 2000), are substantially thinner than monocrystalline silicon and can be made with a variety of materials. Widely considered materials include amorphous or polycrystalline silicon and chalcogenides like cadmium telluride and copper indium gallium selenide (CIGS). Because they are so thin and thus use very little material, thin film cells are hoped to eventually be cheaper than monocrystalline silicon, even if made from more expensive substances. Thin film may have additional advantages if they are flexible (floppy), which may be possible with many candidate materials. The efficiency of thin film materials is generally lower than that of materials in the first category, but they are hoped to make up for that with their lower cost (G. F. Brown and J. Wu, 2009, Martin A Green, 2001). Thin films are often known as the “second generation” of solar materials. 122 For the present purposes, “thin film” will include both the materials described above as well as another set often described as the “third generation” of solar cell technologies. While they include diverse methods, third generation concepts typically have in common the goal of achieving low cost and high efficiency. They include dye‐sensitized solar cells (DSSC), other organic materials, nano‐scale and “quantum” materials, and “multijunction” solar cells made of multiple layers which may be any combination of the first, second or third‐generation materials already mentioned. These definitions of the two or three “generations” of solar photovoltaics are based on common definitions from the literature (Darren M. Bagnall and Matt Boreland, 2008, G. F. Brown and J. Wu, 2009, Gavin Conibeer, 2007, Aimee E. Curtright et al., 2008, Martin A Green, 2001). A final category of solar technology uses gallium arsenide (GaAs) and related chemicals to achieve high efficiency. Termed III‐V materials for their place on the periodic table, they were developed and used on satellites after the first satellite use of monocrystalline silicon (Geoffrey A. Landis, Sheila G. Bailey and Michael F. Piszczor, 1996). III‐V materials are neither monocrystalline silicon nor thin film, except when used as part of multijunction third generation materials. They are highly efficient but expensive and constitute a small part of solar energy research, so they will be disregarded here. The use of the term “generation” deliberately implies that each is newer than the previous, which is certainly true in terms of the commercially available technology. Nevertheless, the concepts behind thin‐film solar and even some third‐generation solar cells are roughly as old as the monocrystalline concept: early monocrystalline silicon cells were built 1953, cadmium sulfide thin film cells were built in 1954, and tandem solar cells, a two‐layer type of multijunction cell, were proposed in 1955 (Joseph J. Loferski, 1993). The first thin film amorphous silicon of “acceptable quality” was made in 1969 by Chittick et al. (J. Carabe and J.J. Gandia, 2004 citing , R. C. Chittick et al., 1969). Even the possibility of producing more than one excited electron per incoming photon, which is the basis of certain third‐generation concepts, has been known since the 1960’s (M. A. Green, 2000). Thus, one may expect traceable streams of research on both monocrystalline silicon and thin film solar to date from as early as the 123 1950’s and potentially be well under way by the 1980’s, which are the first years examined here. 1.2 Previous Solar Article Counts Article counts are an increasingly common measure of research quantity for solar and other energy research (for example, see K.C. Garg and Praveen Sharma, 1991, Katarina Larsen, 2008, Thomas D. IV Perry et al., 2011, Bikramjit Sinha, 2011, Guo Ying et al., 2009). The most similar studies to the work described here are a few investigations which described changes over time in solar subcategories, although none of these include first‐generation or monocrystalline silicon as a category. Bikramjit Sinha, comparing between articles published in 1981‐1988 and 2001‐2008, finds that the percentage of photovoltaic research which mentions silicon has remained high and constant (36% in the first period, 34% in the second period), while the percentage which mentions non‐silicon materials is much smaller but has nearly doubled (9% to 17%). Silicon was not divided into subcategories so it is unclear how much monocrystalline and thin film silicon each contributed to his “silicon” category. The remaining percentage mentioned neither silicon nor other materials, suggesting either keyword searches were insufficiently inclusive or simply that the remaining research was not focused on materials (Bikramjit Sinha, 2011). Another recent study uses keyword clustering to identify active areas of solar research and their trends over time (Yuya Kajikawa et al., 2008). Within solar, they cluster articles by the words that appear in them and focus on the five largest clusters, naming them “silicon,” “compounds,” “dye‐sensitized,” “organics” and “GaAs.” “Compounds” seems to focus on chalcogenides and include heterojunctions. They too do not subdivide within silicon, although their top keywords found within it all appear to suggest thin film: “a‐Si,” “poly‐Si,” “microcrystalline,” and “efficiency.” They find that the silicon and GaAs categories have had relatively constant research quantities throughout 1980‐2005, while compounds, dye‐sensitized and organics have increased dramatically since 1980, with most of the increase occurring after 1990. Thus, their work is informative about trends within thin film rather than comparisons between monocrystalline silicon and thin film. 124 Additional studies have focused on other aspects of solar research articles. For example, recent analyses have targeted specific third‐generation technologies such as dye‐ sensitized solar cells (Guo Ying, Huang Lu and A. L. Porter, 2009) and nanotechnology solar cells (Ying Guo et al., 2010). While drawing articles using more narrow keywords than I use, these studies use detailed within‐text analysis to characterize the literature. More difficult‐to‐ interpret reports include a cross‐national study that included “solar cells”, “solar energy” and “solar power plants” as distinct topics (K.C. Garg and Praveen Sharma, 1991) and an investigation of the role of U.S. federal research laboratories and nearby private research groups which was unfortunately limited by having only a small number of articles (Phech Colatat et al., 2009). The present research contributes by contrasting the two most major categories of solar energy research, monocrystalline silicon and thin film, and their emphasis over time. While more or less consistent with existing findings regarding silicon and thin‐film research quantities over time, results are difficult to compare due to differing definitions of categories. That is, previous research has tended to include silicon‐based thin films in the same category as monocrystalline silicon and the present study instead includes it in the thin‐film category. The present definitions are more consistent with the first‐generation‐versus‐thin‐film dichotomy discussed in the technical literature. Also unlike earlier renewable energy article counts, the article counts presented here combine keyword searching, hand sorting and Bayesian logistic classification to identify tens of thousands of articles, compare these results with results using simple keyword searches, and assess the effectiveness of the article selection methods. 2. Solar Historical Context This essay examines monocrystalline silicon and thin film research during 1985‐2010. During that time, the maximum efficiencies of many types of monocrystalline silicon and thin film solar cells increased. At the same time, many countries instituted policies to support solar energy. Both of these trends may have contributed to enthusiasm for pursuing solar research, and could impact the relationship between monocrystalline silicon and thin film. This section 125 provides historical context on the development of solar cell efficiencies and solar public policies. 2.1 Increasing Solar Cell Efficiencies The history of solar cell efficiencies may suggest thin film has a high potential for improvement relative to monocrystalline silicon. Cell efficiency is defined as the percent of incoming light from the sun which the cell transforms into useable electricity. On the one hand, monocrystalline silicon still has the highest efficiencies without using GaAs. On the other hand, many thin film technologies have achieved gradually higher efficiencies in recent years, implying that further improvements may be possible. When interpreting these results, it must be remembered that lower efficiencies are commercially acceptable if they come at lower costs, as may be the case for thin films. Taken together, solar efficiency results may be seen as implying that thin film could soon achieve efficiencies substantially higher than it has so far, and that could ultimately make it cheaper per unit of energy produced than monocrystalline silicon. Several reports have tracked increases in solar cell efficiency. For example, Progress in Photovoltaics has published such a summary twice annually since 1993, listing the highest recorded solar cell efficiencies that conform to standards including being tested at designated testing centers and using a minimum cell area (Martin Green and Keith Emery, 1993a, 1993b, ... 2011b). These efficiencies are shown in Figure I (results for GaAs and concentrating solar are not shown). The National Renewable Energy Laboratory (NREL) has published a similar plot of such results over time with somewhat different standards for inclusion(National Renewable Energy Laboratory, October 2012). Since these are laboratory results, unfortunately one cannot easily identify the costs which would be associated with their production for sale. 126 Figure I. Highest Reported Photovoltaic Efficiencies by Type Highest reported efficiencies for solar cells using various technologies, as compiled in Progress in Photovoltaics by Martin Green and collaborators. Standard errors are reported for a few of these results, in which cases they are within 0.3‐0.8%. Monocrystalline silicon results—including multicrystalline, because it is also a rigid flat‐panel technology—are shown in squares in shades of orange, while thin films in the second‐generation category are shown in green downward‐pointing triangles and third‐generation technologies are shown in purple upward‐pointing triangles. Each shade of each color represents a subcategory of technologies. Insofar as efficiencies within each labeled category do not rise monotonically, this can be explained by further subcategories not shown. These further subcategories differ by technology and solar cell area, with larger areas typically having lower reported efficiencies. As seen in Figure I, traditional silicon has achieved the highest efficiencies to date, with the exception of GaAs multijunction cells to be discussed below. What is considered the 127 “monocrystalline silicon” category includes three subcategories (shown in light, medium and dark orange squares in Figure I): standard monocrystalline, multicrystalline, and thin crystalline (20 or 47 microns thick). All three are rigid flat panels made of silicon which is much thicker than thin‐film materials, thus having greater material costs. Monocrystalline is somewhat more efficient than multicrystalline, although both have improved slightly over time, with thin crystalline improving from comparability with multicrystalline to monocrystalline. For monocrystalline cells with small areas, the size of the improvement was from 23.1% to 25% in 1990 to 1999. These traditional silicon cells set the bar which thin film seeks to meet. Thin film includes a wide variety of subcategories, each following its own upward trajectory. The first of these, like monocrystalline silicon, is based on silicon. Silicon thin film has gone from a high of 11.5% efficiency in 1981 to 16.7% in 2001. Silicon thin films with larger areas, shown as the same light green in the figure, have produced lower but increasing efficiencies during 2001 through 2007. Chalcogenide efficiencies have similarly improved over time: cadmium telluride (shown in medium green) from 15.8% to 16.7% in 1990 to 2001, with again lower efficiencies for larger cell areas tested; and CIGS (shown in dark green) surpassing it by going from 13.7% to 19.4% in 1992 to 2008, with results at or slightly below the trend for larger cell areas and rarer subcategories (CIGSES and CIGSS). Thin film researchers and supporters may be inspired by these consistently upward trends, suggesting that further improvement is feasible. Third‐generation technologies’ efficiencies follow similar upward trajectories, with overall levels ranging from the lowest to the highest achieved by any technology type. The wide variety of third‐generation materials may mask the improvements occurring. For the purposes of labeling Figure I, third‐generation materials are grouped into four broad categories: multijunction, dye‐sensitized, nanocrystalline silicon, and organic. The earliest third‐generation results shown are for multijunction cells, which functionally are multiple cells stacked on each other (results shown in lightest purple triangles). Within multijunction cells, achievements in the 25‐32% range reflect cells using GaAs, and those in the 6‐15% range reflect cells using silicon and other materials. Note that non‐multijunction results using GaAs are not shown, since GaAs materials are expensive and therefore not usually considered for wide applications, 128 while multijunction GaAs cells are included here for comparison with other multijunction results. Further third‐generation categories show similarly varying efficiencies and rates of improvement. Dye‐sensitized cells show consistent improvement: 6.5% in 1997 has risen to 10.4% in 2005, with some variations due to materials and size. Nanocrystalline silicon alone has hovered around 10.1% during at least 1997 to 2009. Although organic cells have the lowest efficiencies reported here, they show dramatic improvements in recent years, from 3% to 5.15% from March to December of 2006 and for larger areas, 1.1% to 3.5% from 2008 to 2009. While it is impossible to deduce which technology holds the most promise from these trajectories alone, together they clearly show efficiencies increasing with time. Because of the variations among technology types, it is probably impossible to compare these efficiencies with article counts in a meaningful quantitative manner. What efficiency data do suggest is the promise of thin film research. Thin film solar takes many forms, and research on many of them has succeeded in increasing their efficiencies. Monocrystalline silicon, in contrast, has shown slower improvement than almost all thin films, suggesting research on it has been less plentiful and/or less successful. With its higher absolute efficiencies already achieved, monocrystalline may be a less promising area of research while simultaneously providing a high efficiency target for thin film to try to match. If thin film materials are cheaper, they may not need to match it in efficiency in order to compete for market share. These considerations may help explain why thin film research output, as measured in published articles, is substantially higher than monocrystalline silicon research output. 2.2 Solar Public Policies Many countries and local jurisdictions have implemented public policies to support renewable energy in general and solar energy in particular. These include solar roof programs and subsidies for investing in and producing renewable electricity, with the U.S., Germany and Japan having some of the largest programs. If monocrystalline silicon, thin film, or other solar energy research increase dramatically at particular times, it may be attributable to one of these policies. 129 So‐called solar roof programs provide support for residential customers to install solar cells on their roofs. Japan started a pilot program of 200 solar roofs in 1986, then a nationwide subsidy for solar roofs in 1993 (M. A. Green, 2000), with the addition in 1997 of the participants’ ability to sell power back to the grid (known as "net metering;" Joanna Lewis et al., 2009). The European Union and the United States each set a million‐solar‐roofs target in 1997 for 2010 (M. A. Green, 2000), with California setting a similar goal for itself in 2004 for 2015 (Joanna Lewis, Amber Sharick and Tian Tian, 2009). Such programs are lauded by photovoltaic researchers as an effective way to increase solar energy deployment (M. A. Green, 2000). Price supports for solar or renewable energy are another common way to facilitate the same goal. Germany has some of the largest by providing a feed‐in tariff which guarantees that renewable electricity providers will receive a certain price for their electricity for a given length of time. The tariff was created in 2000, increased in 2004, and decreased in 2008 (Joanna Lewis, Amber Sharick and Tian Tian, 2009). The United States has provided tax credits for renewable energy investment and production on and off since at least 1978, with reductions in 1986 and annual extensions of the business tax credit for renewables from 1988 to 1992, when it was made permanent (Energy Information Administration, September 1999). In addition to incentivizing solar installations, federal governments have played a direct role in solar research. Japan has funded photovoltaic research since 1973, with a reorganization in 1997 at the same time as net metering was implemented. Their Cool Earth Initiative in 2007 identified “high‐efficiency, low‐cost solar cells” as one of its goals (Joanna Lewis, Amber Sharick and Tian Tian, 2009). In the United States, a variety of programs have supported traditional, thin‐film and concentrating solar, such as the amorphous silicon project which ran during 1982‐1992, the Thin Film Partnership with industry formed in 1994, and the Concentrator Alliance formed in 1997 (Vicki Norberg‐Bohm, 2000). Total research funding for solar by country and year is reported jointly by the Organization for Economic Co‐operation and Development and the International Energy Administration (OECD/IEA, 2011). The United States and the European Union have spent roughly the same amount on renewable energy research in the last 20 years (K. Blok, 2006). 130 These represent some of the most major of the numerous policy supports for which solar energy is eligible. Previous research on other environmental innovation has found that subsidy policies can affect not only the net quantity but the direction of research—for example, the U.S. market for sulfur dioxide appears to have shifted the emphasis of patents for pollution reduction technologies (scrubbers) towards increasing effectiveness, away from lowering prices (David Popp, 2003). However, most solar‐supporting policies, with the exception of research funding, are likely to be agnostic of the type of solar used. For monocrystalline silicon and thin film research, this appears to be the case. 3. Identifying Articles In this essay I focus on published journal articles as a measure of the quantities of monocrystalline silicon and thin film research over time. Articles on monocrystalline silicon and thin film are identified in four ways which have some of the same steps in common. All of the methods start with abstracts drawn from two academic journal abstract databases by searching for the term “solar.” The first three approaches are variations on the use of Bayesian classifiers, built using hand‐sorted article abstracts. These methods differ only in how the results of the classifiers are cumulated to form monthly article counts. The fourth approach uses keywords chosen a priori rather than by a model. It is the approach most often taken in previous work, and is conducted here primarily for the purposes of comparison. 3.1 Constructing Bayesian Classifiers Bayesian logistic classifiers are a type of Bayesian regression model used to sort data into categories. A small sample of data for which the categories are known is used to create the classifier, which is then applied to the remaining data. In this case, the data are journal article abstracts, the regressors are words which appear in them, and the categories are types of solar or other research. The data with known categories are a random sample of articles which are sorted by hand by two readers. Hand‐classification of texts, using a standardized description of categories, is a research process often used by itself in other contexts such as anthropology or 131 medicine (for more on the hand classification method, see G.W. Ryan and H.R. Bernard, 2000). In this case, two readers categorized each abstract independently in order to both catch errors and create a measure of consistency, namely, the difference between their results. Both readers consulted and agreed upon a final classification for each abstract on which they did not agree initially. These hand‐sorted data, referred to as training data, are used to train i.e. build a Bayesian logistic classifier. The Bayesian functional form with a Laplace prior is chosen in order to minimize overfitting the model (for more on Bayesian logistic classification, see Alexander Genkin et al., 2007). The regressors are the number of times each word appears in each abstract, after various adjustments such as summing together words with the same root or “stem”, removing generic common words, one‐letter words and words which appear less than a minimum number of times, and the use of cosine normalization (Amit Singhal et al., 1996) to counteract the effect of longer abstracts having more words. Cross‐validation is used to inform the choice of these parameters. For more on these classifier specifications, see earlier articles in this dissertation. Three such Bayesian logistic classifiers are constructed used here, each addressing only two, mutually‐exclusive categories. The first classifier identifies solar energy research (vs. other research), the second narrows down to monocrystalline silicon and thin film research (vs. other solar energy research), and the third and final model differentiates between monocrystalline silicon and thin film research. This multi‐step approach maximizes the amount of data available to create the earlier classifiers and minimizes the potential problem of categories which are not analogous enough to each other for a classifier to effectively differentiate among them. The creation, assessment and application of each of these classifiers is described in further detail below. 3.2 Data Sources The initial set of potential articles is identified by searching for the term “solar” in the abstracts of articles published in 1980‐2010 and indexed in two journal article databases: ISI Web of Science and Compendex Engineering Village. Web of Science is a large database which 132 targets a broad academic audience and includes social science and ‘early’ lab research topics as well as more applied research (Thomson Reuters, 2010). Engineering Village is a smaller database which focuses on applied research (Elsevier, 2010), including the chemistry and physics research involved in solar panels. When constructing the article counts, articles from each database are considered separately in order to enable comparison between databases. The topical difference between them may help explain the different results found depending on which database is used. The initial set of abstracts consists of 108,456 abstracts from Engineering Village and 171,447 abstracts from Web of Science. From these, conference abstracts are removed because of likely redundancy with other abstracts, and because their content is likely to be of lower value, on average. Undated abstracts are removed because they cannot be used for the article counts. Each abstract includes the bibliographic information—title, author, journal, etc.—as well as the abstract text, which in some cases is short and in some abstracts from early years is missing. Of these abstracts, a random sample of 750 from Web of Science and 750 from Engineering Village are selected to serve as the training data for all three classifiers. This quantity was chosen in the hopes of including at least 100 abstracts in each of the categories of interest. 3.3 First Classifier: Solar Electricity v. Other Research Of the abstracts which match the keyword “solar,” many are irrelevant to the present purposes, discussing topics such as botany, literature and sun spots. The first classifier differentiates these “irrelevant” abstracts from those whose topics relate to producing electricity from sunlight. In the training data, 472 of the abstracts are relevant and 1028 are irrelevant. These two categories are used to build a classifier which identifies relevance. A minimum word frequency of 50 is used in this classifier. The relevance classifier appears successful: it performs roughly as well as the human readers. Ten‐fold cross‐validation suggests that the relevance classifier finds about 81% of the relevant articles (recall) and of the articles it predicts to be relevant, 84% of them are actually 133 relevant (precision). Typically, performance of 70‐80% is considered good (Kevin W. Boyack et al., 2008). Thus, although there is quite a bit of variation in precision and recall across the ten cross‐validation runs, overall the classifier appears to perform quite well.xii While model performance is a bit lower than the readers’ measures of successfully identifying documents, readers’ performance should be high because it forms the definition of ‘correct’ except when the two readers disagree. Table I. Relevance Article Classifier Performance Measures Several measures show the performance of the classifier used to identify solar research articles. For the purposes of these calculations, an abstract is considered relevant if the model predicts its probability of relevance to be above 50%. Readers 1 and 2 are the analysts who read the subsample of articles on which the classifiers are based. Precision is the percent of predicted relevant abstracts which are actually relevant; recall is the percentage of actually relevant abstracts which are predicted relevant by that person or classifier. AUC represents the area under the receiver operating characteristic curve, a measure of the model’s ability to achieve high precision and recall simultaneously. Variance, shown in parentheses, is across the ten cross‐validation runs. Reader 1 Reader 2 Model Performance Model Cross‐Validation Percent Correct 98% 93% 99% Precision 96% 88% 99% Recall 97% 88% 99% AUC ‐ ‐ 100% 89% (5%) 84% (19%) 81% (21%) 95% (1%) The relevance classifier was applied to all abstracts. Those which it predicted to be relevant with a greater than 50% probability were retained for the next steps of analysis, while the remaining abstracts were not subjected to further classifiers. This left about 44,000 abstracts for subsequent analysis. 3.4 Second Classifier: Monocrystalline Silicon and Thin Film v. Other Solar Research xii One may notice that Reader 2’s results are considerably worse than Reader 1. This is attributed mostly to poor training of this reader, including in clarity of the relevance definition for gray areas such as solar salt ponds, here defined to be irrelevant. 134 The second classifier separates the two topics of interest from the remaining solar energy research. As discussed in the definitions section above, while monocrystalline silicon and thin film compose the bulk of photovoltaic research, there are other solar cell technologies—primarily, III‐V materials—which use neither monocrystalline silicon nor thin films. Solar energy research also includes technologies which are not photovoltaic cells nor are necessary for photovoltaics—that is, that use other ways of converting sunlight to electricity— such as reflectors for concentrating solar technology. The second classifier’s purpose is to identify all of these non‐monocrystalline, non‐thin film abstracts. While this may appear to be a less natural definition of categories than the previous one, it enables the first classification— relevance—to cleanly define solar vs. all other topics, and the last classification to focus on differentiating between the two topics of interest. Most abstracts deemed relevant are in the monocrystalline silicon or thin film categories, so it would be possible to skip this classifier, but preferable to include it in order to remove other types of solar research. The 472 training data abstracts which describe solar technology are used to build this classifier. Of these, 371 discuss either monocrystalline silicon or thin film, and the remainder cover other topics. Minimum total word frequency across all abstracts used to build the classifier is sixty because it performs relatively well in terms of precision and recall. Other model specifications are the same as for the relevance classifier. This classifier also performs quite well. For all four measures listed, its average cross‐ validated performance is 90% or higher, well above the 70‐80% threshold. These high performance levels suggest that it is reasonable to expect the classifier to accurately identify which articles belong to which category. Table II. Monocrystalline and Thin Film v. Not Article Classifier Performance Performance of the model used to separate monocrystalline silicon and thin film photovoltaic research articles from other solar‐related research. Definitions are as for Table I. Reader 1 Percent Correct 96% 135 Precision 99% Recall 98% AUC ‐ Reader 2 Model Performance Model Cross‐Validation 96% 93% 97% 94% 99% 97% ‐ 95% 90% (17%) 92% (25%) 95% (5%) 91% (16%) This classifier was applied to all abstracts identified as relevant. The results are combined with the third classifier results in three ways, described in section 3.7. 3.5 Third Classifier: Monocrystalline Silicon v. Thin Film Finally, monocrystalline silicon and thin film photovoltaic articles are separated using a similar process. Since our focus is on thin film as advancing a new technology distinct from the current widely‐used monocrystalline technology, the monocrystalline category here includes abstracts that discuss technology which might be useful to any type of solar panel, including monocrystalline and thin film panels. For example, these include automated control and assessment of solar panels or assessing the amount of solar energy available. This broad definition of “monocrystalline silicon” research thereby encompasses research which could be applied to building and using monocrystalline silicon photovoltaic systems without requiring thin film to be successful. This definition offers several advantages. Conceptually, it focuses on the difference between research to enhance current technologies (monocrystalline) and research to replace them. Secondly, it favors monocrystalline and thus enables the resulting monocrystalline silicon to thin film article count ratio to be interpreted as an upper bound, strengthening the conclusion that thin film research is more plentiful. Finally and most pragmatically, it yields a large pool of “monocrystalline” abstracts with which to build the classifier, thus improving the quality of the classifier. Without this inclusiveness, the number of abstracts in the monocrystalline silicon category would be too small to produce a relatively reliable classifier. Of the hand‐sorted training data, 127 abstracts address monocrystalline silicon and 244 address thin film, as the categories are defined here. These 371 abstracts were used to build the monocrystalline silicon/thin film classifier. Further model specifications such as the minimum word frequency were the same as for the second classifier for the same reasons. 136 Like the classifiers for solar and photovoltaic relevance, the monocrystalline silicon/thin film classifier has precision and recall levels within or above the desired 70‐80% range. However, the variation in performance across cross‐validation runs is much higher. This is likely due to the small number of monocrystalline silicon abstracts and means that the model’s performance may not be as stable as desired. While the high average performance suggests the model probably performs adequately enough to use, it is likely that future models could perform better. 137 Table III. Monocrystalline Silicon v. Thin Film Article Classifier Performance Precision, recall and AUC for the classifier used to differentiate between monocrystalline silicon and thin film articles. Percent Correct Precision Recall AUC Reader 1 95% 93% 95% ‐ Reader 2 96% 97% 86% ‐ Model Performance 91% 84% 87% 96% Model Cross‐Validation 85% (48%) 77% (162%) 83% (151%) 93% (21%) The monocrystalline silicon/thin film classifier is applied to each of the relevant abstracts, the same set of abstracts to which the second classifier was applied. 3.6 Identifying Monocrystalline Silicon and Thin Film Articles: Using Keywords For comparison, a fourth set of time series is created using hand‐selected keywords. While the classifiers in effect choose and weight keywords based on their sorting effectiveness within the sample data, here keywords are chosen by the researcher based on their knowledge of the keywords and categories of interest. Abstracts are deemed to be in the monocrystalline silicon or thin film categories if they include the keywords associated with those categories in the combinations described in the queries in Table IV. The chosen keywords refer to the defining chemical attributes of monocrystalline silicon and all the major types of thin film technology. The element silicon can be a part of either category, depending on whether it is monocrystalline, so the queries are designed to reflect that. Because thin film includes a larger number of variations, it has a larger number of keywords. Note that unlike the definition of the “monocrystalline silicon” category used in the Bayesian logistic classifiers, the definition used here does not include more general solar research. To do so would require a large number of keywords and it likely would be difficult to make them appropriately inclusive without resulting in an excessive number of extraneous results. Thus, one should expect the keyword‐based method to yield substantially fewer 138 “monocrystalline silicon” articles than the classifier‐based methods because it is more narrowly defined. Table IV. Queries for Keyword‐Selected Articles Queries used to select articles for keyword‐based time series describing monocrystalline silicon and thin film research. Monocrystalline Silicon Thin Film (ANY OF("silicon", "si ","c‐Si") ANY OF("thin film", "polycrystalline", "poly‐Si", "a‐ AND ANY OF("monocrystalline", "multicrystalline")) Si", "polymorphous silicon", "pm‐si", OR ANY OF("wafer", "Czochralski") "microcrystalline", "a‐C", "CIGS", "copper indium", "chalcogenide", "CuInS", "copper sulfide", "CuGaSe", "selenide", "CZTS", "ZnO", "zinc oxide", "CdTe", "CdS", "cadmium", "DSSC", "dye", "liquid junction", "TiO2", "SnO2", "ruthenium", "MEH‐PPV", "fluorescent", "organic", "bulk heterojunction", "bulk‐heterojunction", "polymer", "dendrimer", "oligomer", "P3HT", "PEDOT", "multijunction", "stack", "tandem", "triple‐junction", "four‐junction", "quantum", "nano", "tunnel", "hot carrier") Next, the methods for constructing article counts using these classifiers and keywords will be described. Assessment of the keyword method will be included there. 3.7 Producing Article Counts Monocrystalline silicon and thin film article counts for each month are produced in four ways. The first three of these are variations on the use of the Bayesian logistic classifiers, and the fourth is the keyword‐based approach. Each of the approaches is applied to the training data for the purposes of assessment. Then, it is applied separately to all of the abstracts collected. These results are added up by month and database to produce article counts which reflect how the volumes of monocrystalline silicon and thin film research have evolved over time. All three classifier‐based approaches begin by applying the relevance classifier to all abstracts. Those abstracts it predicts to have a less than 50% chance of being relevant to solar 139 energy are removed from further consideration. The remaining abstracts are each subjected to both the second and third classifiers. Thus, every relevant abstract now has a predicted probability of being in one of the monocrystalline silicon or thin film categories, as opposed to other solar research, and a predicted probability of being in the monocrystalline silicon category rather than thin film category. These probabilities can be written as: and | . The first or “probabilistic” approach uses these probabilities directly. The two probabilities predicted by the last two classifiers are multiplied to find the probability of the abstract being in the monocrystalline silicon category. That is, the abstract’s probability of being in the monocrystalline silicon category is calculated as ∗ | Similarly, the probability of the abstract being in the thin film category is ∗ | 1 | ∗ The resulting probabilities are summed by month of article publication. This approach may be the most obvious and should be accurate if the probabilities are accurate. It takes the classifier’s output literally as the probability that the abstract is monocrystalline silicon or is thin film, and weighting the article’s contribution to both time series accordingly. This method most directly acknowledges the uncertainty in the classifier’s output, but has the possibility of producing time series that are very similar for the two technology types because each article 140 contributes to both series. Since each article truly is in only one category, the results may misrepresent the distribution of research between the two categories. The “threshold” approach avoids this problem by converting the probabilities into 0‐1 variables before multiplying them. That is, each classifier is used to assign each abstract to its more likely category of the two considered by that classifier. Taken together, these classifiers thereby assign each abstract to a single category. The abstracts assigned to the monocrystalline silicon category are counted by month to produce monocrystalline silicon article counts, and the abstracts assigned to the thin film category are counted by month to form thin film article counts. This approach, while coarser, avoids allowing each article to contribute to article counts for the category it is less likely to be in. It is called the “threshold” approach in reference to an abstract’s probability being above the 50% “threshold” in order for it to be assigned to a given category. The “hybrid” approach is a combination of the first two approaches. It converts the moncrystalline and thin film vs. other classifier’s predicted probabilities to a zero‐one variable before multiplying them with the other classifier’s results. That is, it discards abstracts which the classifier assigns a less than 50% chance of being monocrystalline silicon or thin film, and keeps the remaining abstracts without weighting them. Next, it apportions each remaining abstract between the monocrystalline silicon and thin film categories according to the probability predicted by the monocrystalline silicon vs. thin film classifier. These portions are summed to form the article counts. This method is primarily offered as a compromise between the probabilistic and threshold approaches. The fourth method, the keyword approach, is simpler. Using the entire initial sets of abstracts from the two databases, abstracts are selected for the monocrystalline silicon keyword series using the monocrystalline silicon query described in Table IV in the keyword section above. Similarly, abstracts are selected for the thin film series if they include any of the thin film terms listed in Table IV. These abstracts are counted by month to form the “keyword” time series which will be compared with the probabilistic, hybrid and threshold time series. 141 3.8 Assessing Article Count Methods Comparison of the four article count approaches suggests that the three classifier‐based approaches perform similarly, while the keyword approach gives less trustworthy results. Of the classifier approaches, the threshold approach most closely replicates the hand‐sorted data. The keyword results for both monocrystalline silicon and thin film are far from the hand‐sorted or “correct” results. For monocrystalline silicon this may be due to keywords using a much narrower definition of the category, but the definition of thin film used to choose keywords and to hand‐sort the data was the same, suggesting that the keyword approach is performing poorly for identifying thin film abstracts. A number of performance measures are calculated for each of the approaches. As when assessing the classifiers, all approaches are judged by how they compare to the hand‐sorted data, since the latter is defined to be correct, and for this assessment only the training data is used. The three classifier approaches give results more similar to each other than to results for keyword counts, and will be discussed first. Their performance measure results are listed in Table V. Keyword results are described next. Finally, training data article counts from each of the four method are bootstrapped to provide a more intuitive demonstration of their relative performance. 142 Table V. Classifier‐Based Approaches’ Performance Measures Performance measures for each of the classifier‐based approaches used to categorize abstracts. Column titles refer to total abstracts in the category according to hand‐classification; total abstracts in the category according to the approach used; sum of the absolute values of the residuals relative to the hand‐ classified data; sum of the squares of the residuals; precision; and recall. While the values are similar for all approaches, the threshold approach performs best by all measures—including the predictions being most similar to the hand‐coded results—except for the sum of the squares of the residuals. Probabilistic Approach Performance Measures In Cat Pred in Cat Resid Sum Abs Resid Sum Sq Precision Recall Monocrystalline 127 109 83 51 70% 60% Thin Film 244 261 112 72 75% 81% Total 371 370 195 123 NA NA Hybrid Approach Performance Measures In Cat Pred in Cat Resid Sum Abs Resid Sum Sq Precision Recall Monocrystalline 127 112 80 55 71% 63% Thin Film 244 262 103 78 77% 83% Total 371 374 183 132 NA NA Threshold Approach Performance Measures In Cat Pred in Cat Resid Sum Abs Resid Sum Sq Precision Recall Monocrystalline 127 115 70 70 75% 68% Thin Film 244 259 93 93 79% 84% Total 371 374 163 163 NA NA The goal of all four approaches is to produce article counts accurately. Thus, one natural assessment method is to compare the number of training‐data abstracts an approach assigns or “predicts” for one of the two categories (second column in Table V) with the total hand‐sorted abstracts in that category (first column). By this measure, the threshold approach is most accurate, followed by the hybrid approach and finally the probabilistic approach. Below, these results will be bootstrapped, with similar results. 143 More traditionally, one can consider the residuals of each abstract’s predicted probability of being in a category. The sum of squares of the residuals, as well as sum of absolute values of the residuals, are reported in Table V. Note that for the threshold method, these are the same because 0 and 1 square to themselves. By focusing on prediction of individual abstracts rather than their total, this may be less appropriate than the first assessment measure. The sums of absolute values yield similar conclusions to the previous approach, suggesting the threshold method performs best. Interestingly, the sums of squares give the opposite ordering of the three approaches: probabilistic first, hybrid next, and threshold last. Note that this is almost required by the way that classifiers are created: minimizing the sum of squares of residuals is part of the process of building each classifier. One implication of these findings is that if accurate article counts are the goal, rather than accurate assignment of individual abstracts, there may be preferable methods of building classifiers. Finally, precision and recall show the ability of each approach to avoid including abstracts not in the appropriate category and to find those abstracts which are in the category. They are calculated as the total article count correctly assigned by the method divided by the article count produced by the method or the hand‐sorted article count, respectively. As for the comparison of article count totals, precision and recall suggest that the threshold method is best, hybrid is next most effective and probabilistic is last. To further understand why the threshold approach might produce better article counts, it is useful to recall that every abstract truly belongs to only one category. Nevertheless, probabilistic and hybrid article count methods assign part of each abstract to the other, incorrect categories. If the amount they assign to the wrong categories systematically differs by category, which is likely if the categories are of different sizes, then the total article counts and ratios among them will be biased. Depending on the data and model structures, this effect theoretically could bias results in either direction. In this case, it appears to cause probabilistic article counts to bias in favor of thin film and against monocrystalline silicon. This effect must be interpreted with caution; later, it will become clear that this bias does not translate to probabilistic article counts giving a lower proportion of monocrystalline silicon when the methods are applied outside the training data. 144 Plotting the individual residuals of hybrid and probabilistic article counts shows this effect in more detail in Figure II. (Threshold residuals are not shown since they all equal 0 or 1.) In many cases, the two residuals are equal or close to equal. This is due to the classifier step which differentiates them producing probabilities at or near 1. In other cases, where the last classifier step produced probabilities close to 0 or 1, using the hybrid method gives residuals near ‐1, 0 or 1. Since this happens more often for residuals near 0 than residuals near ‐1 or 1, as seen in the figure, the hybrid method has lower average residuals and in that sense tends to perform better. 145 Figure II. Hybrid vs. Probabilistic Residuals Residuals of the hybrid method vs. residuals from the probabilistic method. By definition, the two residuals will be equal when the monocrystalline silicon and thin film vs. other classifier produces a probability of 0 for being in the “other” category. The predominance of results falling along the diagonal shows that this is often the case. More broadly, this shows that the use of that classifier has little effect on the final article counts. Despite the slight differences among them, all three classifier‐based methods perform similarly and reasonably well. In many cases their performance measure values are not statistically different from each other. Their precision and recall are in the 60‐84% range, while 70‐80% is desirable. Their total article counts are within 1‐14% of actual article counts. The threshold method appears to perform best, but only by a small amount. Finally, the keyword method is assessed using the same measures, as shown in Table VI. In addition, percent correct is used for comparison with each of the classifiers used in the classifier‐based approaches. It is rare in the literature for keyword‐based article count methods to be assessed, perhaps because it requires hand‐sorting data to compare the keyword results with, so it is particularly valuable to conduct such assessment. By all measures except thin film recall, keywords perform poorly compared to the classifier‐based methods. 146 Table VI. Keyword Approach Performance Measures Performance measures for the article selection method which relies upon keywords chosen by the analyst. In addition to the measures used for the classifier approaches, percent correct is given in order to facilitate comparison with the individual classifier models. The keyword approach performs much worse than the classifier approaches, although for the monocrystalline silicon category this is attributable to keywords using a narrower definition. Because it requires reading many documents, and perhaps because the keyword method has been assumed to be effective, this type of assessment has rarely been carried out. In Cat Pred in Cat Resid Sum Abs Precision Recall Pct Correct Monocrystalline 127 32 123 56% 14% 1% Thin Film 244 429 223 52% 92% 15% In the case of monocrystalline silicon, comparing the hand‐sorted and keyword results is comparing apples to oranges. The hand‐sorted data were created defining the monocrystalline silicon category to include research which is broadly applicable to either monocrystalline silicon or thin film solar production, while for keywords only terms relating to monocrystalline silicon technology were used. Therefore, while monocrystalline silicon keyword performance measures are reported for consistency, they should not be interpreted as reflecting the failure of the keyword approach. In contrast, the definition of thin film was consistent for both hand‐sorting and keyword selection. The results in Table VI therefore show that keyword‐based thin film article counts are less likely to be accurate than classifier‐based thin film article counts. In particular, the keyword approach overestimates the number of thin film articles. While successfully recalling most of the thin film abstracts, it does so at the expense of precision, with nearly half (48%) of the abstracts it identifies as thin film not actually being in the thin film category. The final methodological assessment again focuses on article counts. Monocrystalline silicon and thin film article counts for each method are listed in the tables above, but the results for several methods are very similar. By bootstrapping these article counts, one can produce probability density curves showing how likely the approaches are to produce similar or different 147 results. The bootstrapping results will be the most reliable assessment of those conducted, and reinforces many of the conclusions reached above. Because of the nature of the classifier‐based methods, there are several potential methods for bootstrapping. Using the classifier‐based results directly may overstate their effectiveness, since the classifiers are more effective on the data used to build them than on other data, as seen in Tables I‐III. Therefore, for the abstracts used to build each classifier, N‐1 cross‐validation results are used instead. For the remaining training data, results of applying the final classifiers are used. In this way, predictions for each abstract are created using each of the three classifier methods. Keyword‐based predictions are constructed in the usual manner. Bootstrap samples of 1,500 abstracts each are drawn from this data 1,000 times, and the monocrystalline silicon and thin film results are summed by approach to create the bootstrapped sets of article counts. Article counts based on hand‐coding are also constructed from the same bootstrapped samples, forming the “actual” distributions of article counts. These produce an ideal curve to which each article count approach’s curve can be compared. If an approach worked perfectly, its probability distributions would overlap exactly with the distributions based on hand‐coding. The resulting bootstrapped curves are shown in Figure III. 148 Figure III. Bootstrapping Article Counts Using the training abstracts, I bootstrap the monocrystalline silicon article count, the thin film article count, and the ratio between the two, as predicted by each of the classifier approaches, the keyword approach, and the real i.e. hand‐coded results. The three classifier approaches give similar results, with the threshold approach peaking closest to the peak of the real results. The keyword approach produces results far from the hand‐classified data. 149 The bootstrapped article counts reinforce the preceding conclusions regarding the four approaches. The strongest conclusion is that all classifier approaches are preferable to the keyword approach, at least for thin film where the definitions are consistent. Of the classifier approaches, the threshold approach performs marginally best, followed by the hybrid and probabilistic approaches. Still, all classifier‐based approaches perform similarly. Despite the classifier‐based methods’ similarity within the training data, applying them to the larger data set exaggerates the differences among them. Since each approach has its pros and cons, including the keyword‐based approach, results from all four approaches will be analyzed below. 4. Results Article counts thus collected demonstrate the dominance of thin film throughout this time period, mostly irrespective of public policies. All four article count approaches yield far more thin film than monocrystalline silicon articles, and this relationship appears to be more or less constant throughout the time considered. Absolute rates and changes over time vary by database. Regressing monocrystalline silicon and thin film article counts with major U.S. renewable energy policies gives some evidence that U.S. research subsidies have favored monocrystalline silicon. Other than this, this preliminary investigation provides no evidence that policies have favored either monocrystalline silicon or thin film over the other. Instead, research of both types follows similar temporal patterns, suggesting it is driven not primarily by internal developments but by broader social, economic and policy forces. 4.1 Article Count Totals Thin film contributes the vast majority of the research articles identified, with monocrystalline silicon contributing a much smaller proportion of total solar research articles. This is true for all four of the approaches used here, as well as for the small sample of data sorted by hand, for both databases used. Differences in methods are likely to account for differences in the exact percentages found. 150 Table VII. Monocrystalline Silicon and Thin Film Article Totals Total monocrystalline silicon and thin film articles identified by the four methods considered. Monocrystalline silicon accounts for only 5% to 39% of total articles. The sample of articles which was hand‐sorted includes general solar research within the “monocrystalline” category and thus it and all three methods based on it are likely to underestimate the dominance of thin film. Database Article ID Method Monocrystalline Sample Engineering Village Web of Science Thin Film % Monocrystalline Total Probability Hybrid Threshold Keyword Sample Probability 92 3,979 5,243 1,969 1,247 35 5,647 142 7,783 10,312 13,495 21,978 102 13,109 39% 34% 34% 13% 5% 26% 30% 234 11,762 15,555 15,464 23,225 137 18,756 Hybrid Threshold Keyword 7,264 1,522 1,506 16,799 22,160 26,947 30% 6% 5% 24,062 23,682 28,453 The hand‐sorted sample of data is perhaps the most trustworthy, since every abstract it includes was classified by hand. While this sample suggests the largest percentage of monocrystalline silicon as opposed to thin film—39% for Engineering Village and 27% for Web of Science—it is important to recall that the “monocrystalline” category is defined to include not only actual monocrystalline silicon research but also research which is broadly useful to photovoltaics. Thus, a more thorough description of this finding is that 61% or 73% of the sampled research focuses on thin film while the remainder focuses on monocrystalline silicon and technologies which are applicable to either. Threshold and hybrid article counts also show the monocrystalline silicon category contributing only about one third of research articles, similar to the proportions given by the hand‐sorted samples. Note the similarity between threshold and hybrid results means simply that the second classifier’s results are not correlated with the third classifier’s results. Their results’ similarity to handcoding is natural because the classifiers are calibrated to match the hand‐coded sample data. 151 Since probabilistic counts allow each article to be considered part monocrystalline silicon and part thin film in the proportions assigned by the classifiers, they could be considered to describe the amount of article space occupied by each of the two topics, acknowledging that a given abstract may discuss both. However, if an abstract gives some mention to one topic but emphasizes the other, it seems likely that the deemphasized topic is mentioned only for comparison rather than as the subject of new research. This would suggest that the probabilistic percentages of monocrystalline silicon may overstate the amount of new research on it. On the other hand, if one thinks of the probabilities as representing degrees of certainty about each article rather than representing parts of the article, then probabilistic article counts may be the most accurate descriptions of total monocrystalline silicon and thin film proportions. In general, it may seem odd to allow an article to contribute to both monocrystalline silicon and thin film article counts. Threshold article counts avoid this by assigning each article only to its most probable category. Since this is thin film in most cases, this approach yields dramatically lower percentages of monocrystalline thin film articles: 13% for Engineering Village and 6% for Web of Science. Note that it is theoretically possible for a threshold approach to yield either a more extreme or less extreme distribution of article counts between two categories. The fact that it yields a more extreme distribution here suggests that the real distribution is indeed relatively extreme, although it could be less so than the threshold results suggest. Keyword article counts corroborate the threshold article counts. Only 5% of keyword‐ collected articles from each database are in the monocrystalline silicon category. With keyword articles and threshold articles agreeing so closely, one may conclude that monocrystalline silicon contributed little more than 5‐6% of research during this period. This is reinforced by the narrower definition of “monocrystalline silicon” for the purpose of keyword counts. However, the use of only eight keyword combinations to define silicon compared with 46 defining thin film may substantially bias the results in favor of thin film. In any case, one would expect the keyword “monocrystalline silicon” results to be smaller than the classifier‐based “monocrystalline silicon” results because of their narrower definition. 152 In interpreting these results, it is important to remember that not all research is published in journals. Since monocrystalline silicon solar has achieved significant commercial production, it is likely that compared to thin film, a larger proportion of the research on it is conducted outside of academia and published in patents and other forms, or never published, and therefore a lower proportion is recorded in journal articles. The high proportions of thin film articles found here simply demonstrate that the community of publishing solar researchers is focusing on thin film rather than monocrystalline silicon, and as seen below, probably has been doing so for a long time. 4.2 Article Counts Over Time All four sets of time series suggest two simple and significant conclusions: thin film has consistently been the focus of more published research than monocrystalline silicon throughout the years considered; and both types have followed similar patterns of increase and decrease, suggesting they respond to the same causal factors. The second conclusion may in part arise from how the articles were selected. Additional conclusions may be drawn from the particular trajectories of the various article count time series. 153 Figure IV. Monthly Article Counts Monthly counts of monocrystalline silicon and thin film articles. Articles identified by the threshold, probabilistic and keyword methods described above are summed by month of publication. Since the raw counts are fairly noisy, twelve‐month rolling averages are shown. Low numbers at the beginning and end of each series may reflect data limitations. 154 Thin film accounts for more articles than monocrystalline silicon throughout the years considered. This is true for all methods and databases, with the only near exception being Web of Science articles before 1992, possibly due to data limitations which will be discussed further. Threshold and keyword monthly counts show monocrystalline silicon research having a mere fraction of the volume of thin film research. Probabilistic and hybrid article counts, since they assign part of each article to each topic, more conservatively allow monocrystalline articles to appear roughly half as numerous as thin film articles at most points in time, and fewer than thin film at all points in time. Thin film’s dominance throughout this period is consistent with monocrystalline silicon and thin film technical history. By 1980, monocrystalline silicon was well past its initial development and first uses on satellites and was commercially available. Thus, the goal of creating a saleable product had been achieved, and efforts to improve that product may have shifted away from published journal research somewhat. Thin film, in contrast, remained relatively nascent. For example, in 1985 (the earliest year for which the Energy Information Administration reports such data), thin film sales were 5% of the U.S. solar market compared to monocrystalline silicon’s 95% (Energy Information Administration (EIA), June 2009). At the same time, predictions of thin film’s eventual dominance date from at least the mid‐1980’s 155 (Martin A Green, 2001). These factors may have spurred ongoing enthusiasm for conducting and publishing thin film research more so than monocrystalline silicon research. Even more strikingly, monocrystalline and thin film research appear to follow similar patterns over time. This is true for all methods used, including keywords, for which the only methodological causes of such similarity would be the database drawn from and the occurrence of “solar” in abstracts of both types. To make these patterns more visible when monocrystalline silicon article counts are low, the threshold and keyword article counts shown in Figure IV are repeated in Figure V, this time with monocrystalline silicon article counts scaled up by a factor of ten. Of course, monocrystalline silicon articles still appear noisier because they are fewer. With a few noteworthy exceptions, peaks and valleys in monocrystalline silicon articles appear to align with peaks and valleys in thin film articles. 156 Figure V. Monthly Article Counts Rescaled Threshold and keyword smoothed article counts are shown again, this time with monocrystalline silicon articles scaled up by a factor of ten in order to make their variations more visible. Probabilistic counts are not shown because their monocrystalline silicon articles are already easily observed in Figure IV. 157 These patterns suggest that both monocrystalline silicon and thin film research are driven by similar underlying forces. While perhaps this is unsurprising, they are sufficiently distinct technologies that the opposite hypothesis—that they would rise and fall separately— would be accurate if they were primarily driven by preceding technological developments. The underlying forces at work must be substantially independent of the technical differences between the two technologies. Thus, they may be broader social and economic influences. Given the large role governments play in solar energy research and markets, these forces are likely to include public policies. To some extent, the similarity between monocrystalline silicon and thin film article count trends may be attributable to the methods used to collect them. While many of the article count methods considered, particularly the threshold and keyword approaches, are designed to counteract this effect, all counts draw from the same initial collection of potentially relevant article abstracts. More generally, the conclusion that monocrystalline silicon and thin film follow similar trajectories merits further examination by other procedures. Here, subsequent analysis including the regression analysis will minimize this problem by focusing on the monocrystalline silicon or thin film articles as a percent of the total of the two, rather than as absolute totals. There are more patterns of interest than can be explored here. For example, in the threshold data from Engineering Village a series of dramatic peaks and valleys occurs around 1984‐1996, suggesting that some particular events may explain them. In the keyword data from both databases, monocrystalline silicon research drops dramatically around 2003 and does not appear to recover. Most of these patterns I will leave for further research, focusing here on one which is visible in Figure IV: for probabilistic article counts from Web of Science, monocrystalline silicon and thin film contribute very similar quantities until 1991 or 1992, after which thin film pulls well above monocrystalline silicon. There are two plausible explanations for the change in 1992, particularly given that it occurs only for one database. Most likely is improvement in the database itself. The level of detail in Web of Science was expanded substantially around 1992. Specifically, in 1992 and later, Thomson Reuters placed a much greater emphasis on including abstracts and formal 158 keywords (not the same as the ‘keywords’) with the citations in Web of Science (Bonnie Snow, 1991, Marylou Warwick, October 17, 2012). These changes may have affected slightly earlier articles as well, since there can be a time delay in entering articles into the database, and may have taken many months to fully implement. The database improvements may enable the classifiers to more accurately classify articles published in or after about 1992, with pre‐1992 articles poorly classified and thus appearing to be about as likely to be in either the monocrystalline silicon or thin film categories. The pre‐1992 articles were mostly classified as probably being about thin film, as shown in the threshold data, but since these probabilities were imprecise, the probabilistic article counts for the same time period assigned roughly equal counts to monocrystalline silicon and thin film. The post‐1992 data, with a substantial difference between monocrystalline silicon and thin film, better reflect the usual monocrystalline‐thin film relationship. Thus, the differences between the probabilistic and threshold methods, together with information about Web of Science, indicate that the change in 1992 is probably due to database changes rather than research responses to policy. Still, there is a possibility that the 1992 increase in Web of Science probabilistic article counts for thin film may be due to policy changes. Three U.S. policy changes enacted within the Energy Policy Act passed in 1992 may have increased the long‐term demand for solar energy. Firstly, a tax credit of 1.5 cents/kWh was created for producing renewable electricity. Secondly, the ten‐percent tax credit for solar business investment was made permanent, whereas previously it had been extended each year on an annual basis (Energy Information Administration, September 1999). Finally, the federal government legalized the creation of wholesale energy producers which would not be regulated as utilities. This law was followed by related legislation in many states, much of which included additional incentives for renewable energy (Vicki Norberg‐Bohm, 2000). These changes substantially increased the financial resources which solar energy could expect to access for years to come. Thus, these policies could have not only incentivized solar research overall, but had particularly strong effects on thin film research, which would take years to reach fruition. 159 4.3 Regressing Article Counts with U.S. Policies This section briefly considers whether certain U.S. economic and policy factors may have influenced the balance of research between monocrystalline silicon and thin film. Article counts for each of the two categories, as a percentage of total monocrystalline and thin film article counts, are regressed with federal solar research subsidies, federal renewable energy tax credit expenditures, and other relevant variables. The research subsidies appear to favor monocrystalline silicon for one of the sets of article counts, suggesting that targeted research policies may be able to affect the direction of research emphasis. No further significant results are found, consistent with the hypothesis that U.S. policies are not the main causes of the relationship between monocrystalline silicon and thin film. For the regression analysis, the hybrid and keyword article counts are used. While it is unclear which of the classifier methods is best, using only one is preferable in order to minimize the possibility of randomly appearing “significant” results, and for simplicity of presentation. The hybrid method is chosen because it errs on the inclusive side in terms of total monocrystalline silicon and thin film articles, while acknowledging the uncertainty in assigning them to one category or the other. The keyword article counts are regressed for comparison because they provide a methodological counterpoint and are the more traditional approach. Monthly article counts for one of the two technology types, as a percent of both monocrystalline silicon and thin film articles, are regressed with earlier conditions using a logistic regression. The equation is ln 1 where is the percent of articles published in the month and each month is weighted by the number of monocrystalline plus thin film articles in that month, i.e. the number of opportunities that an article could have been about one topic rather than the other. This functional form considers publication of a monocrystalline article as an event which only has a chance to happen when an article of either type is published; and, similarly for thin film when thin film article count percentages are the dependent variable. An overdistribution 160 parameter is allowed (that is, the model constitutes a quasibinomial model) in order to account for overdispersion in the data, and is used in calculating the standard errors. Two public policy variables are considered: the sum of the Production Tax Credit and Investment Tax Credit expenditures mentioned earlier, which are only available summed together, and which are effectively a subsidy for producing renewable energy ( ); and federal solar research spending ( ). Economic conditions included are renewable energy consumption ( ), electricity prices ( ), fossil fuel prices paid by power plants ( ), and GDP ( ). A lagged value of twelve‐month‐averaged article counts ( ) is also included. The regressions are limited to 1995‐2008 based on limited availability of the regressor data. A lag time of several months or years is placed between all regressors and the article counts, in order to allow time for research to be completed and published. This lag ( ) is seven months,xiii one year, two years, three years or five years. Significance is calculated across all six independent variables for the given time considered, i.e. at the 5%/6 = 0.833% level, a lax definition considering the number of regressions under consideration. For further details on the covariate data, see earlier essays in this dissertation. Table VIII. Summary Statistics for 1995‐2008 Summary statistics are shown for the count data and all regressors during the time period covered by the regressions, 1995‐2008. The time period is limited by the availability of covariate data. Frequency Annual Source OECD/IEA Annual OMB datac Electricity Prices 546 (58) 9.0 (0.6) Units million $a million $ expendedb million Btu $/million Btub Monthly Monthly EIA EIA Fossil Fuel Prices 2.6 (0.8) $/million Btub Monthly EIA Quarterly BEA Research Subsidies Production and Investment Tax Credits Renewable Consumption GDP Solar 114 (33) 250 (244) 12600000 seasonally (1300000) adjusted million $b xiii Seven months is the average publication time, calculated using reported publication times for the few journals mentioned in Luwel and Moed (1988), weighted by appearances in the solar-relevant articles. Luwel, M. and H. F. Moed. 1998. "Publication Delays in the Science Field and Their Relationship to the Ageing of Scientific Literature." Scientometrics, 41(1-2), 29-40. 161 a. inflation-adjusted by source using 2009 CPI b. inflation-adjusted by author using average 2009 CPI c. compiled from annual Analytical Perspectives budget documents Few results show any statistical significance, demonstrating that the policies considered are not the primary forces responsible for the distribution of research between monocrystalline silicon and thin film. For thin film, no independent variables are significant with any of the time lags considered. Nor are monocrystalline silicon results significant for any keyword counts, or for any results using Web of Science. The only nominally significant results are for research subsidies and renewable energy consumption acting on Engineering Village monocrystalline silicon probabilistic article counts, as a proportion of monocrystalline silicon and thin film counts, which are shown in Table IX. These few significant results suggest that increasing solar research subsidies or increasing renewable energy consumption favors monocrystalline silicon over thin film research. The remaining results tables are given in the appendix. 162 Table IX. Odds Ratios for Monocrystalline Silicon Hybrid Article Counts from Engineering Village Regression results for monocrystalline silicon hybrid article counts for Engineering Village, the only regression with statistically significant results. Odds ratios are for monocrystalline silicon article counts vs. thin film article counts. Production Time 7 months 1 year 2 years 3 years 5 years Constant 0.027 (0.013 ~ 0.052)** 0.056 (0.028 ~ 0.113)** 0.082 (0.021 ~ 0.32) 0.322 (0.079 ~ 1.313) 0.019 (0.003 ~ 0.114)* Lag of Article Counts 4.672 (0.102 ~ 214) 0.064 (0.002 ~ 2.667) 0.015 (0 ~ 1.078) 358 (4.813 ~ 26646) 0.829 (0 ~ 1638) 1.001 (1 ~ 1.002) 1.002 (1.001 ~ 1.002)** 0.997 (0.994 ~ 1) 0.994 (0.99 ~ 0.998) 1.002 (0.997 ~ 1.007) 1 (1 ~ 1) 1 (1 ~ 1) 1.001 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (0.999 ~ 1) 1.002 (1.001 ~ 1.002)** 1.001 (1 ~ 1.002) 1 (0.999 ~ 1.001) 1.001 (1 ~ 1.001) 1.111 (1.055 ~ 1.2)* 0.877 (0.832 ~ 0.925)* 0.972 (0.919 ~ 1.028) 0.969 (0.902 ~ 1.04) 0.872 (0.809 ~ 0.94) 1.071 (1.001 ~ 1.1) 1.1 (1 ~ 1) 0.944 (0.875 ~ 1.019) 1.1 (1.031 ~ 1.2) 1.171 (1.012 ~ 1.355) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) Percent of Deviance Explained 23% 27% 23% 22% 25% Deviance Explained by Research Subsidies 1.2% 4.2% 0.6% 1.5% 0.2% Deviance Explained by Tax Credit 2.0% 0.9% 1.2% 1.3% 0.3% Solar Research Subsidies Tax Credit Expenditures Renewable Energy Consumption Electricity Prices Fossil Fuel Prices GDP For research subsidies, an increase of $10 million predicts an increase of 2% in the monocrystalline silicon to thin film odds ratio. With an average odds ratio of about 50% and average research subsidies of $114 million (with a standard deviation of $33 million), this effect may be considered noteworthy but not close to explaining most of the imbalance between 163 monocrystalline silicon and thin film article counts. It is likely that research subsidies have been deliberately targeted towards research on monocrystalline silicon and general solar topics, which would explain this result. Whether intentional or not, the result suggests that research targeting has occurred. The technical emphasis of funded research may help explain why this result appears in Engineering Village, the more technical of the two databases, and not in Web of Science. For renewable energy consumption, the effect size is numerically the same: an increase of 10 million Btu predicts an increase of 1‐2% in the odds ratio. Again, since renewable energy consumption has averaged 546 Btu (standard deviation 58 Btu) over the time period in question, the size of this effect is substantial but not sufficient to explain the odds ratio. Still, it suggests that increasing renewable energy consumption favors monocrystalline silicon research. Insofar as demand for solar panels goes up when renewable energy consumption goes up, it appears to be encouraging researchers to focus on topics that can soon be used for making solar panels, consistent with the fact that monocrystalline silicon is currently the dominant technology. It may also be that renewable energy consumption‐driven increases in the odds ratio are attributable primarily to research which is useful to solar generally. Again, this explanation emphasizes research which is close to commercialization and therefore may be more likely to appear in Engineering Village than in Web of Science. 164 5. Conclusion Thin film research articles have been far more numerous than monocrystalline silicon research articles, total and throughout 1985‐2010, regardless of which approach is used to sort articles. Thin film is likely to continue to dominate solar cell research for the foreseeable future, although when or if that research will translate to market share remains to be seen. Its dominance does not seem to be a product of policy, in that most relevant public policies appear to have had similar effects on both monocrystalline silicon and thin film research. Thus, these policies are letting a “thousand flowers bloom.” The exception is solar research subsidies, which were examined in the case of the U.S. and appear to have moderated the dominance of thin film. Since research subsidies go directly to research, it is easier for them to affect research direction and this finding shows that they are capable of doing so. The data in this study contain more detail than can be examined here and may constitute useful material for future investigations of monocrystalline silicon and thin film solar research. 5.1 Implications for the Future of Monocrystalline Silicon and Thin Film Thin film research has been far more copious than monocrystalline silicon or general solar research throughout recent decades. All the article counts constructed here reflect this, and their plots over time show that thin film has dominated not just on average but throughout the years considered. This volume of research presumably reflects the optimism and success of thin film researchers, who have been successful at improving the efficiencies of many types of thin films, if not at commercializing them as fast as hoped. These successes, combined with environmental goals and understanding of how thin films work, may have been the inspirations for the high volume of thin film research and continued belief that thin film will one day dominate the solar energy industry. If such successes continue, this dominance seems likely to occur eventually—but when it will arrive is far less clear. On the other hand, it must be remembered that monocrystalline silicon may be the subject of less published research simply because it has already achieved high efficiencies and commercial success. This environment may make it difficult for thin film to compete commercially—thus potentially encouraging thin film researchers to focus on journal 165 publication rather than other forms of output, furthering thin film’s dominance in the published research sphere. With monocrystalline silicon technology improving as well, albeit perhaps at a slower pace, thin film’s research dominance does not guarantee eventual market dominance. 5.2 Policy Implications and Future Research Directions For the most part, public policies appear to have been agnostic of the difference between monocrystalline silicon and thin film technologies. The regression results show no connection between U.S. tax credits for renewable energy and the relative quantities of monocrystalline silicon and thin film, while similar research finds that these tax credits, as well as solar energy subsidies, are related to increases in total solar research (see preceding essays). Qualitative analysis, comparing the dates when major subsidy policies were instituted with monocrystalline silicon and thin film article count trajectories, so far does not reveal examples where policies appear to have influenced one research category but not the other. An increase in both around 1992 may be attributable to the influence of U.S. policy changes. Thus, the analyses conducted here suggest that public policies have largely had similar effects on both monocrystalline silicon and thin film research. However, regression results suggest that U.S. research subsidies have favored monocrystalline silicon and thereby moderated the research dominance of thin film. This result appears only in the more applied of the two databases, using hybrid but not the keyword‐based article counts. Still, it suggests that research funding can be used to influence the direction of research emphasis. If this is done deliberately, it may provide a significant tool for policymakers, even if it remains a small influence compared to other impacts on solar research directions. Given the apparent influence of U.S. subsidies, the relationships between solar research direction and public policies—be they American, Asian or European—may be worth exploring further. The differences between the temporal patterns of Engineering Village and Web of Science article counts also merit further investigation. Finally, the article selection methods could be refined, especially by using a larger set of training data, in order to provide article counts which can be interpreted in greater detail. 166 Appendix. Regression Results The results of logistic regressions of monocrystalline silicon or thin film article count percentages with subsidy and other variables are shown in tables below. Highlighted cells and double asterisks identify statistically significant results at the 0.833% level, and single asterisks at the 5% level, for each variable. The 0.833% level is chosen in order to guarantee significance across all six independent variables considered. Note that for such a large number of results, one still would expect that a few of them will exceed this significance threshold simply as a result of random variation. Results are reported as odds ratios, i.e. the odds of being in the category that is the focus of that regression, divided by the odds of being in the other category. Odds ratios are shown to the nearest thousandth except where their value is at least 100, in which case they are shown only to the nearest 1. 167 Odds Ratios for Thin Film Hybrid Article Counts from Engineering Village Production Time 7 months 1 year 2 years 3 years 5 years Constant 0.058 (0.027 ~ 0.125)* 0.187 (0.089 ~ 0.395)* 0.351 (0.082 ~ 1.499) 2.515 (0.543 ~ 11.655) 0.092 (0.014 ~ 0.604) Lag of Article Counts 2.368 (0.371 ~ 15.135) 0.103 (0.018 ~ 0.588) 0.052 (0.007 ~ 0.406) 1.261 (0.15 ~ 10.61) 0.317 (0.008 ~ 12.273) Wind Research Subsidies 1.001 (1 ~ 1.001) 1.001 (1 ~ 1.002) 0.999 (0.996 ~ 1.002) 0.99 (0.986 ~ 0.995)* 1.003 (0.998 ~ 1.009) Tax Credit Expenditures 1 (1 ~ 1) 1 (1 ~ 1) 1.001 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (0.999 ~ 1) 1 (0.999 ~ 1) 1.002 (1.002 ~ 1.003)* 1.001 (1.001 ~ 1.002) 1 (0.999 ~ 1.001) 1.001 (1 ~ 1.002) 1.115 (1.053 ~ 1.18) 0.833 (0.788 ~ 0.88)* 0.935 (0.881 ~ 0.994) 0.982 (0.909 ~ 1.061) 0.82 (0.756 ~ 0.889)* 0.974 (0.897 ~ 1.057) 1.077 (0.971 ~ 1.196) 1.2 (1.056 ~ 1.4) Renewable Energy Consumption Electricity Prices Fossil Fuel Prices 1.1 1.119 (0.996 ~ 1.2) (1.042 ~ 1.2) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 10% 22% 16% 11% 15% Deviance Explained by Research Subsidies 0.50% 1.0% 0.1% 3.8% 0.4% Deviance Explained by Tax Credit 0.2% 0.0% 1.2% 0.2% 0.1% GDP Percent of Deviance Explained 168 Odds Ratios for Monocrystalline Silicon Hybrid Article Counts from Web of Science Production Time Constant 7 months 1 year 2 years 3 years 5 years 0.001 (0.001 ~ 0.003)* 0.105 (0.044 ~ 0.252)* 0.025 (0.008 ~ 0.075)* 0.022 (0.007 ~ 0.069)* 0.039 (0.011 ~ 0.137)* 382 564077624118 34251062433 32349084227 17754854621471 (0.859 ~ (516487028 ~ (4438687 ~ (1163338 ~ 300000 Lag of Article 169525) 616053354636521) 2642978209489 899535078664089) (2946197469550 Counts * 49)* * ~ 1.069971*10^26) * 1 0.999 1 0.999 0.999 Solar (0.999 ~ (0.998 ~ 0.999)* (0.997 ~ 1.002) (0.995 ~ 1.003) (0.995 ~ 1.003) Research 1.001) Subsidies Tax Credit Expenditures 1 (1 ~ 1) 1 Renewable (0.999 ~ Energy 1) Consumption 1 (1 ~ 1) 1 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (1 ~ 1.001) 1.001 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (0.999 ~ 1) Electricity Prices 1.189 (1.1 ~ 1.2)* 0.854 (0.818 ~ 0.89)* 0.864 (0.827 ~ 0.903)* 0.859 (0.812 ~ 0.909)* 0.839 (0.79 ~ 0.891)* Fossil Fuel Prices 0.991 (0.944 ~ 1.04) 1.1 (1.1 ~ 1)* 1.1 (1 ~ 1.1) 1 (0.93 ~ 1.1) 1.1 (0.993 ~ 1.3) GDP 1 (1 ~ 1)* 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 74% 72% 70% 69% 61% 0.0003% 0.8% 0.001% 0.02% 0.03% 0.3% 0.4% 0.2% 0.5% 0.2% Percent of Deviance Explained Deviance Explained by Research Subsidies Deviance Explained by Tax Credit 1 (1 ~ 1) 169 Odds Ratios for Thin Film Hybrid Article Counts from Web of Science Production Time 7 months 1 year 2 years 3 years 5 years Constant 0.004 (0.002 ~ 0.011)* 0.15 (0.051 ~ 0.439) 0.036 (0.01 ~ 0.131)* 0.07 (0.018 ~ 0.281) 0.049 (0.01 ~ 0.232) Lag of Article Counts 4.491 (0.283 ~ 71.367) 1634 (75.33 ~ 35424)* 42694 (420 ~ 4345193)* 2199 (3.054 ~ 1583334) 82220198 (19843 ~ 340679800114)* Wind Research Subsidies 0.999 (0.999 ~ 1) 0.999 (0.999 ~ 1) 1.002 (0.999 ~ 1.006) 0.998 (0.993 ~ 1.003) 1.003 (0.998 ~ 1.008) Tax Credit Expenditures 1 (1 ~ 1)* 1 (1 ~ 1) 1 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (0.999 ~ 1) 1.001 (1 ~ 1.001) 1.001 (1.001 ~ 1.002)* 1.001 (1 ~ 1.001) 1 (0.999 ~ 1) 1.2 (1.151 ~ 1.3)* 0.837 (0.796 ~ 0.879)* 0.848 (0.804 ~ 0.894)* 0.877 (0.819 ~ 0.939) 0.896 (0.833 ~ 0.963) Renewable Energy Consumption Electricity Prices Fossil Fuel Prices 1 (0.9 ~ 1.1) 1.3 1.1 1.023 (1.2 ~ 1.4)* (1.001 ~ 1.1) (0.937 ~ 1.1) GDP 1 (1 ~ 1)* 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) Percent of Deviance Explained 68% 65% 61% 57% 46% Deviance Explained by Research Subsidies 0.1% 0.2% 0.2% 0.1% 0.2% Deviance Explained by Tax Credit 1.1% 0.1% 0.5% 1.3% 0.8% 170 1.1 (0.942 ~ 1.3) Odds Ratios for Monocrystalline Silicon Keyword Article Counts from Engineering Village Production Time 7 months 1 year 2 years 3 years 5 years Constant 0.18 (0.036 ~ 0.906) 3.37 (0.663 ~ 17.134) 0.026 (0.002 ~ 0.391) 0.147 (0.009 ~ 2.49) 0.297 (0.014 ~ 6.238) Lag of Article Counts 32.891 (0.193 ~ 5602) 0.303 (0.002 ~ 60.396) 0.005 (0 ~ 0.675) 0 (0 ~ 0.03) 242 (2.005 ~ 29324) Solar Research Subsidies 1 (0.999 ~ 1.002) 1.001 (1 ~ 1.003) 1.002 (0.997 ~ 1.008) 1.013 (1.005 ~ 1.021) 1.008 (0.999 ~ 1.016) 0.999 (0.999 ~ 1) 0.999 (0.999 ~ 1) 0.999 (0.998 ~ 1) 1 (0.999 ~ 1.001) 1 (0.999 ~ 1.001) 0.999 (0.998 ~ 1) 0.999 (0.998 ~ 1) 1.003 (1.002 ~ 1.004)* 1.001 (1 ~ 1.002) 1.001 (0.999 ~ 1.002) Electricity Prices 1.022 (0.914 ~ 1.1) 0.816 (0.728 ~ 0.915) 0.84 (0.748 ~ 0.943) 0.911 (0.791 ~ 1.05) 0.926 (0.805 ~ 1.066) Fossil Fuel Prices 1.125 (0.975 ~ 1.3) 1.3 (1.1 ~ 1) 0.789 (0.669 ~ 0.93) 1 (0.823 ~ 1.2) 1.483 (1.115 ~ 1.974) GDP 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 12% 14% 19% 18% 21% Deviance Explained by Research Subsidies 0.04% 0.5% 0.1% 2.0% 0.6% Deviance Explained by Tax Credit 1.4% 1.1% 1.3% 0.005% 0.1% Tax Credit Expenditures Renewable Energy Consumption Percent of Deviance Explained 171 Odds Ratios for Thin Film Keyword Article Counts from Engineering Village Production Time 7 months 1 year 2 years 3 years 5 years Constant 0.169 (0.001 ~ 19.989) 0.981 (0.007 ~ 143) 7278 (40.933 ~ 1294137) 32032 (168 ~ 6111515) 0.014 (0 ~ 1.935) Lag of Article Counts 32.891 (0.193 ~ 5602.105) 0.303 (0.002 ~ 60.396) 0.005 (0 ~ 0.675) 0 (0 ~ 0.03) 242 (2.005 ~ 29324) Wind Research Subsidies 1 (0.998 ~ 1.001) 0.999 (0.997 ~ 1) 0.998 (0.992 ~ 1.003) 0.987 (0.98 ~ 0.995) 0.992 (0.984 ~ 1.001) 1.001 (1 ~ 1.001) 1.001 (1 ~ 1.001) 1.001 (1 ~ 1.002) 1 (0.999 ~ 1.001) 1 (0.999 ~ 1.001) 1.001 (1 ~ 1.002) 1.001 (1 ~ 1.002) 0.997 (0.996 ~ 0.998)* 0.999 (0.998 ~ 1) 0.999 (0.998 ~ 1.001) 0.979 (0.875 ~ 1.094) 1.225 (1.093 ~ 1.374) 1.19 (1.06 ~ 1.337) 1.097 (0.952 ~ 1.264) 1.08 (0.938 ~ 1.242) Fossil Fuel Prices 0.9 (0.77 ~ 1) 0.797 (0.687 ~ 0.9) 1.268 (1.075 ~ 1.496) 1.006 (0.832 ~ 1.215) 0.7 (0.507 ~ 0.9) GDP 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 12% 14% 19% 18% 21% Deviance Explained by Research Subsidies 0.04% 0.5% 0.1% 2.0% 0.6% Deviance Explained by Tax Credit 1.4% 1.1% 1.3% 0.0% 0.1% Tax Credit Expenditures Renewable Energy Consumption Electricity Prices Percent of Deviance Explained 172 Odds Ratios for Monocrystalline Silicon Keyword Article Counts from Web of Science Production Time 7 months 1 year 2 years 3 years 5 years Constant 0.343 (0.081 ~ 1.453) 1.577 (0.355 ~ 7.006) 0.132 (0.01 ~ 1.779) 1.402 (0.089 ~ 22.095) 0.487 (0.035 ~ 6.729) Lag of Article Counts 1.884 (0.018 ~ 202) 0.597 (0.007 ~ 48.523) 0.496 (0.006 ~ 40.977) 0.009 (0 ~ 0.824) 7570 (90.527 ~ 633095)* Solar Research Subsidies 1 (0.999 ~ 1.001) 1.001 (1 ~ 1.003) 0.997 (0.992 ~ 1.002) 1.008 (1.001 ~ 1.016) 1 (0.993 ~ 1.008) 1 (0.999 ~ 1) 0.999 (0.999 ~ 1) 1 (0.999 ~ 1) 1 (0.999 ~ 1.001) 0.999 (0.998 ~ 1) 0.998 (0.997 ~ 0.999)* 1.001 (1 ~ 1.002) 1 (0.999 ~ 1.001) 1.001 (1 ~ 1.002) 1.025 (0.9 ~ 1.1) 0.868 (0.79 ~ 0.955) 0.864 (0.78 ~ 0.957) 0.823 (0.722 ~ 0.938) 0.975 (0.866 ~ 1.098) Fossil Fuel Prices 1.076 (0.948 ~ 1.221) 1 (0.9 ~ 1) GDP 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1)* 11% 17% 16% 15% 24% Deviance Explained by Research Subsidies 0.007% 0.8% 0.2% 1.1% 0.001% Deviance Explained by Tax Credit 0.5% 0.4% 1.5% 0.05% 0.01% 1 Tax Credit Expenditures (0.999 ~ 1) Renewable Energy Consumption Electricity Prices Percent of Deviance Explained 0.8 1.189 (0.697 ~ 0.9) (1.004 ~ 1.4) 173 1 (0.754 ~ 1.2) Odds Ratios for Thin Film Keyword Article Counts from Web of Science Production Time 7 months 1 year 2 years 3 years 5 years Constant 1.548 (0.023 ~ 103) 1.062 (0.022 ~ 51.646) 15.284 (0.194 ~ 1202) 76.6 (0.957 ~ 6128) 0 (0 ~ 0.024) Lag of Article Counts 1.884 (0.018 ~ 202) 0.597 (0.007 ~ 48.523) 0.496 (0.006 ~ 40.977) 0.009 (0 ~ 0.824) 7570 (90.527 ~ 633095)* 1 (0.999 ~ 1.001) 0.999 (0.997 ~ 1) 1.003 (0.998 ~ 1.008) 0.992 (0.984 ~ 0.999) 1 (0.992 ~ 1.007) 1 (1 ~ 1.001) 1 (1 ~ 1.001) 1.001 (1 ~ 1.001) 1 (1 ~ 1.001) 1 (0.999 ~ 1.001) 1.001 (1 ~ 1.002) 1.002 (1.001 ~ 1.003)* 0.999 (0.998 ~ 1) 1 (0.999 ~ 1.001) 0.999 (0.998 ~ 1) Electricity Prices 1 (0.884 ~ 1.1) 1.151 (1.048 ~ 1.266) 1.157 (1.045 ~ 1.282) 1.215 (1.066 ~ 1.384) 1.026 (0.911 ~ 1.155) Fossil Fuel Prices 0.9 (0.8 ~ 1.1) 1 (0.8 ~ 1.1) 1.2 (1.081 ~ 1.4) 0.841 (0.71 ~ 1) 1 (0.811 ~ 1.3) GDP 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1) 1 (1 ~ 1)* Percent of Deviance Explained 11% 17% 16% 15% 24% Deviance Explained by Research Subsidies 0.0% 0.8% 0.2% 1.1% 0.0% Deviance Explained by Tax Credit 0.5% 0.4% 1.5% 0.0% 0.0% Wind Research Subsidies Tax Credit Expenditures Renewable Energy Consumption 174 Bibliography Andersson, Björn A. and Staffan Jacobsson. 2000. "Monitoring and Assessing Technology Choice: The Case of Solar Cells." Energy Policy, 28(14), 1037‐49. Bagnall, Darren M. and Matt Boreland. 2008. "Photovoltaic Technologies." Energy Policy, 36(12), 4390‐ 96. Blok, K. 2006. "Special Issue: Renewable Energy Policies in the European Union." Energy Policy, 34(3), 251‐375. Boyack, Kevin W.; Jeffrey Y. Tsao; Ann Miksovic and Mark Huey. 2008. "International Trends in Solid‐ State Lighting: Analyses of the Article and Patent Literature." Sandia Laboratory, SAND2008‐4564. Brown, G. F. and J. Wu. 2009. "Third Generation Photovoltaics." Laser & Photonics Reviews, 3(4), 394‐ 405. Carabe, J. and J.J. Gandia. 2004. "Thin‐Film‐Silicon Solar Cells." Opto‐Electronics Review, 12(1), 6. Chittick, R. C.; J. H. Alexander and H. F. Sterling. 1969. "The Preparation and Properties of Amorphous Silicon." Journal of The Electrochemical Society, 116(1), 77‐81. Colatat, Phech; Georgeta Vidican and Richard K. Lester. 2009. "Innovation Systems in the Solar Photovoltaic Industry: The Role of Public Research Institutions." Industrial Performance Center Working Paper Series, Massachusetts Institute of Technology, MIT‐IPC‐09‐007. Conibeer, Gavin. 2007. "Third‐Generation Photovoltaics." Materials Today, 10(11), 42‐50. Curtright, Aimee E.; M. Granger Morgan and David W. Keith. 2008. "Expert Assessments of Future Photovoltaic Technologies." Environmental Science and Technology, 42(24), 8. Elsevier. 2010. "About Engineering Village." http://www.engineeringvillage2.org/ (accessed May 15, 2010). Energy Information Administration. September 1999. "Federal Financial Interventions and Subsidies in Energy Markets 1999: Primary Energy." SR/OIAF/99‐03. Energy Information Administration (EIA). June 2009. "Annual Energy Review 2008." DOE/EIA‐ 0384(2008). Garg, K.C. and Praveen Sharma. 1991. "Solar Power Research: A Scientometric Study of World Literature." Scientometrics, 21(2), 147‐57. Genkin, Alexander; David D. Lewis and David Madigan. 2007. "Large‐Scale Bayesian Logistic Regression for Text Categorization." Technometrics, 49(3), 291‐304. Green, M. A. 2000. "Photovoltaics: Technology Overview." Energy Policy, 28(14), 989‐98. Green, Martin A. 2001. "Third Generation Photovoltaics: Ultra‐High Conversion Efficiency at Low Cost." Progress in Photovoltaics: Research and Applications, 9, 13. Green, Martin and Keith Emery. 1993a, 1993b, ... 2011b. "Solar Cell Efficiency Tables." Progress in Photovoltaics: Research and Applications. Guo, Ying; Lu Huang and Alan L. Porter. 2010. "The Research Profiling Method Applied to Nano‐ Enhanced, Thin‐Film Solar Cells." R&D Management, 40(2), 195‐208. Kajikawa, Yuya; Junta Yoshikawa; Yoshiyuki Takeda and Katsumori Matsushima. 2008. "Tracking Emerging Technologies in Energy Research: Toward a Roadmap for Sustainable Energy." Technological Forecasting and Social Change, 75(6), 771‐82. Landis, Geoffrey A.; Sheila G. Bailey and Michael F. Piszczor. 1996. "Recent Advances in Solar Cell Technology." Journal of Propulsion and Power, 12(5). Larsen, Katarina. 2008. "Knowledge Network Hubs and Measures of Research Impact, Science Structure, and Publication Output in Nanostructured Solar Cell Research." Scientometrics, 74(1), 123‐42. 175 Lewis, Joanna; Amber Sharick and Tian Tian. 2009. "International Motivations for Solar Photovoltaic Market Support: Findings from the United States, Japan, Germany and Spain." Prepared for the Center for Resource Solutions and the Energy Foundation China Sustainable Energy Program. Loferski, Joseph J. 1993. "The First Forty Years: A Brief History of the Modern Photovoltaic Age." Progress in Photovoltaics: Research and Applications, 1(1), 67‐78. Luwel, M. and H. F. Moed. 1998. "Publication Delays in the Science Field and Their Relationship to the Ageing of Scientific Literature." Scientometrics, 41(1‐2), 29‐40. National Renewable Energy Laboratory. October 2012. "Best Research‐Cell Efficiencies." http://www.nrel.gov/ncpv/images/efficiency_chart.jpg (accessed November 1, 2012). Norberg‐Bohm, Vicki. 2000. "Creating Incentives for Environmentally Enhancing Technological Change: Lessons from 30 Years of U.S. Energy Technology Policy." Technological Forecasting and Social Change, 65(2), 125‐48. OECD/IEA. 2011. "RD&D Budgets, Group III: Renewable Energy Sources, III.1 Total Solar Energy." Energy Technology RD&D 2011 Edition. Perry, Thomas D. IV; Mackay Miller; Lee Fleming; Kenneth Younge and James Newcomb. 2011. "Clean Energy Innovation: Sources of Technical and Commercial Breakthroughs." NREL/TP‐6A20‐50624. Popp, David. 2003. "Pollution Control Innovations and the Clean Air Act of 1990." Journal of Policy Analysis and Management, 22(4), 641‐60. Ryan, G.W. and H.R. Bernard. 2000. "Data Management and Analysis Methods," N. Densin and Y. Lincoln, Handbook of Qualitative Research, 2nd Ed. Thousand Oaks, CA: Sage Publications, 769‐802. Shah, A.; P. Torres; R. Tscharner; N. Wyrsch and H. Keppner. 1999. "Photovoltaic Technology: The Case for Thin‐Film Solar Cells." Science, 285(5428), 692‐98. Singhal, Amit; Gerard Salton; Mandar Mitra and Chris Buckley. 1996. "Document Length Normalization." Information Processing & Management, 32(5), 619‐33. Sinha, Bikramjit. 2011. "Trends in Global Solar Photovoltaic Research: Silicon Versus Non‐Silicon Materials." Current Science, 100(5). Snow, Bonnie. 1991. "SCISEARCH Changes: Abstracts and Added Indexing." Online, 15(5). Thomson Reuters. 2010. "Web of Science (R) ‐with Conference Proceedings." ISI Web of Knowledge [v.4.10], http://apps.isiknowledge.com/ (accessed November 10, 2010). Warwick, Marylou. October 17, 2012. Personal Correspondence with Anita Szafran Regarding Web of Science, Published by Thomson Reuters. Technical Support Case #TS‐00836298. Ying, Guo; Huang Lu and A. L. Porter. 2009. "Profiling Research Patterns for a New and Emerging Science and Technology: Dye‐Sensitized Solar Cells," 2009 Atlanta Conference on Science and Innovation Policy. 1‐7. 176 C O R P O R AT I O N O B J E C T I V E A N A LYS I S. E FFE C T I V E S O L U T I O N S . This product is part of the Pardee RAND Graduate School (PRGS) dissertation series. PRGS dissertations are produced by graduate fellows of the Pardee RAND Graduate School, the world’s leading producer of Ph.D.’s in policy analysis. The dissertation has been supervised; reviewed; and approved by the faculty committee composed of Siddhartha Dalal (Chair), Nicholas Burger, and Robert Lempert. R HEADQUARTERS CAMPUS 1776 MAIN STREET, P.O. BOX 2138 SANTA MONICA, CA 90407-2138 OFFICES SANTA MONICA, CA WASHINGTON, DC PITTSBURGH, PA NEW ORLEANS, LA/JACKSON, MS BOSTON, MA DOHA, QA CAMBRIDGE, UK BRUSSELS, BE www.rand.org RGSD -313 RAND publications are available at www.rand.org