Naming Bad Performance: Can Performance Disclosure Drive Improvements? Asmus Leth Olsen∗ Department of Political Science University of Copenhagen ajlo@ifs.ku.dk Invited to revise and resubmit at Journal of Public Administration Research and Theory (JPART). Presented at the Midwest Political Science Association (MPSA). Session: The Psychology and Perception of Performance Measures. Chicago, 14th of April, 2012. Abstract If poor performance is disclosed, poorly performing organizations are likely to face reputational damage and therefore improve their performance. This simple hypothesis is tested with performance information on gender equality among Danish public organizations. The initiative aimed at measuring intra-organizational gender policies and actual gender representation at different organizational levels. A regression discontinuity design is proposed for estimating the causal effect of the disclosed performance on subsequent improvements. The analysis finds no or small effects of performance disclosure on subsequent improvements. Counter to expectations, some results even indicate that feedback about poor performance leads to further deterioration in performance. The results imply that performance information in low-salience policy areas can be ignored by the organizations under scrutiny. The effect of disclosing performance information is thus likely to be reliant on media coverage and public dissemination. Disclosing performance information is by itself no quick fix to improving organizational performance. Keywords: performance information · naming and shaming · reputation ∗ Many people have provided valuable comments and suggestions to earlier drafts of this manuscript including: Hanne Foss Hansen, Jacob Gerner Hariri, Kasper M. Hansen, Peter Thisted Dinesen, Kristian Anker-Moeller, and Simon Calmar Andersen. 1 Sunlight is said to be the best disinfectant: electric light the most efficient policeman. –Brandeis (1914, 62)1 Does the disclosure of performance information affect organizational performance? The worldwide dissemination of public sector performance information is well documented and indicates that policy makers believe in the wonders of performance disclosure (Pawson 2002; Hood 2007; Dixon et al. 2008). Some argue that disclosure, by itself, will lead to improvements in performance: that is, even in the absence of formal incentives to improve, performance measures are believed to affect organizational performance solely by public disclosure (Pawson 2002; Bird et al. 2005; Mason and Street 2006). For instance, disclosure can affect organizational performance due to concerns over organizational reputation (Pawson 2002; Bevan and Hood 2006), political pressure (Pawson et al. 2005; Van de Walle and Roberts 2008), public image (Hibbard et al. 2003), embarrassment effects (Mayston 1985; Johnsen 2005), public humiliation (Le Grand 2010, 61), or risk avoidance and scapegoating (James and John 2007; Van de Walle and Roberts 2008). In the organizational classics, salient performance gaps and organizational failures have long been believed to stimulate performance improvements (Cyert and March 1963; Downs 1967). However, there are few empirical studies of the actual causal impact of performance information disclosure in the public sector, and the empirical findings are mixed (Heinrich 2002; Bird et al. 2005; Besley et al. 2009). The question remains: can performance be improved by naming the under-performing organizations? In this article, I test the causal effect of disclosing performance information on the subsequent organizational performance. I apply the publication of performance information on the topic of gender equality within Danish state and local institutions. From 2003 and onwards the Danish government initiated a biannual publication on gender equality within state institutions and local governments. Empirically, this article adds to the existing research in two important aspects: 1) by separating disclosure from other performance changing mechanisms of performance information, and 2) by illustrating a design which allows one to separate the effect of reputational damage from other confounding factors. First, the case allows for separating reputational effects from other mechanisms through which performance information can affect organizational performance. These alternative mechanism include formal sanctions or rewards and threats of exit or voice (Hirschman 1970). Traditionally, the publication of performance information has been restricted to highsalience policy fields. These include health and education which are characterized by strong 1 The quote by justice Louis D. Brandeis has been highlighted by both Fisse and Braithwaite (1983) and (Pawson 2002). 2 mechanisms of exit or voice, and where performance is often linked to formal sanctions and rewards (Hood 2007). The question is how effective naming and shaming is when introduced to less salient policy areas where the audience for keeping organizations accountable is more diffuse, the options and incentives to use exit or voice are unclear, and no formal sanctions or rewards are linked to the performance information. This is exactly the case for the gender equality performance measures in Denmark. Second, a key obstacle to studying the effect of performance measures is their inherent endogenous nature. Multiple organizational factors affect both the measured organizational performance and the following potential performance change. For instance, management incentives and organizational resources are potential confounding, unobservable factors which affect both initial disclosed performance and subsequent responses. To meet this methodological challenge, I exploit a natural experiment on how performance measures are calculated and presented. Specifically, a regression discontinuity design (RDD) is applied (Thistlethwaite and Campbell 1960). With the RDD, I exploit sharp cut-offs in how the performance on gender equality was publicized. By doing so, I am able to isolate the effect of poor performance feedback on subsequent performance from potential confounding factors. Current studies of performance measures apply mostly observational data and few seem to address the fundamental problems to causal inference. Some have used various experimental designs in order to estimate the causal effect of performance measures on organizations and citizens (James 2011; Hibbard et al. 2005, 2003). With the natural experimental RDD, a pragmatic middle-ground is found between rigorous causal inference and ethical and political concerns of experimental studies into performance measures. The findings indicate that the isolated effect of performance information disclosure may be very limited to non-existent. Organizations being outed as performing poorly in terms of gender equality saw no more improvements in their performance compared with those who gained more favorable ratings. Theoretically, this challenges the idea of a performance enhancing effect from solely naming poorly performing organizations. This is an important finding which indicate that performance information cannot only have either performance improving effects or induce unintended consequences. In fact, performance information can also be ignored and discarded by the organizations under scrutiny. From a societal point of view, the article questions if the diffusion of performance disclosure to less salient policy areas is an effective way of enhancing performance and achieving policy goals. The paper is structured in the following manner: first, I formulate a general hypothesis on how performance disclosure via reputational damage can affect organizational performance. Second, the empirical setting of the Danish gender equality reports is outlined. Third, a regression discontinuity design is proposed for testing the effect of potential reputational 3 damage. Finally, the results are presented and implications for further studies of the effects of performance measures are discussed. Performance Disclosure, Reputational Damage, and Improvements The performance literature has multiple accounts of how performance measures can affect organizations both indirectly via exit or voice and directly via formal rewards or punishments. Citizens have been found to hold organizations accountable via performance information at the ballot box or exit options (James and John 2007; Boyne et al. 2009; James 2011; Lavy 2010). Managers and street-level bureaucrats haven been affected by performance pay schemes (Heckman et al. 1997; Lavy 2009) or general career concerns (Propper et al. 2010). However, what is less clear is whether disclosing performance measures by itself can improve organizational performance? There are few empirical studies of the causal impact of disclosing performance information and their empirical findings are mixed (Heinrich 2002; Bird et al. 2005; Besley et al. 2009). Bird et al. (2005, 18) question the effectiveness of disclosing performance measures and call for further research into how different dissemination strategies affect organizational behavior. Besley et al. (2009) look at “naming and shaming effects” inherent in the English hospital star rating system and find some evidence of improvements. In the case of school performance, Hanushek and Raymond (2005) found that there is only a positive subsequent achievement effect if formal consequences are put in place. On the other hand, Figlio and Rouse (2006) (also in a school context) find that the stigma of a low ranking among local schools had greater impact on future improvements than the more formal threat of exit via vouchers. The strongest case for the effect of naming poor performance has been made by Hibbard et al. (2003, 2005) in an experimental study of US hospitals. One of the findings is that managers of poor performing hospitals care about hospital reputation even in the absence of treats of exit and voice options (Hibbard et al. 2005). The hypothesis put forward here will draw on the notion that negative feedback from performance information can be a source of reputational damage to organizations. Importantly, this potential reputational damage gives managers and policy makers an incentive to improve organizational performance. The often cited concept of “naming and shaming” points to the idea that the simple disclosure of performance information can affect organizations (Pawson 2002; Bird et al. 2005). ’Naming’ denotes the disclosure of performance measures which makes it possible to identify poor performing organizations (Pawson 2002, 215). By publicizing organizations’ performance it becomes easier to identify and point fingers at presumably low-performing organizations (Pawson 2002). It has been argued that disclosing poor performance will potentially generate political pressure (Van de Walle and Roberts 2008), concern about public image (Hibbard 4 et al. 2003), evoke an embarrassment effect (Mayston 1985; Johnsen 2005), humiliate the employees (Le Grand 2010, 61), or scapegoating (Johnsen 2012). Here I use reputational damage as a general description of the non-formal effects of disclosing performance measures (Hibbard et al. 2003, 2005; Bevan and Hamblin 2009). Organizational theorists have no unified working definition of organizational reputation (Lange et al. 2011, 155). However, many accept that organizational reputation denotes the belief held by outsiders about what distinguishes an organization (Dutton et al. 1994). Following Carpenter and Krause (2012), organizational reputation can be defined as, “a set of beliefs about an organization’s capacities, intentions, history, and mission that are embedded in a network of multiple audiences.” Reputation is a matter of being known for something, that is, organizations have a varied set of reputations for something, which corresponds to different issues that are important to different audiences. Reputational effects are the initial effect of disclosing performance information (Bevan and Hood 2006, 519). In general, organizational reputation is shaped by the historical behavior of the organization and will be updated by new information (Lange et al. 2011, 154). Performance disclosure can affect organizational performance through realized reputational costs as well as via potential reputational damage (Pawson 2002, 216). The organizational reaction to disclosed performance information can be seen as part of the blame avoidance game (Weaver 1986; Hood and Dixon 2010). The negativity bias held by most actors, not least citizens (Lau 1982) and the media (Soroka 2006), shifts the focus of policy makers and managers toward avoiding reputational damage rather than seeking credit. Hood and Dixon (2010) argues that public managers and policy makers care about blame if it damages their reputation in ways that will harm their careers and possibilities of promotion. Organizational studies find that damages to organizational reputation can affect employees and that this effect is even stronger for reputational damage caused by public dissemination (Dutton and Dukerich 1991; Sutton and Callahan 1987). For instance, Mannion et al. (2005) and Bevan and Hamblin (2009) show how the English star rating system for hospitals affected the low-scoring hospitals with a feeling of shame among hospital employees. From this perspective reputational damage is not only affecting incentives to improve: improvements can also stem from how reputational damage turns on professional norms about improving on bad performance (Hibbard et al. 2005, 1151). In both cases, organizations will go to great lengths to restore reputational damage. Given that the media, the public, and citizens are affected by a negativity bias, being named as performing poorly will cause more profound reputational damage (Johnsen 2012). The experience of failure has been suggested to be a highly salient feedback signal which attracts organizational attention and increases search for solutions (March and Simon 1958; Cyert 5 and March 1963), reduces slack (Cyert and March 1963; Levinthal and March 1981), induces risk-taking (Kahneman and Tversky 1979), affects motivation to adapt (Sitkin 1992) and increases the likelihood of organizational change in general (Greve 1998). Downs (1967, 191) has noted how performance gaps will provide a motivation to look for alternative actions which are believed to satisfy aspirational levels (Johnsen 2012). We have now established how disclosing poor performance measures can cause reputational damage and thereby turn on incentives (or professional norms) for improving organizational performance. This leads to the following hypothesis: Organizations which are publicly named for poor performance will improve their subsequent performance in order to avoid reputational damage. This simple hypothesis captures the central reasoning underlying government efforts to name and shame via performance disclosure. Thus, while it seems to represent a simplistic view of public organizations, it is at the same time highly relevant to policy making. Empirical Setting: The Case of Danish Gender Reports As noted in the introduction, it is difficult to isolate the effect of reputational damage from other mechanisms through which performance information can affect organizational performance. In addition, there are confounding factors which affect both the disclosed performance, the assignment of reputational damage, and subsequent performance improvements. Given these difficulties I test the proposed hypothesis with data from the case of the Danish gender reports. In the following, some context for this case is provided and it is explained how it may be seen as a case for testing the isolated effect of disclosing poor performance. Danish gender equality policy is located under the Ministry for Gender Equality.2 In 2011 the Department of Gender Equality had a budget of approximately 3 million USD (DKK 14.1 million) and administered funds for different gender equality purposes of more than 20 million USD. In addition to developing the government’s gender policy the department is also responsible for collecting gender equality reports from state and local institutions. In 2001 the Danish parliament passed a law mandating local governments, counties and larger state agencies to report a number of gender policy related figures on a two-year basis. This paper applies data for the period between 2003 and 2005. In 2003 it became mandatory for municipalities to report a number of indicators to the Ministry for Gender Equality every second year. The results for local and regional governments were published online in December 2003 and results for central government institutions in February 2004. This was repeated again in 2005 with new sets of indicators being published online. 2 The minister is typically also co-responsible for another larger ministerial resort where the designated Department of Gender Equality is organizationally integrated. 6 The different indicators of the gender equality reports are centered on four different aspects: First, gender policy indicators which capture the extent to which an organization has defined gender specific policies. Second, it focuses on content and initiatives which are directed towards the concrete actions and decisions an organization has taken to strengthen gender matters. Third, leadership equality covers indicators for the degree of representation of females at different executive levels of the organization. And finally, for organizations with an independent political body (such as counties and municipalities), the degree of representation at the political level is measured. Hibbard et al. (2003) has proposed four requirements for a performance scheme to be effective via reputational pathways3 . First, the performance scheme should take the form of a ranking. In this case a simple color scheme was used to rank organizations with respect to their gender policy focus and achievements. The color red was given to the lowest scoring, yellow to the middle group and green to the highest scoring.4 Second, it should be publicized and disseminated to the public. The gender reports were published on a website with the sole purpose of reporting on the results of the reports.5 Third, it should be easily digestible by the public. The gender reports applied traffic-light colored maps which clearly makes the results visually digestible for a wider audience. These traffic light colors are obviously not a coincidence. Colors have both physiological and psychological effects.6 For municipalities and regional councils political maps are presented with the colors. At the state level organizational names are colored.7 The mapping of results foster comparison. The map gave visitors a quick identification of specific municipalities and they instantly provided a means for comparison with neighboring municipalities. Finally, the performance information initiative should entail a follow up with subsequent periods of measurement. The gender policy reports were from the outset planned as an ongoing process with reporting every second year. To this date performance data has been published for 2003, 2005, 2007, 2009 and 2011. Participating organizations were therefore well aware that they would be measured again on their gender policy efforts. At face value, one may conclude that the case of the Danish gender equliaty reports live up to the standards set out by Hibbard et al. (2003). 3 These requirements are also discussed extensively in Bevan and Hamblin (2009, 165). Green colored organizations are described as working extensively with equality and as having achieved results on gender equality. Yellow colored organizations work to some extent with gender equality and/or have achieved some results on the matter. Red colored organizations focus to a lesser extent with gender equality and/or have achieved fewer results. 5 The main page today is: www.ligestillingdanmark.dk. The reports for 2003 and 2005 can be found under www.2003.ligestillingidanmark.dk and www.2005.ligestillingidanmark.dk. 6 For the psychological effects, the color red is among other things associated with negativity, anger and danger. Green on the other hand engenders feelings of comfort, relaxation and calmness. Color highlighting was also applied in the American Quality Counts reports of hospitals studied by (Hibbard et al. 2003). 7 By clicking on the map or an organizational name the viewer could see the exact aggregate score underlying the coloring. 4 7 The case of the Danish gender policy reports represents a topic of lower policy salience policy than typically found in the research for performance disclosure. Most studies of naming and shaming efforts have been conducted for public services with a clearly defined user-group to hold the responsible accountable. These include settings such as hospital services, schools performance, or general local government services (Hibbard et al. 2003; Lindenauer et al. 2007; Propper et al. 2010; Hemelt 2011). These policy areas have high saliency, take up many resources, and see frequent interaction with a well-defined set of citizens (i.e. patients, kids, parents, etc.). Accordingly, citizens have incentives to vote with their feet or use their voice option as a response to the reported performance (Hirschman 1970). In addition, the central government can attach formal rewards or punishments to how organizations perform. This is not the case for the Danish gender policy reports. Thus, the case allows us to separate reputational effects of disclosing performance information from alternative mechanisms of exit or voice as well as formal rewards or sanctions. In addition, the performance scheme resembles the ones applied in more high-salience policy areas. The case thus resembles a clear trade off. On one hand the low salience implies that findings for this particular case are difficult to generalize to more traditional high-salience areas of performance information such as health and education. On the other hand, the low-salience characteristic is what allows us to isolate potential reputational effects from the alternative mechanisms by performance measure work. Research Design: Regression Discontinuity Design It is inherently difficult to estimate the effect of performance measures on some outcome. One can think of a number of confounding factors which affect to what extent an organization is measured as performing poorly, and how the organization changes behavior in response to such information. However, many of these factors cannot be observed or controlled for. Therefore, some researchers have used experimental designs to study the effects of performance information. For instance, James (2011) uses both field and laboratory experiments to estimate the causal impact of performance information on citizens service satisfaction and vote intention. Hibbard et al. (2003) and Hibbard et al. (2005) use field experiments to test the causal impact of performance reports among US hospitals on quality improvements, market share, and reputation. The case of the Danish Gender reports allows for a natural experimental approach to dealing with confounding factors. Natural experiments (sometimes) offer a straightforward and rigorous approach to causal inference and at the same time avoid some of the ethical and political concerns surrounding experimental studies on performance measures. Here I employ the regression discontinuity design (RDD) to separate the reputational effect on 8 subsequent performance from potential confounding factors (Thistlethwaite and Campbell 1960; Olsen 2012). The design idea draws on Hemelt (2011) which used RDD to measure the effects of a school accountability system introduced in the US. Comparison with experimental benchmark indicates that the RDD may provide very similar results to actual experimental estimates (Green et al. 2009; Berk et al. 2010; Shadish 2011). Importantly, the way each organizations gender performance was calculated and summarized provides a good setting for using the RDD. As already noted, in the gender policy reports the color labeling of red, yellow, and green are given to organizations on the basis of their score on a complex index. The index captures different gender policy-related issues. Red denoted the lowest scoring, yellow the middle group and green the highest scoring. The assignment was determined by sharp thresholds in the index score. This is important, because the RDD is useful for settings where a continuous assignment variable (i.e. the index score) deterministically assign treatment of interest (i.e. the color labels) to the subjects under study. Hereby the case allows me to estimate the causal effects of negative performance disclosure as given by the assignment of color labels to organizations. Specifically the colors are assigned to organizations by the following index score thresholds: Red label: ≤ 29.99 Gender Equality Color Scheme = Yellow label: ≥ 30 ; ≤ 54.99 Green label: ≥ 55 That is, all organizations with an index score below 29.99 are red, those with a score between 30 and 54.99 are assigned to yellow and finally the few scoring more than 55 points receive a green mark. Such step function for performance is common for accountability systems (Burgess and Ratto 2003, 287). In the first year of 2003 the participating organizations scored and average of 38.3 with a standard deviation of 12.8 on the composite score. 93 organizations received a red label for they performance, 224 a yellow label, and 36 a green label. In the subsequent gender policy report of 2005 the average score was 36.4 with a standard deviation of 15.4. In terms of performance labels 118 organizations were given the red label, 191 a yellow, and 44 a green label in 2005.8 Given the theoretical outline of naming and shaming, I define those obtaining the ’red label’ as achieving the most reputational damage. That is, for organizations in the vicinity of the index scores assigning them the red or yellow label, the red label will constitute a larger degree of reputational damage compared to those ending up with a yellow label. In addition, I also test for differences between getting the mediocre color 8 All the data used in the analysis was obtained directly from the Department of Gender Equality. In addition, the obtained data files were cross checked with with actual results published online. 9 labeling of yellow and the top color of green. Formally I test this proposition in the RDD framework by modeling the subsequent organizational performance as a function of the color labeling and the index assignment score to color labeling in 2003: Gender Equality Scorei,2005 = α + β1 Red Labeli,2003 + β2 Gender Equality Scorei,2003 + i The dependent variable, Gender Equality Scorei,2005 , captures the gender policy score for each organization in 2005. The selection variable Gender Equality Score is the aggregated gender policy score from 2003 which deterministically allocates a color label to an organization. Red Label2003 constitutes the treatment dummy of receiving the gender policy label denoted by the color red (vs. yellow). A similar dummy is made to evaluate the difference between receiving a yellow color labeling and the green color label (Yellow Label2003 ). Accordingly, β1 is the RDD estimate of the causal effect of falling below the threshold between a red and yellow label.9 To give β1 a causal interpretation a number of different estimations must be made. First, the causal interpretation of β1 rests on a correctly specified functional form for the relationship between the assignment variable and the dependent variable. For instance, if the assignment score is modeled as a linear term but is in fact non-linear, β1 could possibly reflect this and be biased. In order to avoid this, the model will be estimated with different functional assumptions about Gender Equality Score2003 , including a simple linear fit (as shown above), a quadratic, a cubic, and a 4th order polynomial (Green et al. 2009). In addition, the models are also fitted with interactions allowing not only different functional forms for the assignment variable but also varying lines at either side of the cut-off10 . Second, a number of specifications are made with different subsets of the data around the threshold in order to make the causal estimate less dependent on a correct specification of the functional form (Green et al. 2009). Here the validity of the causal effect becomes a matter of choosing a sample of the data in the neighborhood of the discontinuity. The narrower the bandwidth, the smaller the chance that omitted variables will bias the RDD estimate of the causal effect.11 In summary, figure 1 outline the expectation that the color labeling affect reputation. On the y-axis there is a hypothetical variable for the damage to an organization’s reputation. 9 Importantly, as a local average treatment effect (LATE) the RDD estimate is restricted in its external validity to observations around the threshold. 10 In order to interpret the treatment dummy as the effect at the cut-off in the interactive models, the assignment variable has been centered around the threshold of 30 (Green et al. 2009). 11 In the neighborhood of the threshold all organizations should on average be equal on all relevant characteristics. On the other hand, the narrower the bandwidth, the smaller the amount of available data and accordingly a less efficient estimate of the causal effect is derived. Here I apply three bandwidth of +/2.5, 5 and 7.5 composite points around the cut-off. For these estimations the dependent variable measure changes in gender equality to make in comparable with the models where a lagged dependent variable is included. 10 80 60 Red Label 40 Yellow Label 20 Green Label 0 Potential Exposure to Repuational Damage 100 On the x-axis the gender policy index score is provided. The line represents the way in which the color labels translate the gender policy score into reputational damage. I expect that the reputational damage is more or less constant within color labels and only shifts as organizations move between the color labels. 0 20 40 60 80 Gender Equality Score 2003 Figure 1: Theoretical Treatment of Reputational Damage Caused by the Color Labeling The central assumption of the RDD is that around the threshold, the observations are assigned to treatment ’as if random’. That is, the organizations should not be able to affect whether they are given a value on the assignment variable just below or above the threshold of treatment. Accordingly, observations at one side of the threshold serve as the counterfactual for the observations at the opposite site of the threshold. In figure 1 this equals comparing organizations just in the vicinity of the color labeling thresholds. Here I see how organizations with almost identical gender policy scores are treated with very different levels of organizational blame given their assigned color label. In order to validate the assumption of random assignment of organizations around the treatment labels, it is useful to look at the distribution of organizations around these thresholds (McCrary 2008). In figure 2 the distribution of observations around the two thresholds (red/yellow and yellow/green) are shown. The organizations are ordered in 1 point bins on the gender index score. If organizations were able to affect there labeling, one would expect them to sort themselves to the right side of the threshold. Judging from the plots this is not the case. This fits well with the fact that the participating organizations did not know the exact thresholds values when providing the data for the reports. This supports the assumption of random assignment to color labels and 11 0.15 0.10 Density 0.05 0.00 0.00 0.05 Density 0.10 0.15 provide a more credible causal interpretation to the effects of reputational damage from color labels. 20 25 30 35 40 45 (a) The Red-Yellow Threshold (n=177) 50 55 60 65 (b) The Yellow-Green Threshold (n=92) Figure 2: Distribution of Organizations in 1 Point Bins in the Vicinity of the two Thresholds Empirical Findings First, I plot the lagged gender policy score against the change in gender policy score between 2003 and 2005. In figure 3a one can see a negative slope on the fitted line which indicates that previously low-scoring organizations see larger increases in their gender policy score than high-scoring ones. The coefficient for the fitted line is −0.23 and highly significant (p < 0.01). This effect can also be evaluated in terms of differences in mean changes in the gender policy score between organizations with different color labels. Organizations with a red label in 2003 improve on average with 5.80 (p < 0.05) gender policy points in 2005 compared with those achieving a green label. For those with a yellow label the improvement is 2.17 (p = 0.32) points compared with those with a green label. At face value this could indicate that the shameful red labels were effective at improving performance. However, both the negative slope and the differences in the average improvement between the different color labels can also be attributed to a simple regression to the mean effect (Linden et al. 2006). In addition, I have a potential problem with confounding variables as organizations with a red label are likely to be different from 12 −20 ● ● ● ● ● ● ● ● ● 0 20 40 60 40 ● 20 ● ● ● Red ● ● ● ● ● ● ● ● ● ● Yellow ● ● ● ● ● 80 ● ● ● ● ● ● ● 0 Gender Equality Policy 2003 Green ● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●●●●●●●● ●● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ●● ● ● ● ●● ●● ●●●● ● ● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● Green −20 0 ● ● ● Yellow ● ● ● ● ●● ● ●● ●● ●● ● ● ● ●● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ●●● ● ● ● ●● ●● ● ●●●●●●●● ●● ● ●●● ● ● ● ●● ●● ● ● ● ● ● ● ●● ● ●● ●● ●● ● ● ●● ● ●● ● ● ● ●● ●● ●●●● ● ● ● ● ● ● ●● ● ●● ● ●● ●●● ● ● ● ●● ● ● ● ● ●● ●● ●●● ● ● ● ● ● ● ● ● ●● ● ● ● ●●● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●●● ● ●● ●● ● ● ● ● ●● ● ● ● ● ●●●● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −40 20 ● ● ● Change in Gender Policy Score 2003−2005 40 Red −40 Change in Gender Policy Score 2003−2005 the high performers with a green label. In other words, a simple comparison of color labels and subsequent change in performance tell us very little about the effect of disclosing poor performers. 20 40 60 80 Gender Equality Policy 2003 (a) Simple linear regression plot (b) Regression discontinuity plot Figure 3 As already noted, these issues are possible to avoid in the RDD setup. With the RDD I avoid problems of regression to the mean because I only seek to identify jumps around specific thresholds and not along the hole spectrum of the data (Thistlethwaite and Campbell 1960, 315)12 . In addition, discontinuities around color labels are less likely to be influenced by confounding factors as organizations could not sort themselves into color labels in the close vicinity of the color label thresholds. Next I conduct a visual inspection of the possible discontinuities around the treatment threshold of the various colors labels. Thus, any shifts in the line around color label thresholds indicate that an effect exists in the subsequent change in gender equality score for organizations which initially received different color labeling. In figure 3b the plot is made with separate smooth lines for all three color labels. Around the thresholds for the different colors there are no particular shifts in the lines. If anything the organizations which narrowly obtained a better color label performed slightly better in 2005 compared to those who fell short of a more attractive color label. In table 1 the effects are tested more formally. The different specifications vary the functional form of the assignment variable and the 12 See Linden et al. (2006) for a discussion of regression to the mean effects in the RDD. 13 Table 1: RDD estimates: effect of red labeling Intercept Red Label (ref: Yellow Label) Discontinuity samples +/- 2.5 +/- 5 +/- 7.5 3.40 3.07 1.50 (2.86) (1.84) (1.50) -2.83 -3.69 -2.38 (3.31) (2.39) (2.05) Gender Policy Score Red Label*Gender Policy Score Gender Policy Score2 Gender Policy Score3 Gender Policy Score4 N adj. R2 Resid. sd 45 -0.01 10.99 79 0.02 10.58 112 0.00 10.95 Varying the functional form for assignment I II III IV V 32.39∗∗ 32.76∗∗ 32.42∗∗ 31.77∗∗ 31.79∗∗ (1.36) (1.62) (1.39) (1.74) (1.83) -4.73 -4.34 -4.48 -3.42 -3.38 (2.41) (2.30) (2.30) (2.92) (2.84) 0.55∗∗ 0.52∗∗ 0.57∗∗ 0.68∗∗ 0.68∗∗ (0.12) (0.15) (0.11) (0.23) (0.22) 0.14 (0.22) 0.002 -0.006 -0.00 (0.005) (0.005) (0.01) -0.0002 -0.00 (0.0004) (0.00) 0.00 (0.00) 317 317 317 317 317 0.29 0.29 0.29 0.29 0.29 11.86 11.87 11.87 11.89 11.91 OLS regressions - I: Linear, II: Linear with interaction, III: Quadratic, IV: Cubic, V: Fourth order polynomial. Robust standard errors in the parentheses. ∗∗ p < 0.01 and ∗ p < 0.05 (both two-tailed tests). analyzed range of data around color label thresholds. For the red label there are consistently insignificant, negative effects on the subsequent change in gender policy score when compared with organizations that narrowly received a yellow label. The results thereby provide no support for the hypothesized effect. If anything the naming and shaming of the red labeling made the affected organizations perform worse compared with the counter-factual of those given the yellow label. In table 2 the effects are tested for the yellow-green threshold. Here there is a similar pattern. The difference is negative for the less fortunate label but for the most part insignificant across the various specifications. The coefficient varies in a magnitude of −4.81 to −13.95. That is, the depreciation in performance is slightly higher for organizations which fell short of the green label compared to organizations that narrowly received it. Overall the analysis does not lend support to the hypothesis that disclosing poor performance of public organizations stimulated subsequent improvements in performance. The next step in the analysis is to test for heterogeneity in the effects. Specifically, the gender reports included a very varied set of institutions ranging from state level ministries to local and county governments. For this reason, I set out to test whether the effects varied between these two very different groups of institutions. In figure 4a and 4b the effects are plotted for local and state organizations separately. For local organizations the aggregate pattern seems to hold with no sign of shifts in the lines around thresholds. However, for state organizations a shift is found for organizations around the yellow-green threshold. Here organizations 14 Table 2: RDD estimates: effect of yellow labeling Discontinuity samples +/- 2.5 +/- 5 +/- 7.5 0.35 -2.76 -1.58 (2.81) (2.12) (1.77) -10.39 -8.28 -4.81 (6.19) (4.43) (3.06) Intercept Yellow Label (ref: Green Label) Gender Policy Score Red Label*Gender Policy Score Gender Policy Score2 Gender Policy Score3 Gender Policy Score4 N adj. R2 Resid. sd 20 0.04 16.63 39 0.04 15.77 70 0.01 15.04 Varying the functional form for assignment I II III IV V 53.30∗∗ 53.52∗∗ 53.02∗∗ 53.11∗ 57.09∗∗ (2.00) (2.42) (2.42) (2.37) (2.84) -7.64∗ -7.83∗ -7.26 -6.81 -13.95∗ (3.59) (3.43) (4.17) (4.84) (6.01) 0.51∗∗ 0.48 0.55∗ 0.60 -0.43 (0.14) (0.33) (0.23) (0.32) (0.60) 0.04 (0.37) 0.002 -0.00 -0.04∗ (0.000) (0.00) (0.02) -0.00 0.00 (0.00) (0.00) 0.00∗ (0.00) 260 260 260 260 260 0.25 0.25 0.25 0.24 0.26 12.07 12.09 12.09 12.11 12.01 OLS regressions - I: Linear, II: Linear with interaction, III: Quadratic, IV: Cubic, V: Fourth order polynomial. Robust standard errors in the parentheses. ∗∗ p < 0.01 and ∗ p < 0.05 (both two-tailed tests). 20 ● ● −20 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ● ● ●● ● ● ● ●●●● ● ●●●●● ● ●● ● ● ● ● ●●● ● ● ● ● ● ●●●●● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● 0 20 40 60 Yellow 20 Green ● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● 0 ● ● Red −20 Green ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ●● ● ● ● 80 ● ● ● 0 Gender Policy Score 2003 20 40 60 Gender Policy Score 2003 (a) Local and county organizations (b) State organizations Figure 4 15 ● ● ●● −40 Yellow ● Change in Gender Policy Score 2003−2005 40 Red −40 Change in Gender Policy Score 2003−2005 marginally failing to obtain a green label see larger deterioration in the subsequent 2005 score compared to those just receiving a green label (as indicated by the upward shift in the line above the threshold.) 80 Formally, I therefore specify separate models for the subsets of local/county organizations on one hand and state institutions on the other. In table 3 only the main effects of interest are reported (i.e. the coefficients of the dummy variable for each color labeling). First, there are no significant effects for the local/county organizations. In addition, most of the coefficients are negative which indicates that local and county jurisdictions that got a red label performed slightly worse than those who obtained a yellow label. A similar pattern is found for the threshold between obtaining the yellow and green label. There is no significant improvement effect of obtaining the less favorable rating of the yellow label and most coefficients are even negative. However, for the state institutions one can see some significant differences. As indicated by the plot, organizations failing to obtain a green label seem to deteriorate much more in their performance compared to those that obtained a green label. The coefficients are relatively stable and vary across the different specifications from −9.39 to −27.83. This effect runs contrary to our expectation and seems to suggest that getting very favorable performance feedback can help sustain high performance. This being said, the initial plot indicated that the effect is mostly driven by a small number of organizations which fell below the green label threshold and saw sharp decreases in their 2005 gender policy score. I will therefore be cautious with drawing any strong conclusions from this finding, but rather emphasize that the sub-level analysis also rejects the expected performance-enhancing effect of disclosing poor performance results. Finally, a number of robustness checks were made. First, all models were also estimated as Tobit regressions. With Tobit models one can define whether the data is left censored, right censored, or both. For the case of the gender reports, the gender score was bounded in the interval between 0 and 100. That is, it is impossible for organizations to get negative gender policy scores as well as scores of more than 100. Tobit regressions for this interval reveals substantially similar results to the ones already reported. In addition, a sensitivity check was made to test for effects due to changes of organization names. From 2003 to 2005 around 11 organizations changed their names, in part because of organizational amalgamations etc. To make sure that dramatic changes in the gender policy performance were not correlated with these changes, I estimated all the models without these 11 organizations. Again the models were substantially similar to the full models.Both robustness checks and sensitivity analysis can be obtained from the author. Conclusion and Limitations of the Study The world-wide dissemination of public sector performance disclosure is well documented. However, their ability to improve organizational performance via reputational damage is 16 Table 3: RDD estimates: subgroups of local/county and state organizations. Discontinuity Sample +/- 2.5 +/- 5 +/- 7.5 I Parametric Specifications II III IV V Local and County Level: Red Label (ref: Yellow Label) N adj. R2 Resid. sd Yellow Label (ref: Green Label) N adj. R2 Resid. sd -4.40 (4.18) 32 0.01 11.46 -4.51 (2.92) 59 0.02 11.11 -1.93 (2.40) 84 -0.00 11.03 -2.74 (2.72) 227 0.35 11.09 -2.85 (2.70) 227 0.35 11.11 -3.88 (2.80) 227 0.35 11.09 -6.47 (3.46) 227 0.35 11.08 -4.80 (3.53) 227 0.36 11.07 3.68 (2.83) 9 0.02 4.89 -2.16 (4.58) 22 -0.04 12.97 -3.01 (3.13) 43 -0.01 12.17 -4.34 (4.11) 181 0.32 11.38 -3.91 (3.73) 181 0.31 11.41 -0.51 (4.31) 181 0.32 11.37 1.77 (4.60) 181 0.32 11.34 -1.35 (4.97) 181 0.32 11.35 1.65 (5.20) 13 -0.08 9.94 -0.49 (3.71) 20 -0.05 8.71 -3.36 (3.87) 28 -0.01 10.66 -8.11 (4.61) 90 0.18 13.44 -6.51 (4.19) 90 0.19 13.35 -6.08 (4.24) 90 0.20 13.28 0.28 (4.95) 90 0.22 13.16 -0.42 (4.85) 90 0.21 13.21 -22.19 (9.84) 11 0.20 19.04 -17.85∗ (8.27) 17 0.17 17.70 -9.39 (6.38) 27 0.03 18.50 -14.47∗ (6.43) 79 0.16 13.14 -16.79∗ (6.40) 79 0.15 13.19 -21.38∗ (7.74) 79 0.17 13.03 -22.00∗ (8.43) 79 0.16 13.10 -27.83∗ (10.28) 79 0.17 13.07 State Level: Red Label (ref: Yellow Label) N adj. R2 Resid. sd Yellow Label (ref: Green Label) N adj. R2 Resid. sd The same set of controls variables are included as in table 1 and 2. OLS regressions - I: Linear, II: Linear with interaction, III: Quadratic, IV: Cubic, V: Fourth order polynomial. Robust standard errors in the parentheses. ∗∗ p < 0.01 and ∗ p < 0.05 (both two-tailed tests). 17 mostly undocumented. Still, a lot of policy makers believe in the wonders of performance disclosure and how pointing to areas of improvements by itself has beneficiary effects. This article tested this powerful line of thought with data from Danish gender equality reports. The data allowed me to separate reputational effects from other mechanisms through which performance information can affect organizational performance (i.e. formal sanctions or rewards, and threats of exit or voice). In addition, a novel regression discontinuity design made it possible to make causal claims about the (non-) effect of reputational damage by reducing the possibility of confounding factors and regression to the mean effects. The findings showed that the effect of potential reputational damage on subsequent performance improvements were limited to non-existent. Organizations which narrowly received poor performance results did not improve more in their subsequent performance compared to organizations which narrowly received much more favorable performance feedback. This leads to a rejection of the powerfully simple and highly policy-relevant hypothesis that reputational damage induced by naming the poor performing organizations should affect future organizational efforts to improve. However, one have to consider possible alternative explanations for the non-effects of disclosing poor performance. The question relates to how the empirical setting limits the scope of our findings. As noted earlier, the case of the Danish gender reports is to be considered a hard test of the potential performance enhancing effects of reputational damage. Intraorganizational gender equality and gender policy are relatively low-salience issues. If I had encountered strong effects, it would indicate that pure reputational damage could be effective at encouraging poorly performing organizations to perform better in even low-salience settings. The low-salience character of the case allowed us to isolate the reputational concerns from indirect and formal concerns of exit, voice, or punishments. However, the low-salience status may also have affected the media attention and public dissemination of the performance results. Performance information is in a constant competition for attention with other forms of feedback which organizations receive. Under bounded rationality attention is limited and organizations are therefore not equally responsive to all types of feedback they get from the environment (Cyert and March 1963). Low salience can also lower the reputational costs of poor performance. If the reputational costs of being named and shamed are particularly low, managers will tend to ignore it and focus their efforts on other matters (Hood and Dixon 2010). Both the lack of organizational attention and low costs point to the importance of disseminating the results of performance information to the public. The politics of organizational attention is shaped by the media attention given to policy issues (Baumgartner and Jones 2005). In fact, if one searches print and online news coverage of the gender policy reports, 18 the public dissemination of the results turns out to be fairly limited. In the Danish media database ’Infomedia’, which contains all major print and online news outlets, one may search for specific key words within a given time periods. The results for local and county governments were released on the 10th of December 2003, and the state results were released about a month later. From the 10th of December 2003 to 1st of March 2004 around 694 articles mentioned gender equality in some form. However, only 11 of them referred explicitly to the gender equality reports and only six articles referred to the online address of the gender equality reports (i.e. www.ligestillingidanmark.dk). This very limited media attention to the issue of the gender equality reports shows that low reputational costs may account for the non-effects. While the poorly performing organizations were disclosed to the public and thereby “named” for their poor performance, they were not to any larger extent “shamed” or “blamed” by the media or the wider public. This is a weak point for our conclusions as current studies, which find strong support for reputational effect, all emphasize the importance of wide public dissermination of the results (Hibbard et al. 2003, 2005). In addition, one can also think of a few other reasons for the non-effect of disclosing poor performance. For instance, organizations may have relied on the blame avoiding strategy of circling the wagons (Weaver 1986). Here organizations all cover their neck and provide political cover by ignoring potential blame attribution. One could say that putting efforts into changing a poor outcome is indirectly a way of acknowledging that there is a problem, which by itself can attach blame. Staying under the radar and doing nothing, in case of potential blame attribution, may be a better choice than engaging in improvement efforts. Also, Pawson (2002) has put forward the idea that naming can fail if organizations either passively accept the label they have been given, ignore it, or reject it. In summary, while most studies focus on either the potential performance improvements or the unintended consequences caused by performance information, one should perhaps increasingly look at the extent to which performance information is acted upon or ignored by organizations. The main conclusion of this study is that we should not automatically expect the disclosure of performance information to affect organizational change efforts. Disclosing performance information is by itself no quick fix to improving organizational performance. References Baumgartner, F. R. and B. D. Jones (2005). The politics of attention: How government prioritizes problems. Chicago: The University of Chicago Press. Berk, R., G. Barnes, L. Ahlman, and E. Kurtz (2010). When second best is good enough: 19 a comparison between a true experiment and a regression discontinuity quasi-experiment. Journal of Experimental Criminology 6 (2), 191–208. Besley, T. J., G. Bevan, K. B. Burchardi, and C. for Economic Policy Research Great Britain (2009). Naming & Shaming: The impacts of different regimes on hospital waiting times in England and Wales. Centre for economic policy research. Bevan, G. and R. Hamblin (2009). Hitting and missing targets by ambulance services for emergency calls: effects of different systems ofperformance measurement within the UK. Journal of the Royal Statistical Society: Series A (Statistics in Society) 172 (1), 161–190. Bevan, G. and C. Hood (2006). What’s measured is what matters: targets and gaming in the English public health care system. Public administration 84 (3), 517–538. Bird, S. M., C. David, V. T. Farewell, G. Harvey, H. Tim, and Peter (2005). Performance indicators: good, bad, and ugly. Journal of the Royal Statistical Society: Series A (Statistics in Society) 168 (1), 1–27. Boyne, G. A., O. James, P. John, and N. Petrovsky (2009). Democracy and Government Performance: Holding Incumbents Accountable in English Local Governments. The Journal of Politics 71 (04), 1273–1284. Brandeis, L. D. (2009/1914). Other People’s Money - and How Bankers Use It. Mansfield Centre, CT, USA: Matino Publishing. Burgess, S. and M. Ratto (2003). The role of incentives in the public sector: Issues and evidence. Oxford review of economic policy 19 (2), 285–300. Carpenter, D. P. and G. A. Krause (2012). Reputation and Public Administration. Public Administration Review 72 (1), 26–32. Cyert, R. M. and J. G. March (1963). A behavioral theory of the firm. Englewood Cliffs, New Jersey: Prentice-Hall International Series in Management and Behavioral Sciences in Business Series. Dixon, R., C. Hood, and L. R. Jones (2008). Ratings and rankings of public serviceperformance: Special issue introduction. International Public Management Journal 11 (3), 253–255. Downs, A. (1967). Inside bureaucracy. Boston, MA: Little, Brown and Co. Dutton, J. E. and J. M. Dukerich (1991). Keeping an Eye on the Mirror: Image and Identity in Organizational Adaptation. The Academy of Management Journal 34 (3), 517–554. Dutton, J. E., J. M. Dukerich, and C. V. Harquail (1994). Organizational Images and Member Identification. Administrative Science Quarterly 39 (2), 239–263. Figlio, D. and C. Rouse (2006). Do accountability and voucher threats improve lowperforming schools? Journal of Public Economics 90 (1-2), 239–255. 20 Fisse, B. and J. Braithwaite (1983). The impact of publicity on corporate offenders. Albany: State University of New York Press. Green, D. P., T. Y. Leong, H. L. Kern, A. S. Gerber, and C. W. Larimer (2009). Testing the Accuracy of Regression Discontinuity Analysis Using Experimental Benchmarks. Political Analysis 17 (4), 400–417. Greve, H. R. (1998). Performance, Aspirations, and Risky Organizational Change. Administrative Science Quarterly 43 (1), 58–86. Hanushek, E. A. and M. E. Raymond (2005). Does school accountability lead to improved student performance? J. Pol. Anal. Manage. 24 (2), 297–327. Heckman, J., C. Heinrich, and J. Smith (1997). Assessing the Performance of Performance Standards in Public Bureaucracies. The American Economic Review 87 (2), 389–395. Heinrich, C. J. (2002). OutcomesBased Performance Management in the Public Sector: Implications for Government Accountability and Effectiveness. Public Administration Review 62 (6), 712–725. Hemelt, S. W. (2011). Performance effects of failure to make Adequate Yearly Progress (AYP): Evidence from a regression discontinuity framework. Economics of Education Review 30 (4), 702–723. Hibbard, J. H., J. Stockard, and M. Tusler (2003). Does Publicizing Hospital Performance Stimulate Quality Improvement Efforts? Health Affairs 22 (2), 84–94. Hibbard, J. H., J. Stockard, and M. Tusler (2005). Hospital Performance Reports: Impact On Quality, Market Share, And Reputation. Health Affairs 24 (4), 1150–1160. Hirschman, A. O. (1970). Exit, voice, and loyalty: Responses to decline in firms, organizations, and states. Cambridge, MA: Harvard University Press. Hood, C. (2007). What happens when transparency meets blame-avoidance? Public Management Review 9 (2), 191–210. Hood, C. and R. Dixon (2010, July). The Political Payoff from Performance Target Systems: No-Brainer or No-Gainer? Journal of Public Administration Research and Theory 20 (suppl 2), i281–i298. James, O. (2011). Performance Measures and Democracy: Information Effects on Citizens in Field and Laboratory Experiments. Journal of Public Administration Research and Theory 21 (3), 399–418. James, O. and P. John (2007). Public Management at the Ballot Box: Performance Information and Electoral Support for Incumbent English Local Governments. Journal of Public Administration Research and Theory 17 (4), 567–580. 21 Johnsen, A. (2005). What Does 25 Years of Experience Tell Us About the State of Performance Measurement in Public Policy and Management? Public Money & Management 25 (1), 9–17. Johnsen, A. (2012). Why Does Poor Performance Get So Much Attention in Public Policy? Financial Accountability & Management 28 (2), 121–142. Kahneman, D. and A. Tversky (1979). Prospect theory: An analysis of decision under risk. Econometrica: Journal of the Econometric Society 47 (2), 263–291. Lange, D., P. M. Lee, and Y. Dai (2011). Organizational reputation: A review. Journal of Management 37 (1), 153–184. Lau, R. R. (1982). Negativity in Political Perception. Political Behavior 4 (4), 353–377. Lavy, V. (2009). Performance Pay and Teachers’ Effort, Productivity, and Grading Ethics. American Economic Review 41 (5), 1979–2011. Lavy, V. (2010). Effects of free choice among public schools. Review of Economic Studies 77 (3), 1164–1191. Le Grand, J. (2010). Knights and Knaves Return: Public Service Motivation and the Delivery of Public Services. International Public Management Journal 13 (1), 56–71. Levinthal, D. and J. G. March (1981). A model of adaptive organizational search. Journal of Economic Behavior & Organization 2 (4), 307–333. Linden, A., J. L. Adams, and N. Roberts (2006). Evaluating disease management programme effectiveness: an introduction to the regression discontinuity design. Journal of Evaluation in Clinical Practice 12 (2), 124–131. Lindenauer, P. K., D. Remus, S. Roman, M. B. Rothberg, E. M. Benjamin, A. Ma, and D. W. Bratzler (2007). Public reporting and pay for performance in hospital quality improvement. New England Journal of Medicine 356 (5), 486–496. Mannion, R., H. Davies, and M. Marshall (2005). Impact of star performance ratings in English acute hospital trusts. Journal of Health Services Research & Policy 10 (1), 18–24. March, J. G. and H. A. Simon (1958). Organizations. New York: Wiley. Mason, A. and A. Street (2006). Publishing outcome data: is it an effective approach? Journal of evaluation in clinical practice 12 (1), 37–48. Mayston, D. J. (1985). Non-profit performance indicators in the public sector. Financial Accountability & Management 1 (1), 51–74. McCrary, J. (2008). Manipulation of the running variable in the regression discontinuity design: A density test. Journal of Econometrics 142 (2), 698–714. 22 Olsen, A. L. (2012). Regression Discontinuity Designs in Public Administration: The Case of Performance Measurement Research. Paper presented at the ECPR Joint Sessions, Workshop 5: Citizens and Public Service Performance: Demands, Responses and Changing Service Delivery Mechanisms, Antwerp, 11-15 April 2012. Pawson, R. (2002). Evidence and policy and naming and shaming. Policy Studies 23 (3), 211–230. Pawson, R., T. Greenhalgh, G. Harvey, and K. Walshe (2005). Realist review: a new method of systematic review designed for complex policy interventions. Journal of health services research & policy 10, 21–35. Propper, C., M. Sutton, C. Whitnall, and F. Windmeijer (2010). Incentives and targets in hospital care: evidence from a natural experiment. Journal of Public Economics 94 (3-4), 318–335. Shadish, W. R. (2011). Randomized Controlled Studies and Alternative Designs in Outcome Studies. Research on Social Work Practice 21 (6), 636–643. Sitkin, S. B. (1992). Learning through failure: The strategy of small losses. Volume 14 of Research in organizational behavior, pp. 231–266. Greenwich, CT: JAI PRESS LTD. Soroka, S. N. (2006). Good News and Bad News: Asymmetric Responses to Economic Information. Journal of Politics 68 (2), 372–385. Sutton, R. I. and A. L. Callahan (1987). The stigma of bankruptcy: Spoiled organizational image and its management. Academy of Management Journal 30 (3), 405–436. Thistlethwaite, D. L. and D. T. Campbell (1960). Regression-Discontinuity Analysis: An Alternative to The Ex Post Factor Experiment. The Journal Of Educational Psychology 51 (6), 309–317. Van de Walle, S. and A. S. Roberts (2008, November). Publishing Performance Information: An Illusion of Control? Working paper. Weaver, R. K. (1986). The Politics of Blame Avoidance. Journal of Public Policy 6 (4), 371–398. 23 Appendix: Map of the results from the Gender reports of 2003. 24