Supplemental Digital Content 1: Where do the probabilities come from: A brief nontechnical introduction to Markov Chain Monte Carlo parameter estimation In our context, a parameter is a measurable quantity that determines the specific form of a model. To illustrate, assume we believe sample data come from a normal probability distribution with an unknown mean m and standard deviation s. The general model is that the data are normally distributed. However, to use the model to describe the data or make inferences about the population from which the data came, we have to estimate values for the two model parameters m and s. Letting X be the data, we could write this model as X ~ N(m,s). N signifies that the data are from a normal probability distribution and m and s are the unknown parameters of this distribution. We could also represent the model in terms of a likelihood function: L(m,s;X). The likelihood function is the probability of the data given the model parameters, which can be also be written as P(X | m, s). In our case, L(m,s;X) is a normal probability density with mean m and standard deviation s. A common method to estimate the unknown model parameters is to find the values of m and s that maximize the likelihood function, that is, the values for m and s that make the likelihood of seeing what we saw (the data) as large as possible. This is called the maximum likelihood approach. You can often solve analytically for the unknown parameters that maximize the likelihood. When the likelihood function is complicated, there are algorithms that will converge to the maximum likelihood estimates of the unknown parameters. As an alternative to the maximum likelihood approach for estimating parameters of interest, one could use a Bayesian approach. Underlying this approach is Bayes’ theorem, which in its simplest form can be stated as follows: P(parameters|data) = P(data|parameters)*P(parameters) / P(data) The probability P of the data given the parameters is the likelihood function. The probability of the parameters P(parameters) is called the prior probability density function. The probability of the parameters given the data is called the posterior probability density function. The probability of the data, P(data), is a constant, called a “normalizing” constant, which makes the posterior probability distribution a “legitimate” probability distribution (technically a distribution that integrates to 1). The logic of Bayes’ theorem is that you start off with some guesses about the probability distributions of the unknown parameters and then based on the data you modify your initial guesses. If you have only very vague guesses about the prior distributions, the posterior distributions are largely determined by the data; if you have strong feelings about the prior distributions, it takes a lot of data to move the posterior distribution far from your prior estimates. For example, one might guess that the mean is equally likely to be anywhere between -1,000 and + 1,000 (i.e., it follows a uniform distribution over this range) and the standard deviation could be anywhere between 0 and 10,000. With these types of vague or flat prior distributions, the posterior distributions are largely determined from the data. In non-Bayesian statistics, one makes inferences based on a sampling distribution (i.e. a probability distribution that describes how a sample statistic varies). The values of key parameters of this sampling distribution are those in the null hypothesis. The Bayesian approach estimates relevant parameters directly from the data rather than from a null hypothesis. Similar to maximum likelihood estimation, in some cases one can analytically solve for the posterior distributions. However, in many cases this is not possible and a simulation approach called Markov Chain Monte Carlo (MCMC) is used. Essentially, you simulate draws from the posterior distribution many times, compute the average values of m and s, and then use these averages as estimates of the unknown parameters. The Monte Carlo part of the method is the random draws from the posterior distribution. The Markov Chain part of the method is that the selected values form a Markov Chain (i.e., a random process in which the future values depend only on the current state of the system and not the path that it took to arrive at the current state) that will converge to the true values of the parameters. To illustrate, suppose someone tells you they rolled 3 dice and the sum turned out to be 10. What is the probability the values of the dice are 3, 5 and 2? To answer this question, you could simulate the values of 3 different dice conditional on their sum equal to 10. The connection to the statistical problem is that the requirement the dice sum of 10 is equivalent to the data we can see. The three rolls are the parameters that we can't see. Because of the condition, the values of the simulated draws are dependent on each other. That is, the roll of the first die impacts permissible values for the second and third die. One way to do the simulation would be to simulate the rolls of 3 dice until their sum equals 10 and then output the 3 different values for the dice when this happens. Among all the rolls with a sum of 10, you could calculate the proportion where the values are 3, 5 and 2. But, we “waste” a lot of rolls since the chance of getting a sum exactly equal to 10 is not going to occur most of the time. MCMC methods can facilitate the process. We start by arbitrarily selecting numbers for the 3 die that add up to 10 – say 4, 5 and 1. At each step in the process, we choose two dice at random (say the last 2) and roll one of them (say the 2nd). We force the 3rd die to take a value so that the sum adds up to 10. For example, if the second die turns up 3, then we force the 3rd die to be 3. If it is not possible to find a value for the 3rd die so that the sum adds up to 10, we return all the dice to their previous values. We then repeat this process over and over again. After a large number of repetitions in which the sum of the die always equals 10, it turns out that the values on each of the dice will have a distribution very close to their posterior distribution given the condition that the sum of the dice equals 10. Notice that at the beginning of this process the values of the dice depend heavily on their initial conditions. After many repetitions, the impact of the initial conditions becomes less and less. This MCMC approach for simulating a conditional distribution is called the “Gibbs sampler” approach. As noted, the analogy to the statistical model above is to consider the unknown parameters as the dice and the observed data the condition that sum equals 10. At each step, one of the parameters at a time is chosen and then randomly resampled according to its conditional distribution given all the other parameters and the data values (this is equivalent to rolling one of the 2 randomly selected dice). If this process is repeated many times, it turns out that the distribution of the unknown parameters will converge to the posterior distribution of the unknown parameters given the data. If there are many parameters in a model, it can be very difficult to resample all of them simultaneously – it is quite difficult to compute the joint distribution of many parameters. The beauty of the MCMC Gibbs sampler approach is that only one unknown parameter at a time needs to be resampled. In terms of our problem, the parameters of interest are the measures of quality at each facility. Once we have gone through one complete iteration of the Gibbs sampler – that is, a random draw from the conditional distribution of the parameter measuring quality at each of the facilities – we can rank the facilities in order from highest quality to lowest and record for each facility whether its quality is in some top quantile or not. When the entire process is complete, we can calculate for each facility the proportion of the times its quality was in the top quantile. This is the probability metric for profiling facilities that we illustrate.