Data and estimation or “Where do probabilities come from anyway?” Business Statistics 41000 Fall 2015 1 Topics 1. Empirical distributions: creating random variables from data 2. Exploratory data analysis 3. EstimaND, estimaTOR, estimaTE. 4. Gauging sampling variability. 2 Determining probabilities from data We’ve seen that probability concepts allow us to describe statistical patterns and to exploit them to our benefit using the idea of expected utility maximization. But where do those probabilities come from in the first place? The short answer — data! If we have enough of it, probabilities estimated from data are accurate, but when we have only limited, more care is needed because of sampling variability — the observed data does not always reflect the underlying probabilities accurately. 3 Creating a RV via random sampling By data we mean the recorded facts of the world, which can take many forms: presence or absence data (dummy variables), categorical data (this, that or the other), and measurements of various kinds (continuous, discrete). We’ve seen these types before in defining the sample space of random variables. The empirical distribution is the probability distribution defined by randomly sampling our data (with replacement), with each observation getting equal probability. 4 Example: Bernoulli probability iid Consider the random variables Xi ∼ Ber(p = 1/2), for i = 1, . . . , 10, and consider a realization of this random variable. That is, consider the outcome of tossing a fair coin 10 times. Let’s say the results look like this: [0, 0, 1, 0, 1, 0, 0, 0, 0, 1]. The empirical distribution refers to the distribution defined by randomly sampling our data (with replacement). In this case each draw is a iid random variable Di ∼ Ber(p̂ = 3/10). 5 Example: die rolls Consider a three-sided fair die. In ten rolls we might get [3, 1, 1, 3, 1, 2, 3, 3, 3, 1]. x 1 2 3 X P(X = x) 1 3 1 3 1 3 d 1 2 3 D P(D = d) 4 10 1 10 5 10 6 Example: milk demand If the true probabilities are as in our earlier cafe milk ordering scenario x P(X = x) 1 4% 2 15 % 3 35% 4 5% 5 5% 6 5% 7 5% 8 20% 9 3% 10 3% we might actual observe, over the past 100 days, numbers like d 100 · P(D = d) 1 5 2 15 3 36 4 2 5 7 6 6 7 6 8 18 9 2 10 3 7 Example: data from a normal distribution The same idea holds for continuous random variables. Assume the underlying distribution of heights for NBA players is X ∼ N(79, 13) We may randomly observe 10 heights D d P(D = d) 74.66610 1 10 1 10 1 10 1 10 1 10 1 10 1 10 1 10 1 10 1 10 74.81724 74.82560 75.33874 78.54953 78.59774 79.06059 79.23552 80.24751 83.02688 8 Summary features of empirical distributions Summary features of empirical distributions — like summary features of any distributions — are fixed numbers. With the data d1 , d2 , . . . , dn in hand, we define a random variable D, and then we can compute various properties of D as we would with any other distribution. 9 Summary features of empirical distributions Given a sample that defines an empirical distribution, we can compute the usual summary features of a distribution: I mean I median I mode I variance and standard deviation I etc... 10 Mean of empirical distribution (Population) Mean The mean of a random variable X is defined as E (X ) = J X xj P(X = xj ). j=1 Empirical (sample) mean The random variable D defined by a list of n observed numbers, d1 , d2 , . . . , dn has expected value E (D) = n X j=1 dj n 1 1X = dj . n n j=1 11 Median of empirical distribution (Population) Median A random variable X has median m if P(X ≤ m) ≥ 1 1 and P(X ≥ m) ≥ . 2 2 Empirical (sample) median Let D be the random variable defined by a list of n observed numbers, d(1) ≤ d(2) ≤ . . . , d(n) . Any number m so that d(a) ≤ m ≤ d(b) n+1 is a median of D, where a = b n+1 2 c and b = d 2 e. 12 Median of empirical distribution Let’s unpack this a bit. The sorted list of n observed numbers, d(1) ≤ d(2) ≤, . . . , d(n) are called order statistics; d(j) is the jth smallest number. So, if m is between d(a) and d(b) we know that P(D ≤ m) = b n+1 1 2 c ≥ n 2 P(D ≥ m) = d n+1 1 2 e ≥ . n 2 and 13 Mode of empirical distribution The mode is easy. Empirical (sample) mode Let D be the random variable defined by a list of n observed numbers, d1 , d2 , . . . , dn . The mode of D is the number (or numbers) occurring most often. In the three-sided-die example, we observed [3, 1, 1, 3, 1, 2, 3, 3, 3, 1]. The mode of the empirical distribution is 3. The underlying distribution in this example was uniform, so each value was a mode. 14 Variance of empirical distribution (Population) Variance The variance of a random variable X with distribution p(x) is defined as V (X ) = J X 2 (xj − E (X )) p(xj ). j=1 Empirical (sample) variance Let D be the random variable defined by a list of n observed numbers, d1 , d2 , . . . , dn . The variance of D is defined as n 1X 2 V (D) = (di − E (D)) . n i=1 15 Variance of empirical distribution Plug-in formula for empirical variance Let D be the random variable defined by a list of n observed numbers, d1 , d2 , . . . , dn . The variance of D can be written as n 1X 2 V (D) = di − n i=1 n 1X di n !2 i=1 = E (D 2 ) − E (D)2 . 16 Multivariate empirical distributions Consider the following discrete, bivariate random variable (x1 , x2 ) (1,1) (2,1) (2,2) (17,1) X = (X1 , X2 ) P(X1 = x1 , X2 = x2 ) 9 20 1 20 9 20 1 20 17 Multivariate empirical distributions Here is the same information in a different form. X = (X1 , X2 ) X1 = 1 X1 = 2 X2 = 1 9 20 X2 = 2 0 1 20 9 20 X1 = 17 1 20 0 18 Multivariate empirical distributions 1.0 1.2 1.4 x2 1.6 1.8 2.0 We can visualize with a bubble plot. 5 10 15 x1 19 Multivariate empirical distributions Consider a sample of size n = 10 drawn from this distribution: [(17, 1), (2, 2), (17, 1), (2, 2), (2, 2), (2, 2), (1, 1), (17, 1), (2, 2), (2, 2)]. This defines the empirical distribution D = (D1 , D2 ) (d1 , d2 ) P(D = (d1 , d2 )) (1,1) 1 10 (2,1) 0 (2,2) 6 10 3 10 (17,1) 20 Formula: correlation Recall the formula for correlation Plug-in formula for correlation The correlation between two random variables X and Y can be expressed as E (XY ) − E (X )E (Y ) corr(X , Y ) = . σX σY 21 Formula: correlation of empirical distribution Let di refer to the ith observed pair; i.e. d1 = (17, 1), and let di,j refer to the jth co-ordinate of di . Plug-in formula for correlation Let D be the random variable defined by randomly sampling n observed points (d1,1 , d1,2 ), (d2,1 , d2,2 ), . . . (dn,1 , dn,2 ). The correlation between D1 and D2 can be expressed as corr(X , Y ) = = E (D1 D2 ) − E (D1 )E (D2 ) σD1 σD2 1 n P i di,1 di,2 − 1 n Pn i=1 di,1 σD1 σD2 1 n Pn i=1 di,2 . 22 1.0 1.2 1.4 D2 1.6 1.8 2.0 Tool: scatter plots 5 10 15 D1 We have added some “jitter” in order to reflect the number of points at each location. We could as well have directly bubble-plotted the table for D from a few slides back. 23 Example: NBA height and weight Here is the NBA (2008) height and weight data. 80 75 65 70 Height in inches 85 90 NBA 2008 150 200 250 300 Weight in pounds For continuous bivariate data, no jitter is necessary. 24 Example: NBA height and weight 85 80 75 70 65 Height in inches 90 Recall p that the best linear p predictor is given in terms of E(X ), E(Y ), σX = V(X ), σY = V(Y ) and ρ = cor(X , Y ). 150 200 250 300 Weight in pounds Applied to the empirical distribution (see above) we can find the best linear predictor for a particular data set. How can we interpret this in terms of random sampling? 25 Empirical risk minimization Notice that the best linear predictor for an empirical distribution directly minimizes the sum of squares error: n 1X (yi − [a + xi b])2 . n i=1 This general strategy is called empirical risk minimization...which can be thought of as empirical utility maximization. This is similar to the idea of back-testing: make future decision (actions) on the basis of which past decisions would have worked well. 26 Tool: histograms With univariate empirical distributions for continuous data, each value is unique so they all get the same n1 weight. If we create evenly spaced “buckets”, something like a density function emerges. 0.010 0.000 0.005 Density 0.015 NBA 2008 150 200 250 300 Weight in pounds Such a plot is called a histogram. 27 Empirical CDF plots We can visualize the empirical CDF plot. 0.0 0.2 0.4 F(x) 0.6 0.8 1.0 NBA 2008 150 200 250 300 Weight in pounds This plot shows the order statistics d(j) plotted against nj . 28 Probability ideas meet data To reiterate: the thought experiment of randomly sampling observations from our data permits us to connect the probability concepts and terminology from the first three weeks of class to whichever real-world problem we happen to be studying. In particular, we can determine the probability of certain events or visualize certain conditional distributions — we just have to know how to ask the computer to pull them up. 29 Data interrogation Now we will begin to explore some real data sets in R and Excel. We will see the following exploratory data analysis (or EDA) tools: I pivot tables and subsetting I boxplots I histograms and binning I scatterplots and time series plots I simple trend lines [Begin computation lab session.] 30 Guiding principle of statistical estimation The guiding principle of statistical estimation is that: Empirical distributions tend to look like the underlying distributions of the data which define them. Moreover, this likeness improves as we get more and more data. (We rarely have as much data as we would like.) Note: the estimated quantities in our data are not the “true” values of the underlying RV! 31 Guiding principle Consider some event A. In terms of A, the claim is that as the sample size gets bigger and bigger (n → ∞), P(Dn ∈ A) → P(X ∈ A). In more detail, n X 1 1(xi ∈ A) → P(X ∈ A). n i=1 We can perform simulations to support this claim. 32 Guiding principle in action iid For example, let A = {h | 2 ≤ h ≤ 5} and Xi ∼ N(2, 32 ) for i = 1, . . . , n. We can compute P(X ∈ A) using the R command pnorm(5,2,3) - pnorm(2,2,3) and find it to be 0.341. We can then draw x1 , x2 , . . . , xn using the R command x <- rnorm(n,2,3). Finally, we can compute P(Dn ∈ A) with the command mean((2 < x) & (x < 5)). 33 Guiding principle I Because A was arbitrary, we can conclude that essentially any feature of the distribution of Dn will have to look like the corresponding feature of the X distribution. I Every time you perform the demo above, you get a slightly different answer for P(Dn ∈ A). I By the Law of Large numbers (from last lecture), as n gets bigger, P(Dn ∈ A) gets closer and closer to P(X ∈ A). To see this last point, compute the mean and variance of n X 1 1(Xi ∈ A). n i=1 Seriously, like right now. Do it. 34 Terminology: estimand, estimator, estimate Suppose we are interested in a particular feature of an unknown distribution. It could be a common summary statistic — such as the mean or the variance of the distribution. Perhaps we are interested in the probability of a particular event. We may be interested in the optimal action under some problem-specific utility function (such as our “milk demand” example). In each case, any quantity of interest of an unknown probability distribution is referred to as an estimand. 35 Estimand, estimator, estimate Next, we proceed to use data to figure out what the estimand is. To do this, we come up with a recipe for taking observed data and spitting out our best guess as to what the estimand is. This recipe itself is called an estimator. Note that this recipe defines a random variable, induced by the data generating process: estimators are random variables! 36 Estimand, estimator, estimate Finally, after we observe actual data, we can apply the estimator recipe to it, to obtain an actual number. This number — our post-data guess as to the value of the unknown estimand — is called our estimate. Estimates are fixed numbers based on the observed data. 37 Estimand, estimator, estimate An estimand is a fixed but unknown property of a probability distribution. It is the thing we want to estimate. An estimator is a recipe for taking observed data and formulating a guess about the unknown value of the estimand. An estimate is a specific value obtained when the estimator is applied to specific observed data. 38 Example: population mean: E(X ); sample mean: X̄ ; sample mean: x̄ Suppose we are interested in E(X ), the mean of a random variable X . This is sometimes referred to as the population mean, in reference to polling problems. The standard estimator of the population mean is the mean of the observed data, or the sample mean. Before any P data is observed, this n defines our estimator, commonly denoted X̄ ≡ n1 i=1 Xi . Once the Pndata has been observed, we have an estimate in hand, denoted x̄ ≡ n1 i=1 xi . (Note the lower-case x’s here.) This is the observed sample mean. 39 Gauging sampling variabiliy Now we will look at some examples of sampling variability: 1. fair coin or biased coin?, 2. milk demand (aggregate daily demand). 3. best linear predictor of NBA height, given weight, 40 Example: fair or biased coin? Suppose a partner at your firm always tosses a coin to see if you or he pays for your weekly Thursday lunch meeting. After ten lunches, you’ve had to pay eight times. We can denote this by 8 . saying that p̂ = 10 Should you accuse him of using a loaded coin? (This “hat” notation, p̂, indicates an estimate of the corresponding estimand p.) 41 Example: fair or biased coin? (cont’d) First, let’s approach this question mathematically: how much do we trust 8 our estimate based on a sample size of 10? p̂ = 10 First, we Pncompute the sampling distribution of our estimator p̂ ≡ n1 i=1 Xi , where each Xi ∼ Ber(p), for some unknown p (our estimand). Specifically, we learned last week that a sum of n independent Bernoulli RV’s has a binomial distribution, so np̂ ∼ Bin(n, p), for whatever the actual value of p is. Notice that the sampling distribution depends on the unknown estimand E(X ) = p. That’s annoying, but typical. 42 Example: fair or biased coin? (cont’d) However, because our null hypothesis was that the coin is fair, we can use p = 21 as a benchmark. Under the null hypothesis the observed number of “successes” (call it Y ) is a draw from a Bin(10, 12 ) distribution. Accordingly, we calculate P(Y ≤ 7) = 0.945 — if you accuse your senior colleague in cases like the one you observed, you’d wrongly accuse him of being a scoundrel 5.5% of the time. Thus 5.5% is the p-value of your data: the probability, under the null hypothesis, of seeing data as or more extreme than what you actually witnessed. 43 Example: fair or biased coin? (cont’d) An important caveat: for any fixed data set either the guy is cheating or he isn’t. But the rule itself brings with it long-run guarantees. The 5.5% probability of false-accusations, that is a statement about the rule itself, not about the coin on any particular occasion or even on average. IMO, unless you plan to accuse a series of bosses of shenanigans, this sort of guarantee isn’t so very useful. Nonetheless, this approach of developing decision rules that work well averaged over many different data sets (the “frequentist” approach) is standard. We will discuss it in detail next week. 44 Predicting milk demand Recall our milk demand random variable is: x Pr(X = x) 1 4% 2 15 % 3 35% 4 5% 5 5% 6 5% 7 5% 8 20% 9 3% 10 . 3% And our utility function is ( −$5(a − x) u(a, x) = −$35(x − a) if a > x, if x > a. where the action a is the number of gallons we order and the “state” x is the amount of milk required. If we based our calculation on the past 3 weeks of data, how often would we make the wrong decision? 45 Milk order sampling variation 0.0 0.2 0.4 0.6 0.8 We can investigate this question via simulation. 1 2 3 4 5 6 7 8 9 10 Most of the time we get it right. Sometimes we over-order and sometimes we under-order (less often). 46 Example: NBA heights and weights 4 3 0 1 2 Frequency 5 6 7 An easy way to approximate the sampling distribution is simply to re-sample your data (with replacement), compute your estimate, and repeat. 0.100 0.105 0.110 0.115 β This technique is called bootstrapping. 47 80 75 70 65 Height in inches 85 90 Example: NBA heights and weights 150 200 250 300 Weight in pounds 48 Example: NBA heights and weights Here is the code to produce the bootstrap samples. 49