ECON 309 Lecture 1: The Basics I. Descriptive versus Inferential Statistics Descriptive: statistics that summarize the characteristics of given data, without trying to extrapolate or make predictions. Inferential: statistics used to make claims or predictions about the larger population based on a subset (sample) of that population. The distinction between descriptive and inferential statistics is closely related to the distinction between populations and samples: population sample Population: the entire set of all possible outcomes or measurements of interest. Sample: a subset of the population for which we have data, and that we hope is representative of the population. Descriptive statistics usually deal with a sample, but sometimes they deal with a whole population. For instance, we could calculate descriptive statistics using Census data, which purports to document the entire population of the U.S. (In reality, even the Census doesn’t document everyone, so it’s really a sample that is very close to being the whole population.) Inferential statistics apply when we have a sample and we’re using it to make claims or predictions (inferences) about the whole population. In a future lecture, we’ll discuss some problems that arise from trying to get a sample that is sufficiently representative of the population. A note on the meaning of “population”: it’s easy to be misled by the name into thinking a population must be a group of people, or perhaps other organisms. The population can be all kinds of things: all baseballs manufactured in the U.S. (or by a particular company), all restaurants in a metropolitan area, all volcanoes in the world, etc. Moreover, the population may not even in principle be something that we could count or collect. For instance, suppose we want to know something about the likelihood of a baseball-making machine producing defective baseballs. The population is not just all baseballs the machine has ever produced, but every baseball the machine will or ever could produce. Even if we take all baseballs the machine has ever produced, they are just a sample of all the baseballs the machine could produce. Similarly, we might want to know something about the properties of a pair of dice. The population is all the throws of the dice that could possibly be made, which is infinite (you can keep on throwing the dice). Any number of actual dice throws is just a sample. So in the real world, we are usually dealing with samples. II. Types of Data Primary versus secondary. Data is primary if it has been collected by the same person or entity that is using it. Data is secondary if the user is not the person or entity who collected it. The book says information from the Census, Bureau of Labor Statistics, Dept. of Commerce, etc., is secondary. Well, that’s true if you use it. If they (that is, employees of the Census Bureau) use it, it’s primary. Why is this important? Because you have control over how your primary data is collected. You decide the methodology (how the data will be collected). You decide how many observations to collect, for example. With secondary data, you are stuck with the methods used by whoever collected it. What are the ways in which primary data is collected? 1. Direct observation – that is, just watching and measuring. A naturalist watches squirrels to see how many nuts they collect. A business collects information on who is buying their product and how much. (The book mentions focus groups as an example of direct observation, but it really depends on how the focus group is conducted. If the group is just being watched and recorded, that’s direct observation. If the group is being asked specific questions, that’s more like a survey – see below.) In economics, we collect measurements of many economic variables of interest, like prices and quantities sold, by direct observation. 2. Experiments – subjects are divided into treatment groups and control groups to measure the difference between them after some kind of treatment is given to the former group. This is very common in medical testing. It’s much harder to do in economics and business, but it does happen! Experimental economists get students in lab settings and expose them to different incentives (such as playing games for money) to see what they’ll do. Economists also try to find “natural experiments,” which are on the borderline between direct observation and experiment. These occur when some kind of policy change or other event automatically affects one group of subjects but not another (e.g., when New Jersey raises its minimum wage but Pennsylvania does not). The difficulty is making sure the treatment group and control group are sufficiently similar in all other respects. 3. Surveys – asking people questions. More about in a future lecture. The main thing to remember about surveys: the information you’re getting is the truth about how people respond to questions, but not necessarily the truth about the actual content of the question. If you ask smokers if they want to quit, and 67% say yes, that doesn’t mean 67% of smokers really want to quit – it means 67% say they want to quit when asked by a surveyor. We can also divide data based on the sort of information it contains, and how easily numbers can be attached to it. 1. Nominal data. This is when the data fits a qualitative category. We can say whether someone’s a man or a woman, a registered Republican or Democrat, or something else. Any numbers assigned are arbitrary (e.g., we could code “1” for man and “0” for woman, but the reverse would have made just as much sense). 2. Ordinal data. This is when numbers are used to represent an ordering or ranking of items, but the differences between numbers don’t mean anything. E.g., if asked to rank McDonald’s, BK, and Wendy’s, a consumer might say: 1. BK, 2. Wendy’s, 3. McDonald’s. That means this consumer thinks BK is better than both Wendy’s and McDonald’s. But it doesn’t mean, say, that BK is three times better than McDonald’s, or the difference in quality between BK and Wendy’s is the same as the difference in quality between Wendy’s and McDonald’s. 3. Interval data. This is numerical data in which differences have meaning, but ratios don’t. Consider temperature in Fahrenheit. The difference between 40 degrees and 30 degrees is 10, and that’s the same as the difference between 50 degrees and 40 degrees. But is 40 degrees twice as hot as 20? No, that doesn’t follow. For that to make sense, there must be a true “zero” point. We do designate one point as zero degrees, but we could just as easily pick another point (as indicated by the difference between Fahrenheit and Celsius). 4. Ratio data. This is numerical data in which both differences and ratios have meaning. Prices are a nice example. $10 minus $6 is $4, and $20 minus $16 is $4; those $4’s are the same. And $20 is twice as much as $10; there is a true zero point, a price of zero. In economics, we will also often categorize data by how it relates to time. 1. Cross-sectional data. In cross-sectional data, all observations come from the same point in time. The observations typically correspond to individuals or groups like states or countries. For instance, a survey of Americans on who they support in the upcoming presidential election is cross-sectional data. So is a data set with the homicide rate for each state in a single year. 2. Longitudinal or time-series data. In longitudinal or time-series data, each data point corresponds to a particular point in time – usually for a single individual or group. For instance, if you recorded your income every day for a year, that would give me a longitudinal data set. The GDP of the U.S. from 1945 to the present is also a longitudinal data set. 3. Panel data. Panel data is both cross-sectional and longitudinal. It involves getting cross-sectional data for many time periods (or, alternatively, time-series data for many different individuals or groups). For instance, if you recorded the income for each one of your classmates every year for the next 20 years, that would be a panel data set. One way to think of this is in terms of dimensions. Both cross-sectional and time-series data are one-dimensional; panel data is two-dimensional. III. Measures of Central Tendency A measure of central tendency is meant to tell us the “center” of a data set or population. What do we mean by “center”? That’s an inherently vague question. We might mean a typical value, or the most common value, or a value that’s in the middle… We need to be more specific. Mean The mean is what we usually think of as the average (although “average” can be used to refer to other measures of central tendency as well). For a sample, the mean is the sum of all observations divided by the number of observations. Here is the formula for the sample mean: n x x i 1 i n For the mean of a population, in principle we do the same thing: add up the values of all possible observations, and divide by the number of possible observations. But what if there is no maximum number of observations? Consider rolling a standard 6-sided die. We could roll it an infinite number of times, so any finite number of rolls is only a sample. What is the population mean? You have to take each possible value for an outcome (in this case, one through six), multiply by its frequency as a fraction of all outcomes, and add up the results. Here is the formula for the population mean: xf ( x) xX (The expression f(x) means the frequency of the value x in the population; it is a number between 0 and 1.) We use the Greek letter mu (μ) to stand for the population mean, which we sometimes call the “true” mean. For the throw of 6-sided die, the population mean is 1*(1/6) + 2*(1/6) + 3*(1/6) + 4*(1/6) + 5*(1/6) + 6*(1/6) = 3.5. In most cases, we don’t actually know the population mean, so we try to estimate it with the sample mean. The technique we just used to find the population mean is also useful when you’re given sample data in the form of a frequency table, with each value that occurred alongside the frequency with which it occurred. E.g., Answers to “How Many Times Did You Use the Restroom Today?” Answer 0 1 2 3 4 5 # Subjects 1 3 4 8 5 3 We take each possible answer and multiply by its frequency in the sample. The sample mean here is [0(1) + 1(3) + 2(4) + 3(8) + 4(5) + 5(3)]/24 = 2.92. (Notice that this is the same as finding the fractional frequency of each value as # Subjects/24, multiplying by the answer value, and then adding them up.) Median The median is the value of the observation exactly in the middle of the sample or population, such that half the observations have a higher value and half have a lower value. (If there is an even number of observations, then there is no true middle value, so the median is defined as the mean of the two middle values.) In the restroom example, the median is 3 (because there are eight observations higher than 3 and eight observations lower, and the middle eight observations are all the same). What would happen if we added three more people who went to the restroom 6 times each? The total observations would be 29, so we’d be looking for the 15th highest (and 15th lowest) observation. It’s still 3! Mode The mode is the most common outcome in the sample or population. For example, the modal sex in the United States is female (because there are more women than men). The modal race in the United States is white (because there are more whites than there are of any one other race). Note that this latter example will still be correct when whites are no longer the majority race, because that will just mean that people of all the other races together outnumber whites. Mode means the single most common outcome, not necessarily the majority outcome. (We sometimes use the word plurality to mean the most common outcome, which may or may not be the majority outcome. For instance, Clinton won the presidency in 1992 with 43% of the popular vote; this was the plurality, greater than Bush’s 37.5% and Perot’s 18.9%, but it was not the majority. The mode and the plurality are essentially the same, although mode is the preferred term in statistics.) The examples involve nominal data, and that’s where mode is most often useful. But it can be used with numbers, too. In the restroom frequency table above, the mode is 3. In this example, the possible outcomes are numerical, but they are also discrete, meaning they take on a countable number of different values. You can’t use the bathroom onehalf a time! The mode makes less sense with characteristics that are not discrete, but are instead continuous, meaning the variable can take on an uncountable number of different values. Consider height. If you measure height precisely enough, it’s difficult to find anyone who is exactly any height you specify in advance – e.g., 6’0”. Everyone you find will be just slightly above or slightly below it. The frequency of any precisely defined height is approximately zero! So to define the mode in cases like this, you need to establish intervals (such as inches of height, which actually includes an interval of heights rounded off to the nearest whole inch). If you have to create intervals to find the mode, the mode obviously depends on how you’ve defined your intervals. For instance, we could choose half-inch intervals instead of whole-inch intervals for measuring height; our answer for the mode will be a half-inch interval instead of a whole-inch interval. If your intervals are not of equal size, then the mode can be difficult to find, and in some cases is actually meaningless. We’ll see an example of this in a future lecture. Much confusion results from people not truly understanding these three different measures of central tendency and how they can differ. We will return to this in a future lecture. IV. Variance and Standard Deviation Variance: a measure of dispersion in which observations are weighted by the square of their distance from the mean, as given by the following formula: n s2 (x i 1 i x)2 n 1 Note that this gives greater weight to an observation the further it is from the mean. For example, suppose the mean is zero. An observation of 4 (or -4) would be weighted four times as much as an observation of 2 (or -2), despite being only twice as far from the mean. There is an equivalent formula for the sample variance that is often easier to use: n x i n 2 xi i 1 n s 2 i 1 n 1 2 Recall that the mean could be calculated for both the sample and the population. The same is true of the variance. The formula for the population variance is: 2 ( x ) 2 f ( x) xX All you’re doing here is taking each possible value of the x, finding its squared difference from the mean (just as with the sample variance), and weighting it by its frequency in the population. The population variance for throws of a six-sided die is: (1/6)[(1 – 3.5)2 + (2 – 3.5)2 + (3 – 3.5)2 + (4 – 3.5)2 + (5 – 3.5)2 + (6 – 3.5)2] = (1/6)[6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25] = 2.92 You might wonder why, for the sample variance, we divided by (n – 1) instead of just n. If we wanted to just weight each observation’s squared difference from the mean by the frequency of that observation, we would divide by n. So why use (n – 1) instead? The answer is complicated, but here’s the simple version: In calculating the population variance, we take into account every possible value and then weight that value by its true frequency. But when we take a sample, you probably won’t get a sample that represents every possible value including the most unlikely ones. For instance, if you take a sample of heights, you probably won’t get a sample that includes men who are 8’4” or 3’6”, even though such men exist in the population. As a result, your sample understates the amount of variation in the population. Dividing by (n – 1) instead of n corrects for this underestimation. Standard deviation: the square root of the variance (for both the sample and the population). s s2 2 V. Quartiles, Quintiles, Deciles, etc. We use these measures when we want to divide a group or population into a number of equal-sized subgroups. Quartiles are four equal-sized subgroups; quintiles are five equalsized subgroups; deciles are ten equal-sized subgroups. (There are others, but these are the most common.) Note that “equal-sized” is with respect to the number of observations or members in each subgroup, not the size of the group’s interval. For instance, households are often divided into income quintiles. The top quintile includes the 20% of households that have the highest annual income. The bottom quintile includes the 20% of households that have the lowest annual income. These quintiles include equal numbers of households, but they will not correspond to the same size intervals of incomes. Percentiles or “Xiles” are used for various purposes, but most often in economics for dividing the population into income groups. This can be useful for getting a sense of the dispersion of incomes in the economy. But to see how they can be misleading, notice that the dividing lines are much like the median: they can be invariant to changes on either side of them. E.g., people at the top of the top quintile, or the bottom of the bottom quintile, could get richer or poorer without affecting the quintile dividing lines.