ECON 3318 – Data Analysis and Visualization David Quigley Statistics and Probability Review – Aug. 28 Population vs. Sample • Population – the entire group of interest for a particular research question • Examples: Do tariffs help domestic steel companies? • Population = all domestic steel companies • Do tax cuts change what consumers buy? • Population = all consumers • While it may be possible to survey all domestic steel companies, it would be prohibitively expensive to survey all consumers about purchases Population vs. Sample • The U.S. Census surveys the entire U.S. population, but due to cost and effort, most surveys only reach a sample of the desired population • Sample – a subset or part of the population that can be used to make inferences about the entire population • Example: What do Americans watch on TV? It would be too expensive to survey all TV watching Americans, but a representative sample can give us a really good guess about which shows are popular Population vs. Sample • The larger the sample is as a fraction of the population, the more likely the inferences we draw from the sample accurately reflect the entire population • Example: The larger the sample of voters in a poll, the more likely the poll will be close to the actual results of the election • Statistics are about using the sample to make guesses about the population and quantifying how likely our guesses are to be accurate Statistical Inference • Statistical Inference – the process of using statistics from the sample as guesses about the population of interest and assessing how likely those statistics are to be accurate • Parameter – a relationship that exists in the population • Example: For every $1 of taxes cut, a consumer increases spending by $0.70 on average. • Statistic – a relationship that exists in the sample often used as a guess for a parameter Statistical Inference • Example: In the sample of consumers used, a $1 cut in taxes resulted in those consumers increasing spending by $0.71 on average. • The statistic is close to the population parameter, but isn’t 100% accurate because we didn’t have the spending behavior of the population of consumers • However, the statistic is likely close enough given the additional costs that would have entailed from trying to survey the entire population of consumers Statistical Inference • Furthermore, a different sample is likely to have a slightly different statistic • Example: In a different sample of consumers, a $1 cut in taxes resulted in those consumers increasing spending by $0.68 on average. • Therefore, a statistic has an element of randomness to it since the sample has an element of randomness in it • A parameter is NOT random since it covers the entire population, but how close is our statistic? Sampling Distribution • Sampling Distribution – a frequency distribution for a sample statistic over all possible samples of a given size • Example: The Normal Distribution Sampling Distribution • Example: The Normal Distribution • The Normal Distribution is named that way because it’s what we think most normal statistics are distributed as Sampling Distribution • A higher value of the Sampling Distribution indicates that a statistic around those values is more likely or occurs more frequently if you had all possible samples of the given size • A lower value of the Sampling Distribution indicates that while those values of the statistic are possible for some samples of the given size, those values are unlikely or those samples are rare • Consequently, the Sampling Distribution gives you some idea about how far off your statistic is likely to be Summary Statistics • Summary Statistics include statistics as simple as the minimum value and the maximum value in the data • Before conducting any complicated analysis of the data, it’s always a good idea to better understand the data first through Summary Statistics and Data Visualization • Let’s go through some math and notation necessary for this course Summary Statistics • Suppose you have data on credit card users • The size of the sample is denoted by ๐ so ๐ = number of observations • The value of a particular observation is denoted in the following way: for the 1st person in the sample, the amount of the person’s credit card balance is ๐ฅ1 , for the 2nd person in the sample, the amount of the person’s credit card balance is ๐ฅ2 , for the 3rd person, ๐ฅ3 , and so on… Summary Statistics • Up to the last observation, ๐ฅ๐ • The Sample Mean or Average is denoted by ๐ฅาง • Therefore, ๐ฅาง = 1 ๐ ๐ฅ1 + ๐ฅ2 + ๐ฅ3 + … + ๐ฅ๐ • To save space on the summation, the following notation is used: ๐ฅ1 + ๐ฅ2 + ๐ฅ3 + … + ๐ฅ๐ = σ๐๐=1 ๐ฅ๐ • Therefore, ๐ฅาง = 1 ๐ σ๐=1 ๐ฅ๐ ๐ Summary Statistics • Remember that since the Sample Mean depends on the sample, it will be different for different samples • However, the Population Mean or Average does NOT depend on the sample since by definition the population covers everything • The Population Mean or Average is denoted by µ • It follows then that the Sample Mean has a Sampling Distribution and will be closer to or further away from the Population Mean depending on the sample Summary Statistics • Sample Median – the value in the data such that half the data has a larger value and half the data has a lower value • Depending on the data, it is typically the case that the Sample Mean ≠ the Sample Median • For a sample where the Sample Mean > the Sample Median, this is called Right-skewed • For a sample where the Sample Mean < the Sample Median, this is called Left-skewed Right-Skewed Distribution Left-Skewed Distribution