Daniel C. Tracht ECON 15A Notes 2 What Do We Study? We want to know about a parameter in a population, so we look at a sample of the population and create a statistic that will answer our question. The statistic can be used to describe the sample, or for making inferences. In our sample, we have observations which have information in variables. The observations can be grouped by the categories within a variable. Often we want to know the relationship between variables. The affected variable is the dependent, while the affecting variable is independent. Every variable can be a representation of categories, where we can make nominal groupings. Sometimes there is clear order to the observations. When distance is meaningful, we can find the interval between two observations. If there is a meaningful zero, then we can take a ratio of the values. Averages For nominal data, we can only calculate the mode, the most common category. With ordinal data, we can also calculate the median, the middle observation. With Interval/Ratio, we can calculate a mean. The most common mean is the arithmetic mean. But there are also the harmonic, geometric, and quadratic means. For formulae and relative sizes, see Table 1. Harmonic 1 n P i Q x−1 i −1 1 xi ) n Geometric ( Arithmetic 1 n Σi x i P 2 2 1 i xi n i Moments and Standardization Small There are raw and central moments. The b-th raw moP b ment is defined as m0n = n1 i (xi ) . The b-th central P b moment is defined as mn = n1 i (xi − µ) . The second, third, and fourth central moments are the variance, skewness, and kurtosis, respectively. As the order of the central moments, increase, there is more weight placed out observations far from the mean. To give comparable units, the moments are often standardized. The most common of these is the Z-score. The Z-score of an observation is z = σ1 (xi − µ). This is a unitless number that specifies how many standard deviations from the mean the particular observation is. To standardize the n-th moment, the moment is divided by σ n to rid it out its units. For example, using the standardized skew, it becomes possible to say that if the standardized skew is positive, then it is skewed to the right, and the larger the standardized skew, the more skewed the distribution is. Big Graphs To visualize data, we use graphs. Depending on the type of data, we have different options. The types are summarized in Table 2. The key point is that area is meaningful in all of these. In box, histogram, and density, distance is also meaningful. Histograms have all buckets the same size, while density plots can have varying widths. But still, area is meaningful. Dispersion Distribution Shapes In addition to the mean, we also need to know how spread out the data is. There are many ways of calculating an index of qualitative variation. If there is order, we can calculate quartiles, or medians of the data above and below Bar Box the median. If we have interval data, we can calculate an interquartile range using IQR= Q3 − Q1. With cardinal data, the first thing to calculate is the deviation, which is simply xi − µ. Since these could be positive or negative, the standard approaches are to either take the absolute value or square them. After taking the absolute value, we take the arithmeticPmean to find the Average Absolute Deviation: AAD= n1 i |xi − µ|. If we square them, then the sum of these deviations is known as the sum of squared deviations. Taking the average of these, we get the variance, and from there, we get the standard deviation by taking the square root. See Table 3 for information about these. To predict these parameters in the population, we need to use the statistics from the sample. However, since we have already used our data to predict the mean, we must use Bessel’s correction to account for the decrease in degrees of freedom and get better predictions of the population parameters. Smallest Quadratic Biggest Table 1. The Means Pie Variance SD √ P 2 Population σ 2 = n1 i (xi − µ) σ2 p P 2 Uncorrected s2n = n1 i (xi − x) s2n p c2 c2 = 1 P (xi − x)2 σ Corrected σ i n−1 Table 3. Measures of Dispersion Histogram Nominal Ordinal Interval/Ratio Table 2. Graphs Density If the distribution has more than one peak, it is multimodal. Otherwise, it is unimodal. If there is no skew, the distribution is symmetric. Otherwise, it is asymmetric. For many graph shapes, about 68% of the observations are within one standard deviation of the mean, about 95% are within two, and 99.7% are within three. Chebyshev’s Inequality shows that for most distributions, at least 1− k12 share of the observations are within k standard deviations.