Essential Ideas, Terminology, Skills/Procedures, and Concepts for Each Part of the Course Part I Two Types of Statistics: Descriptive and Inferential Descriptive Statistics--purpose: to communicate characteristics of a set of data Characteristics: Mean, median, mode, variance, standard deviation, skewness, etc. Charts, graphs Inferential Statistics--purpose: to make statements about population parameters based on sample statistics Population--group of interest being studied; often too large to sample every member Sample--subset of the population; must be representative of the population Random sampling is a popular way of obtaining a representative sample. Parameter--a characteristic of a population, usually unknown, often can be estimated Population mean, population variance, population proportion, etc. Statistic--a characteristic of a sample Sample mean, sample variance, sample proportion, etc. Two ways of conducting inferential statistics Estimation Point estimate--single number estimate of a population parameter, no recognition of uncertainty such as: "40" to estimate the average age of the voting population Interval estimation--point estimate with an error factor, as in: "40 ± 5" The error factor provides formal and quantitative recognition of uncertainty. Confidence level (confidence coefficient)--the probability that the parameter being estimated actually is in the stated range Hypothesis testing Null hypothesis--an idea about an unknown population parameter, such as: "In the population, the correlation between smoking and lung cancer is zero." Alternate hypothesis--the opposite idea about the unknown population parameter, such as: "In the population, the correlation between smoking and lung cancer is not zero." Data are gathered to see which hypothesis is supported. The result is either rejection or non-rejection (acceptance) of the null hypothesis. Four types of data Nominal Names, labels, categories (e.g. cat, dog, bird, rabbit, ferret, gerbil) Ordinal Suggests order, but computations on the data are impossible or meaningless (e.g. Pets can be listed in order of popularity--1-cat, 2-dog, 3-bird, etc.--but the difference between cat and dog is not related to the difference between dog and bird.) Interval Differences are meaningful, but they are not ratios. There is no natural zero point (e.g. clock time-- the difference between noon and 1 p.m. is the same amount of time as the difference between 1 p.m. and 2 p.m. But 2 p.m. is not twice as late as 1 p.m. unless you define the starting point of time as noon, thereby creating a ratio scale) Ratio Differences and ratios are both meaningful; there is a natural zero point. (e.g. Length--8 feet is twice as long as 4 feet, and 0 feet actually does mean no length at all.) Two types of statistical studies Observational study (naturalistic observation) Researcher cannot control the variables under study; they must be taken as they are found (e.g. most research in astronomy). Experiment Researcher can manipulate the variables under study (e.g. drug dosage). Characteristics of Data Central tendency--attempt to find a "representative" or "typical" value Mean--the sum of the data items divided by the number of items, or Σx / n More sensitive to outliers than the median Outlier--data item far from the typical data item Median--the middle item when the items are ordered high-to-low or low-to-high Also called the 50th percentile Less sensitive to outliers than the mean Mode--most-frequently-occurring item in a data set Dispersion (variation or variability)--the opposite of consistency Variance--the Mean of the Squared Deviations (MSD), or Σ(x-xbar)2/n Deviation--difference between a data item and the mean The sum of the deviations in any data set is always equal to zero. Standard Deviation--square root of the variance Range--difference between the highest and lowest value in a data set Coefficient of Variation—measures relative dispersion—CV = ssd / x-bar (or est. / ) Skewness--the opposite of symmetry Positive skewness--mean exceeds median, high outliers Negative skewness--mean less than median, low outliers Symmetry--mean, median, mode, and midrange about the same Kurtosis--degree of relative concentration or peakedness Leptokurtic--distribution strongly peaked Mesokurtic--distribution moderately peaked Platykurtic--distribution weakly peaked Symbols & "Formula Sheet No. 1" Descriptive statistics Sample Mean--"xbar" (x with a bar above it) Sample Variance--"svar" (the same as MSD for the sample) Also, the "mean of the squares less the square of the mean" Sample Standard Deviation--"ssd"--square root of svar Population parameters (usually unknown, but can be estimated) Population Mean--"μ" (mu) Population Variance--"σ2" (sigma squared) (MSD for the population) Population Standard Deviation--"σ" (sigma)--square root of σ2 Inferential statistics--estimating of population parameters based on sample statistics Estimated Population Mean--"μ^" (mu hat) The sample mean is an unbiased estimator of the population mean. Unbiased estimator--just as likely to be greater than as less than the parameter being estimated If every possible sample of size n is selected from a population, as many sample means will be above as will be below the population mean. Estimated Population Variance--"σ^2" (sigma hat squared) The sample variance is a biased estimator of the population variance. Biased estimator--not just as likely to be greater than as less than the parameter being estimated If every possible sample of size n is selected from a population, more of the sample variances will be below than will be above the population variance. The reason for this bias is the probable absence of outliers in the sample. The variance is greatly affected by outliers. The smaller a sample is, the less likely it is to contain outliers. Note how the correction factor's [n / (n-1) ] impact increases as the sample size decreases. 2 This quantity is also widely referred to as "s " and is widely referred to as the "sample variance." In this context "sample variance" does not mean variance of the sample; it is, rather, a shortening of the cumbersome phrase "estimate of population variance computed from a sample." Estimated Population Standard Deviation--"σ^" (sigma hat)--square root of σ^2 The bias considerations that apply to the estimated population variance also apply to the estimated population standard deviation. This quantity is also widely referred to as "s", and is widely referred to as the "sample standard deviation." In this context "sample standard deviation" does not mean standard deviation of the sample; it is, rather, a shortening of the cumbersome phrase "estimate of population standard deviation computed from a sample." Calculator note--some calculators, notably TI's, compute two standard deviations The smaller of the two is the one we call "ssd" TI calculator manuals call this the "population standard deviation." This refers to the special case in which the entire population is included in the sample; then the sample standard deviation (ssd) and the population standard deviation are the same. (This also applies to means and variances.) There is no need for inferential statistics in such cases. The larger of the two is the one we call σ^ (sigma-hat) (estimated population standard deviation). TI calculator manuals call this the "sample standard deviation." This refers to the more common case in which "sample standard deviation" really means estimated population standard deviation, computed from a sample. Significance of the Standard Deviation Normal distribution (empirical rule)--empirical: derived from experience Two major characteristics: symmetry and center concentration Two parameters: mean and standard deviation "Parameter," in this context, means a defining characteristic of a distribution. Mean and median are identical (due to symmetry) and are at the high point. Standard deviation--distance from mean to inflection point Inflection point--the point where the second derivative of the normal curve is equal to zero, or, the point where the curvature changes from "right" to "left" (or vice-versa), as when you momentarily travel straight on an S-curve on the highway z-value--distance from mean, measured in standard deviations Areas under the normal curve can be computed using integral calculus. Total area under the curve is taken to be 1.000 or 100% Tables enable easy determination of these areas. about 68-1/4%, 95-1/2%, and 99-3/4% of the area under a normal curve lie within one, two, and three standard deviations from the mean, respectively Many natural and economic phenomena are normally distributed. Tchebyshev's Theorem (or Chebysheff P. F., 1821-1894) What if a distribution is not normal? Can any statements be made as to what percentage of the area lies within various distances (z-values) of the mean? Tchebysheff proved that certain minimum percentages of the area must lie within various z-values of the mean. The minimum percentage for a given z-value, stated as a fraction, is [ (z2-1) / z2 ] Tchebysheff's Theorem is valid for all distributions. Other measures of relative standing Percentiles--A percentile is the percentage of a data set that is below a specified value. Percentile values divide a data set into 100 parts, each with the same number of items. The median is the 50th percentile value. Z-values can be converted into percentiles and vice-versa. A z-value of +1.00, for example, corresponds to the 84.13 percentile. The 95th percentile, for example, corresponds to a z-value of +1.645. A z-value of 0.00 is the 50th percentile, the median. Deciles Decile values divide a data set into 10 parts, each with the same number of items. The median is the 5th decile value. The 9th decile value, for example, separates the upper 10% of the data set from the lower 90%. (Some would call this the 1st decile value.) Quartiles Quartile values divide a data set into 4 parts, each with the same number of items. The median is the 2nd quartile value. The 3rd quartile value (Q3), for example, separates the upper 25% of the data set from the lower 75%. Q3 is the median of the upper half; Q1 (lower quartile) is the median of the lower half Other possibilities: quintiles (5 parts), stanines (9 parts) Some ambiguity in usage exists, especially regarding quartiles--For example, the phrase "first quartile" could mean one of two things: (1) It could refer to the value that separates the lower 25% of the data set from the upper 75%, or (2) It could refer to the members, as a group, of the lower 25% of the data. Example (1): "The first quartile score on this test was 60." Example (2): "Your score was 55, putting you in the first quartile." Also the phrase "first quartile" is used by some to mean the 25th percentile value, and by others to mean the 75th percentile value. To avoid this ambiguity, the phrases "lower quartile," "middle quartile," and "upper quartile" may be used. Terminology Statistics, population, sample, parameter, statistic, qualitative data, quantitative data, discrete data, continuous data, nominal measurements, ordinal measurements, interval measurements, ratio measurements, observational study (naturalistic observation), experiment, precision, accuracy, sampling, random sampling, stratified sampling, systematic sampling, cluster sampling, convenience sampling, representativeness, inferential statistics, descriptive statistics, estimation, point estimation, interval estimation, hypothesis testing, dependency, central tendency, dispersion, skewness, kurtosis, leptokurtic, mesokurtic, platykurtic, frequency table, mutually exclusive, collectively exhaustive, relative frequencies, cumulative frequency, histogram, Pareto chart, bell-shaped distribution, uniform distribution, skewed distribution, pie chart, pictogram, mean, median, mode, bimodal, midrange, reliability, symmetry, skewness, positive skewness, negative skewness, range, MSD, variance, deviation, standard deviation, z-value, Chebyshev's theorem, empirical rule, normal distribution, quartiles, quintiles, deciles, percentiles, interquartile range, stem-and-leaf plot, boxplot, biased, unbiased. Skills/Procedures--given appropriate data, compute or identify the Sample mean, median, mode, variance, standard deviation, and range Estimated population mean, variance, and standard deviation Kind of skewness, if any, present in the data set z-value of any data item Upper, middle, and lower quartiles Percentile of any data item Percentile of any integer z-value from -3 to +3 Concepts Identify circumstances under which the median is a more suitable measure of central tendency than the mean Explain when the normal distribution (empirical rule) may be used Explain when Chebyshev's Theorem may be used; when it should be used Give an example (create a data set) in which the mode fails as a measure of central tendency Give an example (create a data set) in which the mean fails as a measure of central tendency Explain why the sum of the deviations fails as a measure of dispersion, and describe how this failure is overcome Distinguish between unbiased and biased estimators of population parameters Describe how percentile scores are determined on standardized tests like the SAT or the ACT Explain why the variance and standard deviation of a sample are likely to be lower than the variance and deviation of the population from which the sample was taken Identify when the sample mean, variance, and standard deviation are identical to the population mean, variance, and standard deviation standard