Statistics - methodology for collecting, conclusions from collected data analyzing, interpreting and drawing Homer Simpson: Aw, you can come up with statistics to prove anything, Kent. 14 percent of all people know that. 1. Design: Planning and carrying out research studies; 2. Description: Summarizing and exploring data; 3. Inference: Making predictions and generalizing about phenomena represented by the data. Anastasia Kadina GM presentation 6/15/2015 Population - the collection of all individuals or items under consideration in a statistical study Sample - that part of the population from which information is collected Parameter – statistical description of the population Population Sample Statistical Data Analysis Variable – characteristic that varies from one item to another Quantitative (numerical) Discrete Continuous Qualitative (categorical) Observing the values of the variables yield data Observation – individual piece of data Data set/Data matrix – collection of observations for variable Data matrix k variables measured in sample with the size of n Presenting data Relative frequency = Frequency / total # of observations Sample and population distributions: Measures of center (averages) 1. The mode: the value that occurs with the highest frequency Example: 4, 2, 5, 2, 6, 1, 2: 2 occurs with a greatest frequency If greatest freq == 1: no mode Can be more than 1 mode 2. The median: Arrange the observed values of variable in a data in increasing order. a. # of observations is odd: the value in the middle. b. # of observations is even: the number halfway between the two middle values Example: 2, 5, 7, 8, 9, 11: Median = 7.5 (len = 6) 3. Sample mean: the sum of observed values in a data divided by the number of observations Measures of variability 1. Range: Range = max – min 2. Standard deviation: For a variable x, the sample standard deviation, denoted by sx or σx (for sample), or σ (for population), is: Sample Population Z-Score (Standard score) Sample Population How many standard deviations a value lies above or below the mean of the set of data; For normal distribution probability of the event (area under the curve) can be found in the tables by z. Empirical rule for symmetrical normal distribution: 68% of the values lie within x ± sx, 95% of the values lie within x ± 2sx, 99.7% of the values lie within x ± 3sx. Z-Score (Standard score) Zα: value of Z for which the area under the standard normal curve to its right is equal to α. If we want to take both ends of the distribution into account, we consider Zα/2 Sampling of the population Random sample - a sample from a finite population random of it is chosen in such a way that each of the possible samples has the same probability of being selected. For random sample of size n of population N: Sampling distribution mean = population mean μ = μx Standard deviation (standard error of the mean): Standard deviation correction factor Infinite population Finite population Central Limit Theorem For large samples the sample distribution of the mean can be approximated closely with a normal distribution. Large: sample size n >= 30 μ = μx Probability and Confidence of Statements Zα denotes the value of z for which the area under the standard normal curve to its right is equal to α Zα/2 is such value that area under the standard normal curve between -Zα/2 and +Zα/2 is equal to 1 - α μ = μx When we use μx as an estimate of μ, the probability is 1 - α that this estimate will be “off” either way by at most E = Zα/2 * (σ / √n) (standard error) In general, we make probability statements about future values of random variables (e.g. potential error of an estimate) and confidence statements once the data has been obtained. Confidence intervals For large samples (n >= 30) and σ is known The probability is (1 – α) that a random variable having the normal distribution will take on a value between -Zα/2 and +Zα/2: -Zα/2 < Z < Zα/2 -Zα/2 < < Zα/2 Confidence interval X - Zα/2* σ / √n < μ < X + Zα/2* σ / √n As we increase the degree of certainty, namely the degree of confidence (1 – α), the confidence interval becomes wider and thus tells us less about the quantity we are trying to estimate. Student’s t-test Also good for small samples (<30) and/or when standard dev is unknown; distribution is roughly the shape of normal distribution t-score Degrees of freedom: df = n – 1 Small sample confidence interval: X - tα/2* s / √n < μ < X + tα/2* s / √n tα/2 can be found in corresponding tables by df and α Error Bars - graphical representation of the variability of data and are used on graphs to indicate the error, or uncertainty in a reported measurement Common Error Bars Test of Hypotheses A statistical hypothesis is an assertion about the parameter(s) of a population. Null hypothesis (H0) – any hypothesis set up primarily to see whether it can be rejected (is directly tested); Alternative hypothesis (HA) – the hypothesis that we accept when the null hypothesis can be rejected. A significance test is a way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis. Data that fall far from the predicted values provide evidence against the hypothesis. If the difference between what we expect and what we observe is so small that it may well be attributed to chance, the results are not statistically significant. The test statistics is a statistic calculated from the sample data to test the null hypothesis. This statistic typically involves a point estimate of the parameter to which the hypotheses refer. p-value - the probability, when H0 is true, of a test statistic value at least as contradictory to H0 as the value actually observed. The smaller the p-value, the more strongly the data contradict H0. The primarily reported result of a significance test. The p-value summarizes the evidence in the data about the null hypothesis. A moderate to large p-value means that the data are consistent with H0. Most studies require very small p-value, such as p 0.05, before concluding that the data sufficiently contradict H0 to reject it. In such cases, results are said to be significant at the 0.05 level. This means that if the null hypothesis were true, the chance of getting such extreme results as in the sample data would be no greater than 5%.