Presentation on statistics

advertisement
Statistics
- methodology for collecting,
conclusions from collected data
analyzing,
interpreting
and
drawing
Homer Simpson: Aw, you can come up with statistics to prove
anything, Kent. 14 percent of all people know that.
1. Design: Planning and carrying out research studies;
2. Description: Summarizing and exploring data;
3. Inference: Making predictions and generalizing about
phenomena represented by the data.
Anastasia Kadina
GM presentation 6/15/2015
Population - the collection of all individuals or items under
consideration in a statistical study
Sample - that part of the population from which information is
collected
Parameter – statistical description of the population
Population
Sample
Statistical Data Analysis
Variable
– characteristic that varies from one item to another
Quantitative
(numerical)
Discrete
Continuous
Qualitative
(categorical)
Observing the values of the variables yield data
Observation – individual piece of data
Data set/Data matrix – collection of observations for variable
Data matrix
k variables measured in sample with the size of n
Presenting data
Relative frequency = Frequency / total # of observations
Sample and population distributions:
Measures of center (averages)
1. The mode: the value that occurs with the highest frequency
Example: 4, 2, 5, 2, 6, 1, 2: 2 occurs with a greatest frequency
If greatest freq == 1: no mode
Can be more than 1 mode
2. The median: Arrange the observed values of variable in a data in
increasing order.
a. # of observations is odd: the value in the middle.
b. # of observations is even: the number halfway between the two middle
values
Example: 2, 5, 7, 8, 9, 11:
Median = 7.5 (len = 6)
3. Sample mean: the sum of observed values in a data divided by the number
of observations
Measures of variability
1. Range:
Range = max – min
2. Standard deviation: For a variable x, the sample standard deviation,
denoted by sx or σx (for sample), or σ (for population), is:
Sample
Population
Z-Score (Standard score)
Sample
Population
How many standard deviations a value lies above or below the mean of
the set of data;
For normal distribution probability of the event (area under the curve)
can be found in the tables by z.
Empirical rule for symmetrical normal
distribution:
68% of the values lie within x ± sx,
95% of the values lie within x ± 2sx,
99.7% of the values lie within x ± 3sx.
Z-Score (Standard score)
Zα: value of Z for which the area under the standard normal curve to its
right is equal to α.
If we want to take both ends of the distribution into account, we consider
Zα/2
Sampling of the population
Random sample - a sample from a finite population random of it is chosen in such a
way that each of the possible samples has the same probability of being selected.
For random sample of size n of population N:
Sampling distribution mean = population mean
μ = μx
Standard deviation (standard error of the mean):
Standard deviation
correction factor
Infinite population
Finite population
Central Limit Theorem
For large samples the sample distribution of the mean can be approximated closely
with a normal distribution.
Large: sample size n >= 30
μ = μx
Probability and Confidence of Statements
Zα denotes the value of z for which the area under the standard normal curve to its
right is equal to α
Zα/2 is such value that area under the standard normal curve between -Zα/2 and +Zα/2 is
equal to 1 - α
μ = μx
When we use μx as an estimate of μ, the probability is 1 - α
that this estimate will be “off” either way by at most
E = Zα/2 * (σ / √n) (standard error)
In general, we make probability statements about future values of random variables
(e.g. potential error of an estimate) and confidence statements once the data has
been obtained.
Confidence intervals
For large samples (n >= 30) and σ is known
The probability is (1 – α) that a random variable having the normal distribution will
take on a value between -Zα/2 and +Zα/2:
-Zα/2 < Z < Zα/2
-Zα/2 <
< Zα/2
Confidence interval
X - Zα/2* σ / √n < μ < X + Zα/2* σ / √n
As we increase the degree of certainty, namely the degree of confidence (1 – α), the
confidence interval becomes wider and thus tells us less about the quantity we are
trying to estimate.
Student’s t-test
Also good for small samples (<30) and/or when standard dev is unknown; distribution
is roughly the shape of normal distribution
t-score
Degrees of freedom: df = n – 1
Small sample confidence interval:
X - tα/2* s / √n < μ < X + tα/2* s / √n
tα/2 can be found in corresponding tables by df and α
Error Bars
- graphical representation of the variability of data and are used on graphs to indicate
the error, or uncertainty in a reported measurement
Common Error Bars
Test of Hypotheses
A statistical hypothesis is an assertion about the parameter(s) of a population.
Null hypothesis (H0) – any hypothesis set up primarily to see whether it can be
rejected (is directly tested);
Alternative hypothesis (HA) – the hypothesis that we accept when the null hypothesis
can be rejected.
A significance test is a way of statistically testing a hypothesis by comparing the data
to values predicted by the hypothesis. Data that fall far from the predicted values
provide evidence against the hypothesis.
If the difference between what we expect and what we observe is so small that it may
well be attributed to chance, the results are not statistically significant.
The test statistics is a statistic calculated from the sample data to test the null
hypothesis. This statistic typically involves a point estimate of the parameter to which
the hypotheses refer.
p-value
- the probability, when H0 is true, of a test statistic value at least as contradictory to H0
as the value actually observed. The smaller the p-value, the more strongly the data
contradict H0. The primarily reported result of a significance test.
The p-value summarizes the evidence in the data about the null hypothesis. A moderate
to large p-value means that the data are consistent with H0.
Most studies require very small p-value, such as p 0.05, before concluding that the data
sufficiently contradict H0 to reject it. In such cases, results are said to be significant at
the 0.05 level. This means that if the null hypothesis were true, the chance of getting
such extreme results as in the sample data would be no greater than 5%.
Download