Statistics Review

advertisement

Review of Top 10 Concepts in Statistics

NOTE: This Power Point file is not an introduction, but rather a checklist of topics to review

Top Ten #1

 Descriptive Statistics

Measures of Central Location

 Mean

 Median

 Mode

Mean

Population mean = µ= Σx/N = (5+1+6)/3 = 12/3 =

4

Algebra: Σx = N*µ = 3*4 =12

Sample mean = x-bar = Σx/n

Example: the number of hours spent on the

Internet: 4, 8, and 9 x-bar = (4+8+9)/3 = 7 hours

Do NOT use if the number of observations is small or with extreme values

Ex: Do NOT use if 3 houses were sold this week, and one was a mansion

Median

 Median = middle value

 Example: 5,1,6

Step 1: Sort data: 1,5,6

Step 2: Middle value = 5

 When there is an even number of observation, median is computed by averaging the two observations in the middle.

 OK even if there are extreme values

 Home sales: 100K,200K,900K, so mean =400K, but median = 200K

Mode

 Mode: most frequent value

 Ex: female, male, female

 Mode = female

 Ex: 1,1,2,3,5,8

 Mode = 1

 It may not be a very good measure, see the following example

Measures of Central Location -

Example

Sample: 0, 0, 5, 7, 8, 9, 12, 14, 22, 23

 Sample Mean = x-bar = Σx/n = 100/10 = 10

 Median = (8+9)/2 = 8.5

 Mode = 0

Relationship

 Case 1: if probability distribution symmetric

(ex. bell-shaped, normal distribution),

 Mean = Median = Mode

 Case 2: if distribution positively skewed to right (ex. incomes of employers in large firm: a large number of relatively low-paid workers and a small number of high-paid executives),

 Mode < Median < Mean

Relationship – cont’d

 Case 3: if distribution negatively skewed to left

(ex. The time taken by students to write exams: few students hand their exams early and majority of students turn in their exam at the end of exam),

 Mean < Median < Mode

Dispersion – Measures of

Variability

 How much spread of data

 How much uncertainty

 Measures

 Range

Variance

Standard deviation

Range

 Range = Max-Min > 0

 But range affected by unusual values

 Ex: Santa Monica has a high of 105 degrees and a low of 30 once a century, but range would be 105-30 = 75

Standard Deviation (SD)

 Better than range because all data used

 Population SD = Square root of variance

=sigma = σ

 SD > 0

Empirical Rule

 Applies to mound or bell-shaped curves

Ex: normal distribution

 68% of data within + one SD of mean

 95% of data within + two SD of mean

 99.7% of data within + three SD of mean

Standard Deviation =

Square Root of Variance s

( x n

1 x ) 2

Sample Standard Deviation x

6 x

 x

6-8=-2

( x

 x )

2

(-2)(-2)= 4

6 6-8=-2 4

7

8

13

7-8=-1

8-8=0

13-8=5

Sum=40

Mean=40/5=8

Sum=0

(-1)(-1)= 1

0

(5)(5)= 25

Sum = 34

Standard Deviation

Total variation = 34

 Sample variance = 34/4 = 8.5

 Sample standard deviation = square root of 8.5 = 2.9

Measures of Variability - Example

The hourly wages earned by a sample of five students are:

$7, $5, $11, $8, and $6

Range: 11 – 5 = 6

Variance: s 2

X n

1

X

2

7

7 .

4

2

5

...

1

6

7 .

4

2

Standard deviation:

21 .

2

5

1

5 .

30 s

 s

2 

5 .

30

2 .

30

Graphical Tools

 Line chart: trend over time

 Scatter diagram: relationship between two variables

 Bar chart: frequency for each category

 Histogram: frequency for each class of measured data (graph of frequency distr.)

 Box plot: graphical display based on quartiles, which divide data into 4 parts

Top Ten #2

 Hypothesis Testing

H

0

: Null Hypothesis

Population mean= µ

Population proportion= π

 A statement about the value of a population parameter

 Never include sample statistic (such as, xbar) in hypothesis

H

A or H

1

: Alternative Hypothesis

 ONE TAIL ALTERNATIVE

– Right tail: µ>number(smog ck)

π>fraction(%defectives)

– Left tail: µ<number(weight in box of crackers)

π<fraction(unpopular President’s % approval low)

One-Tailed Tests

A test is one-tailed when the alternate hypothesis, H

1 or H

A

, states a direction, such as:

• H

1

: The mean yearly salaries earned by full-time employees is more than $45,000. ( µ>$45,000)

• H

1

: The average speed of cars traveling on freeway is less than 75 miles per hour. ( µ<75)

• H

1

: Less than 20 percent of the customers pay cash for their gasoline purchase. (π <0.2)

Two-Tail Alternative

 Population mean not equal to number (too hot or too cold)

 Population proportion not equal to fraction (% alcohol too weak or too strong)

Two-Tailed Tests

A test is two-tailed when no direction is specified in the alternate hypothesis

• H

1

: The mean amount of time spent for the

Internet is not equal to 5 hours. ( µ 

5).

• H

1

: The mean price for a gallon of gasoline is not equal to $2.54. ( µ ≠ $2.54).

Reject Null Hypothesis (H

0

) If

Absolute value of test statistic* > critical value*

Reject H

0 if |Z Value| > critical Z

Reject H

0 if | t Value| > critical t

Reject H

0

 if p-value < significance level (alpha)

Note that direction of inequality is reversed!

 Reject H

0 if very large difference between sample statistic and population parameter in H

0

* Test statistic: A value, determined from sample information, used to determine whether or not to reject the null hypothesis.

* Critical value: The dividing point between the region where the null hypothesis is rejected and the region where it is not rejected.

Example: Smog Check

H

0

: µ = 80

H

A

: µ > 80

If test statistic =2.2 and critical value = 1.96, reject H

0

, and conclude that the population mean is likely > 80

 If test statistic = 1.6 and critical value = 1.96, do not reject H

0

, and reserve judgment about

H

0

Type I vs Type II Error

 Alpha= α = P(type I error) = Significance level = probability that you reject true null hypothesis

 Beta= β = P(type II error) = probability you do not reject a null hypothesis, given H

0 false

Ex: H

0

: Defendant innocent

α = P(jury convicts innocent person)

β =P(jury acquits guilty person)

Type I vs Type II Error

H

0 true H

0 false

Reject H

0

Alpha = α =

P(type I error)

1 – β (Correct

Decision)

Do not reject H

0

1 – α (Correct

Decision)

Beta = β =

P(type II error)

Example: Smog Check

H

0

: µ = 80

H

A

: µ > 80

If p-value = 0.01 and alpha = 0.05, reject H

0

, and conclude that the population mean is likely > 80

 If p-value = 0.07 and alpha = 0.05, do not reject H

0

, and reserve judgment about H

0

Test Statistic

 When testing for the population mean from a large sample and the population standard deviation is known, the test statistic is given by: z

X

 

/ n

Example

The processors of Best Mayo indicate on the label that the bottle contains 16 ounces of mayo. The standard deviation of the process is 0.5 ounces. A sample of 36 bottles from last hour’s production showed a mean weight of

16.12 ounces per bottle. At the .05 significance level, can we conclude that the mean amount per bottle is greater than 16 ounces?

Example – cont’d

1. State the null and the alternative hypotheses:

H

0

: μ = 16, H

1

: μ > 16

2. Select the level of significance. In this case, we selected the .05 significance level.

3. Identify the test statistic. Because we know the population standard deviation, the test statistic is z .

4. State the decision rule.

Reject H

0 if |z |> 1.645 (= z

0.05

)

Example – cont’d

5. Compute the value of the test statistic z

X

  n

16 .

12

16 .

00

0 .

5 36

1 .

44

6. Conclusion: Do not reject the null hypothesis.

We cannot conclude the mean is greater than 16 ounces.

Top Ten #3

 Confidence Intervals: Mean and Proportion

Confidence Interval

A confidence interval is a range of values within which the population parameter is expected to occur.

Factors for Confidence Interval

The factors that determine the width of a confidence interval are:

1. The sample size, n

2. The variability in the population, usually estimated by standard deviation .

3. The desired level of confidence.

Confidence Interval: Mean

 Use normal distribution (Z table if): population standard deviation (sigma) known and either (1) or (2):

(1)

Normal population

(2)

Sample size > 30

Confidence Interval: Mean

 If normal table, then

 

 x

 z n

 n

Normal Table

 Tail = .5(1 – confidence level)

 NOTE! Different statistics texts have different normal tables

 This review uses the tail of the bell curve

 Ex: 95% confidence: tail = .5(1-.95)= .025

 Z

.025

= 1.96

Example

 n=49, Σx=490, σ=2, 95% confidence

 

490

1 .

96

49

9.44 < µ < 10.56

2

49

10

0 .

56

Another Example

One of SOM professors wants to estimate the mean number of hours worked per week by students. A sample of 49 students showed a mean of 24 hours. It is assumed that the population standard deviation is 4 hours. What is the population mean?

Another Example – cont’d

95 percent confidence interval for the population mean.

X

1 .

96

 n

24 .

00

1 .

96

24 .

00

1 .

12

4

49

The confidence limits range from 22.88 to

25.12. We estimate with 95 percent confidence that the average number of hours worked per week by students lies between these two values.

Confidence Interval: Mean t distribution

 Use if normal population but population standard deviation ( σ) not known

 If you are given the sample standard deviation ( s ), use t table, assuming normal population

 If one population, n-1 degrees of freedom

Confidence Interval: Mean t distribution

 

 x

 t n 1

 n s n

Confidence Interval:

Proportion

 Use if success or failure

(ex: defective or not-defective, satisfactory or unsatisfactory)

Normal approximation to binomial ok if

(n)( π) > 5 and (n)(1-π) > 5, where n = sample size

π= population proportion

NOTE: NEVER use the t table if proportion!!

Confidence Interval:

Proportion

  p

 z p ( 1

 p ) n

Ex: 8 defectives out of 100, so p = .08 and n = 100, 95% confidence

.

08

1 .

96

( 0 .

08 )(.

92 )

. 08

.

05

100

Confidence Interval:

Proportion

A sample of 500 people who own their house revealed that 175 planned to sell their homes within five years. Develop a 98% confidence interval for the proportion of people who plan to sell their house within five years.

p

175

500

0 .

35

.

35

2 .

33

(.

35 )(.

65 )

500

.

35

.

0497

Interpretation

If 95% confidence, then 95% of all confidence intervals will include the true population parameter

NOTE! Never use the term “probability” when estimating a parameter!! (ex: Do NOT say

”Probability that population mean is between 23 and

32 is .95” because parameter is not a random variable. In fact, the population mean is a fixed but unknown quantity.)

Point vs Interval Estimate

 Point estimate: statistic (single number)

 Ex: sample mean, sample proportion

 Each sample gives different point estimate

 Interval estimate: range of values

 Ex: Population mean = sample mean + error

 Parameter = statistic + error

Width of Interval

 Ex: sample mean =23, error = 3

 Point estimate = 23

 Interval estimate = 23 + 3, or (20,26)

 Width of interval = 26-20 = 6

 Wide interval: Point estimate unreliable

Wide Confidence Interval If

(1) small sample size(n)

(2) large standard deviation

(3) high confidence interval (ex: 99% confidence interval wider than 95% confidence interval)

If you want narrow interval, you need a large sample size or small standard deviation or low confidence level.

Top Ten #4

 Linear Regression

Linear Regression y

ˆ  b

0

 b

1 x

 Regression equation:

=dependent variable=predicted value x= independent variable

 b

0

=y-intercept =predicted value of y if x=0 b

1

=slope=regression coefficient

=change in y per unit change in x

Slope vs Correlation

 Positive slope (b

1

>0): positive correlation between x and y (y increase if x increase)

 Negative slope (b

1

<0): negative correlation (y decrease if x increase)

 Zero slope (b

1

=0): no correlation(predicted value for y is mean of y), no linear relationship between x and y

Simple Linear Regression

 Simple: one independent variable, one dependent variable

 Linear: graph of regression equation is straight line

Example

 y = salary (female manager, in thousands of dollars)

 x = number of children

 n = number of observations

1

4

2 x

Given Data y

48

52

33

Totals x y

2

1

48

52

4 33

Sum=7 Sum=133 n=3

Slope (b

1

) = -6.5

 Method of Least Squares formulas not on

BUS 302 exam

 b

1

= -6.5 given

Interpretation: If one female manager has 1 more child than another, salary is $6,500 lower; that is, salary of female managers is expected to decrease by -6.5 (in thousand of dollars) per child

Intercept (b

0

) b

0

 y

 b

1 x x

 x n

7

3

2 .

33 y

 y n

133

3

44 .

33

 b

0

= 44.33 – (-6.5)(2.33) = 59.5

 If number of children is zero, expected salary is $59,500

Regression Equation

59 .

5

6 .

5 x

Forecast Salary If 3 Children

59.5 –6.5(3) = 40

$40,000 = expected salary

Standard Error of Estimate y

ˆ

 forecast

 b

0

 b

1 x error

 y

 y

ˆ

S

 

SSE n

2

( y n

2 y

ˆ

) 2

2

1

4

Standard Error of Estimate

(1)=x (2)=y

48

52

33

59.5-

6.5x

46.5

53

33.5

(4)=

(2)-(3)

1.5

-1

-.5

( y

 ˆ )

2

2.25

1

.25

SSE=3.5

Standard Error of Estimate

S

 

3 .

5

3

2

3 .

5

1 .

9

Actual salary typically $1,900 away from expected salary

Coefficient of Determination

 R 2 = % of total variation in y that can be explained by variation in x

 Measure of how close the linear regression line fits the points in a scatter diagram

 R 2 = 1: max. possible value: perfect linear relationship between y and x (straight line)

 R 2 = 0: min. value: no linear relationship

Sources of Variation (V)

 Total V = Explained V + Unexplained V

 SS = Sum of Squares = V

 Total SS = Regression SS + Error SS

 SST = SSR + SSE

 SSR = Explained V, SSE = Unexplained

Coefficient of Determination

 R 2 = SSR

SST

 R 2 = 197 = .98

200.5

 Interpretation: 98% of total variation in salary can be explained by variation in number of children

0 < R 2 < 1

 0: No linear relationship since SSR=0

(explained variation =0)

 1: Perfect relationship since SSR = SST

(unexplained variation = SSE = 0), but does not prove cause and effect

R=Correlation Coefficient

Case 1: slope (b

1

) < 0

R < 0

 R is negative square root of coefficient of determination

R

 

R 2

Our Example

Slope = b

1

R 2 = .98

= -6.5

 R = -.99

Case 2: Slope > 0

 R is positive square root of coefficient of determination

 Ex: R 2 = .49

 R = .70

 R has no interpretation

 R overstates relationship

Caution

 Nonlinear relationship (parabola, hyperbola, etc) can NOT be measured by R 2

 In fact, you could get R 2 =0 with a nonlinear graph on a scatter diagram

Summary: Correlation Coefficient

 Case 1: If b

1

> 0, R is the positive square root of the coefficient of determination

 Ex#1: y = 4+3x, R 2 =.36: R = +.60

 Case 2: If b

1

< 0, R is the negative square root of the coefficient of determination

 Ex#2: y = 80-10x, R 2 =.49: R = -.70

 NOTE! Ex#2 has stronger relationship, as measured by coefficient of determination

Extreme Values

 R=+1: perfect positive correlation

 R= -1: perfect negative correlation

 R=0: zero correlation

MS Excel Output

Correlation Coefficient (-0.9912): Note that you need to change the sign because the sign of slope (b

1

) is negative (-6.5)

Coefficient of Determination

Standard Error of Estimate

Regression Coefficient

Top Ten #5

 Expected Value

Expected Value

 Expected Value = E(x) = ΣxP(x)

= x

1

P(x

1

) + x

2

P(x

2

) +…

Expected value is a weighted average, also a long-run average

Example

 Find the expected age at high school graduation if 11 were 17 years old, 80 were

18 years old, and 5 were 19 years old

 Step 1: 11+80+5=96

Step 2 x

17

18

19

P(x) x

P(x)

11/96=.115

80/96=.833

17(.115)=1.955

18(.833)=14.994

5/96=.052

19(.052)=.988

E(x)= 17.937

Top Ten #6

 What Distribution to Use?

Use Binomial Distribution If:

 Random variable (x) is number of successes in n trials

 Each trial is success or failure

Independent trials

Constant probability of success ( π) on each trial

 Sampling with replacement (in practice, people may use binomial w/o replacement, but theory is with replacement)

Success vs. Failure

The binomial experiment can result in only one of two possible outcomes:

Male vs. Female

Defective vs. Non-defective

Yes or No

Pass (8 or more right answers) vs. Fail (fewer than 8)

Buy drink (21 or over) vs. Cannot buy drink

Binomial Is Discrete

Integer values

0,1,2,…n

 Binomial is often skewed, but may be symmetric

Normal Distribution

Continuous, bell-shaped, symmetric

Mean=median=mode

Measurement (dollars, inches, years)

Cumulative probability under normal curve : use

Z table if you know population mean and population standard deviation

Sample mean: use Z table if you know population standard deviation and either normal population or n > 30

t Distribution

Continuous, mound-shaped, symmetric

Applications similar to normal

More spread out than normal

Use t if normal population but population standard deviation not known

Degrees of freedom = df = n-1 if estimating the mean of one population t approaches z as df increases

Normal or t Distribution?

 Use t table if normal population but population standard deviation ( σ) is not known

 If you are given the sample standard deviation

( s ), use t table, assuming normal population

Top Ten #7

 P-value

P-value

 P-value = probability of getting a sample statistic as extreme (or more extreme) than the sample statistic you got from your sample, given that the null hypothesis is true

P-value Example: one tail test

H

0

: µ = 40

H

A

: µ > 40

Sample mean = 43

P-value = P(sample mean > 43, given H

0 true)

Meaning: probability of observing a sample mean as large as 43 when the population mean is 40

How to use it: Reject H

0

(significance level) if p-value < α

Two Cases

Suppose α = .05

Case 1: suppose p-value = .02, then reject H

(unlikely H

0

0 is true; you believe population mean

> 40)

Case 2: suppose p-value = .08, then do not reject H

0

(H

0 may be true; you have reason to believe that the population mean may be 40)

P-value Example: two tail test

H

0

: µ = 70

H

A

: µ ≠ 70

Sample mean = 72

 If two-tails, then P-value =

2

P(sample mean > 72)=2(.04)=.08

If α = .05, p-value > α, so do not reject H

0

Top Ten #8

 Variation Creates Uncertainty

No Variation

 Certainty, exact prediction

 Standard deviation = 0

 Variance = 0

 All data exactly same

 Example: all workers in minimum wage job

High Variation

 Uncertainty, unpredictable

 High standard deviation

 Ex #1: Workers in downtown L.A. have variation between CEOs and garment workers

 Ex #2: New York temperatures in spring range from below freezing to very hot

Comparing Standard

Deviations

 Temperature Example

 Beach city: small standard deviation (single temperature reading close to mean)

 High Desert city: High standard deviation (hot days, cool nights in spring)

Standard Error of the Mean

Standard deviation of sample mean = standard deviation/square root of n

Ex: standard deviation = 10, n =4, so standard error of the mean = 10/2= 5

Note that 5<10, so standard error < standard deviation.

As n increases, standard error decreases.

Sampling Distribution

Expected value of sample mean = population mean, but an individual sample mean could be smaller or larger than the population mean

Population mean is a constant parameter, but sample mean is a random variable

Sampling distribution is distribution of sample means

Example

 Mean age of all students in the building is population mean

 Each classroom has a sample mean

 Distribution of sample means from all classrooms is sampling distribution

Central Limit Theorem (CLT)

 If population standard deviation is known, sampling distribution of sample means is normal if n > 30

 CLT applies even if original population is skewed

Top Ten #9

 Population vs. Sample

Population

 Collection of all items (all light bulbs made at factory)

 Parameter: measure of population

(1) population mean (average number of hours in life of all bulbs)

(2) population proportion (% of all bulbs that are defective)

Sample

 Part of population (bulbs tested by inspector)

 Statistic: measure of sample = estimate of parameter

(1) sample mean (average number of hours in life of bulbs tested by inspector)

(2) sample proportion (% of bulbs in sample that are defective)

Top Ten #10

 Qualitative vs. Quantitative

Qualitative

 Categorical data: success vs. failure ethnicity marital status color zip code

4 star hotel in tour guide

Qualitative

If you need an “average”, do not calculate the mean

 However, you can compute the mode

(“average” person is married, buys a blue car made in America)

Quantitative

 Two cases

Case 1: discrete

Case 2: continuous

Discrete

(1) integer values (0,1,2,…)

(2) example: binomial

(3) finite number of possible values

(4) counting

(5) number of brothers

(6) number of cars arriving at gas station

Continuous

 Real numbers, such as decimal values

($22.22)

 Examples: Z, t

 Infinite number of possible values

 Measurement

 Miles per gallon, distance, duration of time

Graphical Tools

 Pie chart or bar chart: qualitative

 Joint frequency table: qualitative (relate marital status vs zip code)

 Scatter diagram: quantitative (distance from

CSUN vs duration of time to reach CSUN)

Hypothesis Testing

Confidence Intervals

 Quantitative: Mean

 Qualitative: Proportion

Download