Statistical Formulas

advertisement

BUS 211 Notes

Chapter 1 Introduction and Data Collection

Categorical Variables – responses are a selection i.e. Gender (male or female), Class (freshman, sophomore, junior, senior), Smoke (yes or no), etc.

Numerical Variables – responses are numbers i.e. Income ($30,000), Age (25), etc.

Can be Discrete (Integer) or Continuous (fractional parts),

Chapter 2 Presenting Data in Tables and Charts

Sort Data – Data | Sort

Stem-and-Leaf Graph – PHStat | Descriptive Statistics | Stem-and-Leaf Display

Frequency Distribution - PHStat | Descriptive Statistics | Frequency Distribution

Set up classes then array (bin) the upper limit of the desired frequency distribution

Be sure to include a label for the array (use Upper Limit)

Relative Frequency distribution – Divide the frequency distribution by the total

Percentage Distribution - Divide the frequency distribution by the total and multiply by 100

Or use Format | Cells… | Percentage

Cumulative Distribution – Sum the frequencies from top to bottom listing each total as you go.

Graphs - PHStat does not work well for most graphs use the chart wizard in Excel

Histogram also known as a Vertical Bar Chart or Column Chart -

Set up the frequency distribution then use the midpoints for labels

Double click the chart icon and select a column graph type

Select the frequency without labels as the data

Select the Series tab, mouse into the X-axis label box then select the midpoints

Select Next to insert the title and axis labels and make any other changes

Select Next to pick a location for the chart then Finish

Double click a bar and select Options, set gap width to 0

Polygon also known as a line graph -

Set up the frequency distribution then use the midpoints for labels.

Insert a class with O frequency and an appropriate label at the top and the bottom.

Double click the chart icon and select a line graph type

Select the frequency without labels as the data

Select the Series tab, mouse into the X-axis label box then select the midpoints

Select Next to insert the title and axis labels and make any other changes

Select Next to pick a location for the chart then Finish

Ogive also known as a cumulative line graph or cumulative polygon

Set up the cumulative frequency distribution use the upper class limit for labels.

Insert a class with O frequency and an appropriate label at the top but not the bottom.

Double click the chart icon and select a line graph type and complete the steps

XY Scatter Set up the data in columns with the X values first and the Y in the second column

Double click the chart icon and select XY Scatter graph

Select both columns as the data, do not select the labels, and complete the steps

Bar Chart Same as Histogram but for categorical data.

Use the category labels: if not numerical values they can be selected with the data.

Pie Chart Same as above. Be sure to remove legend, select Data Labels, check Category name

Pareto Chart Raw Data: use line chart on 2 axis or

Select Descriptive Statistics | OneWay Tables & Charts…

Be sure to select labels as the model will not work otherwise

Check table of frequencies and Pareto Diagram

Bivariate Categorical Tables and Charts Use PHStat (also available in Excel - Data | Pivot Wizard)

In PHStat select Descriptive Statistics | Two-Way Tables & Charts

1

Chapter 3 Numerical Descriptive Measures

Use Tools | Data Analysis | Descriptive Statistics, check the Summary statistics box to get the following: sample mean, median, mode, standard deviation, variance, range population mean, median, mode, range

Use fx the individual functions for the following measures geometric mean (GEOMEAN), population variance (VARP) and standard deviation (STDEVP) approximate quartiles (QUARTILE), approximate percentiles (PERCENTILE)

Coefficient of variation: Divide the standard deviation by the mean and multiply by 100%

Box-and-Whisker Plot and Five-Number Summary

PHStat | Descriptive Statistics | Box-and-Whisker Plot then check Five-Number Summary

Gives the exact quartiles not approximations

Coefficient of Correlation: fx (CORREL), or Tools | Data Analysis | Correlation

Chapter 4 Basic Probability

Probability of A or B:

P ( A or B )

P ( A )

P ( B )

P ( A and B )

If A and B are Mutually Exclusive:

P ( A or B )

P ( A )

P ( B )

Conditional probability of A given B:

P ( A B )

P ( A and B )

P ( B )

Joint Probability of A and B:

P ( A and B )

P ( A B ) P ( B )

If A and B are Independent:

P ( A B )

P ( A )

If A and B are Independent:

P ( A and B )

P ( A ) P ( B )

Bayes' Theorem

P ( B i

A )

P ( A B

1

) P ( B

1

)

P (

P (

A B

A

2

B i

) P (

) P ( B

2

B i

)

)

...

P ( A B k

) P ( B k

)

Chapter 5 Some Important Discrete Probability Distributions

 

E ( X )

Combinations: i

N 

1

X i

P ( X i

)

2  i

N 

1

[ X i

E ( X )]

2

P ( X i

)

X !

( n n !

X )!

Binomial distribution: (for an infinite population)

PHStat | Probability & Prob. Distributions | Binomial then check Cumulative Probabilities

Hypergeometric distribution: (for a finite population)

PHStat | Probability & Prob. Distributions | Hypergeometric no cumulative probabilities available

Poisson distribution:

PHStat | Probability & Prob. Distributions | Poisson then check Cumulative Probabilities-

2

Chapter 6 The Normal Distribution and Other Continuous Distributions

Normal Distribution

PHStat | Probability & Prob. Distributions | Normal then check the desired calculation

To check the normality assumption construct a stem-and-leaf, box-and-whisker, histogram or a

Normal probability plot PHStat | Probability & Prob. Distributions | Normal Probability Plot

Uniform Distribution

  a

2 b

2 

( b

 a ) 2

12 where a and b are the endpoints of the uniform distribution.

Exponential distribution

PHStat | Probability & Prob. Distributions | Exponential

Only returns results for

X, for > x use 1-probability, for results between two values find the probability for each and subtract the smaller from the larger

Sampling distribution of the mean

Calculate the standard deviation of the sampling distribution also called the Standard error of the mean then use the Normal Distribution calculator if the population is normally distributed or the sample size is > 30 or the population distribution is symmetrical and the sample size is > 15

Infinite population

 x

 x n

Finite population

 x

 x n

N

N

 n

1

Sampling distribution of the proportion:

Calculate the standard deviation of the sampling distribution (Standard Error of the Mean) then

If n p > 5 and n(1p ) > 5 use the Normal Distribution calculator PHStat | Probability & Prob.

Distributions | Normal p s

X n

 number of sample sucesses size p s = sample proportion p = population proportion

Infinite population

 p s

 p ( 1

 p ) n

Finite population

 p s

 p ( 1

 p ) n

N

N

 n

1

Chapter 7 Confidence Interval Estimation

Interval estimate of the population mean (

 x) with

 x unknown:

PHStat | Confidence Intervals | Estimate for the Mean, sigma unknown be sure to check the finite box for finite populations

Interval estimate of the population proportion:

PHStat | Confidence Intervals | Estimate for the Proportion be sure to check the finite box for finite populations

Interval estimate of the population total:

PHStat | Confidence Intervals | Estimate for the Population Total

Sample size (n) for estimating a mean:

PHStat | Sample Size | Determination for the Mean be sure to check the finite box for finite populations

Estimate of parameters would be from a preliminary sample

Sample size for estimating a proportion:

PHStat | Sample Size | Determination for the Proportion be sure to check the finite box for finite populations

Estimate of True Proportion would be the proportion from a preliminary sample

If a preliminary sample is not available use .5

3

Chapter 8 Fundamentals of Hypothesis Testing: One-Sample Tests

One Sample numerical data

unknown

Hypothesis

Test Statistic

Procedure

H

H o a

:

:

 x x

= value a two tail test

value

H o

:

 x

H o

:

 x

value H

value H a

:

 a

:

 x

value upper tail test x

value lower tail test t

Summary Data: PHStat | One-Sample Tests | t Test for the Mean, sigma unknown

Decision Rule

Conclusion

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked).

Parentheses indicate information to be taken from the problem

One Sample Categorical Data

Hypothesis

H o

: p = value a two tail test H o

: p

value H

H a

: p

value H o

: p

value H a a

: p

value upper tail test

: p

value lower tail test

Test Statistic

Procedure

Decision Rule

Conclusion

Z

Summary Data: PHStat | One-Sample Tests | Z Test for the Proportion

Raw Data: No Tests available, calculate p and use PHStat

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Parentheses indicate information to be taken from the problem

4

Chapter 9 Two-Sample Tests

Procedure to determine the proper two sample mean test for numerical data:

Yes

Are Data Paired Use Paired Data Model

Use

 2

Unequal Model

No

F Test

Are

 2

No

's Equal

Yes

Use

 2

Equal Model

Two Sample test of Means with Paired numerical data

Hypothesis H o

:

1

=

2

H a

:

1

 

2

a two tail test H o

:

1

 

2

H a

:

1

 

2

upper tail test

H o

:

1

 

2

H a

:

1

 

2

lower tail test

Procedure

Test Statistic

Decision Rule

Conclusion

Summary Data: no PHStat calculation available

Raw Data: Data Analysis | t Test: Paired Two Sample for Means t

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Interval estimate of the difference

D

 t n

1

S

D n

To get t use function TINV(1-Confidence, df)

Use Descriptive Statistics to get D and s d

Or PhStat | Confidence Intervals | Estimate for the Mean, sigma unknown - Select the differences as the data

Two Sample test of Variances with numerical data

Hypothesis H o

:

2

H a

:

2

1

=

2

1

2

a two tail test

2

2

H o

:

2

H o

:

2

1

 

2

2

1

 

2

2

H a

:

2

1

 

2

H a

:

2

1

 

2

2

upper tail test

2

lower tail test

Procedure Summary Data: PHStat | Two-Sample Tests | F Test for the Difference in Two Variances

Raw data: Data Analysis | F Test Two Sample for Variances Do not use only gives lower tail value

Test Statistic

Decision Rule

Conclusion

F

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected –There is not sufficient evidence that (Question asked)

5

Two Sample test of Means with numerical data

Hypothesis H o

:

1

=

2

H a

:

1

 

2

a two tail test

Procedure

Test Statistic

Decision Rule

Conclusion t

Summary Data: PHStat | Two-Sample Tests | t Test for Differences in Two Means

Raw Data: Data Analysis | t Test: Two Sample Assuming Equal Variances

If the

If the p p

 2 ’s not proven unequal with the F test

H o

:

H o

:

1

 

2

H a

:

1

 

2

upper tail test

1

 

2

H a

:

1

 

-value is less than alpha Reject the Hypothesis

2

lower tail test

-value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Interval estimate of the difference

X

1

X

2

 t n

1

 n

2

2

To get t use function TINV(1-Confidence, df)

Two Sample test of Means with numerical data

Hypothesis H o

:

1

=

2

H a

:

1

 

2

a two tail test

S

2 p



1 n

1

1 n

2



2 ‘s proven unequal with the F test

H o

:

1

 

2

H a

:

1

 

2

upper tail test

H o

:

1

 

2

H a

:

1

 

2

lower tail test

Procedure

Test Statistic

Decision Rule

Conclusion

Summary Data: Use spreadsheet downloaded from the Homework web page

Raw Data: Data Analysis | t Test: Two Sample Assuming Unequal Variances t

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Interval estimate of the difference

CI

X

1

X

2

 t s

1

2 n

1

 s

2

2 n

2

To get t use function TINV(1-Confidence, df)

Two Sample test of a Proportion with categorical data p i

X i n i

Number in the sample with the desired charateris tic

Total number of items in the sample

Hypothesis H o

: p

1

= p

2 a two tail test

H a

: p

1

 p

2

H o

: p

1

 p

H o

: p

1

 p

2

H a

: p

1

 p

2

upper tail test

2

H a

: p

1

 p

2

lower tail test

Procedure

Test Statistic

Decision Rule

Conclusion

PHStat | Two-Sample Tests | Z Test for the Differences in Two Proportions

Z

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Interval estimate of the difference  p s

1

To get Z use function NORMSINV(two tail) where two tail=Confidence+(1-Confidence)/2 p s

2

Z p s

1

(1

 p s

1

)

 n

1 p s

2

(1

 p s

2

) n

2

6

Chapter 10 Analysis of Variance ( Multi (c) Sample tests with numerical data)

Equality of Variances

Hypothesis

Procedure

Test Statistic

Decision Rule

Conclusion

H o

:

2

1

=

2

2

=

2

3

H a

: not all

 ’s are equal a two tail test

Raw data: PHStat | MultipleSample Tests | Levene’s Test

F

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected –There is not sufficient evidence that (Question asked)

One Factor ANOVA

Hypothesis H o

:

1

=

2

=

3

… =  c

H a

: not all

 ’s are equal c = the number of populations

Procedure Tools | Data Analysis |Anova: Single Factor

Test Statistic

Decision Rule

Conclusion

F from the computer printout P-value = The Probability of F

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Tukey's multiple comparison method: ( determines which of the c means are different from each other).

Procedure PHStat | Multiple-Sample Tests | Tukey-Kramer Procedure

Test Statistic

Input

Decision Rule

Critical Range

Q found in the Studentized Range Table where column = c and row = n-c

c = number of groups n = total number of data points in all groups

Two Factor With Replication

Hypothesis H o1

:

A1

=

A2

=

A3

… = 

H a1

: not all

 ’s are equal r

H o2

:

B1

=

B2

=

B3

… = 

H a2

: not all

 ’s are equal c

If the absolute difference between any two pairs of means is greater than the critical range the pair is different.

H o3

: No Interaction

H a3

: Interaction r = the number of levels in Factor A c = the number of levels in Factor B

Procedure

Test Statistic

Decision Rule

Conclusion

Tools | Data Analysis |Anova: Two Factor With Replication

F from the computer printout. p-value = The Probability of F

For differences in rows see p-value for the Sample row of the ANOVA

For differences in columns see p-value for the Columns row of the ANOVA

For interaction between factors see p-value for the Interaction row of the ANOVA

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

H

1

If rejected – There is sufficient evidence of a difference in (factor A)

H

2

If rejected – There is sufficient evidence of a difference in (factor B)

H

3

If rejected – There is sufficient evidence of an interaction term

If not rejected – There is not sufficient evidence to make a conclusion about …

7

Tukey's multiple comparison method for Two Factor ANOVA with replication:

No spreadsheet, hand calculate with the following formulas: critical range A

Q

MSW cn '

MSW from ANOVA MS Within

Q table column is r the number of levels in Factor A

Q table row is rc(n’-1) where c is the levels in Factor B, and n’ is the number of replications critical range B

Q

MSW rn '

MSW from ANOVA MS Within

Q table column is c the number of levels in Factor B

Q table row is rc(n’-1) where r is the levels in Factor A, and n’ is the number of replications

8

Chapter 11 Chi-Square Tests and Nonparametric Tests

Two Sample test of a Proportion with categorical data (Alternate Procedure)

Hypothesis H o

: p

1

= p

2

H a

: p

1

 p

2

(No <, or > Hypothesis)

Procedure PHStat | Two-Sample Tests | Chi-Square Test for the Differences in Two Proportions

2 Test Statistic

Decision Rule

Conclusion

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Multi (c) Sample test of Proportions with categorical data

Hypothesis H o

: p

1

= p

2

= p

3

… p c c = the number of samples

H a

: not all p ’s are equal

Procedure

Test Statistic

Decision Rule

Conclusion

Hypothesis

Procedure

Test Statistic

PHStat | Multiple-Sample Tests | Chi-Square Test

2

PHStat | Multiple-Sample Tests | Chi-Square Test

2

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

Be sure to check the box for the Marascuilo Procedure to determine which proportions are different.

2 Test of Independence

H o

: Two categorical variables are independent

H a

: Two categorical variables are related

Decision Rule

Conclusion

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that the variables are related

If not rejected – There is not sufficient evidence that the variables are related.

Two Sample test of Medians with numerical data

Hypothesis H

H o a

: M

: M

1

1

= M

M

2

2

a two tail test H o

: M

1

M

2

H a

: M

1

H o

: M

1

M

2

M

2

upper tail test

H a

: M

1

M

2

lower tail test

Procedure

Test Statistic

Decision Rule

Conclusion

Raw Data

Summary Data No Tests available.

Z

PHStat | Two-Sample Tests | Wilcoxon Rank Sum Test

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

9

Kruskal-Wallis Rank Test for Differences Between c Medians

Hypothesis H o

: M

1

= M

2

= M

3

= M

C

H a

: Not all M j are equal ( j=1,2,…C)

Procedure

Test Statistic

Raw Data PHStat | Multiple-Sample Tests | Kruskal-Wallis Rank Test

Summary Data No PHStat or Excel calculation available

H

Decision Rule

Conclusion

If the p -value is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

If rejected – There is sufficient evidence that (Question asked)

If not rejected – There is not sufficient evidence that (Question asked)

10

Chapter 12 Simple Linear Regression

Linear Regression Model: relationship represented as Y

ˆ i

 b

0

 b

1

X i

Determining if the linear model is significant

Hypothesis

Procedure

H o

:

H a

:

1

= 0

1

0

PHStat | Regression | Simple Linear Regression or

Tools | Data Analysis | Regression

Test Statistic F

Decision Rule If the significant F (a p -value) is less than alpha Reject the Hypothesis

If the significant F is greater than or equal to alpha Fail to Reject the Hypothesis

Conclusion If rejected – There is sufficient evidence to accept the linear regression model

If not rejected – There is not sufficient evidence of a linear model end the analysis

Confidence Interval estimate of

1

found on the ANOVA output.

See the independent variable line under Lower 95% and Upper 95%.

Confidence interval estimates for the dependent variable be sure to check the input box and insert a value.

Durbin Watson statistic for autocorrelation be sure to check the input box.

Additional measures from the regression

Standard error of the estimate: a measure of variability of the data around the regression line

Coefficient of determination (r 2 ): measures the percent of the variation in the dependent variable Y that is explained by the independent variable X in the regression model. Shows the strength of the relationship.

Adjusted r 2 : modifies the r 2 for the number of explanatory variables in the model and the sample size

Sample coefficient of correlation (r): estimator of

Checking the Assumptions of regression:

1. Normality - to check normality analyze the normal probability plot of the sample values.

2. Homoscedasticity - variation around the regression line must be constant for all values of X to check analyze the residual plot for horn shape.

3. Independent residuals - to check analyze residual plot for randomness.

11

Chapter 13 Introduction to Multiple Regression

Multiple Regression Model: represented as

Y i

ˆ

 b

0

 b

1

X

1 i

 b

2

X

2 i

 b

3

X

3 i

  b k

X k i

Determining if the multiple linear model is significant

Hypothesis H o

:

1

=

2

=…= 

H a

: Not all

 ’s = 0 k

= 0 where k equals the number of variables

Procedure PHStat | Regression | Multiple Regression or

Tools | Data Analysis | Regression

Test Statistic F

Decision Rule If the significant F (a p -value) is less than alpha Reject the Hypothesis

If the significant F is greater than or equal to alpha Fail to Reject the Hypothesis

Conclusion If rejected – There is sufficient evidence that all or part of the model is significant,

(proceed with the analysis)

If not rejected – There is not sufficient evidence of a linear model (end the analysis)

Determining which variables are significant

Hypothesis H o

:

H a

:

1

= 0

1

0

H

H o a

:

:

2

2

= 0

0

Procedure

Test Statistic t

Tools | Data Analysis | Regression

H

H o a

:

:

PHStat | Regression | Multiple Regression or k

= 0 k

0

Decision Rule If the p -value of the t statistic is less than alpha Reject the Hypothesis

If the p -value is greater than or equal to alpha Fail to Reject the Hypothesis

Conclusion If rejected – There is sufficient evidence that (variable) is significant

If not rejected – There is not sufficient evidence to prove (variable) is significant

Confidence Interval estimates of

 ’s found on the ANOVA output.

See the variable lines under Lower 95% and Upper 95%.

Confidence interval estimates for the dependent variable be sure to check the input box and insert

a value.

Durbin Watson statistic for autocorrelation be sure to check the input box.

Additional measures from the regression

Standard error of the estimate: a measure of variability of the data around the regression line

Coefficient of multiple determination (r 2 ): measures the percent of the variation in the dependent variable Y that is explained by the independent variables in the regression model. Shows the strength of the relationship.

Adjusted r 2 : modifies the r 2 for the number of explanatory variables in the model and the sample size

Sample coefficient of correlation (r): estimator of

Coefficient of partial determination (r 2

Y

) contribution of each variable holding the others constant

12

Dummy Variables Model used to include categorical variables.

Prepare a data matrix with Y, X

1

..X

n

and dummy variables with 1 representing the characteristic and 0 its absence

Y

3.8

X

3

1

… X n

X

1

D1

… X

Dn

4.2

.

.

.

2

.

.

0

0

1

Follow the usual multiple regression procedures. Then test for an interaction term between numerical and categorical variables. If the interaction term is significant you can not use the dummy variable.

Dummy Variables Interactions Model

Used to check on the interaction between the numerical and categorical variables.

To test for an interaction term prepare a data matrix with Y, X

1..n

, the dummy variables and the product of the dummy variable and the numerical variables

Y

.

.

3.8

4.2

X

.

.

3

2

1..n

X

D1..n

1

0

0

1

X

1

* X

D

3

0

Include all possible combinations

Follow the usual multiple regression procedures.

Independent Variables Interactions Model used to check on interaction between numerical variables.

To test for an interaction term prepare a data matrix with Y, X

1

..X

n and the product of all pairs of numerical variables

Prepare a data matrix with Y, X

1

..X

n

and X a

*X b

Y

.

.

3.8

4.2

X

1

.

.

3

2

X

2

11

15

12

18

X a

* X b

33

30

Include all possible combinations

Follow the usual multiple regression procedures.

13

Chapter 14 Multiple Regression Model Building

The Quadratic model

Y

ˆ i

 b

0

 b

1

X

1

 b

11

X i

2 b

0

= estimated Y intercept b

1

= estimated linear effect on Y b

11

= estimated curvilinear effect on Y

Prepare a data matrix with the dependent variable Y and the independent variables X and X 2

Y

3.8

4.2

X

3

2

X

9

4

2

Do a multiple regression with X and X 2 as the independent variables.

The Square-Root Transformation Model

Y

ˆ i

 b

0

 b

1

X

1i

Prepare a data matrix with the dependent variable Y and the independent variable square root of X

Y

3.8

4.2

X

9

4

X

3

2

Do a simple linear regression with X as the independent variables.

The Log Multiplicative Model

Y i

 b

0

X

1i b

1 X b

2

2i log Y i

ˆ  log b

0

 b

1 log X

1 i

 b

2 log X

2 i

Prepare a data matrix with Y , X

1

, X

2

and their logs: use =Log(..)

Y

3.8

4.2

X

1

3

5

X

9

8

2

Log Y

.579784

.623249

Log X

1

.69897

Log X

2

.477121 .954243

.90309

Do a multiple regression with Log Y, Log X

1

and Log X

2

as the independent variables.

To convert your predictions to the original data range take 10 to the power Log Y

The Natural Log Exponential Model

Y i

ˆ

 e b

0

 b

1

X

1i

 b

2

X

2i

ln

Y

ˆ

i

 b

0

 b

1

X

1i

 b

2

X

2i

Prepare a data matrix with Y , X

1

, X

2

and their logs: use =ln(..)

Y X

1

X

2

Ln Y Ln X

1

Ln X

2

3.8

4.2

3

5

9

8

1.335001

1.435085

1.098612

1.609438

2.197225

2.079442

Do a multiple regression with X and X 2 as the independent variables.

To convert your predictions to the original data range take e to the power Ln Y or use =Exp( Ln Y )

14

Model Building

Stepwise Regression – limited evaluation of alternative models

Procedure: PHStat | Regression | Stepwise Regression

Best-Subsets – all possible subsets of the independent variables.

Procedure: PHStat | Regression | Best Subsets

1. Fit a model with all the independent variables and check the VIF box.

2. If all VIF’s are 

10 proceed to the next step, else eliminate the variable with the highest VIF and go to back to step 1

3. Sort the results by the adjusted r 2 select the model with the least variables if the r 2 ’s are close. Or

Sort the results by C p

and pick models with C p

to k+1 (k=total number of variables)and pick the best.

15

Chapter 15 Time-Series Forecasting and Index Numbers

Time-Series models use the same least squares technique as regression models. Only the data is different.

The Linear model

Y

ˆ i

 b

0

 b

1

X

1 b

0

= estimated Y intercept b

1

= estimated linear effect on Y

Prepare a data matrix with the dependent variable Y and the independent variable X

Y

3.8

4.2

X

1

2

Do a simple linear regression with X as the independent variable.

Forecast by plugging the next X value into the linear equation

The Quadratic model

Y i

ˆ  b

0

 b

1

X

1

 b

11

X i

2 b

0

= estimated Y intercept b

1

= estimated linear effect on Y b

11

= estimated curvilinear effect on Y

Prepare a data matrix with the dependent variable Y and the independent variables X and X 2

Y

3.8

4.2

X

1

2

X 2

1

4

Do a multiple regression with X and X 2 as the independent variables.

Forecast by plugging the next X value into the quadratic equation

The Exponential model

Y

ˆ i

 b

0 b

1

X i b

0

= estimated Y intercept b

1

= is the compound growth factor where ( b

1

-1)*100% is the compound growth rate

Prepare a data matrix with the independent variable X and the common log of the dependent variable Y

Y

3.8

4.2

… log Y

.5798

.6232

X

1

2

The data for the independent variable is often time series data where X is the year or month.

Do a linear regression with X as the independent variable, and the log of Y as the dependent variable.

This provides the following transformation log Y

ˆ i

 log b

0

X i log b i

Forecast by plugging the X value into this linear equation yielding the log of Y.

Take 10 to the power (log of Y) to get the antilog which is the actual Y forecast. Y

10 log of Y

or

Take 10 to the power (log of b

0

) to get the antilog which is the actual b

0

then

Take 10 to the power (log of b

1

) to get the antilog which is the actual b

1 and use the exponential equation

Y

ˆ i

 b

0 b

1

X i

16

Autoregressive Models

First-Order autoregressive model

Second-Order autoregressive model

Third-Order autoregressive model p th-Order autoregressive model

Y i

Y i

Y i

Y i

A

0

A

0

A

0

A

0

A

1

Y i

1

  i

A

1

Y i

1

A

2

Y i

2

A

1

Y i

1

A

1

Y i

1

A

2

Y i

2

A

2

Y i

2

 i

A

3

Y i

3

A

3

Y i

3

 i

...

A p

Y i

 p

  i

Autoregressive models lag the dependent variable data by one or more periods to provide a weighted moving average of the previous values of the variable Y .

Third-Order autoregressive model

Prepare a data matrix with the dependent variable Y and lagged versions of Y as the independent variables Y

.

.

.

X

1

2

3

4

Y

3.8

4.2

3.0

4.6

5.0

.

.

.

Y lag 1

3.8

4.2

3.0

4.6

.

.

.

Y lag 2

3.8

4.2

3.0

Y lag 3

3.8

4.2

For this type of autoregressive analysis the X variable is not needed. The dependent variable is Y and the independent variables are the lagged versions of Y

For first-order autoregressive, do a multiple regression with Y lag 1 as the independent variable and Y as the dependent. Forecast by plugging the last Y value into the equation. Forecast additional periods into the future by using the most recently forecast value as the independent variable.

For second-order autoregressive, do a multiple regression with Y lag 1, and Y lag 2 as the independent variables. Forecast by plugging the last two Y values into the equation. Forecast additional periods into the future by using the most recently forecast values and previous values of Y as needed for the independent variables.

For third-order autoregressive, do a multiple regression with Y lag 1, Y lag 2, and Y lag 3 as the independent variables. Forecast by plugging the last three Y values into the equation. Forecast additional periods into the future by using the most recently forecast values and previous values of Y as needed for the independent variables.

Choosing the Best Model

Choose the model with the best adjusted r 2 , where r 2

’s are close choose the simplest model.

17

Download