BA6000 Class Notes - Kennesaw State University

advertisement
INPO Statistics Workshop:
Notes and Exercises
using EXCEL and SAS
Developed by Jennifer Lewis Priestley, Ph.D.
Kennesaw State University
Statistics Workshop Topics
MODULE ONE: Concept Review
• Review of Statistical Concepts
• Data Analysis using EXCEL
MODULE TWO: Inferential testing using EXCEL and SAS
• Confidence Intervals
• Ttests
• ANOVA
• Chi-Square
MODULE THREE: Predictive Modeling using EXCEL and SAS
• Regression Analysis
• Logistic Analysis
• Discriminant Analysis
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Review of Statistical Concepts
Prior to analysis, a determination must be made of the type
of variables in question. Variable type will, in many cases,
dictate the analysis options.
Variable types for review:
• Qualitative
– categorical (e.g., gender, race)
– ordinal (e.g., rankings, Likert data*)
• Quantitative
– interval (e.g., temperature)
– ratio (e.g., weight, height)
* Mathematicians and Statisticians consider Likert data to be qualitative and therefore restrict its
use to qualitative techniques such as Chi-square analysis. However, in practice, most people treat
Likert data as quantitative and utilize quantitative techniques.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Review of Statistical Concepts
Descriptions of data typically include:
• Measurements of Central Tendency
– Mean, Median, Mode (do you know when to use each?)
• Measurements of Dispersion
– Standard Deviation and Variance
Outlier Detection:
• For near bell-shaped data
– Use 3-Sigma (Empirical) Rule : any value that is more
than 3 standard deviations above or below the mean
• For Skewed Data
– Use Tukey’s Rule: any value that is more than one step
below Q1 or more than 1 step above Q3; A step =
1.5*IQR
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Review of Statistical Concepts
The Central Limit Theorem forms the basis for why inferential
statistics (versus descriptive statistics) is possible.
Prior to reviewing the Theorem, pull up this site:
http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/index.html
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Review of Statistical Concepts
Important concepts to remember about the Central Limit Theorem:
• The distribution of sample means will, as the number of
samples increases approach a normal distribution;
• The mean of all sample means approximates the
population mean;
• The std of all sample means is the std of the
population/the SQRT of the sample size;
• If the population is NOT normally distributed, sample sizes
must be greater than 30 to assume normality;
• If the population IS normally distributed, samples can be
of any size to assume normality (although greater than 30
is always preferred).
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Review of Statistical Concepts
A few points to remember about the Normal Distribution:
• Bell shaped and symmetric about the mean μ.
• Mean = μ, Median = μ, Mode = μ.
• The area under the normal curve below μ is .5.
• Probability that a Normal Random Variable Outcome:
– Lies within +/- 1 std dev of the mean is .6826
– Lies within +/- 2 std dev of the mean is .9544
– Lies within +/- 3 std dev of the mean is .9974
• For all other probabilities, convert the relevant
observation to a Z-score: Z=(x- μ)/ σ
• Any observation that has a Z-score greater than 2 is
typically considered to be a statistical outlier…since its
probability of occurrence is less than 5%.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Review of Statistical Concepts
Hypothesis Testing:
Ho is true
Ho is false
Reject Ho
TYPE I Error
Valid Decision
Do not reject Ho
Valid Decision
TYPE II Error
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Review of Statistical Concepts
1.
Statement of Hypotheses
—
—
—
2.
H0: Null Hypothesis – opposite of Alternative Hypothesis
H1: Alternative Hypothesis – what we are trying to prove
Evaluate Type I & Type II Errors
Set Significance level, 
—
—
Standard of Proof, or Level of Risk
Represents Probability of Type I Error
3.
Calculate the test statistic from the sample.
4.
Calculate the p-value (strength of the evidence)
—
—
1. If the p-value < , Accept H1.
2. If the p-value > , Accept H0.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
In this section, we will use EXCEL to execute some of the
most common types of univariate/bivariate and
multivariate data analysis:
•
•
•
•
•
•
•
Descriptive Statistics
Histograms
Scatterplots and Charts
Pivot Tables
Using Formulas (fx)
Look Up Tables
Lagniappe
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: COMP1 Dataset
The dataset used throughout this workshop is the COMP1
dataset, with 14 variables and 100 observations.
•
•
•
•
•
•
•
Speed = Amount of time it takes to deliver
a product (in days) once the order has
been received.
PriceLv = Perceived level of price charged
by suppliers
PriceFlx = Perceived willingness of COMP1
managers to negotiate price
Man_Imag = Overall image of
manufacturer
Service = Overall level of service necessary
to maintain a satisfactory relationship with
customer
Sal_Imag = Overall perceived image of
salesforce
Quality = Perceived level of product quality
•
•
•
•
•
•
•
Size = Large (1) or Small (0)
Usage = Percentage of total product
purchased from COMP1
Satisf = How satisfied purchaser is with
Comp1
SpecBuy = Extent to which a purchaser
evaluates each purchase separately
(1=each purchase evaluated separately, 0
= lot buying)
Procure = Centralization of purchase
decisions (1= Centralized, 0=decentralized)
Ind_Type = Classification of Industry
affiliation (1=Industry A, 0=other)
Buy_Sit = Type of buying situation (NEW =
New, MOD = Modified purchase, REP =
Repeat Purchase
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
Using the COMP1 dataset, determine the following
descriptive statistics for each quantitative variable:
• The most appropriate measurement of central tendency;
• Two measurements of dispersion;
• For one variable, identify if any outliers exist.
Determine these statistics using the f(x) options:
=AVERAGE, =MEDIAN, =MODE, =STDEV, =VAR
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
Using the COMP1 dataset, determine the following
descriptive statistics for each quantitative variable:
• The most appropriate measurement of central tendency;
• Two measurements of dispersion;
• For one variable, identify if any outliers exist.
This time, use the “Descriptive Statistics” option:
TOOLDATA ANALYSISDESCRIPTIVE STATISTICS
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
Using the COMP1 dataset, develop a histogram for the
quant variable of your choice (use three categories).
EXAMPLE: Take the Usage variable and subtract the min
value from the max value (65-25 = 40). Divide that number
(the range) by the number of desired categories (40/3 =
13.33). Now, “massage” that figure as necessary to create
reasonable, logical categories of approximately equal size:
• 25-39
• 40-54
• 55-65
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
A few points about histograms in EXCEL:
• Create a column over to the right that contains only the
TOP of each category that you have assigned
• Label this category “BIN RANGE”
• TOOLSDATA ANALYSISHISTOGRAM
• Ensure that the “Labels” box is ticked
• Ensure that the “Cumulative Percentage” and “Chart
Output” boxes are ticked
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
Using the COMP1 dataset, develop a scatterplot of two
quantitative variables of your choice.
EXAMPLE: Chart the two variables Usage and Price Level.
NOTE: it is helpful to move the two variables into columns
next to each other
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
Using the COMP1 dataset, develop a pivot table of the
entire dataset.
DATAPIVOT TABLES
NOTE: ensure that the labels are included in the table
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
A few more widely used examples of functions in EXCEL:
=IF
=AND
=ABS
=SUMPRODUCT
=RAND
=RANDBETWEEN
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 1: Data Analysis Using EXCEL
Look up tables in EXCEL can be a very helpful way of
converting quantitative data into categorized qualitative
data…which sometimes can be easier to work with.
Use the VLOOKUP function to categorize the Speed variable
into “EXCELLENT” “AVERAGE” and “POOR” events.
Clarification of EXCEL notation :
VLOOKUP(enter the speed value here,enter the range of the
two column table here,enter the column number of the
desired category label here)
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Lagniappe: This word derives from New World Spanish la
ñapa, “the gift,”. The word came into the Creole dialect of
New Orleans and there acquired a French spelling. It is still
used in the Gulf States, especially southern Louisiana, to
denote a little bonus that a friendly shopkeeper might add
to a purchase.
One Lagniappe presented here is the process of selecting
“the best” option from among several alternatives.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Lagniappe for everyone!
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Statistics Workshop Topics
MODULE ONE: Concept Review
• Review of Statistical Concepts
• Data Analysis using EXCEL
MODULE TWO: Inferential testing using EXCEL and SAS
• Confidence Intervals
• Ttests
• ANOVA
• Chi-Square
MODULE THREE: Predictive Modeling using EXCEL and SAS
• Regression Analysis
• Logistic Analysis
• Discriminant Analysis
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
We always prefer to use descriptive statistics. However, often we
are forced to take a sample and use inferential statistics because
of issues related to cost, time, money or access.
When taking a sample, we can estimate a population parameter
such as a mean or a proportion using the sample statistic (which
is not very accurate) or we can calculate a confidence interval.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
Confidence Intervals are calculated using two formulas:
CI for the population mean:
x+ Z*(s)/SQRT(n)
where, x = the sample mean
Z = Z-score (90%CI = 1.645, 95%CI = 1.96, 99%CI = 2.575)
s = sample standard deviation
n = sample size
Note: the part of the expression after the + is called the
Margin of Error  Z*(s)/SQRT(n).
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
The CI for the population proportion is represented by:
p+ Z*SQRT((pq)/n)
where, p = the sample proportion
Z = Z-score (90%CI = 1.645, 95%CI = 1.96, 99%CI = 2.575)
q=1-p
n = sample size
Note: the part of the expression after the + is called the
Margin of Error  Z*(s)/SQRT(n).
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
Fun Manual Calculation! using the Gallup Website:
http://poll.gallup.com/
Replicate the Gallup prediction using the second CI
formula.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
More fun calculations! Using the Comp1 dataset, calculate
the 95% CI for a quantitative variable using formula 1.
=CONFIDENCE
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Alpha is the accepted
prob. of making a T1
error. It is also 1-the
confidence level.
MODULE 2: Inferential Testing – Confidence
Intervals
The result generated from =CONFIDENCE(.05, 1.32, 100) is
0.258857. What does this number mean?
This is the Margin of Error. In other words, if you were to
report the 95% confidence interval for this company’s
speed, you would report:
3.52 + .25
Or
3.25 to 3.77
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
Lets execute CIs in SAS. First, calculate the 95% CI for the
quantitative variable Speed. Use the following code:
Proc Import datafile = "c:\COMP1.xls" OUT = COMP1 DBMS = "EXCEL97" Replace;
run;
Proc Print data=Comp1;
Run;
Proc Means data=Comp1 CLM alpha=.05;
Var Speed;
Run;
Should you require a higher or lower alpha (.01 is more conservative and .10 is more
risk tolerant), change the .05 as appropriate.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
The output is pretty simple:
Notice that this is the same interval from the EXCEL
output. Isnt it nice when numbers match?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
Now, calculate the 95% CI for the qualitative variable
Procure. Use the following code:
Proc Freq data=Comp1;
Tables Procure/Binomial alpha=.05;
Run;
Note that EXCEL does not readily support this calculation.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing – Confidence
Intervals
The output is pretty simple:
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Dependent
Variable
Independent
(predictor)
Variable
Hypothesis Test
Comments
Categorical
(Qualitative)
Categorical
(Qualitative)
Chi-Square
Tests if variables are statistically
independent (i.e. are they related
or not?)
Quantitative
Categorical
(Qualitative)
T-TEST
ANOVA
Determines if categorical variable
(factor) affects dependent
variable; Ttests for 1 or 2 groups
and ANOVA for 3 or more.
Quantitative
Quantitative
Regression
Analysis
Test establishes a regression
model; used to explain, predict or
control dependent variable
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
The Chi-Square Test is used to determine if two QUALITATIVE
variables are related.
The Chi-Square statistic is computed using the following
formula:
X2 = Σ(fo-fe)2/fe
Where:
fo is the frequency of the observed value
fe is the frequency of the expected value
This calculated test statistic is converted into a p-value to
evaluate the presence of a relationship (or not).
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Using the COMP1 dataset, determine if a company’s
buying situation (BuySit) is related to the centralization of
their purchasing decisions (Procure).
The hypothesis statements for this test are:
Ho: BuySit and Procure are NOT related
H1: BuySit and Procure ARE related
Develop the appropriate testing matrix and identify the
Type1 and Type2 errors.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Now, using EXCEL, develop the 2x3 matrix of these two
variables using a pivot table (place “count of procure” in
the center).
You should see this:
Count of Procure BuySit
Procure
MOD
NEW
REP
0
10
8
1
22
26
Grand Total
32
34
Grand Total
32
50
2
50
34
100
This table includes the “frequencies of the observed” values
from our Chi-Square formula.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Now, we need to determine the expected values. If there is
NO relationship, then we would expect to see exactly 32%
of the decentralized procurement companies with a
modified buying situation, 34% with a new buying situation,
etc.
The matrix of expected values looks like this:
MOD
0
1
NEW
16
16
32
REP
17
17
34
TOTAL
17
17
34
50
50
100
=(32/100)*50
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Now use the =CHITEST
formula in EXCEL. The
actual range
requested is the
INTERIOR of the
observed matrix and
the expected range is
the INTERIOR of the
expected matrix (note
that the marginal
values are the same for
the two matrices).
The resulting value is: 1.61E-09 or .00000000161.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
This is a p-value. Given the rule : if p<a reject Ho and p>a
accept Ho…
Can we conclude that there is a relationship between the
centralization of the purchasing decision and the buying
situation?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Lets execute the same test in SAS, using the following code:
Proc Freq data=Comp1;
Tables Procure*BuySit/CHISQ;
Run;
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
This code creates the following output:
This legend
provides
the key to
interpreting
these
numbers
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Here is the second part of the output:
This is the
same p-value
that we
obtained in
EXCEL
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Any additional questions on Chi-Square?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Dependent
Variable
Independent
(predictor)
Variable
Hypothesis Test
Comments
Categorical
(Qualitative)
Categorical
(Qualitative)
Chi-Square
Tests if variables are statistically
independent (i.e. are they related
or not?)
Quantitative
Categorical
(Qualitative)
T-TEST
ANOVA
Determines if categorical variable
(factor) affects dependent
variable; Ttests for 1 or 2 groups
and ANOVA for 3 or more.
Quantitative
Quantitative
Regression
Analysis
Test establishes a regression
model; used to explain, predict or
control dependent variable
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Ttests represent the most common form of statistical testing.
It involves either one sample or two independent samples.
1.One Sample Ttest - compares the mean of the sample to
a given number.
• e.g. Is average monthly revenue per customer who
switches >$50 ?
Formal Hypothesis Statement examples:
H0:   $50
H1:  > $50
H0:  = $50
H1:   $50
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
2. Two Sample Ttest - compares the mean of the first
sample minus the mean of the second sample to a
given number.
•
e.g. Is there a difference in the production output
of the two facilities?
Formal Hypothesis Statement examples:
H0: a  b
H0: a = b
H1: a > b
H1: a  b
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
3. Paired Sample Ttest - compares the mean of the
differences in the observations to a given number.
e.g. Is there a difference in the production output
of a facility after the implementation of new
procedures?
Formal Hypothesis Statement example:
H0: post - pre <=0
H1: post - pre > 0
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing using EXCEL
Using the COMP1 dataset and EXCEL, determine if there is a
difference in the overall satisfaction of large versus small
companies (the large companies will represent a sample
and the small companies will represent a second sample).
Determine the null and alternative hypothesis statements for
this question…then develop the 2x2 hypothesis
matrix…including the Type1 and Type2 errors.
Now, sort the data by size. The satisfaction values
associated with the 0s (the small firms) will be our first array
of numbers and the satisfaction values associated with the
1s (the large firms) will be our second array of numbers.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing using EXCEL
Satisfaction of small firms
Satisfaction of large firms
Two tailed test
homoscedastic
Your computed value should be 1.80363E-06…what do
you conclude?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing using SAS
Using the COMP1 dataset and SAS, determine if there is a
difference in the overall satisfaction of large versus small
companies.
Here is the necessary code:
Proc Ttest data=Comp1;
Var Satisf;
Class Size;
Run;
the quantitative variable of interest
the qualitative variable which
identifies the two samples
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing using SAS
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing using SAS
Using the COMP1 dataset and SAS, determine if the overall
satisfaction is more than 5.0.
Develop the hypothesis statements and develop the 2x2
matrix.
Here is the necessary code:
Proc Ttest data=Comp1 H0=5.0;
Var Satisf;
Run;
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing using SAS
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Any additional questions on Ttests?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Dependent
Variable
Independent
(predictor)
Variable
Hypothesis Test
Comments
Categorical
(Qualitative)
Categorical
(Qualitative)
Chi-Square
Tests if variables are statistically
independent (i.e. are they related
or not?)
Quantitative
Categorical
(Qualitative)
T-TEST
ANOVA
Determines if categorical variable
(factor) affects dependent
variable; Ttests for 1 or 2 groups
and ANOVA for 3 or more.
Quantitative
Quantitative
Regression
Analysis
Test establishes a regression
model; used to explain, predict or
control dependent variable
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing - ANOVA
As with Ttests, ANOVA is a common test to determine if
differences exist among samples. Where Ttests evaluate
either one or two samples (groups), ANOVA accommodates
3 or more.
The hypothesis statements in ANOVA look like this:
H0: a = b = c
H1: a = b = c
Note: the hypotheses are interpreted as “at least one mean is
different”…not that all means are different.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing - ANOVA
SST
= SSW
+ SSB
ij(Xij-X)2
= ij(Xij-Xj)2 + nj(Xj-X)2
SST = Total Sum of Squares
SSW = Sum of Squares Within Groups
SSB = Sum of Squares Between Groups
_
_X = mean of data for all the sample groups combined
Xj = mean of the jth sample group
Xij = the ith element from the jth group
n = number of samples in each group
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
FUN MANUAL CALCULATION OF ANOVA! 
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Question: If we have to test for a
difference among a group of three or
more samples, why cant we simply
execute a series of ttests?
Hint: the issue has to do with the
accepted risk of making a TYPE 1 error.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing - ANOVA
Using the Comp1 dataset in EXCEL, determine if there is a
difference in satisfaction among the different types of buying
situations.
Hint: we will first have to convert the values for the variable
“buying situations” from alpha characters to numeric
characters…the data is still categorical…EXCEL cannot read
alpha characters for ANOVA.
=IF(M2="NEW",1,IF(M2="MOD",2,IF(M2="REP",3)))
Then: TOOLSDATA ANALYSISANOVA: SingleFactor
Note: the input range will be the quant variable and the categorical
(numeric) variable ONLY. And…it helps if they are next to each other.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing - ANOVA
Your output should look like this:
Anova: Single Factor
SUMMARY
Groups
Satisf
BuySit
Count
100
100
SSB
ANOVA
Source of Variation
Between Groups
Within Groups
SS
383.9221
140.4659
Total
524.3879
Sum
Average Variance
477.1
4.771 0.731979
200
2 0.686869
This is your pvalue. What do
you conclude?
SSW
df
MS
F
P-value
F crit
1 383.9221 541.1745 1.53E-58 3.888857
198 0.709424
199
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Question: Where are the differences?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing - ANOVA
Using the Comp1 dataset in SAS, determine if there is a
difference in satisfaction among the different types of buying
situations.
Here is the necessary code:
Proc ANOVA data=Comp1;
Class Buysit;
Qualitative
Independent/Classification
Variable
Model Satisf=Buysit;
Quantitative
Dependent=Independent
Means Buysit/TUKEY;
Run;
Statement to determine
WHICH groups are different
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing - ANOVA
SSB
SSW
Note that this output looks almost the same as the EXCEL output
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing - ANOVA
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 2: Inferential Testing
Any additional questions on ANOVA?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Statistics Workshop Topics
MODULE ONE: Concept Review
• Review of Statistical Concepts
• Data Analysis using EXCEL
MODULE TWO: Inferential testing using EXCEL and SAS
• Confidence Intervals
• Ttests
• ANOVA
• Chi-Square
MODULE THREE: Predictive Modeling using EXCEL and SAS
• Regression Analysis
• Logistic Analysis
• Discriminant Analysis
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling
Dependent
Variable
Independent
(predictor)
Variable
Hypothesis Test
Comments
Categorical
(Qualitative)
Categorical
(Qualitative)
Chi-Square
Tests if variables are statistically
independent (i.e. are they related or
not?)
Quantitative
Categorical
(Qualitative)
T-TEST
ANOVA
Determines if categorical variable
(factor) affects dependent variable;
Ttests for 1 or 2 groups and ANOVA for 3
or more.
Quantitative
Quantitative
or Dummy
Regression
Analysis
Test establishes a regression
model; used to explain, predict or
control dependent variable
Categorical
>2
Categories
Quantitative
or Dummy
Discriminant
Analysis
Test establishes a discriminant
model; used to explain, predict or
control dependent variable
Binary
Quantitative
or Dummy
Logistic Analysis
Test establishes a logistic model;
used to explain, predict or control
dependent variable
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling
All models are wrong…but some are useful.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
Prior to any model development, an initial correlation
analysis is typically generated to understand which variables
are related.
As with any inferential method, Correlation Analysis is
conducted within the context of Hypothesis Statements.
The general form of these statements in Correlation Analysis
is:
H0: Variable A and Variable B are NOT related
H1: Variable A and Variable B ARE related
so, if p<a, then the conclusion is that Variable A and
Variable B ARE related.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
Determine which variables in the Comp1 dataset are
correlated using EXCEL.
TOOLSDATA ANALYSISCORRELATION
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
The output is below. What do the numbers mean?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
These numbers represent the percentage of change in one
variable that moves with the change in another variable.
Because these values are percentages, they can vary from a
low of negative 1 to a high of positive 1, including 0.
For example, Price Level and Speed have a correlation of
-.34923. This means that 35% of the change in Price Level moves
with 35% of the change in Speed…and vice versa. In addition,
since this number is negative, the change moves in the
OPPOSITE direction (i.e. as one goes up, the other goes down).
Note: Correlation does NOT equate to causation. At this point in
the analysis, we cannot state that Price Level decreases Speed
or vice versa. We can only claim that we know that they have
an inverse relationship.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
The scatterplot below is a visual representation of the correlation
coefficient -.34923. Is this relationship significant?
6
Price Level
5
4
3
2
1
0
0
2
4
Speed
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
6
8
MODULE 3: Predictive Modeling –
Correlation Analysis
SAS provides more information regarding the statistical
significance of any correlations. The statistical
significance will be highly sensitive to the sample size…if
the sample is too large…EVERYTHING will appear to be
significant.
Here is the necessary code:
Proc Corr data=Comp1;
Run;
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
Note: the number on the top is the correlation coefficient and the
number on the bottom is the p-value
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
Rather than execute a full matrix, if you are only
interested in assessing the correlations of the variables
with one or two variables of interest (e.g., dependent
variables), include the “with” option:
Proc Corr data=Comp1;
With Usage;
Run;
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Correlation Analysis
So…what will you conclude?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
In Regression, we generate a line of the form:
Y = β0 + β1X1 + β2X2 +… βnXn
that represents “best-fit” of the data…meaning that the
difference between the actual and the predicted values is
minimized. In this equation,
Y = the predicted value (MUST be a ratio scale variable)
B0 = the Y intercept
B1 = the coefficient or weight of X1
X1 = an independent or predictor variable (MUST be a ratio scale
variable or a dummy variable)
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
To evaluate the “Best fit line”, we calculate the following statistics using
ANOVA (again usually done by the computer program):
Total sums of squares (SST), Σ (Yi - Y)2 is a measure of the variability of the
dependent variable Y
Error sums of squares (SSE), Σ (Yi - Yi-Pred)2 is a measure of the variability of
the dependent variable Y that is left over after using the regression model
and the predictor variable X to explain Y. If this value becomes 0, then our
Regression model is perfect (usually not the case!)
MODEL(Regression) sums of squares (SSR), SST - SSE, is a measure of the
amount of the total sums of squares accounted for by the regression
model. The closer SSR is to SST, the better the model explains the variation
of Y.
Coefficient of determination, or R2, is calculated as SSR/SST. This is the
proportion of the total sums of squares (i.e. total variation in Y) that is
explained by the regression model. R2 is close to 1 for good fits and close to
0 for no relationship.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
Lets execute a Regression analysis in EXCEL. Use Usage as
the dependent (y) variable.
TOOLSDATA ANALYSISREGRESSION
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
The dependent variable is
the Y Range
All other quantitative or
dummy variables are in the
X Range
The Residuals can be
helpful
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
From this output, what
do you know?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
Using this output, answer the following questions:
1.
2.
3.
4.
What is the linear equation?
Which predictor variables are significant?
How accurate is the model?
How could you improve the efficiency (parsimony) of the
model?
5. Can you use this model to make a prediction?
6. Are there any problems with the model? Such as
multicollinearity among the predictors? Or any influential
observations?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
Although a Regression model can be developed using
EXCEL, the process is fairly manual and there are limited
diagnostics.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
Limitations of using EXCEL for Regression modeling:
1. Only supports an “all in” selection method; retention or
deletion of variables must be done manually.
2. No diagnostics for collinearity among the predictors.
3. No intervals for the predictions.
4. Cumbersome identification of outliers.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
Here is the necessary code:
Proc Reg data=Comp1;
Model Usage = SpeedPriceLv PriceFlx Man_Imag Service
Sal_Imag Quality Size Satisf SpecBuy Procure IndType/
VIF P R CLI Selection=Forward Partial;
Output out=reg p=pred;
Plot Residual.*Pred.;
Run;
Quit;
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Regression
Looks as
if Obs 7
might
be a
problem
for us…
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Any additional questions on Regression?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling
Dependent
Variable
Independent
(predictor)
Variable
Hypothesis Test
Comments
Categorical
(Qualitative)
Categorical
(Qualitative)
Chi-Square
Tests if variables are statistically
independent (i.e. are they related or
not?)
Quantitative
Categorical
(Qualitative)
T-TEST
ANOVA
Determines if categorical variable
(factor) affects dependent variable;
Ttests for 1 or 2 groups and ANOVA for 3
or more.
Quantitative
Quantitative
or Dummy
Regression
Analysis
Test establishes a regression
model; used to explain, predict or
control dependent variable
Categorical
>2
Categories
Quantitative
or Dummy
Discriminant
Analysis
Test establishes a discriminant
model; used to explain, predict or
control dependent variable
Binary
Quantitative
or Dummy
Logistic Analysis
Test establishes a logistic model;
used to explain, predict or control
dependent variable
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Discriminant Analysis
What if we need to create a model with a
CATEGORICAL dependent variable?
Regression Analysis is not a viable option. ANOVA
should be executed first. Followed by Discriminant
Analysis.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Discriminant Analysis
Here is the necessary code to execute Discriminant
Analysis:
Proc Discrim data=Comp1 method=normal
Pool=Yes List Crossvalidate;
Class BuySit;
Var Speed PriceLv PriceFlx Man_Imag Service
Sal_Imag Quality Size Satisf SpecBuy Procure IndType Usage;
Run;
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling –
Discriminant Analysis
The objective in Discriminant Analysis is to minimize
the errors in the “Hit Matrix”:
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling
Dependent
Variable
Independent
(predictor)
Variable
Hypothesis Test
Comments
Categorical
(Qualitative)
Categorical
(Qualitative)
Chi-Square
Tests if variables are statistically
independent (i.e. are they related or
not?)
Quantitative
Categorical
(Qualitative)
T-TEST
ANOVA
Determines if categorical variable
(factor) affects dependent variable;
Ttests for 1 or 2 groups and ANOVA for 3
or more.
Quantitative
Quantitative
or Dummy
Regression
Analysis
Test establishes a regression
model; used to explain, predict or
control dependent variable
Categorical
>2
Categories
Quantitative
or Dummy
Discriminant
Analysis
Test establishes a discriminant
model; used to explain, predict or
control dependent variable
Binary
Quantitative
or Dummy
Logistic Analysis
Test establishes a logistic model;
used to explain, predict or control
dependent variable
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling – Logistic
Analysis
Logistic Analysis is a particularly handy modeling
technique when the dependent variable can assume two
(and only two) values.
Common Applications of Logistic Analysis:
• Will an individual qualify for a loan?
• Does this person have a particular disease?
• Will this person respond to a marketing solicitation?
• Will applicant A be accepted to college?
• Will the product fail in a particular situation?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling – Logistic
Analysis
The general form of the Logistic Equation is:
Prob (event)
= eB0 + B1X1+B2X2…+BnXn
Prob (no event)
The objective in Logistic Analysis is to correctly predict the
presence of an event. The accuracy of the model, is then
evaluated using several metrics, including a classification
table.
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling – Logistic
Analysis
Here is the code to execute Logistic Analysis:
Proc Logistic descending data=Comp1;
model SpecBuy = Speed PriceLv PriceFlx Man_Imag Service Sal_Imag
Quality/
Selection=Stepwise
CTable PPROB = (0 to 1 by .1)
Lackfit
RISKLIMITS;
run;
quit;
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling – Logistic
Analysis
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
MODULE 3: Predictive Modeling – Logistic
Analysis
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
General Questions on Predictive Modeling?
c 2006 Jennifer Priestley, Ph.D. Kennesaw State University Department of Mathematics and Statistics
Download