Lab Manual Draft

advertisement
Marketing Research today increasingly involves the need to
understand global markets. This photo was taken in Shanghai,
a major shopping and marketing center.
Lab Manual Prepared by
Dr. Cristanna Cook
PREFACE
The purpose of this lab manual is to enable student
understanding and participation in Marketing Research, BA
424. It is intended to act as a guide to hands on problem
development for major topic areas in inferential statistics. It
also provides the techniques for completing the same problem
done by hand on the computer using version 16.0 of SPSS. By
seeing the problem, working the problem by hand, and then
solving the problem on the computer using a common
statistical package, the students will receive a more complete
understanding of the underlying concept and the technology
available for solving such problems in a way that is commonly
used in the business world.
The student would have had a basic course is statistics before
taking BA424.
3
Table on Contents
Chapter 1 (Probability)
Probability Distributions…………………………………………..
The Normal Distribution …………………………………………..
Chapter 2 (Lab 1)
Understanding File Structure…………………………………..
Chapter 3 (Lab 2 and 3)
Frequencies ……………………………………………………………..
Case Summaries ………………………………………………………..
Descriptive Statistics…………………………………………………..
Chapter 4 (Lab 4)
Hypothesis Testing (One Sample)………………………………
Chapter 5 (Lab 5)
Hypothesis Testing (Chi-Square)…………………………………
Chapter 6 (Lab 6)
Hypothesis Testing (Two Samples)……………………………..
4
Chapter 7 (Lab 7)
Hypothesis Testing (ANOVA)………………………………………
Chapter 8 (Lab 8)
Hypothesis Testing (Two-Way ANOVA)………………………
Chapter 9 (Lab 9)
Hypothesis Testing (Correlation)……………………………….
Chapter 10 (Lab 10)
Hypothesis Testing (Simple and Multiple
Regression)...…………………………………..……………………….
5
Chapter 1 (Probability)
PROBABILITY DISTRIBUTIONS
Introduction:
Inferential statistics is based upon the idea that activities have
probabilities attached to them. If we think about the possibility
of rain tomorrow, we might ask, what is the probability that it
will rain tomorrow? In like fashion, research activities often
involve probabilities. Probability is the likelihood that an event
will occur. How we define “event” depends on the situation.
An event is the result of an activity and the word activity can be
replaced with the word experiment. When we flip a coin, that
activity could be thought of as an experiment where the
outcome is random. We do not know if we will get a head or
tail. But one or the other will result. So, an experiment has
outcomes like the outcomes of flipping a coin are heads or tails.
An event might be described as getting a head on a flip of a coin
or getting a tail on a flip of a coin. What the event is depends
on our point of view.
We often want to associate probabilities with the outcomes of
an experiment. If we undertake a survey which is a kind of an
experiment and we are interested in the number of males or
females who prefer peaches to pears, then we will use the
numbers of males and females who indicate they prefer
6
peaches and the number who say they prefer pears to calculate
what we call “empirical probability” which is based on empirical
studies like surveys that ask people questions. Empirical
probability is represented by the frequency of people who
answer a certain way to our empirical questions.
There are two other kinds of probability. One is classical
probability and the other is subjective probability. Classical
experiments are often based upon events defined in terms of a
deck of cards, marbles in a bowl, etc. These were the types of
events that early statisticians used to develop the theory of
statistics. Subjective probability is our own view of whether an
even will happen and is based upon our own experience which
may be valid or not.
So, we have activities which we call experiments and from
which we can identify an event for which we can calculate
empirical probability.
The Rules of Probability:
In order to calculate a probability for an event from an
experiment, we often have to calculate the number of times
the outcomes of the event happens. To do this, we can use
what are called counting rules. These counting rules will give us
the total number of times something happens. Since a
probability is a ratio with a numerator and a denominator, to
calculate a probability we often have to calculate the number
7
of times a certain event occurs and put that number in the
numerator and then calculate the total outcomes and put that
number in the denominator and we can than calculate a
probability.
If we flip a coin 2 times and want to record the probability of
getting 1 head (which is our event), we need to identify all the
outcomes in our experiment and the outcomes we are looking
for or the event (different ways of getting at least 1 head). If
we flip a coin two times we can get: HH HT TH or TT. So there
are 4 total outcomes and 3 ways of getting 1 at least one head
(HT, TH, HH). So the probability of getting at least one head is 3
out of 4 or ¾.
Now sometimes, the actual number of total outcomes or the
number of outcomes in an event may be difficult to calculate,
so we have special counting rules to help us. These counting
rules are:
1. the addition rule;
2. the multiplication rule;
3. the permutation rule; and
4. the combination rule.
You may find these rules in any statistics text. However, the
purpose of the rules is to help us find a probability.
8
Probability Distribution:
We now know part of what makes up a probability
distribution. The one part is the probability. That probability
has to be associated with something. That something is the
values that a random variable can take. A variable is any
thing that varies. A random variable is one in which we do
not know the exact result or value of that random variable
when we do an experiment. When we flip a coin, getting at a
head is a random result. So is getting a tail. We know will get
one or the other, but on any one throw of the coin we do not
know which we will get.
When we pair the values of a random variable such as Head
or Tail with the probability of getting that value ( ½ for a tail
and ½ for a head) in the case of flipping a coin once , we have
a probability distribution.
Now there are some complicated probability distributions.
The probabilities associated with the values that a random
variable can take have been calculated for us by statisticians
for these more complicated probability distributions.
9
The Normal Distribution:
This distribution is used often to represent random variables.
It is used to find probabilities associated with values that the
random variable can take.
There are certain types of problems that are commonly
solved by using the normal distribution. The characteristics of
this distribution can be reviewed in your basic statistics text.
Statisticians have calculated probabilities associated values
of a random variable and have standardized these values.
These probabilities have been placed in a table called The
Standard Normal Table. To use this table, we take the value
of our random variable and change it into a z-score and find
the probability associated with a particular z-score.
Often the problems fall into one of 7 types when we try to
find probabilities associated values of a random variable for
which we have the z score.
Type 1:
Finding the area under the standard normal distribution
curve between 0 and +z or 0 and –z.
10
Type2:
Finding the area under the standard normal distribution
curve in any tail from +z to the end of the right side of the
distribution or from –z to the end of the left side of the
distribution
11
Type 3:
Finding the area under the standard normal distribution
between any z values on one side of the distribution or from
+z1 to +z2 or from –z2 to –z1
Type 4:
Finding the area under the standard normal distribution
between any z values on the opposite side of the mean or
between z1 on one side of the mean and z2 on the other side
of the mean
12
Type 5:
Finding the area under the standard normal distribution from
+z which is to the right of the mean
Type 6:
Finding the are under the standard normal distribution from
–z which is to the left of the mean
13
Type 7:
Finding the area under the standard normal distribution
curve on any two tails or from +z out to the right side of the
distribution and from –z out to the left side of the
distribution
Examples:
Type 1:
Find the probability or area from z=0 to z= 1.2.
Find the probability or area from z=0 to z= -1.2
Type 2:
Find the probability or area to the right of z=1.2
Find the probability or area to the left of z=-1.2
14
Type 3:
Find the probability or area between z1=+1.2 and z2=+1.7
Find the probability or area between z1=-1.2 and z2=-1.7
Type 4:
Find the probability or area between –z=-1.2 and +z=1.7
Type 5:
Find the probability or area to the left of +z=1.2
Type 6:
Find the probability or area to the right of –z=1.2
Type 7:
Find the probability or area to the right of +z=+1.2 and to the
left of –z=-1.2
15
Chapter 2 (Lab 1)
File Structure:
We can think of a file as a table of rows and columns. When
you use SPSS, you put your data in a file with rows and
columns.
As you can see with the file above, this file called
Employee.sav, shows the data for a file with information
about employees for a particular company. A record is
16
composed of the information on each employee. There are
474 employees in this data file. Each file is made of up
variables. The variables are called fields and represent the
information collected for all the records. So we have the
following fields or variables across the top: id, gender, bdate
for date of birth, educ for education, jobcat for job category,
salary, salbegin for beginning salary, jobtime for how long
the person has been in the job, prevexp for previous
experience, and minority for minority status.
The view we see here shows the actual data. We can also
have another view which is called variable view. We can
switch between data and variable view easily by just
selecting one or the other view at the bottom left of the
screen. As we can see below, the variable view lists all the
variables and across the top of the screen we see the
attributes for all those variables: name, type, width,
decimals, label, values, missing, columns, align, and measure.
Name: Name is the name of the field or variable. You
provide this it should be reflective of the meaning of the
variable
Type: The type of field or variable refers to the underlying
nature of the field or variable. Is it numeric or non-numeric?
This is the basic question. You want fields for which you plan
to do math to be numeric. There are other kinds of fields
17
such as dollar. If you indicate that the field is dollar, then a $
will be inserted into the field. However, you cannot do math
with this kind of field. Some fields may be alpha numeric
meaning that we use letters and numbers to represent data
in the field. However, some procedures and other programs
to which you might want to export your data may not read
alphanumeric fields in a way in which math can be done on
the data in the fields.
Width: This refers to the number of positions provided for
the field when the field data were input into the file.
Decimals: If the field should have decimals, you can specify
the number of decimal places.
Label: As each field is actually a variable, you will want to
give it a descriptive label which reminds you what this
variable means. This label is composed of a description you
think describes the variable well. This label will be printed on
any computerized output.
Values: As each variable is a random variable, the value that
the variable can take depends on the test unit (such as a
person- usually with business data). So, the value will be
different depending on which test unit. Thus, we have to
indicate the different values the variable can take. Gender is
a variable that can take on two values of male or female, for
example.
18
Missing: Sometimes the person we interview will not answer
a question. If that is the case, then the data will be missing.
So some value will have to represent missing data. Any value
can be used. It often is convention to use a string of 9’s to
represent a missing value. There is a built-in missing value in
SPSS and that is the dot (.). Dots are automatically read as
missing.
Columns: This is just the column width of the field in the SPSS
spreadsheet.
Align: You can align the fields to the right, left, or center to
make the spreadsheet look nice.
Measure: This is kind of scaling measurement: ordinal, ratio,
nominal, interval.
19
We will be working with the fields or variables to develop
hypotheses so we need to know as much about these as
possible. If we name them, label them, and give them value
labels and appropriate missing values, it will be easier for to
understand the printed output.
20
Your Data:
Each of you signed up for a case which comes with a data
file. What you have to do is the following:
1. Read the case in your text;
2. Understand the meaning of the variables (to get the
meaning use the questionnaire or survey in the case to
help you);
3. Change the name of the variables to suit you;
4. Check out the label and value labels for the variables;
5. Check out the missing data;
6. Check out if the variables re numeric or not as this will
determine how you can analyze the data.
Please proceed to
instructors.husson.edu/cookc/marketingcourses to find
specific instructions for Lab 1.
21
Chapter 3 (Lab 2)
Case Summary and Frequencies:
Often we would like to see how the respondents answered
to the questions we have asked them or maybe we want to
take a look at the data in order to identify errors in the data
set. We can do this by running the Case Summary procedure
is SPSS and by running the Frequency procedure in SPSS.
If we run Frequencies, we can tell immediately if an incorrect
answer has been coded into the data set because all data
values will appear in our Frequency runs. If a value appears
to be incorrect, we can then run Case Summary to identify
the record where the error occurs and isolate the record
using the ID of the record. All records should have an ID. ID
or something similarly named should be a field in the data
set. So, if you know the field where the error is, you can run
Case Summary and get a listing of the field where the error is
along with a listing of the ID field and you can than find the
data record where the error is located. We will run a case
summary for the Employee.sav data set. We want to limit the
number of cases printed otherwise we tend to waste paper.
22
Frequencies:
Lets run a frequency for the Employee.sav data set that
comes with SPSS and see what the output looks like.
The Employee.sav dataset has a number of fields. We do not
want to ask for a frequency on data that is continuous
because we would have page after page of output.
We might do this just to identify any error but we should
resist printing all this out as you would have many pages to
print.
For example, the Employee.sav data set has fields called
bdate, prevexp, and code . Bdate is the emplyee’s birth date
and code is the employee id and prevexp is the employee’s
previous experience in months. It probably would not make
sense to ask for a frequency on these fields as it would not
be too meaningful. Also, we have salary and salbegin fields
that represent current salary and beginning salary. If we ask
for a frequency for this data, it would be a rather long list
because every individual in our dataset would likely have a
different number for salary and beginning salary. So it is best
to run the Frequency procedure with fields where not every
person’s answer is represented by a different value.
23
Example of the Frequency Procedure:
As you see below, there are three fields we have identified in
the Frequencies procedure. There are frequencies for
gender, previous experience, and minority status. As you can
see, the previous experience field goes on and on because it
is one of the fields that have numbers that are different for
most employees. Each frequency table provides the value
label, frequency, percent, valid percent which leaves out
missing data, and cumulative percent.
FREQUENCIES VARIABLES=gender prevexp minority
/ORDER=ANALYSIS.
Frequencies
[DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav
Statistics
Previous
Gender
N
Valid
Missing
24
Experience
Minority
(months)
Classification
474
474
474
0
0
0
Frequency Table
Gender
Cumulative
Frequency
Valid
Percent
Valid Percent
Percent
Female
216
45.6
45.6
45.6
Male
258
54.4
54.4
100.0
Total
474
100.0
100.0
Previous Experience (months)
Cumulative
Frequency
Valid
25
missing
Percent
Valid Percent
Percent
24
5.1
5.1
5.1
2
4
.8
.8
5.9
3
5
1.1
1.1
7.0
4
4
.8
.8
7.8
5
12
2.5
2.5
10.3
6
7
1.5
1.5
11.8
7
7
1.5
1.5
13.3
8
6
1.3
1.3
14.6
9
7
1.5
1.5
16.0
10
3
.6
.6
16.7
11
9
1.9
1.9
18.6
12
6
1.3
1.3
19.8
13
3
.6
.6
20.5
14
1
.2
.2
20.7
15
3
.6
.6
21.3
16
2
.4
.4
21.7
17
3
.6
.6
22.4
18
8
1.7
1.7
24.1
26
19
5
1.1
1.1
25.1
20
6
1.3
1.3
26.4
21
1
.2
.2
26.6
22
7
1.5
1.5
28.1
23
2
.4
.4
28.5
24
7
1.5
1.5
30.0
25
2
.4
.4
30.4
26
6
1.3
1.3
31.6
27
2
.4
.4
32.1
29
2
.4
.4
32.5
30
2
.4
.4
32.9
32
6
1.3
1.3
34.2
33
1
.2
.2
34.4
34
4
.8
.8
35.2
35
3
.6
.6
35.9
36
7
1.5
1.5
37.3
37
1
.2
.2
37.6
38
6
1.3
1.3
38.8
40
1
.2
.2
39.0
41
4
.8
.8
39.9
42
1
.2
.2
40.1
43
2
.4
.4
40.5
44
3
.6
.6
41.1
45
2
.4
.4
41.6
46
4
.8
.8
42.4
47
6
1.3
1.3
43.7
48
10
2.1
2.1
45.8
49
4
.8
.8
46.6
50
3
.6
.6
47.3
51
2
.4
.4
47.7
27
52
4
.8
.8
48.5
53
2
.4
.4
48.9
54
4
.8
.8
49.8
55
4
.8
.8
50.6
56
5
1.1
1.1
51.7
57
1
.2
.2
51.9
58
1
.2
.2
52.1
59
3
.6
.6
52.7
60
2
.4
.4
53.2
61
2
.4
.4
53.6
62
1
.2
.2
53.8
63
2
.4
.4
54.2
64
4
.8
.8
55.1
66
1
.2
.2
55.3
67
2
.4
.4
55.7
68
5
1.1
1.1
56.8
69
4
.8
.8
57.6
70
3
.6
.6
58.2
72
5
1.1
1.1
59.3
74
2
.4
.4
59.7
75
4
.8
.8
60.5
76
1
.2
.2
60.8
78
2
.4
.4
61.2
79
2
.4
.4
61.6
80
2
.4
.4
62.0
81
2
.4
.4
62.4
82
1
.2
.2
62.7
83
3
.6
.6
63.3
84
1
.2
.2
63.5
85
2
.4
.4
63.9
28
86
1
.2
.2
64.1
87
2
.4
.4
64.6
90
3
.6
.6
65.2
91
1
.2
.2
65.4
93
1
.2
.2
65.6
94
1
.2
.2
65.8
96
3
.6
.6
66.5
97
2
.4
.4
66.9
99
1
.2
.2
67.1
101
1
.2
.2
67.3
102
2
.4
.4
67.7
103
1
.2
.2
67.9
105
1
.2
.2
68.1
106
1
.2
.2
68.4
107
1
.2
.2
68.6
108
2
.4
.4
69.0
110
1
.2
.2
69.2
113
1
.2
.2
69.4
114
1
.2
.2
69.6
115
1
.2
.2
69.8
116
1
.2
.2
70.0
117
1
.2
.2
70.3
120
4
.8
.8
71.1
121
1
.2
.2
71.3
122
1
.2
.2
71.5
123
1
.2
.2
71.7
124
1
.2
.2
71.9
125
1
.2
.2
72.2
126
1
.2
.2
72.4
127
1
.2
.2
72.6
29
128
1
.2
.2
72.8
129
1
.2
.2
73.0
130
1
.2
.2
73.2
132
2
.4
.4
73.6
133
2
.4
.4
74.1
134
1
.2
.2
74.3
137
1
.2
.2
74.5
138
2
.4
.4
74.9
139
1
.2
.2
75.1
143
2
.4
.4
75.5
144
4
.8
.8
76.4
149
2
.4
.4
76.8
150
3
.6
.6
77.4
151
1
.2
.2
77.6
154
1
.2
.2
77.8
155
2
.4
.4
78.3
156
2
.4
.4
78.7
159
1
.2
.2
78.9
163
2
.4
.4
79.3
165
1
.2
.2
79.5
168
1
.2
.2
79.7
169
1
.2
.2
80.0
171
2
.4
.4
80.4
173
2
.4
.4
80.8
174
1
.2
.2
81.0
175
2
.4
.4
81.4
176
1
.2
.2
81.6
180
2
.4
.4
82.1
181
1
.2
.2
82.3
182
1
.2
.2
82.5
30
184
1
.2
.2
82.7
190
1
.2
.2
82.9
191
1
.2
.2
83.1
192
2
.4
.4
83.5
193
1
.2
.2
83.8
194
1
.2
.2
84.0
196
1
.2
.2
84.2
198
1
.2
.2
84.4
199
2
.4
.4
84.8
205
1
.2
.2
85.0
207
1
.2
.2
85.2
208
1
.2
.2
85.4
209
2
.4
.4
85.9
210
1
.2
.2
86.1
214
1
.2
.2
86.3
216
2
.4
.4
86.7
221
1
.2
.2
86.9
228
4
.8
.8
87.8
229
1
.2
.2
88.0
231
1
.2
.2
88.2
240
3
.6
.6
88.8
241
1
.2
.2
89.0
244
1
.2
.2
89.2
246
1
.2
.2
89.5
252
1
.2
.2
89.7
258
1
.2
.2
89.9
261
1
.2
.2
90.1
264
3
.6
.6
90.7
265
1
.2
.2
90.9
271
1
.2
.2
91.1
31
272
2
.4
.4
91.6
275
1
.2
.2
91.8
281
2
.4
.4
92.2
284
1
.2
.2
92.4
285
1
.2
.2
92.6
288
1
.2
.2
92.8
302
1
.2
.2
93.0
305
1
.2
.2
93.2
307
1
.2
.2
93.5
308
1
.2
.2
93.7
314
1
.2
.2
93.9
315
1
.2
.2
94.1
317
1
.2
.2
94.3
318
1
.2
.2
94.5
319
1
.2
.2
94.7
320
1
.2
.2
94.9
324
1
.2
.2
95.1
338
1
.2
.2
95.4
344
1
.2
.2
95.6
348
1
.2
.2
95.8
358
1
.2
.2
96.0
359
1
.2
.2
96.2
367
1
.2
.2
96.4
371
1
.2
.2
96.6
372
1
.2
.2
96.8
375
1
.2
.2
97.0
380
1
.2
.2
97.3
381
1
.2
.2
97.5
385
1
.2
.2
97.7
387
1
.2
.2
97.9
390
1
.2
.2
98.1
408
1
.2
.2
98.3
412
1
.2
.2
98.5
429
1
.2
.2
98.7
432
1
.2
.2
98.9
438
1
.2
.2
99.2
444
1
.2
.2
99.4
451
1
.2
.2
99.6
460
1
.2
.2
99.8
476
1
.2
.2
100.0
474
100.0
100.0
Total
Minority Classification
Cumulative
Frequency
Valid
Percent
Valid Percent
Percent
No
370
78.1
78.1
78.1
Yes
104
21.9
21.9
100.0
Total
474
100.0
100.0
The previous experience variable or field has 24 records with
missing data and this number is left out of the valid percent
calculation.
It is very important to know how to if we have missing data
and how many records have missing data in that field
because if we have a lot of missing data, our analysis may be
biased because we have left out people. However, in
practice, it is difficult if not impossible to fill in the missing
data records. There are ways of estimating a value that can
32
fill in for missing data. Your text does mention these
methods.
What you need to do now is identify 5 variables in your data
set for which you will run the Frequency procedure.
Please proceed to read the online instructions at
instructors.husson.edu/cookc/marketingcourses to find
specific instructions for completing Lab 2.
Case Summaries:
We can also list the data for the individual cases for all the
variables or fields in our data set. In order to do this, SPSS
has a procedure on the Analyze dropdown menu. The Case
Summary procedure is under Reports. If you click on Reports
and then Case Summary a screen will appear that will ask you
to identify the variables or fields for which you want case
summaries. You can list the data for all the records or you
can specify a specific number of records such as the first 20
records. Using the procedure, we can see below the data for
the first 20 records in the Employee.sav data set for the
specified variables. This procedure is helpful to get a listing of
all the data and identify the record with data that is in error.
We then might be able to track down the source of the error
and make changes in our data file.
33
Summarize
Notes
Output Created
2009-09-28T09:24:11.044
Comments
Input
Data
C:\Documents and
Settings\cookc\Desktop\employee.sav
Active Dataset
DataSet1
File Label
05.00.00
Filter
<none>
Weight
<none>
Split File
<none>
N of Rows in Working Data
474
File
Missing Value Handling
Definition of Missing
For each dependent variable in a table,
user-defined missing values for the
dependent and all grouping variables
are treated as missing.
Cases Used
Cases used for each table have no
missing values in any independent
variable, and not all dependent
variables have missing values.
Syntax
SUMMARIZE
/TABLES=gender bdate salary
salbegin
/FORMAT=VALIDLIST NOCASENUM
TOTAL LIMIT=20
/TITLE='Case Summaries'
/MISSING=VARIABLE
/CELLS=COUNT.
Resources
Processor Time
0:00:00.328
Elapsed Time
0:00:01.328
[DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav
34
Case Processing Summarya
Cases
Included
N
Excluded
Percent
N
Total
Percent
N
Percent
Gender
20
100.0%
0
.0%
20
100.0%
Date of Birth
20
100.0%
0
.0%
20
100.0%
Current Salary
20
100.0%
0
.0%
20
100.0%
Beginning Salary
20
100.0%
0
.0%
20
100.0%
a. Limited to first 20 cases.
Case Summariesa
Gender
Date of Birth
Current Salary
Beginning Salary
1
Male
2/03/1952 $57,000
$27,000
2
Male
5/23/1958 $40,200
$18,750
3
Female
7/26/1929 $21,450
$12,000
4
Female
4/15/1947 $21,900
$13,200
5
Male
2/09/1955 $45,000
$21,000
6
Male
8/22/1958 $32,100
$13,500
7
Male
4/26/1956 $36,000
$18,750
8
Female
5/06/1966 $21,900
$9,750
9
Female
1/23/1946 $27,900
$12,750
10
Female
2/13/1946 $24,000
$13,500
11
Female
2/07/1950 $30,300
$16,500
12
Male
1/11/1966 $28,350
$12,000
13
Male
7/17/1960 $27,750
$14,250
14
Female
2/26/1949 $35,100
$16,800
15
Male
8/29/1962 $27,300
$13,500
16
Male
11/17/1964 $40,800
$15,000
17
Male
7/18/1962 $46,000
$14,250
18
Male
3/20/1956 $103,750
$27,510
35
19
Male
8/19/1962 $42,300
$14,250
20
Female
1/23/1940 $26,250
$11,550
Total
N
20
20
20
20
a. Limited to first 20 cases.
Please proceed to read the online instructions at
instructors.husson.edu/cookc/marketingcourses to find
specific instructions for completing Lab 2.
36
Chapter 4 (Lab 3)
Descriptive Analysis:
You may have the need to look at various kinds of descriptive
data such as means, standard errors, sums, maximums,
minimums, standard deviations, variances, medians, modes,
etc. These concepts you learned about in MS132 or Mat132.
Calculation of a mean can only be carried out on data measured
as interval or ratio data. For instance, calculating a mean
gender or mean minority status is not very sensible. So, be
careful to choose variables that are measured on a continuous
basis (interval or ratio measurement). In the Employee.sav
dataset, prevexp, salary, salbegin, and educ are measured on at
least an interval scale. When we use salary and salbegin, we
must remember to change the type to numeric from dollar.
These two variables have the Type of dollar. This means there
is a dollar sign ($) included in the field. Some procedures cannot
process this ($). It is necessary to change the type to numeric so
we can work with the fields in such a way to do math with
these fields (such as take an average). So, we can calculate a
mean. We cannot calculate a mean on fields such as gender,
jobcat, and minority for example. However, we could still look
at medians, mode, maximum, minimum, or any descriptive
statistic that is meaningful to analyze on count data with fields
such as gender, jobcat, and minority.
37
There are several procedures that provide descriptive statistics.
We have already seen that these descriptive statistics can also
be calculated using the frequency procedure.
Explore Procedure:
When we use the Explore procedure, it allows us to take a field
measured as an interval or ratio scale and look at the average
of that field by another field measured as a nominal or ordinal
scale. Go to Analyze, Descriptive Statistics and Explore to find
the procedure.
You will want to choose a continuous variable such as salary to
analyze. Continuous fields are measured as an interval or ratio
scale. Then choose a factor which should be a variable
measured on a nominal or ordinal scale. You may analyze more
than one continuous variable at a time but make sure that the
variables are continuous or measured on a ratio or interval
scale and the factors are measured on a nominal or ordinal
scale.
You will see that the descriptive statistics for the continuous
variables are printed for each value of the factors. This allows
you to see the value of statistics for each value of a factor. For
example, you may feel that average salary will differ by level of
education, gender or minority status. Explore will tell you what
the average salary is for each value of gender, minority status
or any other nominal or ordinal measured variable (called
factors).
38
These are the descriptive statistics available: mean, 95%
confidence interval, 5% trimmed mean, median, variance,
standard deviation, minimum, maximum, range, interquartile
range, skewness, and kutosis.
Lets say we want to look at salary and begsalary by the minority
field or factor. We would use the Explore procedure and our
results would look like the following output.
Explore
Notes
Output Created
2009-09-28T11:25:00.840
Comments
Input
Data
C:\Documents and
Settings\cookc\Desktop\employee.sav
Active Dataset
DataSet1
File Label
05.00.00
Filter
<none>
Weight
<none>
Split File
<none>
N of Rows in Working Data File
Missing Value Handling
Definition of Missing
474
User-defined missing values for dependent
variables are treated as missing.
Cases Used
Statistics are based on cases with no
missing values for any dependent variable
or factor used.
Syntax
EXAMINE VARIABLES=salary BY minority
/PLOT BOXPLOT STEMLEAF
/COMPARE GROUP
/STATISTICS DESCRIPTIVES
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.
39
Resources
Processor Time
0:00:01.078
Elapsed Time
0:00:03.296
[DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav
Minority Classification
Case Processing Summary
Cases
Minority
Valid
Classific
ation
Current Salary
N
Missing
Percent
N
Total
Percent
N
Percent
No
370
100.0%
0
.0%
370
100.0%
Yes
104
100.0%
0
.0%
104
100.0%
Descriptives
Minority Classification
Current Salary
No
Statistic
Mean
95% Confidence Interval for
Mean
$36,023.31
Lower Bound
$34,178.68
Upper Bound
$37,867.94
5% Trimmed Mean
$34,094.52
Median
$29,925.00
Variance
40
3.256E8
Std. Deviation
$18,044.096
Minimum
$15,750
Maximum
$135,000
Range
$119,250
Interquartile Range
$16,200
Std. Error
$938.068
Yes
Skewness
1.896
.127
Kurtosis
4.256
.253
Mean
95% Confidence Interval for
Mean
$28,713.94
Lower Bound
$26,492.72
Upper Bound
$30,935.17
5% Trimmed Mean
$27,092.63
Median
$26,625.00
Variance
1.305E8
Std. Deviation
$11,421.638
Minimum
$16,350
Maximum
$100,000
Range
$83,650
Interquartile Range
$7,125
Skewness
Kurtosis
41
$1,119.984
3.749
.237
18.249
.469
Current Salary
Stem-and-Leaf Plots
Current Salary Stem-and-Leaf Plot for
minority= No
Frequency
Stem &
Leaf
22.00
1 . 5566666667777888999999
84.00
2 .
000000000011111111111111111122222222222222222222223333333333333444444444444444444444
79.00
2 .
5555555555555555566666666666666777777777777777777788888888888899999999999999999
62.00
3 . 00000000000000000000011111111111112222223333333333333444444444
26.00
3 . 55555556666677777788889999
42
18.00
4
11.00
4
12.00
5
12.00
5
8.00
6
36.00 Extremes
Stem width:
Each leaf:
.
.
.
.
.
000000001122233334
55556667788
001112234444
555556667889
00001112
(>=65000)
10000
1 case(s)
Current Salary Stem-and-Leaf Plot for
minority= Yes
Frequency
Stem &
5.00
1
6.00
1
7.00
2
9.00
2
16.00
2
22.00
2
8.00
2
16.00
3
1.00
3
5.00
3
1.00
3
1.00
3
1.00
4
6.00 Extremes
Stem width:
Each leaf:
.
.
.
.
.
.
.
.
.
.
.
.
.
Leaf
66677
899999
0001111
222222333
4444444444555555
6666666666666777777777
88888999
0000000000011111
3
45555
6
8
0
(>=43950)
10000
1 case(s)
As you can see, there is a lot of information provided including
stem-and-leaf plots and box-plots which you learned about in
Ms132 or Mat132. If we take a look at the numeric descriptive
results, we see that descriptive results are given for salary by
the two categories of minority status: yes and no. We can see
the mean, variance, sum, median, etc. We can then look at
difference in the interval/ratio measured field by levels of some
factor we think may be important in determining statistically
differences among the different levels of the factor.
43
Descriptive Procedure:
Perhaps you want to look at the descriptive numerics for the
entire data set. You may use the Descriptive procedure to do
this. Go to Analyze, Descriptive Statistics, and Descriptives.
Select the variables for which you want descriptive statistics
and run the procedure.
Descriptives
Notes
Output Created
2009-09-28T11:34:21.987
Comments
Input
Data
C:\Documents and
Settings\cookc\Desktop\employee.sav
Active Dataset
DataSet1
File Label
05.00.00
Filter
<none>
Weight
<none>
Split File
<none>
N of Rows in Working Data File
Missing Value Handling
Definition of Missing
474
User defined missing values are treated as
missing.
Cases Used
Syntax
All non-missing data are used.
DESCRIPTIVES VARIABLES=salbegin
/STATISTICS=MEAN STDDEV MIN MAX.
Resources
Processor Time
0:00:00.031
Elapsed Time
0:00:00.017
[DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav
44
Descriptive Statistics
N
Minimum
Beginning Salary
474 $9,000
Valid N (listwise)
474
Maximum
$79,980
Mean
$17,016.09
Std. Deviation
$7,870.638
We then have descriptive statistics such as the number of
observations (N), the minimum value in the data set, the
maximum value in the data set, the mean, and the standard
deviation. These are the default descriptive values provided.
You can specify others if you want: sum, variance, range,
standard error of the mean, skewness, and kurtosis.
Remember that it makes no sense to do some descriptive
statistics on some variables. You can only do descriptive
statistics such as means on interval or ratio measured variables.
So be careful or otherwise the results of your analysis will be
bogus.
Please proceed to read the online instructions at
instructors.husson.edu/cookc/marketingcourses to find specific
instructions for completing Lab 3.
45
Chapter 5 (Lab 4)
One-Sample t or z Hypothesis Testing:
One sample tests mean that we select our variable to text
from one sample. We are not comparing two or more
samples. We do have to have a number to compare against
however. This number is some value for which it is
appropriate to test our variable against. It may be some
value that represents what we think exists in the population.
It may be some hypothetical value.
We need to calculate the sample statistic (either a mean or
proportion) and compare it to this hypothetical value.
In SPSS, it is easier to calculate a mean statistic from the
sample to compare to our test number which is also a mean.
In other statistical programs, we can more easily compare a
sample proportion to a test number (proportion) or we can
do the test for comparing a sample proportion to a test
proportion by hand.
In the Employee.sav data set, there are fields or variables for
which we can calculate a mean. We can also hypothesize
some test number to compare this mean against. We can
use variables like educ, salary, salbegin, and prevexp as these
variables are measured more or less continuously. Means are
46
calculated on data which is at least interval. In the
Employee.sav data set we do not have variables measured
using an interval scale, but we do have variables measured
on a ratio scale and therefore we can calculate means. We
would also be able to calculate a mean on interval data. Your
data set online contains interval data for which you can also
calculate a mean.
Let’s calculate mean salary for the Employee data set and
test this mean salary against the mean salary for the entire
industry that includes the company from which the
Employee.sav data were taken. Let’s say that the mean salary
for this industry is $40,000. We want to know if the mean
salary in the company is significantly different from the mean
salary in the industry which is $40,000.
This is a one sample test. We can do a one-tailed or twotailed test with this one sample. SPSS automatically does a
two-tailed test. So our null (Ho) and alternative (Ha)
hypotheses would be:
Ho: u=$40,000
Ha=u≠$40,00
There are two kinds of one-tailed test: left sided and right
sided. Left sided means we are on the left side of the mean in
the z or t distributions. Right sided means we are on the right
side of the mean in the z or t distribution. We would then
have to set up left sided or right sided tests.
47
Remember, that no matter what kind of test, the = sign is
always part of the null hypothesis (Ho). So the null might be
stated as u≤value where value is some number we are
testing, or u≥value where again value is some number we are
testing. The alternative (Ha) we be either u>value or
u<value.
Let’s take the salary data and test to see if there is a
signigicant difference from the industry average of $40,000.
In SPSS, go to Analyze, Compare Means, One-Sample t-test,
and use salary as your sample variable and use $40,000 as
the test value.
T-TEST
/TESTVAL=40000
/MISSING=ANALYSIS
/VARIABLES=salary
/CRITERIA=CI(.9500).
48
T-Test
Notes
Output Created
2009-09-30T11:43:33.519
Comments
Input
Data
C:\Documents and
Settings\cookc\Desktop\employee.sav
Active Dataset
DataSet1
File Label
05.00.00
Filter
<none>
Weight
<none>
Split File
<none>
N of Rows in Working Data
474
File
Missing Value Handling
Definition of Missing
User defined missing values are treated
as missing.
Cases Used
Statistics for each analysis are based
on the cases with no missing or out-ofrange data for any variable in the
analysis.
Syntax
T-TEST
/TESTVAL=40000
/MISSING=ANALYSIS
/VARIABLES=salary
/CRITERIA=CI(.9500).
Resources
Processor Time
0:00:00.031
Elapsed Time
0:00:00.109
[DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav
One-Sample Statistics
N
Current Salary
49
Mean
474 $34,419.57
Std. Deviation
$17,075.661
Std. Error Mean
$784.311
One-Sample Test
Test Value = 40000
95% Confidence Interval of the
Difference
t
Current Salary
-7.115
df
Sig. (2-tailed)
473
Mean Difference
.000 $-5,580.432
Lower
$-7,121.60
Upper
$-4,039.27
We are given the t value, the degrees of freedom (df), the 2tailed test probability level (sig. (2-tailed)), the mean
difference which is the difference between the mean in the
sample and the test value of $40,000, and the confidence
interval at the 95% level.
We are most interested in the probability level or
significance level which is .000. This means that there is a
highly significant difference between the sample mean and
the test value. Remember that when we set a confidence
level like the 95% level, we are say we have a 5% chance of
making an error. If we find that the significance level is very
low such as smaller than the 5%, we can say that we have a
small chance of being in error given the conditions of our
hypothesis test. Since .000 is much smaller than the 5%, we
reject our null hypothesis and accept the alternative and say
that there is a significant difference between the mean salary
and our test number. We may also want to know the
direction of difference. If you look at the mean difference,
50
we see that it is negative, so our mean is far lower than the
$40,000 test number. If you look at the One Sample Statistics
table above, you see that the actual sample mean is
$34,419.57 which is lower than the $40,000 test value.
However, not only is the number lower but significantly
lower from a statistical point of view.
The computer output above gives a t value. But remember
that the t distribution and z distribution give the same
probabilities as long as the sample size is 30 or greater. Our
sample has a large number of people, so we are really
calculating a z value even though the computer output does
not say this. It is a bit confusing.
The calculation for the sample t and z statistics are given
below. These are the formulas used by the computer as well.
t and z Statistics for Means and Proportions:
Means
𝑥̅ −𝜇
𝑡 = 𝑠/
√
𝑥̅ −𝜇
or 𝑧 = 𝑠/
𝑛
√𝑛
If we use a t statistic, we to find a t score in the t table just
like we do for z scores. However, to find such a t score we
need to calculate the degrees of freedom. In the case of
comparing one sample mean to a hypothetical mean, we
51
calculate the degrees of freedom using n-1 where n is equal
to the sample size.
Remember that we compare the calculated t or z score
above with the table t or z score to determine if we reject
the null hypothesis or not. If the calculated score is greater or
less than the table score for a two-tailed test, then we reject
the null and accept the alternative. For a left sided one-tailed
test, the t or z score must be more negative than the table t
or z score in order to reject the null and accept the
alternative hypothesis. For a right sided one-tailed test, the t
or z score must be more positive than the table t or z score
to reject the null and accept the alternative hypothesis.
In our Emplyee.sav data set example, we rejected the null
hypothesis and accept the alternative because the calculated
t value is more negative than the table t value. Also, from
the computer output, we do not see the table t or z value,
but the output does give the probability level. We can easily
tell from that probability level if we accept or reject the null.
Since the probability level was .000 which is much smaller
than the stated alpha level of 5% (error level), we reject the
null and accept the alternative. If the probability level had
been higher than the 5 % such as 5.5% we would have
accepted the null hypothesis and rejected the alternative. In
52
that case, the mean sample salary would not have been
significantly different from the test number.
Please proceed to read the online instructions at
instructors.husson.edu/cookc/marketingcourses to find specific
instructions for completing Lab 4.
53
Chapter 6 (Lab 5)
Chi-Square:
In the previous lab, we were working with statistics based
upon the Normal Distribution. However, many statistics are
based upon other distributions. If a random variable is
measured nominally, we would not use the Normal
Distribution to find associated probabilities. We would have
to use other types of distributions.
This lab deals with variables that are counts. Although
counts are numeric, we are just counting the number of
times a particular event occurs (an event could be the
number of females in our sample, the number of people who
plan to vote a certain way, or the number of people who fall
in a particular category in which we have an interest).
Chi-square is a statistic based upon counts and uses a
different probability distribution called the Chi-square
distribution. Now, this distribution is one sided unlike the
Normal Distribution. The reason that it is one sided is that
we are only interested in positive values of Chi-square. We
are squaring values anyway and would not get any negative
values. The Chi-square values are similar to the idea of a zscore. In using the Chi-square table (at the back of your text)
you need to use the idea of degrees of freedom like you did
54
for the t distribution. The degrees of freedom in this case
depend upon the number of rows and columns in the table
you construct that contains the frequencies of counts of test
units (such as individuals) who fall in certain cells of that
table. The table is constructed by taking two nominally
measured variables with different levels (such as gender
having the two levels of male and female) and crosstabulating the levels of these two variables. We will start
with two variables although you can cross-tabulate more
than two variables.
The Chi-square statistic is:
2 
( f 0  f e )
where f e is
fe
the expected frequency and
f o is
the
observed frequency. We find the observed frequencies in
each cell of our table. The expected frequencies we have to
calculate for each cell in the table . To get the expected
frequencies we take the appropriate row and cell totals and
multiply them together and divide by the grand total (total
number in our table). We do this for each cell in the table
and add the results together. This number is the calculated
Chi-square. As in the case for the t or z distributions, we have
to find the table value or here the table Chi-square. To find
the table Chi-square we need a probability level and the
degrees of freedom.
55
Then given a certain probability and degrees of freedom
which is the number of rows in the table minus one times the
number of columns in the table minus one, we look up the
table Chi-square value.
As before (z or t tests), we compare the calculated Chisquare to the table Chi-square. If the calculated value is
greater than the table value, we reject our null hypothesis
which in this case is: No association between the variables. If
we reject the null hypothesis, we then can accept the
alternative hypothesis which is: There is an association
between the variables.
Let’s take a look at the Employee.sav data set to find two
variables to cross-tabulate and then calculate a Chi-square
statistic on this cross-tabulation. We need two nominally
measured variables. We can use minority and jobcat as both
variables are measured nominally. We then would construct
a 3 by 2 table. There are 3 levels of job category and two
levels of minority status. We would want to know if the two
variables are related in any way. So the Chi-square test will
tell us if these two variables are statistically related. This
statistical method only tells us if they are related but not
specify how. It does not tell if one variable causes the other.
The variables are only related through some fact of causation
but we do not know what fact of causation. There could be
56
some third variable actually causing the relationship we see
and if we included that third variable the relationship
between the first two variables would disappear.
In SPSS we could go to Analyze, Descriptive Statistics, and
then Crosstabs.
We would then select the variables we want to crosstab
which in this case is jobcat and minority. Our hypothesis may
be that we feel that there may be negative discrimination in
57
this company and they tend to hire more minority people in
the lower paying jobs such as clerical and janitorial.
So the null hypothesis is: There is no relationship between
minority status and job category. The alternative hypothesis
is: There is a relationship between minority status and job
category.
So to test this, we will run a Chi-square test.
CROSSTABS
/TABLES=jobcat BY minority
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT EXPECTED COLUMN
/COUNT ROUND CELL.
Crosstabs
[DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav
Case Processing Summary
Cases
Valid
N
Employment Category *
Minority Classification
58
Missing
Percent
474
100.0%
N
Total
Percent
0
.0%
N
Percent
474
100.0%
Employment Category * Minority Classification Crosstabulation
Minority Classification
No
Employment Category
Clerical
Count
Expected Count
% within Minority
Classification
Custodial
Count
Expected Count
% within Minority
Classification
Manager
Count
Expected Count
% within Minority
Classification
Total
Count
Expected Count
% within Minority
Classification
Chi-Square Tests
Asymp. Sig. (2Value
Pearson Chi-Square
Likelihood Ratio
Linear-by-Linear Association
N of Valid Cases
df
sided)
26.172a
2
.000
29.436
2
.000
9.778
1
.002
474
a. 0 cells (.0%) have expected count less than 5. The minimum
expected count is 5.92.
59
Yes
Total
276
87
363
283.4
79.6
363.0
74.6%
83.7%
76.6%
14
13
27
21.1
5.9
27.0
3.8%
12.5%
5.7%
80
4
84
65.6
18.4
84.0
21.6%
3.8%
17.7%
370
104
474
370.0
104.0
474.0
100.0%
100.0%
100.0%
The thinking is that the minority status determines the job
category although there may be some other factor really
causing this relationship. However, the relationship is
significant. If you look at the Pearson Chi-square value above
(26.172), we see that the significance level for a two sided
test is .000 which is far lower than the standard probability of
0.05 that is used as the standard probability. This means that
we a very unlikely to get this high a level of Chi-square by
chance. So we can reject the null and accept the alternative
and we find that there is some kind of relationship between
minority status and job category.
Remember that there could be a third variable causing this
relationship. So we might add a third variable to our analysis
such as education level. However, in this data set education
is not measured nominally so we would have to recode that
data into a new variable where education would have a
nominal classification.
Example-Calculation of Chi-square by Hand:
We have two variables, smoking and gender, and we want to
know if there is an association between the two variables.
We have found the frequencies within each cell as follows:
60
Gender
Male
Female
Total
Smoking
Yes
100
75
175
No
135
140
275
Total
235
215
450
We need to calculate the expected values for the four cells
(non-total cells) and then calculate the Chi-square statistic.
Then we need to compare that to a table value associated
with an alpha level or confidence interval. We get the
expected frequencies for each cell as follows:
Cell1 (235*175)/450 Cell2 (215*175)/450
Cell3 (235*275)/450 Cell4 (215*275)/450
Cell1: 91.39
Cell2: 83.61
Cell3: 143.61
Cell4: 131.39
61
2 
(100  91.39) 2 (75  83.61) 2 (135  143.61) 2 (140  131.39) 2



 .81  .89  .52  .56  2.78
91.39
83.61
143.61
131.61
We must find the Chi-square table value to compare this
number against. To use the Chi-square distribution it is
necessary to find the degrees of freedom. For this type of
test, the degrees of freedom are (the number of rows -1
times the number of columns -1) or (r-1)(c-1). Here we have
two rows and two columns, so the d.f. are (2-1)(2-1)=1. We
also need to choose the confidence level. Let us choose the
95% level. Remember that this table is one sided only. So we
find that the value associated with 1 d.f. and the 95%
confidence level is: 3.841. If your calculated value is equal to
or greater than this number, we could reject the null and
accept the alternative. However, it is not greater in this case,
so we accept the null and reject the alternative. There is no
association between gender and smoking.
62
Chapter 7(Lab 6)
Two Group or Two Sample t test or z test:
Sometimes we want to compare two means or two
proportions from two different samples or compare two
groups taken from one sample. Why do we want to do this?
Because we want to compare one mean or proportion to
another to see if there is a significant difference between the
two groups.
Just because one number is higher or lower than another,
does not mean that number is statistically significantly
different from the other. We have to perform a statistical
text to find out. After all, we want to make decisions on the
basis of the best information available. Hypothesis testing
allows us to back our decisions with a high probability of
being correct if we reject the null hypothesis of no difference
between the two means or proportions.
Whether we use a t test or z test depends upon the sample
size. If the two groups or samples are 30 or greater, we
would use the z test whether we know the standard
deviation from the population from which the samples or
groups were taken. If we do not know the population
standard deviations, and if our sample sizes are less than 30,
then we need the t test. If our sample sizes are less than 30
63
but we know the standard deviations from the population,
we cans still use the z test.
We can also compute confidence intervals for the difference
between means and proportions.
We also assume that the two populations from which are
samples come are independent. If they are not, we have to
use a special test to see if the two samples come from two
different populations.
Confidence Interval Formulas for the Difference between
Means and Proportions for Independent Samples:
Here are the formulas for the calculation of confidence
intervals for the difference between means and proportions.
Difference Between Two Means: Confidence Intervals
1. Large sample case with standard deviation of the
populations known
( x1  x 2 )  z a / 2
 12
n1

 22
n2
2. Large sample case with standard deviation of the
populations unknown
2
( x1  x 2 )  z a / 2
64
2
s1
s
 2
n1
n2
3. Small sample case with standard deviation of the
populations known
( x1  x 2 )  za / 2
 12
n1

 22
n2
4. Small sample case with the standard deviation of the
population unknown for the two samples but the standard
deviations are assumed equal and are estimated by the
sample standard deviations
1 1
( x1  x 2 )  t a / 2 s 2   
 n1 n2 
where
(n  1) s1  (n2  1) s 2
s  1
n1  n2  1
2
2
2
and degrees of freedom are
n1  n2  2
5. Small sample case with the standard deviation of the
population unknown and unequal
2
( x1  x 2 )  ta / 2
2
s1
s
 2
n1 n 2
with degrees of freedom of
 s1 2 s 2 2 


n  n 
1
2


2
2
2
 s1 2 
 s2 2 




n 
n 
 1   2 
n1  1
n2  1
which is a lot of work to calculate!!!!!!!!!!!!!!
65
Difference Between Two Proportions: Confidence Intervals
The difference between two proportions is always a z test.
We assume we have large enough sample sizes to use the
normal distribution. If the sample size is small we would have
to use the binomial distribution. As we have not covered the
binomial distribution in this class, we will use the z test only.
( p1  p2 )  za / 2
p1q1 p2 q2

n1
n2
We can find the correct hypothesis test from the above
formulas. Remember that these tests are for the difference
between two means or the difference between two
proportions.
Hypothesis TestingFormulas for the Difference Between Two
Means and Proportions for Independent Samples:
Difference Between Two Means
1.Large sample case with standard deviation of the
populations known
z
x1  x 2  Do
 12
n1
66

 22
n2
2. Large sample case with standard deviation of the
populations unknown
z
x1  x 2  Do
2
2
s1
s
 2
n1
n2
3. Small sample case with standard deviation of the
populations known
z
x1  x 2  D0
 12
n1

 22
n2
4. Small sample case with the standard deviation of the
populations unknown for the two samples but the
standard deviations are assumed equal and are estimated
by the sample standard deviations
t
x1  x 2  Do
1
1
s 2  
 n1 n2



5. Small sample case with the standard deviation of the
populations unknown and unequal
t
x1  x 2  D0
2
2
s1
s
 2
n1 n2
67
Difference Between Two Proportions: Hypothesis Testing
The test for the difference between two proportions is a large
sample test. For smaller samples, the binomial distribution may
be used.
z
p1  p 2  D0
1
1
P(1  P)   
 n1 n2 
P is equal to
n1 P1  n2 P2
n1  n2
For hypothesis testing, we calculate the standard error of the
difference between the proportions.
Use of the Formulas for Hypothesis Testing:
We can use the above formulas to test any hypotheses that fit
the particular situations for which the formula applies. Let’s
look at these situations: A. Large sample case with standard
deviation of the populations unknown, B. Small sample case
with the standard deviation of the populations unknown for the
two samples but the standard deviations are assumed equal
and are estimated by the sample standard deviations, and C.
The difference between proportions.
Large Sample Case with Standard Deviation of the Populations
Unknown:
We want to know if there are is a significant difference in
beginning salary between men and women who have recently
68
joined a local firm. We might think that this firm discriminates
against women. However, we have to be careful in our analysis
as other factors may actually case any real difference between
the beginning salaries of men and women. We find the
following:
The mean beginning salary for men is $31,083 and the mean
beginning salary for women is $29,745. The associated sample
standard deviations are $2312 for the sample of men and
$2569 for the sample of women. We have a sample of 40 for
each group.
Our test is:
z
31,083  29,745  0
(2312) 2 (2569) 2

40
40
 2.45
Using a two-tailed test (we could use a one-tailed test
depending on how we word the hypothesis test), for α=0.05,
we find that the z score has to be less than -1.96 or greater
than 1.96 for there to be a significant difference in mean
beginning salaries. We can feel confident that there is a
significant difference between the two groups, but we really do
not know why there is a difference. It could be discrimination; it
could be that men have higher entry qualifications; it could be
that women have lower education level; or some other reason.
69
Let’s take a look at a similar problem using SPSS. We will use
our Employee.sav data set.
We have the variable Beginning Salary and the variable Gender.
Our hypothesis is as in our example above. In our Employee.sav
data set, we have 474 observations. We want to run the
independent sample z-test. Now, the SPSS program will
compute this but this test is actually under the independent
sample t-test procedure. The SPSS program will run a t-test or a
z-test based upon sample sizes. The output will also give you
two different assumptions: equal variances, and unequal
variances. It is usual that we will not know the standard
deviations of the populations and will be working with samples.
This procedure covers for all the situations we have identified
except for samples where the samples are interdependent. We
would have to use another procedure for tests of the difference
between proportions.
We would go to Analyze, and the select Independent Sample ttest. A screen will appear, and we will select the test variable
(dependent variable) which as to be measured on at least an
interval basis (not a nominal variable). We also need an
independent variable that must not have more than two
categories. If it has more than two categories, we would then
have to use one-way ANOVA. So using the independent sample
70
t-test, we just are comparing two groups. We also have a
Define Group box in which we have to give the procedure the
designations for the categorical variable. In our case, we would
have to identify m for male gender and f for female gender.
Some data sets may use 1 for female and 2 for male or some
other designation. But we have to KNOW what these
designations are.
As we see below, we have filled in the Test Variable(s),
Grouping Variable, and Define Group boxes.
71
Now we can run our procedure. The results show that there is
72
significant difference between male and female salaries. The
two-tailed sig (significance) level is .000 which means that this
is less than 0.05 or 0.01 probability levels. The computer
calculates out to three decimal places. Now, with this
information, we could do more sophisticated analysis which
would try to control for other variables that might explain the
difference in beginning salary.
Small Sample Case with Standard Deviation of the Populations
Unknown, Equal Variances Assumed:
Our Malhotra text gives us an example of two groups, one adult
and one teenager. We want to know if there are differences
between the two groups in amusement park preferences.
There are ten respondents in each sample. The mean
amusement park score for the adults was 4 and the mean
amusement park score for the teenagers was 5.5. The standard
deviation for the adult group was 1.080 and the standard
deviation for the teenagers was 1.080. We could do a test of
the quality of variances. Computer outputs often give this
anyway. We will assume that the variances are equal. We will
pool the variances in this case. Each variance is the standard
deviation squared. The pooled variance and standard deviation
are:
S2 
73
(10  1)1.66  (10  1)1.111
 1.139
10  10  2
1 1
s x1 x2  1.139(  )  0.477
10 10
The t-test is:
t
5.5  4
 3.14
0.477
with 18 degrees of freedom. Thus, using the t-
distribution for 18 degrees of freedom, we find that the critical
value in the t-table is 2.0019 for a two-tailed text. So the null
hypothesis of equal means is rejected. Now, if we had the raw
data, this is the same result we would get if we used the
independent sample t-test in SPSS.
Hypothesis Test of the Difference in Proportions:
Malhotra gives us an example where we have two independent
samples that give the percentage of users of jeans in the United
States and Hong Kong. We interview a sample of 200
customers in each area and find that 80% of customers in the
US and 60% of customers in Hong Kong use jeans. Is there a
significant difference between these proportions?
The z test is:
z
(.8  .6)  0
1 
 1
0.7 * 0.3

 200 200 
 4.36
Using a two-tailed test, the z-score is +/- 1.96. Since 4.36 is
greater than 1.96, there is a significant difference between the
two groups. Now, SPSS does not have a procedure where we
74
can make this calculation directly as a difference between two
proportions. However, we can use a Chi-square test and get the
same result.
Paired Sample t-test:
If the two samples in question we have to use a different
procedure. We can use the paired samples t-test or
alternatively, we can use the Chi-square procedure.
The paired-samples t-test is a t-test with n-1 degrees of
freedom and is given by:
t n 1 
D  uD
sD
n
Where D is the mean of the differences between the pairs of
observations and sD is the standard deviation of the differences
in the two groups. The standard deviation is calculated by
taking the square root of the difference between the paired
observations in the two groups minus the mean of the
differences squared and then summed over all paired
observations and then divided by n-1. This value is then divided
by the square root of the sample size, n. We do not have paired
data to work with in our data sets, so we will not be using this
test in SPSS. Remember, you can also do this test using Chisquare when you have a large sample size.
75
Chapter 8(Labs 7 and 8)
ANALYSIS OF VARIANCE (ANOVA):
If we have more than two means and if the dependent
variable is measured at least on an internal basis and we
have one independent variable measured on a categorical
basis, we will be using One-Way Analysis of Variance.
There are many other types of ANOVA such as: completely
randomized design, randomized design, and factorial
designs.
A completely randomized design has one dependent variable
and one categorical variable. It assumes that the there is no
other source of variation other than the categorical variable.
However, there may be other variables that can affect the
dependent variable. If this is the case, we use a randomized
or factorial design. The randomized design has one other
categorical variable. Observations are randomly assigned to
the different combination of the levels of the two categorical
variables. Factorial designs can have more than two
controlling variables and allow for interaction effects. It is
possible for a particular level of one variable to have a
positive or negative effect on the dependent variable for a
particular level of another variable. If this is the case then we
need to use a factorial design. We can also have
76
independent variables that are not categorical. In this case,
the analysis of variance is called covariance. There are other
models of ANOVA as well such as Repeated Measures
ANOVA. We will just cover the One-Way-ANOVA and learn
how to do One-Way-ANOVA in SPSS and try a factorial design
as well although we will not do the math for this design.
One-Way-ANOVA or the Completely Randomized Design:
In this design, the categorical variable is called a factor, and
the different levels of this factor are called treatments. We
want to know if the dependent variable varies by the
different treatments.
We first will have to decompose the total variation in the
dependent variable into the variation explained by the
independent variable and the error left over. So,
SStotal=SSbetween +SSwithin where the total variation is
SStotal and SSbetween is the explained variation, and
SSbetween is the error variation or the variation in the
dependent variable not explained by the factor for
independent variable. We are actually comparing means.
When we compare two means it is a t-test. When we have 3
or more means we use ANOVA. We have to be careful
because with a larger number of categories for the
independent variable, that leads to comparing each mean to
each other and we have to do a lot of comparisons. That
77
could lead to getting a significant result by random chance. In
order to lower this possibility, we would have to employ
what are called multiple comparison tests which makes it
less likely to get a significant difference by chance.
The null hypothesis for this test is that the means are not
significantly different from one another. The test for this is
an F test which uses an F distribution. This distribution is
defined by two degrees of freedom, one for the numerator
and one for the denominator in the formula. The formula is:
SS x
MS x
F  c 1 
SS error
MS eror
( N  c)
We have to know how to calculate the SS terms or the sums
of squares terms. The c-1 degrees of freedom for the
numerator takes the number of treatments minus 1 and the
N-c degrees of freedom takes the total number of
observations minus the number of treatments.
Let’s see how we would calculate the sums of squares and
how we would use the F table to get the critical values.
Our experiment comes from Malhotra and shows the effect
of in-store promotion on sales. We have three levels of the
factor so we have three treatments of high, medium, and
low. Fifteen stores are randomly selected and assigned
78
randomly to the three levels. Sales have been converted to a
scale from 0 to 10.
Error sums of squares is calculated by taking the mean within
each treatment and subtracting off the grand mean for the
entire sample and multiplying each calculation by the
associated number of observations within each treatment.
The means for the High, Medium, and Low groups are 9, 5,
and 4 respectively. The grand mean is 6.
So
SS x  5(9  6) 2  5(5  6) 2  5(4  6) 2  70
. And to get the error sums
of squares, we take each observation and subtract of the
associated mean from the particular treatment where the
observation comes from. It is
SS error  (10  9) 2  (9  9) 2  (10  9) 2  (8  9) 2  (8  9) 2  (6  5) 2  (4  5) 2  (7  5) 2 
(3  5) 2  (5  5) 2  (5  4) 2  (6  4) 2  (5  4) 2  (2  4) 2  (2  4) 2  28
The F-test then becomes:
70
F  3  1  15.0
28
15  3
The 15 is the number of observations and the 3 is the
number of treatments.
In the F-table, we see that for 2 and 12 degrees of freedom,
the F us 3.89. We reject the null hypothesis and state that
79
mean differences exist. The F distribution like the Chi-square
distribution is a one-tailed distribution.
Using our Employee.sav data set, we think that beginning
salary varies by job category. We go to Analyze, Compare
Means, and One-Way Anova. We put beginning salary in the
dependent list and employment category as the factor.
When we run the ANOVA, we find a significant difference.
However, we really do not know which comparisons are the
significant ones. There are three possible comparisons. We
would have to do a multiple comparison test to figure out
which comparisons are significant. There are a number of
80
these tests. One is called Scheffe’s test, another is the
Bonferroni test, and another is the Duncan’s Multiple Range
test.
There are different reasons for using each test, but that
analysis is beyond our scope of work. You can use either the
Scheffe’s test or Bonferroni test as these are easy to read
from the computer output. Where ever there is an asterisk,
it means there is a significant difference. Let’s run a Scheffe’s
test. You will select Analyze, Compare Mans, One-Way
ANOVA, and then Post Hoc. Then check Scheffe. You can
81
check any other Multiple Comparison test if you understand
why and how these are used.
The results below do show asterisks for several significant
comparisons. There is a significant difference between clerical
beginning salary and management beginning salary and
custodial beginning salary and management beginning salary
but not between custodial beginning salary and clerical
beginning salary.
82
Other Analysis of Variance Methods:
The other ANOVA methods are beyond the scope of our text.
However, we can easily implement these methods in SPSS. For
example, we might want to carry out a factorial analysis of
some kind. We can run various factorial analyses by using the
Univariate Procedure in SPSS. If you go to Analysis, General
Linear Model, and Univariate, a screen will come up that will
allow you to put in one dependent variable but several types of
factors. These factors are called fixed factors, random factors,
and covariates. Factorial designs then allow different kinds of
factors as well as interaction terms. Our simple ANOVA does
83
not allow more than one fixed independent variable and no
interaction terms. Fixed means that the response of the
dependent variable for a level of the independent variable has
no random distribution.
There is an ANOVA model where there can be more than one
dependent variable. His can happen when we have several
models where the data are interdependent from one model to
the next. Hypothesis testing would be inaccurate without the
use of an interdependent model.
84
Chapter 9 Correlation (Lab 9)
The Product-Moment Correlation:
There are many types of correlation. The correlation we will be
doing has both variables measured on at least an interval basis.
There are other types of correlation for other situations. There
also is something called partial correlation where we can see
the relationship between two variables while controlling for
other variables. The variables have to be measured on at least
an interval basis. We will look at the product-moment
correlation.
We have to be careful in our analysis. Just because two
variables are correlated does not mean one causes the other.
The correlation may just by chance. Thus it is necessary to have
a good theory to explain the correlation. It is like the person
who found an association between the density of storks and
the birth rate. One might conclude that storks bring babies. But
I do not think that is the way babies are created. So having a
good theory will allow us to get a handle on the correlation.
We can plot the relationship between the two variables using a
scatter plot. A scatter plot plots one variable on the x axis and
the other on the y axis.
85
The sample correlation coefficient is denoted as r and the
population correlation coefficient is denoted as the Greek letter
ρ or rho. The formula for the correlation coefficient is:
r
X i  X Yi  Y 
n 1
X

 X  Yi  Y 

n 1
n 1
2
2
i
The numerator is called the covariance between X and Y. The
numerator is the square root of the standard deviation of the X
variable times the standard deviation of the Y variable.
The above is easily calculated by forming a table and taking
each X variable value and subtracting off the mean of the X
variable and taking the Y variable value and subtracting off the
mean of the Y variable and multiplying each by the other and
then summing over all pairs of observations. Then the
numerator is divided by n-1. In the denominator, we take the X
value and subtract of the mean and sum over all X values minus
the mean and divide the total by n-1. We do the same for the Y
values. Then we take these two values and multiply them
together and take the square root. This number is then divided
into the numerator.
So if we think that there is a correlation between attitude
toward sports cars and duration of car ownership, then we
would proceed as follows.
86
X  (10  12  12  4  12  6  8  2  18  9  17  2) / 12  9.333
Y  (6  9  8  3  10  4  5  2  11  9  10  2) / 12  6.583
X i  X Yi  Y   (10  9.33)(6  6.58)  (12  9.33)(9  6.58)..etc.  179.6668
X i  X   (10  9.33) 2  (12  9.33) 2  (12  9.33) 2 ..etc.  304.6668
Yi  Y   (6  6.58) 2  (9  6.58) 2  (8  6.58) 2 ...etc.  120.9168
r
179.6668
(304.6668)(120.9168)
 0.9361
The correlation coefficient varies from -1 to +1. The closer the
correlation coefficient is to -1 or 1, the closer the variables are
to one another. The significance of the relationship is
measured by a t-test with n-2 degrees of freedom.
Sometimes we will want to measure the amount of variation
explained in our model. This is called r squared and is calculated
by taking r and squaring it. This gives us the amount of variation
explained in our model. It also is the (Total Variation-Error
Variation)/Total Variation. We know how to calculate error
variation and explained variation already. If we add the two
values together we get total variation. So it is fairly easy to
calculate the percent of variation explained.
We think that there is a relationship between beginning salary
and current salary. Go to Analyze, Correlation, Bivariate. Put
the variables we want to correlate in the variable box.
87
The results give the correlation as well as the significance level.
The variables are significantly correlated at the 0.000 level and
the correlation coefficient is 0.88. So there is a correlation
between beginning salary and current salary. Likely, as a person
starts a job with a larger beginning salary, then any pay raises
occur on a bigger base. So then there would be a correlation
between beginning salary and current salary. There may be
other reasons for this correlation but theory would have to
provide sensible explanantions.
88
89
Chapter 10 Regression (Lab 10)
90
Download