Marketing Research today increasingly involves the need to understand global markets. This photo was taken in Shanghai, a major shopping and marketing center. Lab Manual Prepared by Dr. Cristanna Cook PREFACE The purpose of this lab manual is to enable student understanding and participation in Marketing Research, BA 424. It is intended to act as a guide to hands on problem development for major topic areas in inferential statistics. It also provides the techniques for completing the same problem done by hand on the computer using version 16.0 of SPSS. By seeing the problem, working the problem by hand, and then solving the problem on the computer using a common statistical package, the students will receive a more complete understanding of the underlying concept and the technology available for solving such problems in a way that is commonly used in the business world. The student would have had a basic course is statistics before taking BA424. 3 Table on Contents Chapter 1 (Probability) Probability Distributions………………………………………….. The Normal Distribution ………………………………………….. Chapter 2 (Lab 1) Understanding File Structure………………………………….. Chapter 3 (Lab 2 and 3) Frequencies …………………………………………………………….. Case Summaries ……………………………………………………….. Descriptive Statistics………………………………………………….. Chapter 4 (Lab 4) Hypothesis Testing (One Sample)……………………………… Chapter 5 (Lab 5) Hypothesis Testing (Chi-Square)………………………………… Chapter 6 (Lab 6) Hypothesis Testing (Two Samples)…………………………….. 4 Chapter 7 (Lab 7) Hypothesis Testing (ANOVA)……………………………………… Chapter 8 (Lab 8) Hypothesis Testing (Two-Way ANOVA)……………………… Chapter 9 (Lab 9) Hypothesis Testing (Correlation)………………………………. Chapter 10 (Lab 10) Hypothesis Testing (Simple and Multiple Regression)...…………………………………..………………………. 5 Chapter 1 (Probability) PROBABILITY DISTRIBUTIONS Introduction: Inferential statistics is based upon the idea that activities have probabilities attached to them. If we think about the possibility of rain tomorrow, we might ask, what is the probability that it will rain tomorrow? In like fashion, research activities often involve probabilities. Probability is the likelihood that an event will occur. How we define “event” depends on the situation. An event is the result of an activity and the word activity can be replaced with the word experiment. When we flip a coin, that activity could be thought of as an experiment where the outcome is random. We do not know if we will get a head or tail. But one or the other will result. So, an experiment has outcomes like the outcomes of flipping a coin are heads or tails. An event might be described as getting a head on a flip of a coin or getting a tail on a flip of a coin. What the event is depends on our point of view. We often want to associate probabilities with the outcomes of an experiment. If we undertake a survey which is a kind of an experiment and we are interested in the number of males or females who prefer peaches to pears, then we will use the numbers of males and females who indicate they prefer 6 peaches and the number who say they prefer pears to calculate what we call “empirical probability” which is based on empirical studies like surveys that ask people questions. Empirical probability is represented by the frequency of people who answer a certain way to our empirical questions. There are two other kinds of probability. One is classical probability and the other is subjective probability. Classical experiments are often based upon events defined in terms of a deck of cards, marbles in a bowl, etc. These were the types of events that early statisticians used to develop the theory of statistics. Subjective probability is our own view of whether an even will happen and is based upon our own experience which may be valid or not. So, we have activities which we call experiments and from which we can identify an event for which we can calculate empirical probability. The Rules of Probability: In order to calculate a probability for an event from an experiment, we often have to calculate the number of times the outcomes of the event happens. To do this, we can use what are called counting rules. These counting rules will give us the total number of times something happens. Since a probability is a ratio with a numerator and a denominator, to calculate a probability we often have to calculate the number 7 of times a certain event occurs and put that number in the numerator and then calculate the total outcomes and put that number in the denominator and we can than calculate a probability. If we flip a coin 2 times and want to record the probability of getting 1 head (which is our event), we need to identify all the outcomes in our experiment and the outcomes we are looking for or the event (different ways of getting at least 1 head). If we flip a coin two times we can get: HH HT TH or TT. So there are 4 total outcomes and 3 ways of getting 1 at least one head (HT, TH, HH). So the probability of getting at least one head is 3 out of 4 or ¾. Now sometimes, the actual number of total outcomes or the number of outcomes in an event may be difficult to calculate, so we have special counting rules to help us. These counting rules are: 1. the addition rule; 2. the multiplication rule; 3. the permutation rule; and 4. the combination rule. You may find these rules in any statistics text. However, the purpose of the rules is to help us find a probability. 8 Probability Distribution: We now know part of what makes up a probability distribution. The one part is the probability. That probability has to be associated with something. That something is the values that a random variable can take. A variable is any thing that varies. A random variable is one in which we do not know the exact result or value of that random variable when we do an experiment. When we flip a coin, getting at a head is a random result. So is getting a tail. We know will get one or the other, but on any one throw of the coin we do not know which we will get. When we pair the values of a random variable such as Head or Tail with the probability of getting that value ( ½ for a tail and ½ for a head) in the case of flipping a coin once , we have a probability distribution. Now there are some complicated probability distributions. The probabilities associated with the values that a random variable can take have been calculated for us by statisticians for these more complicated probability distributions. 9 The Normal Distribution: This distribution is used often to represent random variables. It is used to find probabilities associated with values that the random variable can take. There are certain types of problems that are commonly solved by using the normal distribution. The characteristics of this distribution can be reviewed in your basic statistics text. Statisticians have calculated probabilities associated values of a random variable and have standardized these values. These probabilities have been placed in a table called The Standard Normal Table. To use this table, we take the value of our random variable and change it into a z-score and find the probability associated with a particular z-score. Often the problems fall into one of 7 types when we try to find probabilities associated values of a random variable for which we have the z score. Type 1: Finding the area under the standard normal distribution curve between 0 and +z or 0 and –z. 10 Type2: Finding the area under the standard normal distribution curve in any tail from +z to the end of the right side of the distribution or from –z to the end of the left side of the distribution 11 Type 3: Finding the area under the standard normal distribution between any z values on one side of the distribution or from +z1 to +z2 or from –z2 to –z1 Type 4: Finding the area under the standard normal distribution between any z values on the opposite side of the mean or between z1 on one side of the mean and z2 on the other side of the mean 12 Type 5: Finding the area under the standard normal distribution from +z which is to the right of the mean Type 6: Finding the are under the standard normal distribution from –z which is to the left of the mean 13 Type 7: Finding the area under the standard normal distribution curve on any two tails or from +z out to the right side of the distribution and from –z out to the left side of the distribution Examples: Type 1: Find the probability or area from z=0 to z= 1.2. Find the probability or area from z=0 to z= -1.2 Type 2: Find the probability or area to the right of z=1.2 Find the probability or area to the left of z=-1.2 14 Type 3: Find the probability or area between z1=+1.2 and z2=+1.7 Find the probability or area between z1=-1.2 and z2=-1.7 Type 4: Find the probability or area between –z=-1.2 and +z=1.7 Type 5: Find the probability or area to the left of +z=1.2 Type 6: Find the probability or area to the right of –z=1.2 Type 7: Find the probability or area to the right of +z=+1.2 and to the left of –z=-1.2 15 Chapter 2 (Lab 1) File Structure: We can think of a file as a table of rows and columns. When you use SPSS, you put your data in a file with rows and columns. As you can see with the file above, this file called Employee.sav, shows the data for a file with information about employees for a particular company. A record is 16 composed of the information on each employee. There are 474 employees in this data file. Each file is made of up variables. The variables are called fields and represent the information collected for all the records. So we have the following fields or variables across the top: id, gender, bdate for date of birth, educ for education, jobcat for job category, salary, salbegin for beginning salary, jobtime for how long the person has been in the job, prevexp for previous experience, and minority for minority status. The view we see here shows the actual data. We can also have another view which is called variable view. We can switch between data and variable view easily by just selecting one or the other view at the bottom left of the screen. As we can see below, the variable view lists all the variables and across the top of the screen we see the attributes for all those variables: name, type, width, decimals, label, values, missing, columns, align, and measure. Name: Name is the name of the field or variable. You provide this it should be reflective of the meaning of the variable Type: The type of field or variable refers to the underlying nature of the field or variable. Is it numeric or non-numeric? This is the basic question. You want fields for which you plan to do math to be numeric. There are other kinds of fields 17 such as dollar. If you indicate that the field is dollar, then a $ will be inserted into the field. However, you cannot do math with this kind of field. Some fields may be alpha numeric meaning that we use letters and numbers to represent data in the field. However, some procedures and other programs to which you might want to export your data may not read alphanumeric fields in a way in which math can be done on the data in the fields. Width: This refers to the number of positions provided for the field when the field data were input into the file. Decimals: If the field should have decimals, you can specify the number of decimal places. Label: As each field is actually a variable, you will want to give it a descriptive label which reminds you what this variable means. This label is composed of a description you think describes the variable well. This label will be printed on any computerized output. Values: As each variable is a random variable, the value that the variable can take depends on the test unit (such as a person- usually with business data). So, the value will be different depending on which test unit. Thus, we have to indicate the different values the variable can take. Gender is a variable that can take on two values of male or female, for example. 18 Missing: Sometimes the person we interview will not answer a question. If that is the case, then the data will be missing. So some value will have to represent missing data. Any value can be used. It often is convention to use a string of 9’s to represent a missing value. There is a built-in missing value in SPSS and that is the dot (.). Dots are automatically read as missing. Columns: This is just the column width of the field in the SPSS spreadsheet. Align: You can align the fields to the right, left, or center to make the spreadsheet look nice. Measure: This is kind of scaling measurement: ordinal, ratio, nominal, interval. 19 We will be working with the fields or variables to develop hypotheses so we need to know as much about these as possible. If we name them, label them, and give them value labels and appropriate missing values, it will be easier for to understand the printed output. 20 Your Data: Each of you signed up for a case which comes with a data file. What you have to do is the following: 1. Read the case in your text; 2. Understand the meaning of the variables (to get the meaning use the questionnaire or survey in the case to help you); 3. Change the name of the variables to suit you; 4. Check out the label and value labels for the variables; 5. Check out the missing data; 6. Check out if the variables re numeric or not as this will determine how you can analyze the data. Please proceed to instructors.husson.edu/cookc/marketingcourses to find specific instructions for Lab 1. 21 Chapter 3 (Lab 2) Case Summary and Frequencies: Often we would like to see how the respondents answered to the questions we have asked them or maybe we want to take a look at the data in order to identify errors in the data set. We can do this by running the Case Summary procedure is SPSS and by running the Frequency procedure in SPSS. If we run Frequencies, we can tell immediately if an incorrect answer has been coded into the data set because all data values will appear in our Frequency runs. If a value appears to be incorrect, we can then run Case Summary to identify the record where the error occurs and isolate the record using the ID of the record. All records should have an ID. ID or something similarly named should be a field in the data set. So, if you know the field where the error is, you can run Case Summary and get a listing of the field where the error is along with a listing of the ID field and you can than find the data record where the error is located. We will run a case summary for the Employee.sav data set. We want to limit the number of cases printed otherwise we tend to waste paper. 22 Frequencies: Lets run a frequency for the Employee.sav data set that comes with SPSS and see what the output looks like. The Employee.sav dataset has a number of fields. We do not want to ask for a frequency on data that is continuous because we would have page after page of output. We might do this just to identify any error but we should resist printing all this out as you would have many pages to print. For example, the Employee.sav data set has fields called bdate, prevexp, and code . Bdate is the emplyee’s birth date and code is the employee id and prevexp is the employee’s previous experience in months. It probably would not make sense to ask for a frequency on these fields as it would not be too meaningful. Also, we have salary and salbegin fields that represent current salary and beginning salary. If we ask for a frequency for this data, it would be a rather long list because every individual in our dataset would likely have a different number for salary and beginning salary. So it is best to run the Frequency procedure with fields where not every person’s answer is represented by a different value. 23 Example of the Frequency Procedure: As you see below, there are three fields we have identified in the Frequencies procedure. There are frequencies for gender, previous experience, and minority status. As you can see, the previous experience field goes on and on because it is one of the fields that have numbers that are different for most employees. Each frequency table provides the value label, frequency, percent, valid percent which leaves out missing data, and cumulative percent. FREQUENCIES VARIABLES=gender prevexp minority /ORDER=ANALYSIS. Frequencies [DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav Statistics Previous Gender N Valid Missing 24 Experience Minority (months) Classification 474 474 474 0 0 0 Frequency Table Gender Cumulative Frequency Valid Percent Valid Percent Percent Female 216 45.6 45.6 45.6 Male 258 54.4 54.4 100.0 Total 474 100.0 100.0 Previous Experience (months) Cumulative Frequency Valid 25 missing Percent Valid Percent Percent 24 5.1 5.1 5.1 2 4 .8 .8 5.9 3 5 1.1 1.1 7.0 4 4 .8 .8 7.8 5 12 2.5 2.5 10.3 6 7 1.5 1.5 11.8 7 7 1.5 1.5 13.3 8 6 1.3 1.3 14.6 9 7 1.5 1.5 16.0 10 3 .6 .6 16.7 11 9 1.9 1.9 18.6 12 6 1.3 1.3 19.8 13 3 .6 .6 20.5 14 1 .2 .2 20.7 15 3 .6 .6 21.3 16 2 .4 .4 21.7 17 3 .6 .6 22.4 18 8 1.7 1.7 24.1 26 19 5 1.1 1.1 25.1 20 6 1.3 1.3 26.4 21 1 .2 .2 26.6 22 7 1.5 1.5 28.1 23 2 .4 .4 28.5 24 7 1.5 1.5 30.0 25 2 .4 .4 30.4 26 6 1.3 1.3 31.6 27 2 .4 .4 32.1 29 2 .4 .4 32.5 30 2 .4 .4 32.9 32 6 1.3 1.3 34.2 33 1 .2 .2 34.4 34 4 .8 .8 35.2 35 3 .6 .6 35.9 36 7 1.5 1.5 37.3 37 1 .2 .2 37.6 38 6 1.3 1.3 38.8 40 1 .2 .2 39.0 41 4 .8 .8 39.9 42 1 .2 .2 40.1 43 2 .4 .4 40.5 44 3 .6 .6 41.1 45 2 .4 .4 41.6 46 4 .8 .8 42.4 47 6 1.3 1.3 43.7 48 10 2.1 2.1 45.8 49 4 .8 .8 46.6 50 3 .6 .6 47.3 51 2 .4 .4 47.7 27 52 4 .8 .8 48.5 53 2 .4 .4 48.9 54 4 .8 .8 49.8 55 4 .8 .8 50.6 56 5 1.1 1.1 51.7 57 1 .2 .2 51.9 58 1 .2 .2 52.1 59 3 .6 .6 52.7 60 2 .4 .4 53.2 61 2 .4 .4 53.6 62 1 .2 .2 53.8 63 2 .4 .4 54.2 64 4 .8 .8 55.1 66 1 .2 .2 55.3 67 2 .4 .4 55.7 68 5 1.1 1.1 56.8 69 4 .8 .8 57.6 70 3 .6 .6 58.2 72 5 1.1 1.1 59.3 74 2 .4 .4 59.7 75 4 .8 .8 60.5 76 1 .2 .2 60.8 78 2 .4 .4 61.2 79 2 .4 .4 61.6 80 2 .4 .4 62.0 81 2 .4 .4 62.4 82 1 .2 .2 62.7 83 3 .6 .6 63.3 84 1 .2 .2 63.5 85 2 .4 .4 63.9 28 86 1 .2 .2 64.1 87 2 .4 .4 64.6 90 3 .6 .6 65.2 91 1 .2 .2 65.4 93 1 .2 .2 65.6 94 1 .2 .2 65.8 96 3 .6 .6 66.5 97 2 .4 .4 66.9 99 1 .2 .2 67.1 101 1 .2 .2 67.3 102 2 .4 .4 67.7 103 1 .2 .2 67.9 105 1 .2 .2 68.1 106 1 .2 .2 68.4 107 1 .2 .2 68.6 108 2 .4 .4 69.0 110 1 .2 .2 69.2 113 1 .2 .2 69.4 114 1 .2 .2 69.6 115 1 .2 .2 69.8 116 1 .2 .2 70.0 117 1 .2 .2 70.3 120 4 .8 .8 71.1 121 1 .2 .2 71.3 122 1 .2 .2 71.5 123 1 .2 .2 71.7 124 1 .2 .2 71.9 125 1 .2 .2 72.2 126 1 .2 .2 72.4 127 1 .2 .2 72.6 29 128 1 .2 .2 72.8 129 1 .2 .2 73.0 130 1 .2 .2 73.2 132 2 .4 .4 73.6 133 2 .4 .4 74.1 134 1 .2 .2 74.3 137 1 .2 .2 74.5 138 2 .4 .4 74.9 139 1 .2 .2 75.1 143 2 .4 .4 75.5 144 4 .8 .8 76.4 149 2 .4 .4 76.8 150 3 .6 .6 77.4 151 1 .2 .2 77.6 154 1 .2 .2 77.8 155 2 .4 .4 78.3 156 2 .4 .4 78.7 159 1 .2 .2 78.9 163 2 .4 .4 79.3 165 1 .2 .2 79.5 168 1 .2 .2 79.7 169 1 .2 .2 80.0 171 2 .4 .4 80.4 173 2 .4 .4 80.8 174 1 .2 .2 81.0 175 2 .4 .4 81.4 176 1 .2 .2 81.6 180 2 .4 .4 82.1 181 1 .2 .2 82.3 182 1 .2 .2 82.5 30 184 1 .2 .2 82.7 190 1 .2 .2 82.9 191 1 .2 .2 83.1 192 2 .4 .4 83.5 193 1 .2 .2 83.8 194 1 .2 .2 84.0 196 1 .2 .2 84.2 198 1 .2 .2 84.4 199 2 .4 .4 84.8 205 1 .2 .2 85.0 207 1 .2 .2 85.2 208 1 .2 .2 85.4 209 2 .4 .4 85.9 210 1 .2 .2 86.1 214 1 .2 .2 86.3 216 2 .4 .4 86.7 221 1 .2 .2 86.9 228 4 .8 .8 87.8 229 1 .2 .2 88.0 231 1 .2 .2 88.2 240 3 .6 .6 88.8 241 1 .2 .2 89.0 244 1 .2 .2 89.2 246 1 .2 .2 89.5 252 1 .2 .2 89.7 258 1 .2 .2 89.9 261 1 .2 .2 90.1 264 3 .6 .6 90.7 265 1 .2 .2 90.9 271 1 .2 .2 91.1 31 272 2 .4 .4 91.6 275 1 .2 .2 91.8 281 2 .4 .4 92.2 284 1 .2 .2 92.4 285 1 .2 .2 92.6 288 1 .2 .2 92.8 302 1 .2 .2 93.0 305 1 .2 .2 93.2 307 1 .2 .2 93.5 308 1 .2 .2 93.7 314 1 .2 .2 93.9 315 1 .2 .2 94.1 317 1 .2 .2 94.3 318 1 .2 .2 94.5 319 1 .2 .2 94.7 320 1 .2 .2 94.9 324 1 .2 .2 95.1 338 1 .2 .2 95.4 344 1 .2 .2 95.6 348 1 .2 .2 95.8 358 1 .2 .2 96.0 359 1 .2 .2 96.2 367 1 .2 .2 96.4 371 1 .2 .2 96.6 372 1 .2 .2 96.8 375 1 .2 .2 97.0 380 1 .2 .2 97.3 381 1 .2 .2 97.5 385 1 .2 .2 97.7 387 1 .2 .2 97.9 390 1 .2 .2 98.1 408 1 .2 .2 98.3 412 1 .2 .2 98.5 429 1 .2 .2 98.7 432 1 .2 .2 98.9 438 1 .2 .2 99.2 444 1 .2 .2 99.4 451 1 .2 .2 99.6 460 1 .2 .2 99.8 476 1 .2 .2 100.0 474 100.0 100.0 Total Minority Classification Cumulative Frequency Valid Percent Valid Percent Percent No 370 78.1 78.1 78.1 Yes 104 21.9 21.9 100.0 Total 474 100.0 100.0 The previous experience variable or field has 24 records with missing data and this number is left out of the valid percent calculation. It is very important to know how to if we have missing data and how many records have missing data in that field because if we have a lot of missing data, our analysis may be biased because we have left out people. However, in practice, it is difficult if not impossible to fill in the missing data records. There are ways of estimating a value that can 32 fill in for missing data. Your text does mention these methods. What you need to do now is identify 5 variables in your data set for which you will run the Frequency procedure. Please proceed to read the online instructions at instructors.husson.edu/cookc/marketingcourses to find specific instructions for completing Lab 2. Case Summaries: We can also list the data for the individual cases for all the variables or fields in our data set. In order to do this, SPSS has a procedure on the Analyze dropdown menu. The Case Summary procedure is under Reports. If you click on Reports and then Case Summary a screen will appear that will ask you to identify the variables or fields for which you want case summaries. You can list the data for all the records or you can specify a specific number of records such as the first 20 records. Using the procedure, we can see below the data for the first 20 records in the Employee.sav data set for the specified variables. This procedure is helpful to get a listing of all the data and identify the record with data that is in error. We then might be able to track down the source of the error and make changes in our data file. 33 Summarize Notes Output Created 2009-09-28T09:24:11.044 Comments Input Data C:\Documents and Settings\cookc\Desktop\employee.sav Active Dataset DataSet1 File Label 05.00.00 Filter <none> Weight <none> Split File <none> N of Rows in Working Data 474 File Missing Value Handling Definition of Missing For each dependent variable in a table, user-defined missing values for the dependent and all grouping variables are treated as missing. Cases Used Cases used for each table have no missing values in any independent variable, and not all dependent variables have missing values. Syntax SUMMARIZE /TABLES=gender bdate salary salbegin /FORMAT=VALIDLIST NOCASENUM TOTAL LIMIT=20 /TITLE='Case Summaries' /MISSING=VARIABLE /CELLS=COUNT. Resources Processor Time 0:00:00.328 Elapsed Time 0:00:01.328 [DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav 34 Case Processing Summarya Cases Included N Excluded Percent N Total Percent N Percent Gender 20 100.0% 0 .0% 20 100.0% Date of Birth 20 100.0% 0 .0% 20 100.0% Current Salary 20 100.0% 0 .0% 20 100.0% Beginning Salary 20 100.0% 0 .0% 20 100.0% a. Limited to first 20 cases. Case Summariesa Gender Date of Birth Current Salary Beginning Salary 1 Male 2/03/1952 $57,000 $27,000 2 Male 5/23/1958 $40,200 $18,750 3 Female 7/26/1929 $21,450 $12,000 4 Female 4/15/1947 $21,900 $13,200 5 Male 2/09/1955 $45,000 $21,000 6 Male 8/22/1958 $32,100 $13,500 7 Male 4/26/1956 $36,000 $18,750 8 Female 5/06/1966 $21,900 $9,750 9 Female 1/23/1946 $27,900 $12,750 10 Female 2/13/1946 $24,000 $13,500 11 Female 2/07/1950 $30,300 $16,500 12 Male 1/11/1966 $28,350 $12,000 13 Male 7/17/1960 $27,750 $14,250 14 Female 2/26/1949 $35,100 $16,800 15 Male 8/29/1962 $27,300 $13,500 16 Male 11/17/1964 $40,800 $15,000 17 Male 7/18/1962 $46,000 $14,250 18 Male 3/20/1956 $103,750 $27,510 35 19 Male 8/19/1962 $42,300 $14,250 20 Female 1/23/1940 $26,250 $11,550 Total N 20 20 20 20 a. Limited to first 20 cases. Please proceed to read the online instructions at instructors.husson.edu/cookc/marketingcourses to find specific instructions for completing Lab 2. 36 Chapter 4 (Lab 3) Descriptive Analysis: You may have the need to look at various kinds of descriptive data such as means, standard errors, sums, maximums, minimums, standard deviations, variances, medians, modes, etc. These concepts you learned about in MS132 or Mat132. Calculation of a mean can only be carried out on data measured as interval or ratio data. For instance, calculating a mean gender or mean minority status is not very sensible. So, be careful to choose variables that are measured on a continuous basis (interval or ratio measurement). In the Employee.sav dataset, prevexp, salary, salbegin, and educ are measured on at least an interval scale. When we use salary and salbegin, we must remember to change the type to numeric from dollar. These two variables have the Type of dollar. This means there is a dollar sign ($) included in the field. Some procedures cannot process this ($). It is necessary to change the type to numeric so we can work with the fields in such a way to do math with these fields (such as take an average). So, we can calculate a mean. We cannot calculate a mean on fields such as gender, jobcat, and minority for example. However, we could still look at medians, mode, maximum, minimum, or any descriptive statistic that is meaningful to analyze on count data with fields such as gender, jobcat, and minority. 37 There are several procedures that provide descriptive statistics. We have already seen that these descriptive statistics can also be calculated using the frequency procedure. Explore Procedure: When we use the Explore procedure, it allows us to take a field measured as an interval or ratio scale and look at the average of that field by another field measured as a nominal or ordinal scale. Go to Analyze, Descriptive Statistics and Explore to find the procedure. You will want to choose a continuous variable such as salary to analyze. Continuous fields are measured as an interval or ratio scale. Then choose a factor which should be a variable measured on a nominal or ordinal scale. You may analyze more than one continuous variable at a time but make sure that the variables are continuous or measured on a ratio or interval scale and the factors are measured on a nominal or ordinal scale. You will see that the descriptive statistics for the continuous variables are printed for each value of the factors. This allows you to see the value of statistics for each value of a factor. For example, you may feel that average salary will differ by level of education, gender or minority status. Explore will tell you what the average salary is for each value of gender, minority status or any other nominal or ordinal measured variable (called factors). 38 These are the descriptive statistics available: mean, 95% confidence interval, 5% trimmed mean, median, variance, standard deviation, minimum, maximum, range, interquartile range, skewness, and kutosis. Lets say we want to look at salary and begsalary by the minority field or factor. We would use the Explore procedure and our results would look like the following output. Explore Notes Output Created 2009-09-28T11:25:00.840 Comments Input Data C:\Documents and Settings\cookc\Desktop\employee.sav Active Dataset DataSet1 File Label 05.00.00 Filter <none> Weight <none> Split File <none> N of Rows in Working Data File Missing Value Handling Definition of Missing 474 User-defined missing values for dependent variables are treated as missing. Cases Used Statistics are based on cases with no missing values for any dependent variable or factor used. Syntax EXAMINE VARIABLES=salary BY minority /PLOT BOXPLOT STEMLEAF /COMPARE GROUP /STATISTICS DESCRIPTIVES /CINTERVAL 95 /MISSING LISTWISE /NOTOTAL. 39 Resources Processor Time 0:00:01.078 Elapsed Time 0:00:03.296 [DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav Minority Classification Case Processing Summary Cases Minority Valid Classific ation Current Salary N Missing Percent N Total Percent N Percent No 370 100.0% 0 .0% 370 100.0% Yes 104 100.0% 0 .0% 104 100.0% Descriptives Minority Classification Current Salary No Statistic Mean 95% Confidence Interval for Mean $36,023.31 Lower Bound $34,178.68 Upper Bound $37,867.94 5% Trimmed Mean $34,094.52 Median $29,925.00 Variance 40 3.256E8 Std. Deviation $18,044.096 Minimum $15,750 Maximum $135,000 Range $119,250 Interquartile Range $16,200 Std. Error $938.068 Yes Skewness 1.896 .127 Kurtosis 4.256 .253 Mean 95% Confidence Interval for Mean $28,713.94 Lower Bound $26,492.72 Upper Bound $30,935.17 5% Trimmed Mean $27,092.63 Median $26,625.00 Variance 1.305E8 Std. Deviation $11,421.638 Minimum $16,350 Maximum $100,000 Range $83,650 Interquartile Range $7,125 Skewness Kurtosis 41 $1,119.984 3.749 .237 18.249 .469 Current Salary Stem-and-Leaf Plots Current Salary Stem-and-Leaf Plot for minority= No Frequency Stem & Leaf 22.00 1 . 5566666667777888999999 84.00 2 . 000000000011111111111111111122222222222222222222223333333333333444444444444444444444 79.00 2 . 5555555555555555566666666666666777777777777777777788888888888899999999999999999 62.00 3 . 00000000000000000000011111111111112222223333333333333444444444 26.00 3 . 55555556666677777788889999 42 18.00 4 11.00 4 12.00 5 12.00 5 8.00 6 36.00 Extremes Stem width: Each leaf: . . . . . 000000001122233334 55556667788 001112234444 555556667889 00001112 (>=65000) 10000 1 case(s) Current Salary Stem-and-Leaf Plot for minority= Yes Frequency Stem & 5.00 1 6.00 1 7.00 2 9.00 2 16.00 2 22.00 2 8.00 2 16.00 3 1.00 3 5.00 3 1.00 3 1.00 3 1.00 4 6.00 Extremes Stem width: Each leaf: . . . . . . . . . . . . . Leaf 66677 899999 0001111 222222333 4444444444555555 6666666666666777777777 88888999 0000000000011111 3 45555 6 8 0 (>=43950) 10000 1 case(s) As you can see, there is a lot of information provided including stem-and-leaf plots and box-plots which you learned about in Ms132 or Mat132. If we take a look at the numeric descriptive results, we see that descriptive results are given for salary by the two categories of minority status: yes and no. We can see the mean, variance, sum, median, etc. We can then look at difference in the interval/ratio measured field by levels of some factor we think may be important in determining statistically differences among the different levels of the factor. 43 Descriptive Procedure: Perhaps you want to look at the descriptive numerics for the entire data set. You may use the Descriptive procedure to do this. Go to Analyze, Descriptive Statistics, and Descriptives. Select the variables for which you want descriptive statistics and run the procedure. Descriptives Notes Output Created 2009-09-28T11:34:21.987 Comments Input Data C:\Documents and Settings\cookc\Desktop\employee.sav Active Dataset DataSet1 File Label 05.00.00 Filter <none> Weight <none> Split File <none> N of Rows in Working Data File Missing Value Handling Definition of Missing 474 User defined missing values are treated as missing. Cases Used Syntax All non-missing data are used. DESCRIPTIVES VARIABLES=salbegin /STATISTICS=MEAN STDDEV MIN MAX. Resources Processor Time 0:00:00.031 Elapsed Time 0:00:00.017 [DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav 44 Descriptive Statistics N Minimum Beginning Salary 474 $9,000 Valid N (listwise) 474 Maximum $79,980 Mean $17,016.09 Std. Deviation $7,870.638 We then have descriptive statistics such as the number of observations (N), the minimum value in the data set, the maximum value in the data set, the mean, and the standard deviation. These are the default descriptive values provided. You can specify others if you want: sum, variance, range, standard error of the mean, skewness, and kurtosis. Remember that it makes no sense to do some descriptive statistics on some variables. You can only do descriptive statistics such as means on interval or ratio measured variables. So be careful or otherwise the results of your analysis will be bogus. Please proceed to read the online instructions at instructors.husson.edu/cookc/marketingcourses to find specific instructions for completing Lab 3. 45 Chapter 5 (Lab 4) One-Sample t or z Hypothesis Testing: One sample tests mean that we select our variable to text from one sample. We are not comparing two or more samples. We do have to have a number to compare against however. This number is some value for which it is appropriate to test our variable against. It may be some value that represents what we think exists in the population. It may be some hypothetical value. We need to calculate the sample statistic (either a mean or proportion) and compare it to this hypothetical value. In SPSS, it is easier to calculate a mean statistic from the sample to compare to our test number which is also a mean. In other statistical programs, we can more easily compare a sample proportion to a test number (proportion) or we can do the test for comparing a sample proportion to a test proportion by hand. In the Employee.sav data set, there are fields or variables for which we can calculate a mean. We can also hypothesize some test number to compare this mean against. We can use variables like educ, salary, salbegin, and prevexp as these variables are measured more or less continuously. Means are 46 calculated on data which is at least interval. In the Employee.sav data set we do not have variables measured using an interval scale, but we do have variables measured on a ratio scale and therefore we can calculate means. We would also be able to calculate a mean on interval data. Your data set online contains interval data for which you can also calculate a mean. Let’s calculate mean salary for the Employee data set and test this mean salary against the mean salary for the entire industry that includes the company from which the Employee.sav data were taken. Let’s say that the mean salary for this industry is $40,000. We want to know if the mean salary in the company is significantly different from the mean salary in the industry which is $40,000. This is a one sample test. We can do a one-tailed or twotailed test with this one sample. SPSS automatically does a two-tailed test. So our null (Ho) and alternative (Ha) hypotheses would be: Ho: u=$40,000 Ha=u≠$40,00 There are two kinds of one-tailed test: left sided and right sided. Left sided means we are on the left side of the mean in the z or t distributions. Right sided means we are on the right side of the mean in the z or t distribution. We would then have to set up left sided or right sided tests. 47 Remember, that no matter what kind of test, the = sign is always part of the null hypothesis (Ho). So the null might be stated as u≤value where value is some number we are testing, or u≥value where again value is some number we are testing. The alternative (Ha) we be either u>value or u<value. Let’s take the salary data and test to see if there is a signigicant difference from the industry average of $40,000. In SPSS, go to Analyze, Compare Means, One-Sample t-test, and use salary as your sample variable and use $40,000 as the test value. T-TEST /TESTVAL=40000 /MISSING=ANALYSIS /VARIABLES=salary /CRITERIA=CI(.9500). 48 T-Test Notes Output Created 2009-09-30T11:43:33.519 Comments Input Data C:\Documents and Settings\cookc\Desktop\employee.sav Active Dataset DataSet1 File Label 05.00.00 Filter <none> Weight <none> Split File <none> N of Rows in Working Data 474 File Missing Value Handling Definition of Missing User defined missing values are treated as missing. Cases Used Statistics for each analysis are based on the cases with no missing or out-ofrange data for any variable in the analysis. Syntax T-TEST /TESTVAL=40000 /MISSING=ANALYSIS /VARIABLES=salary /CRITERIA=CI(.9500). Resources Processor Time 0:00:00.031 Elapsed Time 0:00:00.109 [DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav One-Sample Statistics N Current Salary 49 Mean 474 $34,419.57 Std. Deviation $17,075.661 Std. Error Mean $784.311 One-Sample Test Test Value = 40000 95% Confidence Interval of the Difference t Current Salary -7.115 df Sig. (2-tailed) 473 Mean Difference .000 $-5,580.432 Lower $-7,121.60 Upper $-4,039.27 We are given the t value, the degrees of freedom (df), the 2tailed test probability level (sig. (2-tailed)), the mean difference which is the difference between the mean in the sample and the test value of $40,000, and the confidence interval at the 95% level. We are most interested in the probability level or significance level which is .000. This means that there is a highly significant difference between the sample mean and the test value. Remember that when we set a confidence level like the 95% level, we are say we have a 5% chance of making an error. If we find that the significance level is very low such as smaller than the 5%, we can say that we have a small chance of being in error given the conditions of our hypothesis test. Since .000 is much smaller than the 5%, we reject our null hypothesis and accept the alternative and say that there is a significant difference between the mean salary and our test number. We may also want to know the direction of difference. If you look at the mean difference, 50 we see that it is negative, so our mean is far lower than the $40,000 test number. If you look at the One Sample Statistics table above, you see that the actual sample mean is $34,419.57 which is lower than the $40,000 test value. However, not only is the number lower but significantly lower from a statistical point of view. The computer output above gives a t value. But remember that the t distribution and z distribution give the same probabilities as long as the sample size is 30 or greater. Our sample has a large number of people, so we are really calculating a z value even though the computer output does not say this. It is a bit confusing. The calculation for the sample t and z statistics are given below. These are the formulas used by the computer as well. t and z Statistics for Means and Proportions: Means 𝑥̅ −𝜇 𝑡 = 𝑠/ √ 𝑥̅ −𝜇 or 𝑧 = 𝑠/ 𝑛 √𝑛 If we use a t statistic, we to find a t score in the t table just like we do for z scores. However, to find such a t score we need to calculate the degrees of freedom. In the case of comparing one sample mean to a hypothetical mean, we 51 calculate the degrees of freedom using n-1 where n is equal to the sample size. Remember that we compare the calculated t or z score above with the table t or z score to determine if we reject the null hypothesis or not. If the calculated score is greater or less than the table score for a two-tailed test, then we reject the null and accept the alternative. For a left sided one-tailed test, the t or z score must be more negative than the table t or z score in order to reject the null and accept the alternative hypothesis. For a right sided one-tailed test, the t or z score must be more positive than the table t or z score to reject the null and accept the alternative hypothesis. In our Emplyee.sav data set example, we rejected the null hypothesis and accept the alternative because the calculated t value is more negative than the table t value. Also, from the computer output, we do not see the table t or z value, but the output does give the probability level. We can easily tell from that probability level if we accept or reject the null. Since the probability level was .000 which is much smaller than the stated alpha level of 5% (error level), we reject the null and accept the alternative. If the probability level had been higher than the 5 % such as 5.5% we would have accepted the null hypothesis and rejected the alternative. In 52 that case, the mean sample salary would not have been significantly different from the test number. Please proceed to read the online instructions at instructors.husson.edu/cookc/marketingcourses to find specific instructions for completing Lab 4. 53 Chapter 6 (Lab 5) Chi-Square: In the previous lab, we were working with statistics based upon the Normal Distribution. However, many statistics are based upon other distributions. If a random variable is measured nominally, we would not use the Normal Distribution to find associated probabilities. We would have to use other types of distributions. This lab deals with variables that are counts. Although counts are numeric, we are just counting the number of times a particular event occurs (an event could be the number of females in our sample, the number of people who plan to vote a certain way, or the number of people who fall in a particular category in which we have an interest). Chi-square is a statistic based upon counts and uses a different probability distribution called the Chi-square distribution. Now, this distribution is one sided unlike the Normal Distribution. The reason that it is one sided is that we are only interested in positive values of Chi-square. We are squaring values anyway and would not get any negative values. The Chi-square values are similar to the idea of a zscore. In using the Chi-square table (at the back of your text) you need to use the idea of degrees of freedom like you did 54 for the t distribution. The degrees of freedom in this case depend upon the number of rows and columns in the table you construct that contains the frequencies of counts of test units (such as individuals) who fall in certain cells of that table. The table is constructed by taking two nominally measured variables with different levels (such as gender having the two levels of male and female) and crosstabulating the levels of these two variables. We will start with two variables although you can cross-tabulate more than two variables. The Chi-square statistic is: 2 ( f 0 f e ) where f e is fe the expected frequency and f o is the observed frequency. We find the observed frequencies in each cell of our table. The expected frequencies we have to calculate for each cell in the table . To get the expected frequencies we take the appropriate row and cell totals and multiply them together and divide by the grand total (total number in our table). We do this for each cell in the table and add the results together. This number is the calculated Chi-square. As in the case for the t or z distributions, we have to find the table value or here the table Chi-square. To find the table Chi-square we need a probability level and the degrees of freedom. 55 Then given a certain probability and degrees of freedom which is the number of rows in the table minus one times the number of columns in the table minus one, we look up the table Chi-square value. As before (z or t tests), we compare the calculated Chisquare to the table Chi-square. If the calculated value is greater than the table value, we reject our null hypothesis which in this case is: No association between the variables. If we reject the null hypothesis, we then can accept the alternative hypothesis which is: There is an association between the variables. Let’s take a look at the Employee.sav data set to find two variables to cross-tabulate and then calculate a Chi-square statistic on this cross-tabulation. We need two nominally measured variables. We can use minority and jobcat as both variables are measured nominally. We then would construct a 3 by 2 table. There are 3 levels of job category and two levels of minority status. We would want to know if the two variables are related in any way. So the Chi-square test will tell us if these two variables are statistically related. This statistical method only tells us if they are related but not specify how. It does not tell if one variable causes the other. The variables are only related through some fact of causation but we do not know what fact of causation. There could be 56 some third variable actually causing the relationship we see and if we included that third variable the relationship between the first two variables would disappear. In SPSS we could go to Analyze, Descriptive Statistics, and then Crosstabs. We would then select the variables we want to crosstab which in this case is jobcat and minority. Our hypothesis may be that we feel that there may be negative discrimination in 57 this company and they tend to hire more minority people in the lower paying jobs such as clerical and janitorial. So the null hypothesis is: There is no relationship between minority status and job category. The alternative hypothesis is: There is a relationship between minority status and job category. So to test this, we will run a Chi-square test. CROSSTABS /TABLES=jobcat BY minority /FORMAT=AVALUE TABLES /STATISTICS=CHISQ /CELLS=COUNT EXPECTED COLUMN /COUNT ROUND CELL. Crosstabs [DataSet1] C:\Documents and Settings\cookc\Desktop\employee.sav Case Processing Summary Cases Valid N Employment Category * Minority Classification 58 Missing Percent 474 100.0% N Total Percent 0 .0% N Percent 474 100.0% Employment Category * Minority Classification Crosstabulation Minority Classification No Employment Category Clerical Count Expected Count % within Minority Classification Custodial Count Expected Count % within Minority Classification Manager Count Expected Count % within Minority Classification Total Count Expected Count % within Minority Classification Chi-Square Tests Asymp. Sig. (2Value Pearson Chi-Square Likelihood Ratio Linear-by-Linear Association N of Valid Cases df sided) 26.172a 2 .000 29.436 2 .000 9.778 1 .002 474 a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 5.92. 59 Yes Total 276 87 363 283.4 79.6 363.0 74.6% 83.7% 76.6% 14 13 27 21.1 5.9 27.0 3.8% 12.5% 5.7% 80 4 84 65.6 18.4 84.0 21.6% 3.8% 17.7% 370 104 474 370.0 104.0 474.0 100.0% 100.0% 100.0% The thinking is that the minority status determines the job category although there may be some other factor really causing this relationship. However, the relationship is significant. If you look at the Pearson Chi-square value above (26.172), we see that the significance level for a two sided test is .000 which is far lower than the standard probability of 0.05 that is used as the standard probability. This means that we a very unlikely to get this high a level of Chi-square by chance. So we can reject the null and accept the alternative and we find that there is some kind of relationship between minority status and job category. Remember that there could be a third variable causing this relationship. So we might add a third variable to our analysis such as education level. However, in this data set education is not measured nominally so we would have to recode that data into a new variable where education would have a nominal classification. Example-Calculation of Chi-square by Hand: We have two variables, smoking and gender, and we want to know if there is an association between the two variables. We have found the frequencies within each cell as follows: 60 Gender Male Female Total Smoking Yes 100 75 175 No 135 140 275 Total 235 215 450 We need to calculate the expected values for the four cells (non-total cells) and then calculate the Chi-square statistic. Then we need to compare that to a table value associated with an alpha level or confidence interval. We get the expected frequencies for each cell as follows: Cell1 (235*175)/450 Cell2 (215*175)/450 Cell3 (235*275)/450 Cell4 (215*275)/450 Cell1: 91.39 Cell2: 83.61 Cell3: 143.61 Cell4: 131.39 61 2 (100 91.39) 2 (75 83.61) 2 (135 143.61) 2 (140 131.39) 2 .81 .89 .52 .56 2.78 91.39 83.61 143.61 131.61 We must find the Chi-square table value to compare this number against. To use the Chi-square distribution it is necessary to find the degrees of freedom. For this type of test, the degrees of freedom are (the number of rows -1 times the number of columns -1) or (r-1)(c-1). Here we have two rows and two columns, so the d.f. are (2-1)(2-1)=1. We also need to choose the confidence level. Let us choose the 95% level. Remember that this table is one sided only. So we find that the value associated with 1 d.f. and the 95% confidence level is: 3.841. If your calculated value is equal to or greater than this number, we could reject the null and accept the alternative. However, it is not greater in this case, so we accept the null and reject the alternative. There is no association between gender and smoking. 62 Chapter 7(Lab 6) Two Group or Two Sample t test or z test: Sometimes we want to compare two means or two proportions from two different samples or compare two groups taken from one sample. Why do we want to do this? Because we want to compare one mean or proportion to another to see if there is a significant difference between the two groups. Just because one number is higher or lower than another, does not mean that number is statistically significantly different from the other. We have to perform a statistical text to find out. After all, we want to make decisions on the basis of the best information available. Hypothesis testing allows us to back our decisions with a high probability of being correct if we reject the null hypothesis of no difference between the two means or proportions. Whether we use a t test or z test depends upon the sample size. If the two groups or samples are 30 or greater, we would use the z test whether we know the standard deviation from the population from which the samples or groups were taken. If we do not know the population standard deviations, and if our sample sizes are less than 30, then we need the t test. If our sample sizes are less than 30 63 but we know the standard deviations from the population, we cans still use the z test. We can also compute confidence intervals for the difference between means and proportions. We also assume that the two populations from which are samples come are independent. If they are not, we have to use a special test to see if the two samples come from two different populations. Confidence Interval Formulas for the Difference between Means and Proportions for Independent Samples: Here are the formulas for the calculation of confidence intervals for the difference between means and proportions. Difference Between Two Means: Confidence Intervals 1. Large sample case with standard deviation of the populations known ( x1 x 2 ) z a / 2 12 n1 22 n2 2. Large sample case with standard deviation of the populations unknown 2 ( x1 x 2 ) z a / 2 64 2 s1 s 2 n1 n2 3. Small sample case with standard deviation of the populations known ( x1 x 2 ) za / 2 12 n1 22 n2 4. Small sample case with the standard deviation of the population unknown for the two samples but the standard deviations are assumed equal and are estimated by the sample standard deviations 1 1 ( x1 x 2 ) t a / 2 s 2 n1 n2 where (n 1) s1 (n2 1) s 2 s 1 n1 n2 1 2 2 2 and degrees of freedom are n1 n2 2 5. Small sample case with the standard deviation of the population unknown and unequal 2 ( x1 x 2 ) ta / 2 2 s1 s 2 n1 n 2 with degrees of freedom of s1 2 s 2 2 n n 1 2 2 2 2 s1 2 s2 2 n n 1 2 n1 1 n2 1 which is a lot of work to calculate!!!!!!!!!!!!!! 65 Difference Between Two Proportions: Confidence Intervals The difference between two proportions is always a z test. We assume we have large enough sample sizes to use the normal distribution. If the sample size is small we would have to use the binomial distribution. As we have not covered the binomial distribution in this class, we will use the z test only. ( p1 p2 ) za / 2 p1q1 p2 q2 n1 n2 We can find the correct hypothesis test from the above formulas. Remember that these tests are for the difference between two means or the difference between two proportions. Hypothesis TestingFormulas for the Difference Between Two Means and Proportions for Independent Samples: Difference Between Two Means 1.Large sample case with standard deviation of the populations known z x1 x 2 Do 12 n1 66 22 n2 2. Large sample case with standard deviation of the populations unknown z x1 x 2 Do 2 2 s1 s 2 n1 n2 3. Small sample case with standard deviation of the populations known z x1 x 2 D0 12 n1 22 n2 4. Small sample case with the standard deviation of the populations unknown for the two samples but the standard deviations are assumed equal and are estimated by the sample standard deviations t x1 x 2 Do 1 1 s 2 n1 n2 5. Small sample case with the standard deviation of the populations unknown and unequal t x1 x 2 D0 2 2 s1 s 2 n1 n2 67 Difference Between Two Proportions: Hypothesis Testing The test for the difference between two proportions is a large sample test. For smaller samples, the binomial distribution may be used. z p1 p 2 D0 1 1 P(1 P) n1 n2 P is equal to n1 P1 n2 P2 n1 n2 For hypothesis testing, we calculate the standard error of the difference between the proportions. Use of the Formulas for Hypothesis Testing: We can use the above formulas to test any hypotheses that fit the particular situations for which the formula applies. Let’s look at these situations: A. Large sample case with standard deviation of the populations unknown, B. Small sample case with the standard deviation of the populations unknown for the two samples but the standard deviations are assumed equal and are estimated by the sample standard deviations, and C. The difference between proportions. Large Sample Case with Standard Deviation of the Populations Unknown: We want to know if there are is a significant difference in beginning salary between men and women who have recently 68 joined a local firm. We might think that this firm discriminates against women. However, we have to be careful in our analysis as other factors may actually case any real difference between the beginning salaries of men and women. We find the following: The mean beginning salary for men is $31,083 and the mean beginning salary for women is $29,745. The associated sample standard deviations are $2312 for the sample of men and $2569 for the sample of women. We have a sample of 40 for each group. Our test is: z 31,083 29,745 0 (2312) 2 (2569) 2 40 40 2.45 Using a two-tailed test (we could use a one-tailed test depending on how we word the hypothesis test), for α=0.05, we find that the z score has to be less than -1.96 or greater than 1.96 for there to be a significant difference in mean beginning salaries. We can feel confident that there is a significant difference between the two groups, but we really do not know why there is a difference. It could be discrimination; it could be that men have higher entry qualifications; it could be that women have lower education level; or some other reason. 69 Let’s take a look at a similar problem using SPSS. We will use our Employee.sav data set. We have the variable Beginning Salary and the variable Gender. Our hypothesis is as in our example above. In our Employee.sav data set, we have 474 observations. We want to run the independent sample z-test. Now, the SPSS program will compute this but this test is actually under the independent sample t-test procedure. The SPSS program will run a t-test or a z-test based upon sample sizes. The output will also give you two different assumptions: equal variances, and unequal variances. It is usual that we will not know the standard deviations of the populations and will be working with samples. This procedure covers for all the situations we have identified except for samples where the samples are interdependent. We would have to use another procedure for tests of the difference between proportions. We would go to Analyze, and the select Independent Sample ttest. A screen will appear, and we will select the test variable (dependent variable) which as to be measured on at least an interval basis (not a nominal variable). We also need an independent variable that must not have more than two categories. If it has more than two categories, we would then have to use one-way ANOVA. So using the independent sample 70 t-test, we just are comparing two groups. We also have a Define Group box in which we have to give the procedure the designations for the categorical variable. In our case, we would have to identify m for male gender and f for female gender. Some data sets may use 1 for female and 2 for male or some other designation. But we have to KNOW what these designations are. As we see below, we have filled in the Test Variable(s), Grouping Variable, and Define Group boxes. 71 Now we can run our procedure. The results show that there is 72 significant difference between male and female salaries. The two-tailed sig (significance) level is .000 which means that this is less than 0.05 or 0.01 probability levels. The computer calculates out to three decimal places. Now, with this information, we could do more sophisticated analysis which would try to control for other variables that might explain the difference in beginning salary. Small Sample Case with Standard Deviation of the Populations Unknown, Equal Variances Assumed: Our Malhotra text gives us an example of two groups, one adult and one teenager. We want to know if there are differences between the two groups in amusement park preferences. There are ten respondents in each sample. The mean amusement park score for the adults was 4 and the mean amusement park score for the teenagers was 5.5. The standard deviation for the adult group was 1.080 and the standard deviation for the teenagers was 1.080. We could do a test of the quality of variances. Computer outputs often give this anyway. We will assume that the variances are equal. We will pool the variances in this case. Each variance is the standard deviation squared. The pooled variance and standard deviation are: S2 73 (10 1)1.66 (10 1)1.111 1.139 10 10 2 1 1 s x1 x2 1.139( ) 0.477 10 10 The t-test is: t 5.5 4 3.14 0.477 with 18 degrees of freedom. Thus, using the t- distribution for 18 degrees of freedom, we find that the critical value in the t-table is 2.0019 for a two-tailed text. So the null hypothesis of equal means is rejected. Now, if we had the raw data, this is the same result we would get if we used the independent sample t-test in SPSS. Hypothesis Test of the Difference in Proportions: Malhotra gives us an example where we have two independent samples that give the percentage of users of jeans in the United States and Hong Kong. We interview a sample of 200 customers in each area and find that 80% of customers in the US and 60% of customers in Hong Kong use jeans. Is there a significant difference between these proportions? The z test is: z (.8 .6) 0 1 1 0.7 * 0.3 200 200 4.36 Using a two-tailed test, the z-score is +/- 1.96. Since 4.36 is greater than 1.96, there is a significant difference between the two groups. Now, SPSS does not have a procedure where we 74 can make this calculation directly as a difference between two proportions. However, we can use a Chi-square test and get the same result. Paired Sample t-test: If the two samples in question we have to use a different procedure. We can use the paired samples t-test or alternatively, we can use the Chi-square procedure. The paired-samples t-test is a t-test with n-1 degrees of freedom and is given by: t n 1 D uD sD n Where D is the mean of the differences between the pairs of observations and sD is the standard deviation of the differences in the two groups. The standard deviation is calculated by taking the square root of the difference between the paired observations in the two groups minus the mean of the differences squared and then summed over all paired observations and then divided by n-1. This value is then divided by the square root of the sample size, n. We do not have paired data to work with in our data sets, so we will not be using this test in SPSS. Remember, you can also do this test using Chisquare when you have a large sample size. 75 Chapter 8(Labs 7 and 8) ANALYSIS OF VARIANCE (ANOVA): If we have more than two means and if the dependent variable is measured at least on an internal basis and we have one independent variable measured on a categorical basis, we will be using One-Way Analysis of Variance. There are many other types of ANOVA such as: completely randomized design, randomized design, and factorial designs. A completely randomized design has one dependent variable and one categorical variable. It assumes that the there is no other source of variation other than the categorical variable. However, there may be other variables that can affect the dependent variable. If this is the case, we use a randomized or factorial design. The randomized design has one other categorical variable. Observations are randomly assigned to the different combination of the levels of the two categorical variables. Factorial designs can have more than two controlling variables and allow for interaction effects. It is possible for a particular level of one variable to have a positive or negative effect on the dependent variable for a particular level of another variable. If this is the case then we need to use a factorial design. We can also have 76 independent variables that are not categorical. In this case, the analysis of variance is called covariance. There are other models of ANOVA as well such as Repeated Measures ANOVA. We will just cover the One-Way-ANOVA and learn how to do One-Way-ANOVA in SPSS and try a factorial design as well although we will not do the math for this design. One-Way-ANOVA or the Completely Randomized Design: In this design, the categorical variable is called a factor, and the different levels of this factor are called treatments. We want to know if the dependent variable varies by the different treatments. We first will have to decompose the total variation in the dependent variable into the variation explained by the independent variable and the error left over. So, SStotal=SSbetween +SSwithin where the total variation is SStotal and SSbetween is the explained variation, and SSbetween is the error variation or the variation in the dependent variable not explained by the factor for independent variable. We are actually comparing means. When we compare two means it is a t-test. When we have 3 or more means we use ANOVA. We have to be careful because with a larger number of categories for the independent variable, that leads to comparing each mean to each other and we have to do a lot of comparisons. That 77 could lead to getting a significant result by random chance. In order to lower this possibility, we would have to employ what are called multiple comparison tests which makes it less likely to get a significant difference by chance. The null hypothesis for this test is that the means are not significantly different from one another. The test for this is an F test which uses an F distribution. This distribution is defined by two degrees of freedom, one for the numerator and one for the denominator in the formula. The formula is: SS x MS x F c 1 SS error MS eror ( N c) We have to know how to calculate the SS terms or the sums of squares terms. The c-1 degrees of freedom for the numerator takes the number of treatments minus 1 and the N-c degrees of freedom takes the total number of observations minus the number of treatments. Let’s see how we would calculate the sums of squares and how we would use the F table to get the critical values. Our experiment comes from Malhotra and shows the effect of in-store promotion on sales. We have three levels of the factor so we have three treatments of high, medium, and low. Fifteen stores are randomly selected and assigned 78 randomly to the three levels. Sales have been converted to a scale from 0 to 10. Error sums of squares is calculated by taking the mean within each treatment and subtracting off the grand mean for the entire sample and multiplying each calculation by the associated number of observations within each treatment. The means for the High, Medium, and Low groups are 9, 5, and 4 respectively. The grand mean is 6. So SS x 5(9 6) 2 5(5 6) 2 5(4 6) 2 70 . And to get the error sums of squares, we take each observation and subtract of the associated mean from the particular treatment where the observation comes from. It is SS error (10 9) 2 (9 9) 2 (10 9) 2 (8 9) 2 (8 9) 2 (6 5) 2 (4 5) 2 (7 5) 2 (3 5) 2 (5 5) 2 (5 4) 2 (6 4) 2 (5 4) 2 (2 4) 2 (2 4) 2 28 The F-test then becomes: 70 F 3 1 15.0 28 15 3 The 15 is the number of observations and the 3 is the number of treatments. In the F-table, we see that for 2 and 12 degrees of freedom, the F us 3.89. We reject the null hypothesis and state that 79 mean differences exist. The F distribution like the Chi-square distribution is a one-tailed distribution. Using our Employee.sav data set, we think that beginning salary varies by job category. We go to Analyze, Compare Means, and One-Way Anova. We put beginning salary in the dependent list and employment category as the factor. When we run the ANOVA, we find a significant difference. However, we really do not know which comparisons are the significant ones. There are three possible comparisons. We would have to do a multiple comparison test to figure out which comparisons are significant. There are a number of 80 these tests. One is called Scheffe’s test, another is the Bonferroni test, and another is the Duncan’s Multiple Range test. There are different reasons for using each test, but that analysis is beyond our scope of work. You can use either the Scheffe’s test or Bonferroni test as these are easy to read from the computer output. Where ever there is an asterisk, it means there is a significant difference. Let’s run a Scheffe’s test. You will select Analyze, Compare Mans, One-Way ANOVA, and then Post Hoc. Then check Scheffe. You can 81 check any other Multiple Comparison test if you understand why and how these are used. The results below do show asterisks for several significant comparisons. There is a significant difference between clerical beginning salary and management beginning salary and custodial beginning salary and management beginning salary but not between custodial beginning salary and clerical beginning salary. 82 Other Analysis of Variance Methods: The other ANOVA methods are beyond the scope of our text. However, we can easily implement these methods in SPSS. For example, we might want to carry out a factorial analysis of some kind. We can run various factorial analyses by using the Univariate Procedure in SPSS. If you go to Analysis, General Linear Model, and Univariate, a screen will come up that will allow you to put in one dependent variable but several types of factors. These factors are called fixed factors, random factors, and covariates. Factorial designs then allow different kinds of factors as well as interaction terms. Our simple ANOVA does 83 not allow more than one fixed independent variable and no interaction terms. Fixed means that the response of the dependent variable for a level of the independent variable has no random distribution. There is an ANOVA model where there can be more than one dependent variable. His can happen when we have several models where the data are interdependent from one model to the next. Hypothesis testing would be inaccurate without the use of an interdependent model. 84 Chapter 9 Correlation (Lab 9) The Product-Moment Correlation: There are many types of correlation. The correlation we will be doing has both variables measured on at least an interval basis. There are other types of correlation for other situations. There also is something called partial correlation where we can see the relationship between two variables while controlling for other variables. The variables have to be measured on at least an interval basis. We will look at the product-moment correlation. We have to be careful in our analysis. Just because two variables are correlated does not mean one causes the other. The correlation may just by chance. Thus it is necessary to have a good theory to explain the correlation. It is like the person who found an association between the density of storks and the birth rate. One might conclude that storks bring babies. But I do not think that is the way babies are created. So having a good theory will allow us to get a handle on the correlation. We can plot the relationship between the two variables using a scatter plot. A scatter plot plots one variable on the x axis and the other on the y axis. 85 The sample correlation coefficient is denoted as r and the population correlation coefficient is denoted as the Greek letter ρ or rho. The formula for the correlation coefficient is: r X i X Yi Y n 1 X X Yi Y n 1 n 1 2 2 i The numerator is called the covariance between X and Y. The numerator is the square root of the standard deviation of the X variable times the standard deviation of the Y variable. The above is easily calculated by forming a table and taking each X variable value and subtracting off the mean of the X variable and taking the Y variable value and subtracting off the mean of the Y variable and multiplying each by the other and then summing over all pairs of observations. Then the numerator is divided by n-1. In the denominator, we take the X value and subtract of the mean and sum over all X values minus the mean and divide the total by n-1. We do the same for the Y values. Then we take these two values and multiply them together and take the square root. This number is then divided into the numerator. So if we think that there is a correlation between attitude toward sports cars and duration of car ownership, then we would proceed as follows. 86 X (10 12 12 4 12 6 8 2 18 9 17 2) / 12 9.333 Y (6 9 8 3 10 4 5 2 11 9 10 2) / 12 6.583 X i X Yi Y (10 9.33)(6 6.58) (12 9.33)(9 6.58)..etc. 179.6668 X i X (10 9.33) 2 (12 9.33) 2 (12 9.33) 2 ..etc. 304.6668 Yi Y (6 6.58) 2 (9 6.58) 2 (8 6.58) 2 ...etc. 120.9168 r 179.6668 (304.6668)(120.9168) 0.9361 The correlation coefficient varies from -1 to +1. The closer the correlation coefficient is to -1 or 1, the closer the variables are to one another. The significance of the relationship is measured by a t-test with n-2 degrees of freedom. Sometimes we will want to measure the amount of variation explained in our model. This is called r squared and is calculated by taking r and squaring it. This gives us the amount of variation explained in our model. It also is the (Total Variation-Error Variation)/Total Variation. We know how to calculate error variation and explained variation already. If we add the two values together we get total variation. So it is fairly easy to calculate the percent of variation explained. We think that there is a relationship between beginning salary and current salary. Go to Analyze, Correlation, Bivariate. Put the variables we want to correlate in the variable box. 87 The results give the correlation as well as the significance level. The variables are significantly correlated at the 0.000 level and the correlation coefficient is 0.88. So there is a correlation between beginning salary and current salary. Likely, as a person starts a job with a larger beginning salary, then any pay raises occur on a bigger base. So then there would be a correlation between beginning salary and current salary. There may be other reasons for this correlation but theory would have to provide sensible explanantions. 88 89 Chapter 10 Regression (Lab 10) 90