STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 A STATISTICAL INVESTIGATION In this section, we will discuss both descriptive and inferential methods that are appropriate only when investigating a single categorical variable with two levels (e.g. yes or no, low birth weight or normal birth weight, smoker or nonsmoker, etc.) . FORCED CHOICE TECHNIQUE IN CRIMINAL INVESTIGATIONS Example 3.1: A suspected serial-rape murderer, an ex-con with a history of sex crimes, was interrogated by police after he was overheard bragging to others that he raped, killed, and buried a young woman victim in an isolated valley outside of the city in which he resided. He told police that he had never met the victim and that he had never been to the valley. A series of binary (yes/no) questions embedded within the interrogation was designed to test his knowledge of victim characteristics that only the perpetrator would know. Of the 20 questions regarding victim characteristics, clothing, and information obtained from family and friend who last saw her, he answered 3 correctly. Does this provide evidence that the suspect was guilty of the crime? Why or why not? Questions: 1. What is the single categorical variable of interest in this problem (i.e., our response variable)? 2. How many questions did the suspect answer correctly? What percentage is this? Note that calculating and reporting this sample proportion was covered in the Descriptive Statistics section. 3. Suppose the suspect was merely guessing the answers to the 20 questions and had no knowledge of the victim how many questions would we expect the suspect to answer correctly? What percent is this? Note that the observed number of correct answers is less than would be expected. However, even though this is less than expected, this is not necessarily enough statistical evidence to support the suspect’s guilt. A key question is how to determine whether the suspect’s result is surprising under the assumption that he was merely guessing the answers to 20 questions asked about the victim. To answer this, we will use Tinkerplots 2® to simulate the process of answering 20 questions by merely guessing with 50% chance of answering correctly, over and over again. Each time we simulate the process, we’ll keep track of how many questions a suspect who was simply guessing would got right. Once we’ve repeated this process a large number of times, we’ll have a pretty good sense for what outcomes would be very surprising, or somewhat surprising, or not so surprising if suspect was simply guessing answers to the questions about the victim. 43 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 The Simulation in Tinkerplots (you will not be doing this yourself right now) Drag a new Sampler to the workspace. Click and drag the Spinner over the workspace so that you see a spinner. Change Attr1 to Answer. Change the options on the Spinner from a and b to Correct and Wrong. Click on the drop-down arrow below the spinner and select Show Percentage. Change the proportions to 50% (so that we simulate the situation in which the subject is merely guessing the answers to the questions; i.e., there is 50% chance of answering correctly). Change Draw to 1 to represent taking only one student from the population at a time. Change the Repeat value to 20 to represent the 20 questions asked by the interrogators about the victim. Your spinner should look like this: Change the speed to Fastest, and click Run. A table should appear with 20 entries. This represents one set of answers to the 20 questions where the suspect was merely guessing the answers.. To count the number of correct answers , highlight the column titled Answers and drag a new plot to the workspace. Click and drag one point from your plot all the way to the right, and the points should organize as follows: 44 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 Use the N button to count the number of correct (and wrong) answers. How questions in your first sample did the guessing suspect answer correctly? To simulate this process 99 more times, place your cursor over the number of correct answers displayed on the plot, right-click, and select Collect Statistic. Ask Tinkerplots to collect 99 more samples (note that each is of size 20) and click Collect. A table should appear containing the results of all 100 simulated trials. To summarize these results, highlight the column containing these counts and drag a new plot to the workspace. Click and drag a point all the way to the right to organize the points. Double-click on either endpoint, and change Bin Width to 1. Use the vertical stack option to better organize the data points. Select N to count the number of times each outcome occurred in your 100 simulations. Sketch your results on the graph on the next page. 45 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 Questions: 4. What does each dot on the above plot represent? 5. Based on the results of the simulation study, would you consider the result actually obtained during the interrogation of the suspect (3 out of 20 correct answers) to be surprising or unusual if the he was simply guessing the answers to questions about the victim? 6. Do you think that the interrogation results provide sufficient evidence of the suspect’s guilt? Explain why or why not. 46 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 MAC vs. PC and the WSU Laptop Program Example 3.2: Consider a survey administered to a random sample of 318 Winona State undergraduate students. One of the items on the survey was which platform they chose for their laptop, PC or Mac. Suppose that a college professor hypothesizes that the majority of all WSU undergraduate students prefer MACs. Of the 318 students surveyed, 171 reported that they preferred MACs. Do these data support the professor’s hypothesis? We will carry out a statistical investigation in order to answer this question. Questions: 1. What is the single categorical variable of interest in this problem (i.e., our response variable)? 2. How many of the students in the sample favored the MAC? What percentage is this? Note that calculating and reporting this sample proportion was covered in the descriptive statistics section 3. 3. Suppose the overall population of WSU undergraduate students has no real preference for either the PC or the Mac; that is, the population of all students is split evenly in terms of their preference. If this is the case, what percentage of the 318 students would you expect to choose a MAC when taking the survey? How many students is this? Note that the observed number of students who chose the MAC is greater than what would be expected if the population of all WSU undergraduates had no real preference. However, even though more than half of our sample chose the MAC, this is not enough statistical evidence to support the professor’s claim that the majority of all WSU undergraduates prefer the MAC. A key question is how to determine whether the survey’s result is surprising under the assumption that the population of WSU undergraduates overall has no preference. To answer this, we will use Tinkerplots 2® to simulate the process of 318 students choosing either a MAC 47 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 or a PC each with a 50% probability, over and over again. Each time we simulate the process, we’ll keep track of how many times a student chose the MAC (note that you could also keep track of the number of times a student chose the PC). Once we’ve repeated this process a large number of times, we’ll have a pretty good sense for what outcomes would be very surprising, or somewhat surprising, or not so surprising if the population of all WSU undergraduates has no real preference. The Simulation in Tinkerplots Drag a new Sampler to the workspace. Click and drag the Spinner over the workspace so that you see a spinner. Change Attr1 to Computer Preference. Change the options on the Spinner from a and b to PC and Mac. Click on the drop-down arrow below the spinner and select Show Proportion. Change the proportions to .50 (so that we simulate the situation in which the population has no preference; i.e., a student randomly selected from the population has a 50% chance of choosing a PC). Change Draw to 1 to represent taking only one student from the population at a time. Change the Repeat value to 318 to represent taking a random sample of 318 students. Your spinner should look like this: Change the speed to Fastest, and click Run. A table should appear with 318 entries. This represents one random sample of 318 students taken from a population with no real preference for either the MAC or the PC. To count the number of students who chose the MAC in this first sample, highlight the column titled Computer_Preference and drag a new plot to the workspace. Click and drag one point from your plot all the way to the right, and the points should organize as 48 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 follows: Use the N button to count the number who chose the MAC (and the PC). How many students out of the 318 in your first sample chose the MAC? To simulate this process 99 more times, place your cursor over the number who chose the PC displayed on the plot, right-click, and select Collect Statistic. Ask Tinkerplots to collect 99 more samples (note that each is of size 318) and click Collect. A table should appear containing the results of all 100 simulated trials. To summarize these results, highlight the column containing these counts and drag a new plot to the workspace. Click and drag a point all the way to the right to organize the points. Double-click on either endpoint, and change Bin Width to 1. Use the vertical stack option to better organize the data points. Select N to count the number of times each outcome occurred in your 100 simulations. Sketch your results on the graph below. Number Choosing MACs 49 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 Questions: 4. What does each dot on the above plot represent? 5. Based on the results of the simulation study, would you consider the result actually obtained in the survey study (171 out of 318 preferring MACS) to be surprising or unusual if the population of all WSU students had no real preference for either type of computer? Do you think that the survey data provide evidence that the majority of all WSU undergraduates prefer MACs? Explain why or why not. Also, note that using the data obtained from the sample to make a generalization about the population of all WSU undergraduates is an example of inferential statistics. 50 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 Example 3.3: Evaluating Deafness Consider the case study presented in an article by Pankratz, Fausti, and Peed titled “A ForcedChoice Technique to Evaluate Deafness in the Hysterical or Malingering Patient.” Source: Journal of Consulting and Clinical Psychology, 1975, Vol. 43, pg. 421-422. The following is an excerpt from the article: The patient was a 27-year-old male with a history of multiple hospitalizations for idiopathic convulsive disorder, functional disabilities, accidents, and personality problems. His hospital records indicated that he was manipulative, exaggerated his symptoms to his advantage, and that he was a generally disruptive patient. He made repeated attempts to obtain compensation for his disabilities. During his present hospitalization he complained of bilateral hearing loss, left-sided weakness, leftsided numbness, intermittent speech difficulty, and memory deficit. There were few consistent or objective findings for these complaints. All of his symptoms disappeared quickly with the exception of the alleged hearing loss. To assess his alleged hearing loss, testing was conducted through earphones with the subject seated in a sound-treated audiologic testing chamber. Visual stimuli utilized during the investigation were produced by a red and a blue light bulb, which were mounted behind a oneway mirror so that the subject could see the bulbs only when they were illuminated by the examiner. The subject was presented several trials on each of which the red and then the blue light were turned on consecutively for 2 sec each. On each trial, a 1,000-Hz tone was randomly paired with the illumination of either the blue or red visual stimulus, and the subject was instructed to indicate with which stimulus the tone was paired. Questions: 1. What is the single categorical variable of interest in this problem (i.e., our response variable)? 2. Suppose the subject is presented with 100 trials. If he truly has suffered hearing loss, he is essentially guessing on each trial. If this is the case, in how many trials would you expect the suspect to correctly identify with which stimulus the tone was paired? 3. Suppose the subject correctly identifies with which stimulus the tone was paired in only 45 out of 100 trials. A researcher argues that since this was less than the expected number of correct matches, the subject must be intentionally answering incorrectly in order to convince them he can’t hear. What is wrong with their reasoning? 4. Suppose the subject correctly identifies with which stimulus the tone was paired in none of 100 trials. A researcher believes this result provides evidence that the subject must be intentionally answering incorrectly in order to convince them he can’t hear. Do you agree? 51 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 Once again, the key question is how to determine whether the subject’s result on the 100 trials is surprising under the assumption that he truly has hearing loss and is simply guessing on each trial. To answer this, we will simulate the process of guessing on 100 trials of this experiment, over and over again. Each time we simulate the process, we’ll keep track of how many times the subject was incorrect (note that you could also keep track of the number of times he was correct). Once we’ve repeated this process a large number of times, we’ll have a pretty good sense for what outcomes would be very surprising, or somewhat surprising, or not so surprising if the subject is really guessing. Use the instructions outlined above for Examples 3.1 and 3.2 to carry out the Tinkerplots simulation. Note that you will have to revise a few elements of the simulation that relate to the following questions: What are the two possible outcomes on each of the trials? Change the values on your spinner accordingly. What is the probability that each outcome occurs, given that the subject is just guessing on each trial? Change your spinner accordingly. Be sure to change the Draw value to 1 since the subject is guessing only one color at a time. In how many trials does the subject participate overall? Keep this value in mind when setting the Repeat value. Carry out the simulation study 100 times overall, keeping track of the number of times the subject was incorrect in each trial. Sketch in your results below: Questions: 5. What does each dot on this plot represent? 52 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 6. Suppose a subject was incorrect on 57 out of 100 trials. Would you believe they were probably just guessing, even though they had more than the expected number of incorrect answers? Why or why not? 7. The actual subject was incorrect on 64 out of 100 trials. Based on this statistical investigation, do you believe he was just guessing, or do these indicate that he may have been answering incorrectly on purpose in order to mislead the researchers into thinking he was deaf? Explain your reasoning. Example 3.4: Are Women Passed Over for Managerial Training? This example involves possible discrimination against female employees. Suppose a large supermarket chain occasionally selects employees to receive management training. A group of female employees has claimed that they are less likely than male employees of similar qualifications to be chosen for this training. The large employee pool that can be tapped for management training is 60% female and 40% male; however, since the management program began, 9 of the 20 employees chosen for management training were female (only 45%). The question of interest is as follows: Do the data provide evidence of gender discrimination against females? Questions: 1. What is the population of interest? 2. What is the sample? 3. What is the variable of interest? 53 STAT 110: Section 3 – Methods for Analyzing a Single Categorical Variable Fall 2015 Simulation Study To investigate this research question, we will carry out a simulation in Tinkerplots 2®. Once again, note that you will have to revise a few elements of the simulation that relate to the following questions: What are the two possible outcomes on each of the trials? Change the values on your spinner accordingly. What is the probability that each outcome occurs, given that there is no discrimination? Change your spinner accordingly. Be sure to change the Draw value to 1 since only one employee is selected from management at a time. How many employees were selected for management overall in the study? Keep this value in mind when setting the Repeat value. Carry out the simulation study 1,000 times overall, keeping track of the number of times a female was chosen in each trial. You should see something similar to the following: Questions: 1. What does each dot represent? 2. Would you say that this observed result (9 out of 20) provides evidence of gender discrimination against females? Explain. 54