LIS570 Rese arc h Met hods Hazel Tay lor LIS570 QUANTITATIVE ANALYSIS PRACTICAL 4 HYPOTHESIS TESTING FOR NOMINAL DATA STUDENT NAME: …………………………………………….. The purpose of this exercise is two-fold: 1. To apply what we have learned in class about testing hypotheses about an underlying population from nominal sample data. 2. To use a spreadsheet package (MS Excel) to perform simple non-parametric inferential statistical tests. You will be working through a research case exercise, using a raw data set extracted from a graduate survey at a New Zealand community college. Read the instructions in each section carefully, to avoid errors that will, in the end, slow you down or prevent you from completing the assignment successfully. Please note that full step-bystep instructions are NOT given. Instead, at each point you are given some guidelines about where to look for the commands you need, and then you are expected to try and work out the detail of what’s required by yourself – use the Help functions when you are not sure! Having said that, please do NOT spend hours struggling if you get stuck, and can’t find a solution through the Help functions. Instead, put the exercise aside, come back to it the next day and if you still can’t see any way through, send me an email! NOTE: This exercise involves the use of MSExcel pivot tables. Review Practical 2 for basic instructions on using pivot tables. You can submit this exercise via eSubmit, if you wish. Fill in the answers to the questions on this document, and submit the document and your Excel spreadsheet. Your document should contain answers to: RC4.1a, RC4.1b, RC4.1c, RC4.2a, RC4.2b, RC4.3a, RC4.3b, RC4.3c, RC4.4a, RC4.4b. Your spreadsheet should show results for: RC4.2, RC4.3, RC4.4 Page 1 of 6 LIS570 Rese arc h Met hods Hazel Tay lor RESEARCH CASE 4: GENDER IMBALANCE IN COMPUTING GRADUATES For this exercise, you are provided with a data set, gndrqual.xls, which has been extracted from a graduate survey of students graduating from a New Zealand community college. The data set contains a random sample of responses from 1998 computing graduates, recording the gender and qualification of each graduate. Load the gndrqual.xls file into MS Excel, and examine the data set. You should see a spreadsheet with 3 columns: Response; Qual; Gender. Note the coding key is shown on a separate worksheet. The community college offered many certificate and diploma qualifications in computing, in addition to a bachelor’s degree. As usual, protect your raw data sheet and create a new work-sheet for your analyses. RC4.1 Hypotheses We have a couple of alternative hypotheses, listed below, that we wish to test. What is the null hypothesis for each alternative hypothesis? Alternate hypothesis 1: From this sample we can conclude that there is an imbalance in gender in the underlying population of 1998 computing graduates from this community college. RC4.1a Answer - Null hypothesis 1: Alternate hypothesis 2: From this sample we can conclude that there is a relationship between gender and qualification in the underlying population of 1998 computing graduates from this community college. RC4.1b Answer - Null hypothesis 2: RC4.1c Answer – p value: What p value will you choose for this analysis? Page 2 of 6 LIS570 Rese arc h Met hods Hazel Tay lor RC4.2 Hypothesis 1 Take another look at the first hypothesis. What variable(s) are we considering in this hypothesis? What type of variable? What sort of test can we use in this situation? RC4.2a Answer: The first step is to establish a count of the numbers of males and numbers of females in the sample. You can do this using the function COUNTIF, or you can prepare a pivot table (see Practical 2 for instructions on pivot tables). Prepare a table of the count of responses by gender. Your results should show a difference between the numbers of male and female graduates. We want to find out whether this difference in the sample is enough to infer that there is a difference in the whole (1998) population of computing graduates. In order to do this we compare our observed sample results with the results we would expect, if in fact there were no difference in the whole population. Add another column to your table to show the expected results. Make sure that you double check to ensure there are no calculation errors. We now need to add a column to calculate the values for the one-way chi-square statistic. You can find this formula in the lecture notes on inferential statistics. Add another column to hold the chi-square calculation. Enter the calculation in each cell in the chi-square column, and total this column to find the value of the chi square statistic. On its own, this is no help because we don't know the probability of getting this value of chi square for a sample if the null hypothesis is in fact true. MS Excel has a function, CHIDIST, which enables you to calculate the probability. However, you first need to work out the degrees of freedom. Enter the degrees of freedom in a cell on your spreadsheet, and type an informative label next to this cell, in order to identify its content. Now use the CHIDIST function to calculate the probability of getting your value of chi square with the degrees of freedom you have. Make sure you label this cell too. Compare your calculated probability with the p-value you set at the beginning. Can we conclude that there is a statistically significant difference in gender in the underlying population of 1998 computing graduates from this community college? Why or why not? RC4.2b Answer: Page 3 of 6 LIS570 Rese arc h Met hods Hazel Tay lor RC4.3 Hypothesis 2 Our second hypothesis requires a different sort of test - what is it? (HINT: how many variables are involved?) RC4.3a Answer: First we need to create a contingency or frequency table of qualifications by gender. Use a Pivot Table to do this. Take a look at the resulting table. Remember that chi-square requires an expected value of at least 5 in every cell. Will you meet this condition? RC4.3b Answer: One option to deal with this is to collapse the categories. Of course, that can only be done with suitable categories. For our example, we could choose to compare just three categories: Bachelor’s degree; certificate and diploma in computing education (both teaching related); others (all vocational certificates and diplomas). In order to do this, make sure you have the Pivot Table tool-bar displayed. Hold down the control key and click on the categories in the Pivot table that you want to select for the first group. Select the Pivot Table drop-down menu from the Pivot Table tool-bar, and select Group and Show Detail/Group to group these categories. Click on the new Group 1 cell in the table, and click on the Minus sign on the Pivot Table toolbar to hide details. Finally, rename the group with a more meaningful name. Repeat for the other groups. Your Pivot Table should now show only 3 categories of qualification, and the totals for these categories should be summarised. Page 4 of 6 LIS570 Rese arc h Met hods Hazel Tay lor We now need to create the expected values for this table. You can find this formula in the lecture notes on inferential statistics. Create the expected values alongside your table. Double check for errors. We are now ready to do the appropriate chi-square calculation. MS Excel has a function called CHITEST that will return the probability for the two-way chi square value related to the observed and expected results you have entered. For this function, you don’t need to enter the degrees of freedom because Excel will work it out from your arrays of observed and expected values. Now use the CHITEST function to calculate the probability of getting your value of chi square with the degrees of freedom that you have. If necessary, format the CHITEST cell as number, with at least 6 decimal places Make sure you label this cell too. Compare your calculated probability with the p-value you set at the beginning. Can we conclude that there is a statistically significant relationship between gender and qualification in the whole population of 1998 computing graduates from this community college? Why or why not? RC4.3c Answer: RC4.4 Further Analysis of Hypothesis 2 Take another look at the pivot table you create in section RC4.3. Some rows show more males than females, while one row should show more females than males. At this stage, we may wonder whether any relationship between gender and qualification in computing graduates is due to the computing education graduates, who seem to be predominantly female, even though our whole sample has more males than females. Can we conclude that there is a relationship between gender and qualification for the remaining qualifications? Page 5 of 6 LIS570 Rese arc h Met hods Hazel Tay lor In order to investigate this, we first need to sub-set our data, and select just the rows that don't contain computing education graduates. First of all select your grouped Pivot Table and copy it to a new location (so that you don't do anything to affect the calculations you've already done). Click your cursor in the computing education qualification cell in the new Pivot Table, right click, and select Hide. Your Pivot Table should now display just the categories for the bachelor’s degree and vocational qualifications. Do a chi-square analysis on the new pivot table, and find the corresponding CHITEST value. What is your conclusion from this test? RC4.4a Answer: A final point relates to the conclusions you can draw from your results. Can you conclude that there is a relationship between gender and the computing education qualifications for all computing graduates from this community college? Explain why or why not. RC4.4b Answer: Congratulations! You have completed the final practical for this course. You can submit this exercise via eSubmit, if you wish. Fill in the answers to the questions on this document, and submit the document and your Excel spreadsheet. Your document should contain answers to: RC4.1a, RC4.1b, RC4.1c, RC4.2a, RC4.2b, RC4.3a, RC4.3b, RC4.3c, RC4.4a, RC4.4b. Your spreadsheet should show results for: RC4.2, RC4.3, RC4.4 Page 6 of 6