Presentation 2 Summarizing One or Two Categorical Variables & Relationships Between Categorical Variables Types of Variables Categorical – Possible values define group or categories, not necessarily in an apparent ordering Ex. Color of M&M’s Gender Stat 200 Section Ordinal – Categorical variable where values or categories have a natural ordering Ex. Rate the roller coaster on a scale of 1-5 (1 is terrible and 5 is excellent) Age groups (child, teen, adult, senior citizen) Shirt sizes (S, M, L, XL) Quantitative – Measurements or counts, recorded as numerical values Ex. Height Temperature # of Red M&M’s Possible Roles Played by Variables: Response Variables – are the variables of which we want to determine the outcome. These are the variables of main interest. Explanatory Variables – are partially explain the value of the response variable for the individual. For each of the following identify the response and the explanatory variables as well as the variable type: 1. Is there a relationship between a person’s gender and their favorite kind of music? Response: 2. Do men and women listen to the same number of hours of music? Response: 3. Explanatory: Do people who play musical instruments rate the types of music the same? Response: 5. Explanatory: Does a person’s hometown influence the amount they would pay for a single CD? Response: 4. Explanatory: Explanatory: Do people who have a CD burner prefer to buy or burn their CDs? Response: Explanatory: Summarizing Categorical Variables: For one variable: 1. 2. Numerical Summaries: counts and percents Graphical Summaries: Pie Chart or Bar Graph For two variables: 1. Numerical Summaries: 2-way tables with counts and row percents. The explanatory variable should be the row variable (first variable entered in Minitab) and the response variable should be the column variable (second variable entered in Minitab). 2. Graphical Summaries: Bar Graph Example for One Categorical Variable: Where do Penn State alumni live? The PSU Alumni Association would like to obtain the answer to this question from all PSU alumni. They can’t ask all alumni so they take a random sample of 50 alumni from the directory. They determined the state of residence from the address. Here are the results: State Frequency PA 25 NJ 10 MD 5 VA 5 OH 2 NY 2 OTHER 1 TOTAL n=50 What do these descriptive statistics tell us? Example for Two Categorical Variables: Do most college students have a credit card? A study would like to determine if the percentage of students that have at least one credit card differs based on year in school. Four different samples (Fr, So, Jr, Sr) each having 100 PSU students, were obtained. Each student was asked one question, “Do you currently have at least one credit card?” Identify the response and the explanatory variable in this case: Response: Explanatory: Bar Graph Credit Card Example Yes No Row total Freshman 42 58 100 Sophomore 55 45 100 Junior 76 24 100 Senior 81 19 100 254 146 400 Column Total 90 80 70 60 50 No 40 Yes 30 20 10 0 Fr What do these descriptive statistics tell us? So Jr Sr Assessing the Statistical Significance of the Relationship between two Categorical Variables. Suppose we ask 15 randomly picked students 2 questions: 1. Do you smoke? 2. Did you have a beer last night? We summarize the results using the Cross Tabulation function in Minitab : Tabulated Statistics: smoke, beer Rows: smoke Columns: beer n y All n 9 81.82 2 11 18.18 100.00 y 1 25.00 3 4 75.00 100.00 All 10 66.67 5 15 33.33 100.00 Cell Contents -Count % of Row Inference about the Population! How can we tell if there’s a relationship between being a smoker and drinking beer last night? Does the relationship presented in sample data hold in the population presented by this sample? Techniques used to make generalizations about the population using a sample are known as inferential statistics. A statistically significant relationship is one that is large enough to be unlikely to have occurred in the observed sample if there is no relationship in the population. Null and Alternative Hypotheses Another way to express our objective is that we are deciding between two possible hypotheses about the population: Null Hypothesis: The two variables are not related. Alternative Hypothesis: The two variables are related. In our example we have: Null Hypothesis : Being a smoker and drinking beer last night are not related. Alternative Hypothesis : Being a smoker and drinking beer last night are related. Chi-square Statistic We usually use Chi-square Statistic to handle this type of questions. Chi-square Statistic measures the statistical significance of the association between 2 categorical variables. A large Chi-square Statistic indicates there is a statistically significant relationship between the 2 variables. How Chi-square Statistic works? It measures the difference between the observed counts and the counts that would be expected if there were no relationship (under the null hypothesis). Chi-Square Statistic and p-value A large Chi-square Statistic indicates there is a statistically significant relationship between the 2 variables. However, how large is large? This is why we need to use “p-value” as an indicator to tell us if the Chi-square Statistic is “large enough”. We can obtain the p-value in our Minitab output. How to use the p-value? 1. 2. 3. The bigger the Chi-square Statistic is, the smaller the pvalue will be. Generally, when the p-value is less than 0.05 (5%), we will assume that the observed relationship did not occur by chance, and it is statistically significant. Generally, when the p-value larger than 0.05 (5%), we will say the observed relationship could have occurred just by chance. Therefore, we can not reject the null hypothesis that there is no relationship. Example: Part 3 of the activity….