TigerSTAT Instructor Guide Laboratory Exercise: Using Simple Linear Regression to Estimate a Tiger’s Age Quick Info Level: Intro/Intermediate Undergraduate Statistics Brief Description: Students investigate the association of tiger age with particular tiger characteristics by conducting a simple regression analysis. Instructors also have the option to give a reading assignment brom an article in a scientific publication comprised descriptive statistics and regression analysis. Topics Covered: Experimental Design, Data Analysis, Linear Regression, introduction of a real application of statistical modeling using the Arcsine transformation. Software Required: Data analysis software such as Excel or Minitab for descriptive statistics and regression analysis. Students will also need computer access to play the TigerSTAT game on the web (this can be done inside or outside of the regularly scheduled class time). Prerequisites: Descriptive Statistics, Distributions, Hypothesis Testing Time: 1 to 3 hours in class and 1 to 2 hours of homework Instructor Resources: Student Lab, Instructor Guide, TigerSTAT Game Website (http://web.grinnell.edu/individuals/kuipers/stat2labs/tigerstat.html) Sustainable trophy hunting of African lions, (Whitman, et. al. 2004) (http://www.cbs.umn.edu/lionresearch/publications/articles/Sustainable_trophy_hunting_of_African_lion s.pdf) Why use this lab in your course? In this lab, students use the on-line TigerSTAT game to collect data and explore models for estimating the age of a Siberian tiger. In this game, students act as researchers on a national preserve where they are expected to catch tigers, collect data, analyze their data (using the simple linear regression on transformed data), and draw appropriate conclusions. Before playing the game, student can read a scientific paper discussing current methods of estimating age in lions largely through the use of proxy variables. They are exposed to messy data and issues associated with data collection and through the TigerSTAT game. This lab provides an engaging way to practice simple linear regression applied to a real problem. The realism of the lab can be increased if they also read and discuss the research article provided in the introduction. One goal of this lab is to encourage students to consider the implications of more complicated research design topics like sampling and bias. What type of course is this TigerSTAT lab designed for? This lab is designed for any course that introduces the simple linear regression model. When should you use this lab in your course and what are the prerequisites? Distributions and hypothesis testing should be familiar topics to the students. Although linear regression is the primary topic of this lab, the game can be used to motivate many topics such as descriptive statistics and visualizing data. In this sense, the game may be visited several times during a course. How should you conduct the lab? How much time should you expect to allocate? Day 1: OPTIONAL READING: Ask students to complete the background research exercise (page 1) where they read the entire article Sustainable trophy hunting of African lions, (Whitman, et. al. 2004) and answer the discussion questions before class. At the beginning of Day 1 you can start class by using the questions from the background research exercise to motivate discussion. An alternative approach is to include this as a part of the first day, and perhaps select only portions of the article to have students read in class and then discuss together. The game and the following lab questions do not require students to read the article or look at the mathematical model used in the article. You may also choose to skip page 1, reading and discussing the research article, and simply start students with page 2 (the lab and the game). Introduce the game (15 minutes). If students have computers available, the first day would be an excellent opportunity to go to the game’s webpage and play the tutorial. If no computers are available, a brief discussion is warranted. The instructor should explain how the tutorial works, the difference between the two missions and how to retrieve the dataset. The instructor should have students complete questions 1-4 of the lab in preparation for the next class. Day 2: Have students come to class with their game data and complete all the lab questions, working in groups of 2-3. There are a few points of time during the lab that the instructor should solicit discussion on certain blocks of questions. We recommend the following: Begin with a discussion of student responses to questions 1-4 (10 min). Have students work on questions 5-9 (focus discussion on assumptions of a linear model) (15 min). Instructors should be aware that 1) some student data may not support the model, 2) small data sets may be very erratic and may not accurately fit any model, and 3) it may be best to group data (make sure there are no duplicate observations in grouped data).. After a majority of student groups have worked on questions 10-12, discuss significance of association and model fit (15 min). Question 13 will elicit very different responses from the students. For some, the transformation will mean very little in terms of the relationship between noseblack proportion and the model R2 value. For others the transformation will be readily apparent. Discussion should focus on the issues of sample size, representative sample and bias in the context of the experiment. (20 min) The Background Research exercise and TigerSTAT lab are available in the next section. For more information and ideas on using TigerSTAT in your course go to: http://web.grinnell.edu/individuals/kuipers/stat2labs/tigerstat.html. TigerSTAT Background Research Before conducting the TigerSTAT lab, read the 2004 article by Whitman et. al., Sustainable trophy hunting of African lions, ( the article can be found at http://www.cbs.umn.edu/lionresearch/publications/articles/Sustainable_trophy_hu nting_of_African_lions.pdf ) and answer the following questions: 1. Why is estimating the age of a lion a worthwhile question? 2. What are some of the difficulties associated with estimating the age? 3. What are a few approaches to estimation for lion ages? Which of these are possibly useful in estimating the age of a Tiger? 4. How could you test to see if your model produces good estimates for a Tiger’s age? TigerSTAT Using Simple Linear Regression to Estimate a Tiger’s Age You are hired to develop models to use in estimating the age of a population of tigers. The Bolshoy Kosha (Russian for big cat) Reserve is a newly created animal reserve that was uniquely developed to help endangered species prosper. This 10,000 acre wild animal reservation was selected because an abundance of Siberian tigers have been found in the area. The diverse terrain of the reserve provides a wide variety of habitats for many different species of animals. Since the tigers in this area are much more abundant than any other area in the world, they are starting to draw a significant number of researchers to the region. Your primary responsibility will be to help these researchers as they study the tigers and then incorporate the results of their research into a system to identify the best management practices for this reserve. Establishing a simple model to estimate the age of a tiger. While the exact age is not known for most of the tigers in your reserve, the age of some tigers are known. These have been carefully monitored by keeping them in a smaller research zone within the BK land area. To estimate the age of a tiger that is captured on your reserve, you will need to compare characteristics of the captured tiger to the ones that live on the research zone (whose ages are known). When data is collected as an indirect measure for the variable of interest, it is often called proxy data. For this task we will examine one model developed for lions and see how well it extends to our tigers. In the Whitman et. al. (2004) article the authors used the percentage of nose blackness (NOSEBLACK%) to develop the model: AGE 0 1 arcsin(NOSEBLACK%) (1) Your mission is to go into the Bolshoy Kosha reserve and gather data on as many tigers as you can in 30 minutes. Using your sample data, answer the questions shown below. Collecting Data Go to http://statgames.tietronix.com/TigerStat/ and enter a PlayerName and GroupName (Use a secret name, any combination of letters and numbers with no spaces. Do not use your name or a term that will identify you). If you are working in teams each person on your team should have the same GroupName. Use the Full Screen option to see the entire game on your computer screen. Questions: 1) Calculate the mean and standard deviation of NOSEBLACK% of the tigers in your sample. 2) Calculate the mean and standard deviation of the AGE of the tigers in your sample. 3) Produce a graph of the AGE against NOSEBLACK% for your sample – describe the relationship you observe. Would a linear model be appropriate for these variables? Why or why not? 4) Create a new variable ANOSEBLACK% by computing the Arcsin of NOSEBLACK% for tigers in your sample. Graph AGE as a function of this new variable. Do you think the assumed linear relationship is reasonable? Why or why not? 5) Use your software package to regress ANOSEBLACK% on AGE in order to estimate the parameters in equation (1). Checking the assumptions for any statistical model is imperative before any inferences are made. For our simple regression model, we assume that the residuals are normally distributed. Let’s check the validity of this assumption. 6) Using the parameter values obtained in (5), estimate the age of the tigers in your sample. 7) For each estimated value, compute the associated residual ei ( yi yˆ i ) . 8) Create an appropriate plot you have learned about in class (histogram, qq-plot) for assessing the normality assumption for the set of residual values computed in Question (7). Does our assumption of normality hold? Before performing our regression, we transformed our explanatory variable NOSEBLACK% using the Arcsin function (ANOSEBLACK%). Let’s evaluate the validity of our transformation. 9) Create a plot of the residuals computed in (7) versus the ANOSEBLACK% of each member of the dataset. Do any patterns emerge? What does it mean if there is a distinctive pattern? Now that we have checked the assumptions of our model, let’s look at how well the model performs. One measure for this is the coefficient of determination, R 2 . This is the proportion of variability in the data set that is accounted for by the statistical model and gives us insight as to how well future outcomes are likely to be predicted by the model. We compute R 2 using equation (2) below. SSE R 2 1 SST where SSE ( yi yˆi )2 and SST ( yi y ) 2 i (2) i SSE is the sum of squared error, a measure of the unexplained variance or variability not captured by the model. SST, the sum of squares total, is a measure of the overall sample variance. 10) Compute R 2 for the model developed for your sample. Based on this value, how well does our model perform? Before making any inferences or predictions on the mean values of the response variable, we must determine if the parameter associated with the predictor ANOSEBLACK% is significant. That is, we desire to test the null hypothesis that 𝛽1 = 0 versus the alternative hypothesis 𝛽1 ≠ 0. The test statistic (t) for this hypothesis is t ˆ1 1 s1 (3) and the test statistic has a t-distribution with n-2 degrees of freedom when the null hypothesis is true. 11) Compute the test statistic for the null hypothesis Ho: 𝛽1 = 0. Do we reject or fail to reject the null hypothesis? Why or why not? 12) Interpret the results of the hypothesis test in the context of the study. 13) Now perform a similar analysis without the Arcsin transformation. Describe any differences you see from the previous analysis.