TigerSTAT Instructor Guide Laboratory Exercise: Using Simple

advertisement

TigerSTAT Instructor Guide

Laboratory Exercise: Using Simple Linear Regression to Estimate a Tiger ’s Age

Quick Info

Level: Introductory/Intermediate Undergraduate Statistics

Brief Description: Students investigate the association of tiger age with particular tiger characteristics by

conducting a simple regression analysis. Instructors also have the option of asking students to read a

scientific publication discussing current methods in estimating ages of tigers.

Topics Covered: Simple Linear Regression, prediction, sampling bias, transformations, model assumptions and adequacy.

Software Required: Data analysis software such as Minitab, R, Stata, or Excel for descriptive statistics and regression analysis. Students will also need computer access to play the TigerSTAT game on the web (the game can be played inside or outside of the regularly scheduled class time).

Prerequisites: Prior to this lab, students should know how to analyze data using summary statistics, graphical methods and hypothesis testing. In this lab students create a simple linear regression model, evaluate a residual plot, and conduct a hypothesis test for the slope of the regression line; this material can be learned as part of the lab or prior.

Time: 1 to 3 hours in class + 3 to 5 hours of homework

Instructor Resources: Student Lab, Instructor Guide, TigerSTAT Game Website

( web.grinnell.edu/individuals/kuipers/stat2labs/tigerstat.html

)

Optional research article: Sustainable trophy hunting of African lions, (Whitman, et. al. 2004) http://www.cbs.umn.edu/sites/default/files/public/downloads/Sustainable_trophy_hunting_of_African_lions.pdf

(if this link does not work, go to the lion research page http://www.cbs.umn.edu/lionresearch and the article is then under the current project section "Trophy hunting" link).

Why use this lab in your course? the age of a Siberian tiger. In this game, students act as researchers on a national preserve where they are expected to walk through an animal reserve, tranquilize tigers, collect data, analyze their data (using the simple linear regression), and draw appropriate conclusions. They are exposed to messy data and issues associated with data collection through the TigerSTAT game.

This lab provides an engaging way to practice simple linear regression applied to a real problem. The realism of the lab can be increased if they also read and discuss the research article discussing current methods of estimating age in lions through the use of proxy variables. One goal of this lab is to encourage students to consider the implications of more complicated research design topics like sampling and bias. The usefulness of a model rather than simply the statistical significance is also addressed in a very practical way students understand since they “own the data”. Multiple opportunities to highlight subtleties not often addressed in traditional textbook problems are natural outcomes from using the lab. Examples of these opportunities include sampling bias, the cost of data collection, and consideration of how a model is used rather than simply its statistical significance.

What type of course is this TigerSTAT lab designed for?

This lab is designed for any course that introduces the simple linear regression model. However the game can easily be extended to more complex models appropriate for advanced courses.

When should you use this lab in your course and what are the prerequisites?

Although linear regression is the primary topic of this lab, the game can be used to motivate many topics such as descriptive statistics and visualizing data. In this sense, the game may be visited several times during a course.

For this simple linear regression lab, students should have prior understanding of hypothesis testing. In particular, they are expected to understand the concepts of null and alternative hypothesis, test statistics and pvalues.

How should you conduct the lab? How much time should you expect to allocate?

Optional Assignment Prior to 1 st Day of Class

Optional background research article lab: Ask students to the read the entire article

Sustainable Trophy Hunting of African lions, (Whitman, et al. 2004) and answer the discussion questions.

Before collecting data to develop a model, it is often important to learn more about the issues, factors, and possibilities for models that have been used or proposed in the past. The research article lab has students read the 2004 Nature article by Whitman et al., “Sustainable trophy hunting of African lions”,

The article can be found at: http://www.cbs.umn.edu/sites/default/files/public/downloads/Sustainable_trophy_hunting_of_African_lions.pdf

(if this link does not work, go to the lion research page http://www.cbs.umn.edu/lionresearch and the article is then under the current project section "Trophy hunting" link).

Questions based on the reading are provided in the student handout for the TigerStat website.

Day 1 (last 15-25 minutes of class):

Read the Introduction section of the regression lab and discuss (5-10 Minutes) to motivate this particular application of statistical analysis using a simple linear regression model. An alternative approach is to assign reading prior to day one and have a discussion of the article.

The game and the following lab questions do not require students to read the article or look at the mathematical model used in the article.

Introduce the game (10-15 minutes). If students have computers available, the first day would be an excellent opportunity to go to the game’s webpage and play the tutorial. If no computers are available, the instructor should explain how the tutorial works, and how to collect and retrieve the data. If students do not have computers available in class the time spent on day one is shorter and the students are then asked to play the game between day 1 and day 2.

Homework Assignment:

Collect samples – if computers not available in class or instructor does not want to collect data

 during class time.

Assign questions 1-5 of the lab in preparation for the next class

Day 2:

Begin class with a 5 to 10 minute discussion of student responses to questions 1-5 (10 min).

Students (in groups of 2-3) should then work on questions 6-9 (focus discussion on assumptions of a linear model) (15 min). Instructors should be aware that 1) some student data may not support the model, 2) small data sets may be very erratic and may not accurately fit any model, and 3) it may be best to group data (make sure there are no duplicate observations in grouped

data).

After a majority of student groups have worked on questions 10-13, discuss significance of association and model fit (15 min).

Day 3:

Start with a discussion of Question 13. Then proceed to complete questions 14-18. The final question will elicit very different responses from the students. Discussion should focus on the issues of sample size, representative sample and bias in the context of the experiment. Then, more importantly, the usefulness and appropriateness of the model. Finally, this is a great opportunity to critically review the journal article and have students discuss whether the model the authors propose is appropriate for their goals. (20 min)

Optional Day 3 Activity 1: work questions 19-23 transforming the data and comparing model results.

The transformation here seems more advanced than what is typical in an introductory course, but we have found that it is not hard to apply and analyze and students can handle this more advanced topic easily in this lab setting. For some, the transformation will mean very little in terms of the relationship between noseblack proportion and the model R 2 value. For others the transformation will be readily apparent.

Helpful hints are included as comments in the student regression lab below.

What else is in this Instructor Guide?

In the next section we provide detailed comments on the student regression lab. We suggest questions you can ask to promote class discussion and point out common issues you may run into when using the lab. For more information and ideas on using TigerSTAT in your course go to: http://web.grinnell.edu/individuals/kuipers/stat2labs/tigerstat.html

.

TigerSTAT: Simple Linear Regression Model Lab

Introduction to TigerSTAT

The Bol'shaya Koshka (Russian for big cat) Reserve is a newly created animal reserve that was uniquely developed to help endangered species prosper. This 10,000 acre wild animal reservation was selected because an abundance of Siberian tigers have been found in the area. The diverse terrain of the reserve provides a wide variety of habitats for many different species of animals.

Since the tigers in this area are much more abundant than any other area in the world, they are starting to draw a significant number of researchers to the region. Your primary responsibility will be to help these researchers as they study the tigers and then incorporate the results of their research into a system to identify the best management practices for this reserve.

An important component of monitoring endangered species is to understand the age distribution of the population. Shifts in the distribution could indicate potential issues in sustaining the population.

While the exact age is not known for most of the tigers in your reserve, the age of some tigers are known. To estimate the age of a tiger that is captured on your reserve, you will need to compare characteristics of the captured tiger to the ones that live on the research zone (whose ages are known).

When data is collected as an indirect measure for the variable of interest, it is often called proxy data. For example, in their 2004 paper, Whitman, et. al. describe how the color of a lion’s nose can be used to estimate it’s age. Your mission is to go into the Bol'shaya Koshka reserve and gather sample data on tigers. Then, using your sample data, you are to establish a simple linear regression model to estimate the age of a tiger based on the available proxy variables.

Play the tutorial for the TigerSTAT game briefly so you are familiar with the game controls. The game is found at the web site: http://statgames.tietronix.com/TigerStat/ . E nter a PlayerName and GroupName (The “PlayerName” is a secret name, any combination of letters and numbers with no spaces. Do not use your name or a term that will identify you or your group. All group members should use the same “PlayerName”). The “GroupName” will be provided by your instructor. You can choose either the Casual or Hard version, select Continue and Load Tutorial . If you forget commands anytime during game play, you can hit the “p” key to pause the game and see game instructions.

Collect Tiger Data using TigerSTAT. Go to http://statgames.tietronix.com/TigerStat/ and enter your PlayerName and the GroupName provided by your instructor. Use the Full Screen option to see the entire game on your computer screen. Select Load Mission 1 and then DataSet1 . Use the Full Screen option to see the entire game on your computer screen. You can choose either the Casual or Hard game option. You can type “ p ” to pause anytime while playing the game. This will allow you to review all the controls, exit the game and save your data.

TASK #1: Preliminary data analysis

For this task we will examine one model developed for lions and see how well it extends to our tigers. We will use the simple linear regression model:

Y = β

0

+ β

1 x + ϵ . (1)

In this case, Y is the age of the tiger and x the proxy variable. For Questions 1-4 you may need to first explore the data collected to determine what proxy variables to consider.

Since this is your first task, you will only be required to collect a minimum of data from five tigers. You have the option to collect more data. Recall that a larger sample size will improve the accuracy of your test results.

1.

Calculate the mean and standard deviation of the potential proxy variables of the tigers in your sample.

2.

Calculate the mean and standard deviation of the Age of the tigers in your sample.

3.

Produce a graph of the Age against each potential proxy variable for your sample – describe the relationships you observe. Would a linear model be appropriate for these variables?

4.

Are there any reasons to suspect your data may be biased? If you could, how would you ensure these issues were addressed in collecting tiger data?

5.

In the Whitman et. al. (2004) article the authors used the proportion of nose blackness as their proxy to develop the model. Does this seem like the best choice for the tigers in your sample? What additional work would you want to do to choose the best proxy?

TASK #2: Preliminary model estimation

Use your software package to regress NoseBlackProportion on Age in order to estimate the parameters in equation (1). Report the estimated slope value to the instructor, then answer questions #6 - #9 in preparation for classroom discussion.

Before making any inferences or predictions on the mean values of the response variable, we generally first determine if there is a significant relationship between the predictor and response. If there is no relationship, the slope would be zero hence we desire to test the null hypothesis that hypothesis β

1

≠ 0 . The test statistic (t) for this hypothesis is t

 

1 s

1

β

1

= 0 versus the alternative

(1) and the test statistic has a t-distribution with n-2 degrees of freedom when the null hypothesis is true.

6.

Compute the test statistic for the null hypothesis Ho: β

1

= 0 . Do we accept or reject the null hypothesis?

7.

Interpret the hypothesis test in the context of the study.

8.

What is the interpretation of the estimated slope parameter (be specific and be sure the answer is in the context of the tiger age)?

9.

Compare your answers in questions 6 through 8 to that of one or two other groups. What issues should you consider in using this model?

TASK #3: Model performance and assessment

Performance: A statically significant relationship is important, but we must assess the model performance and fit before using it. One measure of performance commonly used is the coefficient of determination,

R 2

. This is the proportion of variability in the data set that is accounted for by the statistical model and gives us insight as to how well future outcomes are likely to be predicted by the model. We compute

Ошибка! Источник ссылки не найден.

below.

R 2

using equation





R

2

SSE

where SSE

  i

( y i

 y

ˆ i

2

) and SST

  i

( y i

 y

2

)

SST

. (3)

SSE is the sum of squared error, a measure of the unexplained variance or variability not captured by the model.

SST, the sum of squares total, is a measure of the overall sample variance.

10.

Compute

R 2

for the preliminary model developed in Task #2. Based on this value, how well do you expect your model will perform?

11.

Based on your estimate of the model, what age is an average tiger with 10% NoseBlackProportion? 50%?

12.

Compare your estimates from #11 to that of one or two other groups? Comment on the results in terms of the

R 2

value of the model.



13.

Comment on any strengths/weaknesses you see in your model. What do you think our model might be good for? Is the coefficient of determination found for your model the best way to determine the goodness for this model?

Assessing the model: Checking assumptions for any statistical model is imperative before making inferences.

For our simple regression model, we assume that the errors are randomly distributed, following a normal distribution, with mean zero and a constant variance for all values of the predictor. Let’s check the validity of these assumptions.

14.

Using the parameter values estimated in Task #2, produce the model based estimates of the age of the tigers in your sample, 𝑦̂ .

15.

For each estimated value, compute the associated residual or difference between actual and predicted: e i

( y i

 ˆ i

)

.

16.

Create an appropriate plot you have learned about in class (histogram, qq-plot) for assessing the normality assumption for the set of residual values computed. Does the assumption of normality hold?

17.

Plot the residuals against the NoseBlackProportion. Does the assumption that the errors are random appear reasonable? Mean of zero? Constant variance?

18.

How appropriate to the model seem for the data in the sample? Would you recommend using it to determine the tiger ages in the preserve?

TASK #4: Model revision (optional)

It takes a careful reading of the Whitman et. al. (2004) article to see what model the authors actually used. It is found in the caption for Table 1. Relook at this caption to confirm that they modeled the age of lions using by first computing the arcsin of percentage of nose blackness (NoseBlackProportion), or the model:

AGE = β

0

+ β

1 arcsin⁡(NoseBlackProprtion) (4)

The use of the arcsin is what is known as a “transformation”. In statistical modeling when the assumptions of the model do not hold for a data set this is often a means of solving the problem. The choice of transformation is a more advanced topic. In fact, we could choose to transform the response, age, instead of the predictor. Our interest at this point is not to become experts in transforming data. The choice of the arcsin is actually not uncommon in certain fields when the predictor variable is a percentage or proportion. Our interest is whether

the choice, which was used for the lion data, appears reasonable in modeling tiger ages.

19.

Create a new variable ANoseBlackProportion by computing the Arcsin of NoseBlackProportion for tigers in your sample (note that most software packages have the arcsin function available – if not, one can first compute this in Excel). Graph AGE as a function of this new variable. Do you think the assumed linear relationship is reasonable? Why or why not?

20.

Use your software package to regress ANoseBlackProportion on AGE in order to estimate the parameters in equation Ошибка! Источник ссылки не найден.

. Then repeat the key steps used in the model without the transformed data (i.e. answer questions #6-12 for the model with the transformation). Did the transformation improve the model? Do you believe the model using the transformed variable is reasonable for use for tiger age data?

21.

To use the transformed data, what is the interpretation of the slope coefficient? How do you then use the model to make age predictions/estimates? Use this new model to produce the estimates in question #14.

22.

How do your estimates of the tigers ages compare to those from other groups? What is your advice to the research team about the use of the model/data in predicting ages of Tigers?

23.

What can you do to improve the model?

Download