Homework 09

advertisement

UNC-Wilmington ECN 377

Department of Economics and Finance Dr. Chris Dumas

Homework 9 (Due October 1st)

Scatterplots (X-Y Graphs) and Correlations

In this homework, we explore linear correlation analysis, and we practice producing scatterplots (X-Y graphs) and correlation analyses in SAS. For the purposes of this homework, let's agree that a value of r between 0 and 0.39

(whether positive or negative) indicates a weak correlation, between 0.40 and 0.69 indicates a moderate correlation, and between 0.70 and 1.00 indicates a strong correlation.

Multiple Choice

1) Suppose the correlation between high interest rates and high unemployment is r = 0.89, and it is statistically significant at α = 5%. Does this mean that high interest rates necessarily, directly cause high unemployment? a) Yes, because high interest rates are always bad. b) Yes, because high correlation means causation. c) No, because the correlation is not significant at the α = 1% level. d) No, because high correlation does not necessarily imply causation.

2) Suppose you conduct a correlation analysis for variables INFLATION RATE and UNEMPLOYMENT RATE and find no correlation. Does this mean that the inflation rate is unrelated to the unemployment rate? a) yes, correlation analysis proved that they are unrelated b) yes, because correlation does not imply causation c) no, the correlation analysis indicates that there is no linear relationship between the two variables,

but there may be a relationship of some other shape between the two variables d) no, correlation analysis tests for non-linear relationships only, there might still be a linear relationship

between the two variables e) both c and d are true

3) Which procedure does one use to conduct correlation analysis in SAS? a) proc ols c) proc corr b) proc reg d) proc tabulate

4) Your firm produces three versions of a sports car: white, black and red. Your new employee submits his weekly report to you, in which he says that he ran a correlation analysis to determine whether a consumer's gender affects the color of sports car purchased. He finds that the correlation coefficient is 0.78 and a t-test of the correlation coefficient is statistically significant. You know that either: (i) he is lying to you about what he did, or (ii) he screwed up his analysis in a big way, because: a) the correlation coefficient would need to be greater than 0.85 for the results to be statistically significant b) correlation analysis doesn't apply to categorical variables, such as color c) he should have conducted an F-test instead of a t-test d) he should have calculated R2 instead

5) Suppose you have data on GDP, TIME and INFLATION RATE, and you would like to make a graph showing data points with TIME as the X variable and GDP and the Y variable. Which procedure would you use in SAS? a) proc chart b) proc gplot c) proc corr d) proc graph

6) Correlation Analysis can be used to identify ________ relationships, whereas Regression Analysis can be used to identify ____________ relationships. a) only nonlinear ///both linear and nonlinear b) only linear /// both linear and nonlinear c) both linear and nonlinear /// only linear d) both linear and nonlinear /// only nonlinear

1

UNC-Wilmington ECN 377

Department of Economics and Finance Dr. Chris Dumas

7) You work for a company that does business in three regions, Region 1, Region 2 and Region 3. You work in

Region 1. Your boss says that the correlation between sales and region is positive, with a statistically significant r

= 0.80. Therefore, your boss concludes that your region, Region 1, has the worst sales, and you should be fired.

You object by saying: a) the correlation coefficient would need to be greater than 0.85 for the results to be statistically significant b) correlation analysis doesn't apply to categorical variables, such as Region c) the boss should have conducted an F-test instead of a t-test d) the boss should have calculated R 2 instead

8) Your intern shows you the following graph and says that there is no correlation between X and Y, because the slope is negative. You reply: a) You’re right! How smart! You get a raise! b) No, there is a correlation, it’s just that the correlation is zero. c) No, there is a correlation, it’s just that the correlation is negative. d) No, there is a correlation, it’s just that the correlation is nonlinear.

9) Suppose that you calculate the correlation coefficient, r = 0.23, between variables X and Y, based on a sample size of 18. Now you want to test the hypothesis that r > 0. You will need to calculate t test

, and find t critical

from the t-table using a significance level of α = 0.05. a) this is a one sided test, and you should reject H0: r = 0 and accept H1: r > 0. b) this is a one sided test, and you should accept H0: r = 0 and reject H1: r > 0. c) this is a two-sided test, and you should reject H0: r = 0 and accept H1: r > 0. d) this is a two-sided test, and you should accept H0: r = 0 and reject H1: r > 0.

10) Suppose the correlation between corporate tax rates (X) and government tax collections (Y) is positive, based on the data that we have, which is for X between 0.15 and 0.45, and for Y between $2 trillion and $3.75 trillion.

Suppose an economist claims that if we reduce X below 0.15, then Y will rise (a negative correlation). Can we say that the economist’s claim is wrong based on the positive correlation? a) yes, of course he’s wrong, the correlation is positive b) no, he’s not wrong, a reduction in X that leads to a rise in Y is simply another type of positive correlation c) no, he’s not wrong, because the positive relationship could be nonlinear d) we don’t know whether he’s wrong, because he’s making claims about what might happen outside the relevant range

11) Your firm has a data set consisting of 10 variables, each of which is a numerical measure of some aspect of quality in your production process. You want to determine whether there are any statistically significant linear relationships among all possible pairs of the 10 variables. Which type of analysis do you need to conduct? a) chi-squared test for differences between variances b) correlation matrix c) dependent (paired) samples t-test d) contingency table / cross-tabulation

12) What do the following commands in SAS accomplish? proc gplot data=dataset01; plot Z*X Q*X / overlay; run; a) two separate graphs, one with a plot of Z vs. X, and a separate one with a plot of Q vs. X b) one combined graph, with a plot of Z vs. X overlaid with a plot of Q vs. X c) two separate graphs, with correlation analysis results overlaid on the graphs d) two separate graphs, arranged horizontally (side-by-side), instead of vertically (one below the other)

2

UNC-Wilmington ECN 377

Department of Economics and Finance Dr. Chris Dumas

Consider the following Correlation Matrix of correlation results produced by SAS. The top number in each cell is the correlation coefficient (r), and the bottom number is the p-value for the hypothesis test H0: ρ = 0, H1: ρ ≠ 0.

Pearson Correlation Coefficients, N = 100

Prob > |r| under H0: Rho=0

LandArea

LandArea PopCens Age65Index UrbanIndex

1.00000 0.26397

0.0080

-0.24595

0.0136

0.10178

0.3136

PopCens

Age65Index

0.26397

0.0080

-0.24595

0.0136

1.00000

-0.47553

<.0001

-0.47553

<.0001

1.00000

0.68849

<.0001

-0.49199

<.0001

UrbanIndex 0.10178

0.3136

0.68849

<.0001

-0.49199

<.0001

1.00000

13) In the table above, which variables have a positive, linear correlation with LandArea? a) PopCens only b) Age65Ind only c) UrbanInd only d) PopCens and Age65Ind e) PopCens and UrbanInd f) Age65Ind and UrbanInd g) PopCens and Age65Ind and UrbanInd

14) What does it mean for two variables to have a positive, linear correlation?

(CHOOSE ONE BEST ANSWER) a) if one of the variables increases, then the other increases b) if one of the variables increases, then the other decreases c) if one of the variables decreases, then the other increases d) if one of the variables decreases, then the other decreases e) a and b f) c and d g) a and d h) b and c

15) In the table above, which pairs of variables have a negative, linear correlation with each other? (CIRCLE ALL

THAT APPLY) a) LandArea and PopCens b) LandArea and Age65Ind c) LandArea and UrbanInd

d) PopCens and Age65Ind e) PopCens and UrbanInd f) Age65Ind and UrbanInd

16) What does it mean for two variables to have a negative, linear correlation?

(CHOOSE ONE BEST ANSWER) a) if one of the variables increases, then the other increases b) if one of the variables increases, then the other decreases c) if one of the variables decreases, then the other increases d) if one of the variables decreases, then the other decreases e) a and b f) c and d g) a and d h) b and c

17) In the table above, the linear correlations between which pairs of variables are statistically significant at the α =

0.05 significance level? (CIRCLE ALL THAT APPLY) a) LandArea and PopCens b) LandArea and Age65Ind c) LandArea and UrbanInd d) PopCens and Age65Ind e) PopCens and UrbanInd f) Age65Ind and UrbanInd

18) In the table above, the linear correlations between which pairs of variables are not statistically significant at the

α = 0.05 significance level? (CIRCLE ALL THAT APPLY) a) LandArea and PopCens b) LandArea and Age65Ind c) LandArea and UrbanInd d) PopCens and Age65Ind e) PopCens and UrbanInd f) Age65Ind and UrbanInd

3

UNC-Wilmington ECN 377

Department of Economics and Finance Dr. Chris Dumas

Problem Section

19) Suppose that the port of Wilmington hires your consulting firm to investigate the relationship between the price of bunker fuel (the fuel used by many large cargo ships) and the annual tonnage of freight moving through the port. The port is worried that changes in the price of bunker fuel could affect tonnage, which in turn affects the fees collected by the port on each ton. The port gives you a sample of 30 years of data on the average price of bunker fuel each year (X) and the tonnage moving through the port each year (Y). The sample standard deviation of X is s

X

= 80.39, the sample standard deviation of Y is s

Y

= 494,874.2, and the sample covariance between X and Y is s

XY

= -21,629,737 . Calculate the Pearson Correlation Coefficient (r) between X and Y. Conduct a test of the hypothesis that there is no linear relationship between X and Y (that is, H0: ρ = 0 vs. H1: ρ ≠ 0). Do the data indicate that there is a statistically significant linear relationship between X and Y? If so, is the relationship positive or negative? Weak, moderate, or strong? SHOW YOUR WORK.

20) Suppose that you are the cheerful business analyst who works for Kiss-Your-Cheek, a toilet paper manufacturer. Your boss is interested in the possible relationship between household income (X) and toilet paper rolls purchased by a household per year (Y). You do a random sample survey of 50 households that purchased your toilet paper last year. The sample standard deviation of X is s

X

= 7642.2, the sample standard deviation of Y is s

Y

= 19.4, and the sample covariance between X and Y is s

XY

= 3297.1 . Calculate the Pearson Correlation

Coefficient (r) between X and Y. Conduct a test of the hypothesis that there is no linear relationship between X and Y (that is, H0: ρ = 0 vs. H1: ρ ≠ 0). Do the data indicate that there is a statistically significant linear relationship between X and Y? If so, is the relationship positive or negative? Weak, moderate, or strong?

SHOW YOUR WORK.

SAS--Scatterplots and Correlation Analysis

In this section of the homework, we’ll explore correlations between an index of air pollution in North Carolina counties in the year 2000 and other variables that might be associated with air pollution. We’ll use dataset

“ProcCorrData.xls”, which contains data on 9 variables for a random sample of 45 North Carolina counties (out of the total population of 100 North Carolina counties). The dataset is located on the Homework page of the course website. The variable names and definitions are presented below:

Variable Name Variable Definition

CntyName

PopCens

LandArea

Name of county in North Carolina

Population in county in year 2000

Land area (square miles) in county in year 2000

PM10Area

HousingUnits

EmpManf2000

VehRegs

PavedMiles

Air pollution index (estimated emissions in tons of air pollution particles less than

10 micrometers in size) for county in year 2000

Total of houses, apartments, mobile homes, etc., in county in year 2000

Manufacturing employment in county in year 2000

Number of cars and trucks registered (owned and located) in county in year 2000

Number of miles of paved roads in county in year 2000

MeanFamInc Average (mean) household income in county in year 2000

Create a SAS Program to Bring the Data into SAS

Download the data file ProcCorrData.xls to the ECN377 folder on your V: drive.

In SAS, write a program that begins with the usual comment lines and option commands. Then, use Proc

Import to bring the ProcCorrData.xls data file into SAS, and name it dataset01.

4

UNC-Wilmington ECN 377

Department of Economics and Finance Dr. Chris Dumas

Use Proc Gplot to Create Individual Scatterplots (X-Y Graphs)

Suppose someone claims that air pollution in a county is directly related to population in the county. In SAS, use

Proc Gplot to make a scatterplot of the air pollution index (PM10Area) vs. county population (PopCens). Based on the graph, does there appear to be a linear relationship between these variables? If so, is the relationship positive or negative? Does the relationship appear to be weak, moderate, or strong?

Suppose someone else claims that air pollution is not related to population but is instead related to the number of motor vehicles registered in the county. In SAS, use Proc Gplot to make a scatterplot of the air pollution index

(PM10Area) vs. vehicle registrations (VehRegs) in the county. Based on the graph, does there appear to be a linear relationship between these variables? If so, is the relationship positive or negative? Does the relationship appear to be weak, moderate, or strong?

Next, someone claims that air pollution is really related to the number of housing units (HousingUnits) in a county, then someone else claims, no, it’s related to manufacturing activity (as measured by manufacturing employment in each county EmpManf2000), and someone else says it’s miles of paved roads in the county.

Then, someone claims that air pollution is worse in poor counties. Others say, no, it’s worse in rich counties . . .

Use Proc Corr to Investigate Linear Relationships between Pairs of Variables

Okay, you say to yourself, enough playing around with investigating one possible variable at a time. What I need is an overview of all the variables that might be affecting air pollution. Use Proc Corr to make a Scatterplot

Matrix for variables PM10Area, PopCens, LandArea, HousingUnits, EmpManf2000, VehRegs, PavedMiles and

MeanFamInc. (Be careful, you may not need the commas between the variable names in SAS.) (Remember:

SAS puts the Scatterplot Matrix in the Results Window of SAS (not the Output Window), in subfolder “Corr:

The SAS System”, filename “MatrixPlot.”) Based on the scatterplot matrix, there appear to be relationships between which pairs of variables? Do any of these relationships appear to be linear? If so, is the relationship positive or negative? Does the relationship appear to be weak, moderate, or strong?

Looking at graphs is a useful way to begin exploring the data, but now it’s time to formally test some hypotheses about whether the data provide enough evidence to conclude that some (linear) relationships exist between air pollution and the other variables.

Use Proc Corr to produce a Correlation Matrix for variables PM10Area, PopCens, LandArea, HousingUnits,

EmpManf2000, VehRegs, PavedMiles and MeanFamInc. (Be careful, you may not need the commas between the variable names in SAS.)

Based on the results in the correlation matrix, do there appear to be any statistically significant linear relationships between any pairs of variables in the matrix (that is, can we reject H0: ρ=0 for any pairs of variables in the matrix, using a significance level of α = 0.05, α/2 = 0.025)? If so, for which pairs of variables, and are the relationships positive or negative, and are the relationships weak, moderate, or strong?

After answering the questions above, summarize, in a sentence or two, what appears to be associated with air pollution in NC counties, and what is the relationship between air pollution and whether a county is rich or poor.

Complete Your Homework

Answer the multiple choice, short answer, and SAS-related questions on paper and turn in. When this homework asks you specific questions about the SAS results, you need to answer in complete sentences, in addition to giving any appropriate numbers. Save your SAS program as HW09.sas

. Print out your SAS program (you can copy it from the Editor window of SAS, paste it into Word, and print it), and attach it to the end of your homework. Be sure to put your name, ECN377, your section, and “Homework 9” at the top of your homework.

5

Download