UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Homework 5 (Due Tuesday, Sept. 8th) Example of a SAS Program without a Data Step The SAS program for this homework will not include a Data Step, because we don’t need to create any new variables or modify our variables. We’ll just be working with the variables as they are in dataset FewerVariables.xls. Import Dataset “FewerVariables.xls” and Run Proc Contents Write a SAS program that uses Proc Import to import the “FewerVariables.xls” dataset that you created and saved in your ECN377 folder in Homework 4. Use Proc Import to name the dataset “dataset01”. Run Proc Contents on dataset01. Based on the results of Proc Contents in the output window, answer these questions: How many rows of data are in dataset01? How many variables are in dataset01? Which of these variables are character/text variables? Using Proc Sort and Proc Print 1. Continuing to add commands to your SAS program, use Proc Sort to sort dataset01 by CntyName. 2. Use Proc Print with dataset01 to print the data for the following variables (only these variables) to the Output window in SAS: CntyName, PopCens, ManfJobs, and UnempRate. (Reminder: Don’t put commas between the variable names.) Note: If you run your SAS program multiple times, which is fine, SAS will add the results of the new run to the bottom of the output window below the results from any prior runs. So, the newest results will always be at the bottom of the output window. If you want, you can clear the output window each time before you run your program to clear out any results from prior runs. The same is true for the log window. SAS adds log info from new runs beneath any log info from old runs. So, the newest log info is always at the bottom of the log window. 3. Look at the output from Proc Print in the output window (Not the output from Proc Contents, but the output from Proc Print, which will be at the bottom of the output window.). Answer these questions: Which NC county name is first in alphabetical order? Which is last? What was the population in each of these two counties in 2000? What about the unemployment rate in each of these two counties in 2000? 4. Use a new Proc Sort command to sort dataset01 again, this time by variable PopCens. 5. Use a new Proc Print command to print the data again for variables CntyName and PopCens (only) to the output window in SAS. Look at the output (closest to the bottom of the output window) to answer these questions: Which NC county had the largest population in 2000? The smallest? Which had population (PopCens) equal to 160,307 ? 6. Use another Proc Sort command to sort dataset01 again, this time by variable ManfJobs. 7. Use another Proc Print to print the data again for variables CntyName and ManfJobs (only) to the output window in SAS. Which NC county had the largest value of ManfJobs in year 2000? The smallest? What was the value of ManfJobs for NewHanover county? 8. Use another Proc Sort command to sort dataset01 again by variable GeoRegion and then by variable ManfJobs within each region, like this: proc sort data=dataset01; by georegion manfjobs; run; 9. Use another Proc Print command to print the data from dataset 01 for variables CntyName, GeoRegion and ManfJobs (only) to the Output window in SAS. Which NC county in the "coast" region had the largest manufacturing employment per 1000 population in year 2000, and what was the employment? Which NC county in the "mountain" region had the largest manufacturing employment per 1000 population in year 2000, and what was the employment? 1 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Using Proc Means 1. Use a Proc Means command with dataset01 to calculate the following statistics: N, MAX, MIN, MEAN, MEDIAN, CV, showing 2 decimal places to the right of the decimal, for the following variables: PopCens, ManfJobs, ConstJobs, ServJobs, FarmJobs, UnempRate. Proc Means automatically sends the output to the output window; you don’t need to use a separate Proc Print command. 2. Based on the output from Proc Means, what was the smallest unemployment rate in year 2000 in a North Carolina county? What was the largest? What was the mean county unemployment rate? What about the median? Is the mean different from the median? If so, what is this telling you? 3. What is the coefficient of variation (CV) of unemployment rate across NC counties? What is this telling you? 4. Compare the CV’s of ManfJobs, ConstJobs, ServJobs, FarmJobs. What does the comparison of CV’s tell you? 5. Why can’t we use Proc Means to analyze variables CntyName, GeoRegion and UnempCat? 6. Use another Proc Means command to calculate the MAX, MEAN and CV of ManfJobs by GeoRegion. Before you use Proc Means to do this, you must use Proc Sort again to sort your data by GeoRegion. Looking at the output of this second Proc Means command in the output window (scroll down to the bottom of the output window—remember, newest results at the bottom), which GeoRegion had the county with the largest value of ManfJobs, and what was the value? The counties of which GeoRegion had the largest mean value of ManfJobs? Which GeoRegion had the greatest variation of ManfJobs across the counties in the region? Using Proc Gchart We can’t use Proc Means to describe character/categorical/text variables like GeoRegion and UnEmpCat, but we can use Proc Gchart to describe these variables by creating frequency distributions and percentage distributions: 1. Use a Proc Gchart command with dataset01 to make a vertical frequency Gchart for GeoRegion and a vertical percentage Gchart for variable UnEmpCat. You need two “vbar” statements in the command, one for each chart. What number (frequency) of counties is in the "mountain" region of North Carolina? What percentage of counties is in the “high” UnEmpCat? 2. Use another Proc Gchart command with dataset01 to make a horizontal frequency distribution of the UnempCat variable. By default, SAS will print the categories in alphabetical order along the Gchart axis, but it makes more sense to order the categories from 'Low' to ‘Med’ to 'High'. Recall that you can control the order of the categories using the "midpoints" option, as shown below. Put the category names in single quotes: proc Gchart data=dataset01; hbar UnempCat / midpoints = 'Low' 'Med' 'High'; run; How many counties (frequency) are in the ‘Low’ category of UnempCat? What percentage of counties is in the ‘High’ category? What does the cumulative percent value for the ‘Med’ category of UnempCat tell us? Save Your Program and Write up Your Homework After you run your SAS program and verify that it is working correctly, save the SAS program as HW05.sas. When this homework asks you to answer specific questions about the results, you need to answer in complete sentences, in addition to giving the appropriate numbers. Don’t forget to include the answers to the multiple choice questions below. Also, include a print out of your SAS program “HW05.sas.” Finally, be sure to put your name, ECN377, your section, and “Homework 5” at the top of your homework. 2 UNC-Wilmington Department of Economics and Finance ECN 377 Dr. Chris Dumas Multiple Choice and Matching Section 1) Suppose you want to summarize the values of a numerical measurement variable. Which descriptive statistics should be used? a) frequency distributions, mean, standard deviation b) histograms, mean, standard deviation c) mean, mode, t-test d) mean, median, frequency distributions 2) Suppose you want to summarize the values of a nominal character/text variable. Which descriptive statistics should be used? a) skew, mean, histograms b) mean, median, frequency distributions c) frequency distributions, mode d) histograms, mean, mode 3) The procedure used to calculate descriptive statistics for numerical measurement variables in SAS is: a) proc contents b) proc sort c) proc means d) proc ttest 4) Suppose you have a dataset with variables State, Year, Revenues and Expenses, and suppose you want to calculate mean Revenues and mean Expenses by State. One must first use ________ to sort the data before using _________ to calculate the means by State. a) Proc Print, Proc Sort b) Proc Sort, Proc Means c) Proc Means, Proc Print d) Proc Print, Proc Sort 5) Suppose you are trying to describe the income of a typical household in a small, poor country that has many poor households and a few very rich households. Which measure of central tendency would be better to use? a) range b) mean c) median d) coefficient of variation 6) Suppose you are trying to describe the variation in stock price data, and you need to describe the variation using different measurement units (different currencies) for different clients. Which measure of variation would produce results that are comparable across the different measurement units? a) variance b) standard deviation c) coefficient of variation d) sum of deviations 7) In SAS, Proc ______ is used to create frequency distribution graphs. a) sort b) print c) gchart d) graph 8) Suppose you have descriptive statistics for the variables in a client’s dataset (but you don’t have the actual data), and you are trying to determine which of the patterns below best describes the values of one of the variables in the dataset. Which descriptive statistic would help you? a) mean b) standard deviation c) skewness d) kurtosis 3 UNC-Wilmington ECN 377 Department of Economics and Finance Dr. Chris Dumas 9) Suppose you have descriptive statistics for the variables in a client’s dataset (but you don’t have the actual data), and you are trying to determine which of the patterns below best describes the values of one of the variables in the dataset. Which descriptive statistic would help you? a) mean b) standard deviation c) skewness d) kurtosis 10) When should a histogram be used to describe the distribution of the values of a variable instead of a frequency distribution? a) when the variable is a text/character variable b) when the variable is an ordinal numerical variable c) when the variable is a measurement numerical variable d) when the variable gives the names of the key epochs in European history 11) Suppose you see the following commands in a colleague’s SAS program. What do these programs produce? proc gchart data=dataset02; vbar Revenue / levels=7; run; a) a frequency distribution with 7 categories for variable “levels” b) a histogram with 7 bars for variable Revenue c) descriptive statistics for the 7 sub-types of Revenue in the dataset d) a printout summary of the Revenue data in a table with 7 columns 12) Suppose you want to use Proc Print to print the data from dataset02 for only variables A, B and C to the output window of SAS, and you want to print only the rows of data for which B = 4. Which Proc Print command below would accomplish this? a) b) proc print data=dataset01; where A B C; var B=4; run; d) c) proc print data=dataset02; var A B C; where B=4; run; e) proc print data=dataset02; where A, B=2, C; run; g) f) proc print data=dataset02; where A B C; var B=4; run; h) proc print data=dataset01; var A B C; var B=4; run; proc print data=dataset01; var A B C; where B=4; run; proc print data=dataset02; var A=all, B=2, C=all; run; i) proc print data=dataset02; where C=2; run; proc print data=dataset01; where A B C; where B=4; run; 4