Introduction to SAS What is a data set? • A data set (or dataset) is a collection of data, usually presented in tabular form. Each column represents a particular variable. Each row corresponds to a given member of the data set in question. There are three types of datasets • Cross-sectional • Time-Series • Panel (combination of cross-sectional timeseries data sets) Cross-Sectional Data • Cross-sectional data refers to data collected by observing many subjects (such as individuals, firms or countries/regions) at the same point of time, or without regard to differences in time. Members Age Wage Years of schooling John 40 100k 14 Paul 34 110k 17 Mary 28 75k 10 Tom 30 130k 16 Sara 37 50k 15 Time-Series Data • A time series is a sequence of data points, measured typically at successive times spaced at uniform time intervals. Year GDP xyz Inflation Rate 2004 34 3.2 2005 30 2.5 2006 37 2.7 2007 38 3 2008 41 2.9 2009 43 3.4 • Frequencies: daily, weekly, monthly, quarterly, annual Panel Data • Panel data, also called longitudinal data or cross-sectional time series data, are data where multiple cases (people, firms, countries etc) were observed at two or more time periods. Person Year Income Age Sex 1 2003 1500 27 1 1 2004 1700 28 1 1 2005 2000 29 1 2 2003 2100 41 2 2 2004 2100 42 2 2 2005 2200 43 2 What should you know about your dataset? • • • • What type of dataset do you have? How many variables do you have? How many observations do you have? What kind of variables do you have? – Numeric. numerical variable is an observed response that is a numerical value – String. A string variable is any combination of one or more characters. • Are there missing values? How to store your dataset? • Microsoft Excel Spreadsheets Accessing SAS Version 9.2 or 9.3 Click on ENGLISH 9.2 or 9.3 1. What does SAS look like? LOG WINDOW EXPLORER WINDOW NEW LIBRARIES EXECUTE THE PROGRAM OUTPUT WINDOW RESULTS WINDOW EDITOR WINDOW Anatomy of a SAS Program (1) Data name statement (2) Input statement (list of all variables to be read into the program) (3) Transformation statements (4) Datalines statement (copy & paste from Excel) (5) Placement of data (6) PROC statements – Means – Corr – Reg – Model – Autoreg (7) Run Statement Examples Spaghetti Sauce Program data spaghettisauce; Data set name input week qclassico pclassico qhunts phunts qnewmans pnewmans qprego pprego qprivl Input statement pprivl qragu pragu totalexp; datalines; Placement of data after the datalines statement 1 16.557905 1.336341 15.409280 1.311940 17.952117 1.381772 15.187799 1.362120 15.651408 1.356528 29.478939 2.311685 80.401300 2.287727 117.049632 1.909622 276.759921 1.677577 937.542909 2 31.380376 2.305388 75.181905 2.299160 125.986697 1.847495 206.699207 1.777097 845.490186 3 31.762660 2.299778 69.281355 2.160420 123.057729 1.870962 218.231648 1.738040 846.008960 4 28.447741 2.341264 68.898908 2.321191 114.953810 1.932617 204.152369 1.752055 804.175192 5 27.772665 2.340832 77.208027 2.249415 113.247798 1.920066 180.526273 1.846330 782.554156 1.099910 41.363767 1.105020 41.584220 1.111880 34.458333 1.108804 33.825571 1.080379 35.508482 2.246208 1.812694 2.203934 1.796701 2.207156 1.809025 2.205486 1.754891 2.216497 1.798351 2.195138 1.692623 2.175770 1.704789 2.168626 1.772264 2.164687 1.671172 2.158183 1.680284 6 28.251703 2.362670 125.877846 1.899778 910.381585 7 26.947404 2.368843 120.413152 1.877365 864.910385 8 26.669631 2.375479 121.300549 1.823129 793.463874 9 29.190977 2.354548 126.792828 1.855721 898.975245 10 30.564590 2.301370 112.731447 1.930341 869.899250 11 29.502039 2.342312 122.730980 1.912570 894.705963 12 29.454762 2.383079 118.288762 1.892754 921.236157 13 28.853887 2.393748 133.727889 1.822013 869.240450 14 30.275710 2.361550 130.808890 1.849916 994.631492 15 34.241497 2.290308 137.464940 1.858437 1011.737278 96.507708 61.823708 1.060712 16.511740 1.133252 217.893723 84.284722 37.019864 1.048924 16.802342 1.353188 222.342183 81.810965 33.521622 1.071276 16.730153 1.359883 187.289016 97.958015 38.925676 1.045915 16.885963 1.343224 229.625749 96.337535 44.831781 1.073289 16.835041 1.315669 212.556985 76.135599 43.597670 1.104502 16.832199 1.272487 244.799346 69.803347 56.155822 1.118029 18.328200 1.161683 260.440575 72.185000 35.419832 1.108094 18.922787 1.334668 219.052937 110.997722 41.793621 0.970290 18.885386 1.296186 286.263290 91.463049 42.349396 1.059148 18.770848 1.337508 287.937805 2.108769 1.787616 2.202753 1.756095 2.238778 1.587328 2.244342 1.604585 2.219749 1.616331 2.184630 1.815537 2.178318 1.779423 2.148674 1.798278 2.127138 1.835858 2.136197 1.820689 16 43.102764 2.246922 150.014841 1.806350 914.592550 17 35.687632 2.329571 124.371155 1.881480 920.101902 18 37.710794 2.327977 136.538891 1.873221 1067.052402 19 36.972091 2.265346 134.412773 1.827838 1015.942051 20 32.236119 2.393364 131.812201 1.822419 973.126818 21 31.584801 2.409353 137.357622 1.816392 856.927770 22 33.133108 2.406975 169.203190 1.780520 940.026516 23 36.753574 2.363383 131.769601 1.897437 924.339307 24 34.855495 2.399628 151.600412 1.848591 884.805764 25 39.940000 2.369996 131.142332 1.913279 868.475691 80.876807 42.336055 1.067257 18.982216 1.277091 204.939009 80.627606 65.104947 1.084763 16.899180 1.132962 230.349266 131.616811 53.952967 0.983176 16.947751 1.180342 310.249254 90.488003 41.157579 1.049932 16.030220 1.263050 313.810158 92.735918 39.314523 1.032483 17.484182 1.331265 290.076840 77.131493 40.795877 1.078318 17.947063 1.295275 196.146760 90.895588 47.883579 1.047093 18.852538 1.285596 202.989618 97.552040 51.256152 1.048465 19.426319 1.274568 210.257725 92.632436 43.226880 1.050174 20.321799 1.305765 176.466149 71.949897 39.030243 1.102729 19.500000 1.317081 192.511247 2.117485 1.863909 2.130235 1.675408 2.071805 1.708309 2.174521 1.734620 2.168873 1.816932 2.124518 1.666451 2.198892 1.628485 2.177400 1.798869 2.173055 1.816060 2.161882 1.710587 26 33.047390 2.333880 104.450759 1.885650 666.704699 27 38.182377 2.358465 145.691479 1.834819 977.050906 28 39.340907 2.310507 140.632073 1.858516 927.411188 29 42.142760 2.249801 136.383001 1.849351 858.853482 30 33.415941 2.359613 113.736908 1.876736 726.389342 31 38.053214 2.421593 164.054088 1.831481 1044.930162 32 36.574890 2.448129 176.283723 1.757429 1108.443481 33 39.515679 2.460343 162.312382 1.804064 1099.622735 34 49.178044 2.448336 138.827152 1.826863 934.109753 35 42.717913 2.461972 149.951936 1.769869 953.331361 64.127801 30.055429 1.073595 16.576976 1.322393 133.549765 82.609900 50.747411 1.091488 21.204360 1.278392 250.367996 88.645386 39.082103 1.074976 20.679452 1.323507 225.536421 70.741651 39.929939 1.112186 18.398206 1.311964 196.440037 64.464785 33.423684 1.107849 16.740638 1.322969 155.285085 117.697194 50.552102 0.980586 22.297249 1.259899 255.540157 111.200050 47.497145 1.033788 20.088398 1.281080 300.352885 96.845945 45.812445 1.047416 19.503119 1.294824 281.485505 119.427531 40.742757 0.942305 18.707266 1.314488 194.565294 88.517788 57.621016 1.054659 18.438739 1.141232 224.360623 36 41.197544 2.471480 94.505700 1.046608 18.941095 2.156862 135.265353 1.858542 60.168718 1.180721 229.472645 1.719798 958.668060 37 39.788842 2.453568 102.044994 0.985755 18.903699 2.151540 147.601324 1.850108 42.290973 1.306216 234.360074 1.651436 954.238256 38 41.314488 2.395698 85.400518 1.028951 20.735704 2.067240 177.527764 1.656228 36.378665 1.345077 239.807719 1.692423 978.530320 39 42.616783 2.383487 85.074060 1.037431 20.512587 2.059879 160.196547 1.799702 42.088850 1.291107 208.047699 1.687847 925.888402 40 41.664717 2.379957 89.145855 1.069841 18.182713 2.111224 163.945941 1.799987 43.358742 1.282735 226.592166 1.639427 955.119559 41 51.567643 2.222860 83.334685 1.057869 18.994402 2.103811 132.379793 1.884823 53.652463 1.147004 192.771435 1.735189 888.292404 42 41.421016 2.451080 74.908315 1.067700 16.726950 2.162721 129.406980 1.867044 37.261913 1.295095 227.179097 1.576445 865.683152 43 53.792936 2.439119 78.807962 1.057600 16.198722 2.182931 135.693022 1.840945 40.210292 1.306747 200.088160 1.645527 881.513904 44 43.606944 2.456948 81.652867 1.071207 16.419492 2.195529 146.731878 1.853020 46.255420 1.258085 192.757055 1.708901 890.149666 45 37.815625 2.486153 90.360281 1.013287 15.269039 2.216885 133.537747 1.840366 38.729663 1.333144 201.300751 1.644203 847.795877 46 37.094566 2.487414 72.384919 1.068425 14.802451 2.229075 141.855150 1.734876 35.310976 1.343831 159.132530 1.762777 776.671417 47 33.204738 2.459258 64.056208 1.073800 13.912000 2.230996 107.101937 1.820402 32.968196 1.318080 171.275693 1.729661 716.152377 48 35.602401 2.476466 73.726567 1.061919 15.994960 2.215290 108.142838 1.873297 36.839772 1.326406 169.023852 1.756630 750.253771 49 34.042741 2.504729 76.487107 1.074008 15.018651 2.213809 123.314975 1.854718 40.085731 1.330138 169.606295 1.745948 778.821860 50 34.286204 2.485415 80.908097 1.029931 14.701141 2.202930 149.748097 1.759291 33.764603 1.345225 157.402089 1.821746 796.548881 51 32.317382 2.512603 60.741143 1.093612 15.676934 2.159240 113.224372 1.859247 33.115052 1.325042 131.361039 1.850146 678.906266 52 39.603541 2.371574 68.719874 1.064090 18.118513 2.107728 115.728895 1.868074 35.667096 1.322699 152.943591 1.786516 741.838900 ;;; Need this statement after the data Options nodate; No date will appear on the output proc means data=spaghettisauce n mean median std min max cv skewness kurtosis; var qprego pprego; run; proc corr data=spahgettisauce; var qprego pprego; run; Model Statement Creation of a data set named datareg which contains the predicted values of the dependent variable and the residuals proc reg data=spaghettisauce; model qprego=pprego / dwprob; output out=datareg r=resqprego p=predqprego; run; proc autoreg data=spaghettisauce; model qprego=pprego / normal; Test of normality of the residuals run; autoreg also produces AIC, SIC, and within sample MAE, MAPE, and RMSE. print proc print data=datareg; var week qprego pprego resqprego predqprego; run; proc reg data=spaghettisauce; model qprego=pprego / pcorr2 clb cli alpha=.10; run; Square of partial correlation coefficients Confidence intervals associated with the estimated coefficients Statistics in SAS Use PROC MEANS or PROC CORR Proc Means Data = ??? N mean median std min max cv skewness kurtosis var var_name1 var_name2…; The SAS System The MEANS Procedure Variable N Mean Median Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ qprego 52 134.5458093 132.9587700 17.8065350 104.4507590 177.5277640 pprego 52 1.8458800 1.8515640 0.0517779 1.6562280 1.9326170 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Coeff of Variable Variation Skewness Kurtosis ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ qprego 13.2345519 0.5902592 -0.1063091 pprego 2.8050533 -1.0928616 2.5133372 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Regression in SAS Use PROC REG PROC AUTOREG or PROC MODEL Simple and Multiple Regression Using SAS PROC REG for Simple Linear Regression • The general syntax for PROC REG is – PROC REG <options>; <statements>; • The most commonly used options are: – DATA=datsetname • Specifies dataset – SIMPLE • Displays descriptive statistics • The most commonly used statements are: – MODEL dependentvar = independentvar </ options >; • Specifies the variable to be predicted (dependentvar) and the variable that is the predictor (independentvar) • Several MODEL options are available. Example Proc reg data = spaghettisauce; Model qprego = pprego/ P r cli clb dwprob; The SAS System The REG Procedure Model: MODEL1 Dependent Variable: qprego Number of Observations Read Number of Observations Used 52 52 Analysis of Variance Source DF Model SSR 1 Error SSE 50 Corrected Total SST 51 Sum of Squares 8631.07541 7539.63173 16171 Root MSE Dependent Mean Coeff Var Variable Intercept pprego DF 1 1 Mean Square 8631.07541 150.79263 12.27977 134.54581 9.12683 F Value 57.24 R-Square Adj R-Sq Parameter Estimates Parameter Standard Estimate Error 598.31966 61.32413 -251.24810 33.20935 t Value 9.76 -7.57 Pr > F <.0001 0.5337 0.5244 R2 R2 Pr > |t| <.0001 <.0001 The SAS System The REG Procedure Model: MODEL1 Dependent Variable: qprego Durbin-Watson D Pr < DW Pr > DW Number of Observations 1st Order Autocorrelation 1.132 0.0004 0.9996 52 0.422 NOTE: Pr<DW is the p-value for testing positive autocorrelation, and Pr>DW is the p-value for testing negative autocorrelation. The SAS System The AUTOREG Procedure Dependent Variable qprego Ordinary Least Squares Estimates SSE MSE SBC MAE MAPE Durbin-Watson 7539.63173 150.79263 414.25971 9.49555836 7.12604319 1.1321 DFE Root MSE(RMSE) AIC AICC Regress R-Square Total R-Square 50 12.27977 410.357222 410.60212 0.5337 0.5337 Miscellaneous Statistics Statistic Normal Test Variable Intercept pprego Value 0.4812 Prob 0.7862 DF Estimate Standard Error 1 1 598.3197 -251.2481 61.3241 33.2094 Test of normality of residuals Label Pr > ChiSq t Value Approx Pr > |t| 9.76 -7.57 <.0001 <.0001 The SAS System residual Obs week qprego pprego resqprego 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 117.050 125.987 123.058 114.954 113.248 125.878 120.413 121.301 126.793 112.731 122.731 118.289 133.728 130.809 137.465 150.015 124.371 136.539 134.413 131.812 137.358 169.203 131.770 151.600 131.142 1.90962 1.84750 1.87096 1.93262 1.92007 1.89978 1.87737 1.82313 1.85572 1.93034 1.91257 1.89275 1.82201 1.84992 1.85844 1.80635 1.88148 1.87322 1.82784 1.82242 1.81639 1.78052 1.89744 1.84859 1.91328 -1.4811 -8.1534 -5.1863 2.2005 -2.6589 4.8738 -6.2221 -18.9614 -5.2805 -0.5937 4.9409 -4.4800 -6.8145 -2.7229 6.0740 5.5372 -1.2302 8.8625 -4.6661 -8.6281 -4.5970 18.2358 10.1774 17.7357 13.5304 predicted variables predqprego 118.531 134.140 128.244 112.753 115.907 121.004 126.635 140.262 132.073 113.325 117.790 122.769 140.542 133.532 131.391 144.478 125.601 127.676 139.079 140.440 141.955 150.967 121.592 133.865 117.612 Obs week 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 qprego pprego resqprego predqprego 104.451 145.691 140.632 136.383 113.737 164.054 176.284 162.312 138.827 149.952 135.265 147.601 177.528 160.197 163.946 132.380 129.407 135.693 146.732 133.538 141.855 107.102 108.143 123.315 149.748 113.224 115.729 1.88565 1.83482 1.85852 1.84935 1.87674 1.83148 1.75743 1.80406 1.82686 1.76987 1.85854 1.85011 1.65623 1.79970 1.79999 1.88482 1.86704 1.84095 1.85302 1.84037 1.73488 1.82040 1.87330 1.85472 1.75929 1.85925 1.86807 -20.1029 8.3666 9.2610 2.7093 -13.0564 25.8906 19.5148 17.2604 -0.4966 -3.6915 3.9008 14.1178 -4.6678 14.0486 17.8696 7.6183 0.1786 -0.0927 13.9800 -2.3934 -20.5802 -33.8452 -19.5145 -9.0103 -6.5530 -17.9630 -13.2407 124.554 137.325 131.371 133.674 126.793 138.164 156.769 145.052 139.324 153.643 131.365 133.484 182.196 146.148 146.076 124.761 129.228 135.786 132.752 135.931 162.435 140.947 127.657 132.325 156.301 131.187 128.970 The REG Procedure Model: MODEL1 Dependent Variable: qprego Number of Observations Read Number of Observations Used 52 52 Analysis of Variance Source DF Sum of Squares Model Error Corrected Total 1 50 51 8631.07541 7539.63173 16171 Root MSE Dependent Mean Coeff Var 12.27977 134.54581 9.12683 Mean Square 8631.07541 150.79263 R-Square Adj R-Sq F Value Pr > F 57.24 <.0001 0.5337 0.5244 Parameter Estimates Variable Intercept pprego DF Parameter Estimate Standard Error t Value Pr > |t| Squared Partial Corr Type II 1 1 598.31966 -251.24810 61.32413 33.20935 9.76 -7.57 <.0001 <.0001 . 0.53375 90% Confidence Limits 495.54624 -306.90382 701.09307 -195.59238 Confidence limits of parameter estimates square of partial correlation coefficients Using SAS PROC REG for Multiple Linear Regression • The general syntax for PROC REG is – PROC REG <options>; <statements>; • The most commonly used options are: – DATA=datsetname • Specifies dataset – SIMPLE • Displays descriptive statistics • The most commonly used statements are: – MODEL dependentvar = independentvar </ options > • Specifies the variable to be predicted (dependentvar) and the variables that are the predictors (independentvars) MODEL STATEMENT OPTIONS (Place after slash following the list of explanatory variables.) • P Requests a table containing predicted values from the model • R Requests that the residuals be analyzed. • CLI Requests the 95 percent upper and lower confidence limits for an individual value of the dependent variable. Example data firms; input firm_id capital labor output; log_output=log(output); log_capital=log(capital); log_labor=log(labor); datalines; 1 8 23 106 2 9 14 81.08 3 4 38 72.8 4 2 97 57.34 5 6 11 66.79 6 6 43 98.23 7 3 93 82.68 8 6 49 99.77 9 8 36 110 10 8 43 118.93 11 4 61 95.05 12 8 31 112.83 13 3 57 64.54 14 6 97 137.22 15 4 93 86.17 16 2 72 56.25 17 3 61 81.1 18 3 97 65.23 19 9 89 149.56 20 3 25 65.43 21 1 81 36.06 22 4 11 56.92 23 2 64 49.59 24 3 10 43.21 25 6 71 121.24 ;;; options nodate; proc reg data=firms; model output=labor capital / pcorr2; run; proc reg data=firms; model log_output=log_labor log_capital / pcorr2; run; log_output=log(output); log_capital=log(capital); log_labor=log(labor); Transformation statements The REG Procedure Model: MODEL1 Dependent Variable: output Number of Observations Read Number of Observations Used Analysis of Variance Sum of Squares Source DF Model SSR Error SSE Corrected Total 2 22 24 17596 3578.83410 21175 Root MSE Dependent Mean Coeff Var 12.75438 84.56080 15.08309 SST 25 25 Mean Square 8798.14334 162.67428 R-Square Adj R-Sq F Value Pr > F 54.08 <.0001 0.8310 0.8156 R2 R2 Parameter Estimates Variable Intercept labor capital DF Parameter Estimate Standard Error t Value Pr > |t| Squared Partial Corr Type II 1 1 1 2.15525 0.47631 11.64477 9.01440 0.09215 1.13539 0.24 5.17 10.26 0.8132 <.0001 <.0001 . 0.54842 0.82703 Square of partial correlation coefficients Model: MODEL1 Dependent Variable: log_output Number of Observations Read Number of Observations Used 25 25 Source DF Analysis of Variance Sum of Squares Model Error Corrected Total 2 22 24 3.01454 0.18711 3.20165 Root MSE Dependent Mean Coeff Var 0.09222 4.37573 2.10760 Mean Square 1.50727 0.00851 R-Square Adj R-Sq F Value Pr > F 177.22 <.0001 0.9416 0.9362 R2 R2 Parameter Estimates Variable Intercept log_labor log_capital DF Parameter Estimate Standard Error t Value Pr > |t| Squared Partial Corr Type II 1 1 1 2.48108 0.25734 0.64011 0.12862 0.02696 0.03473 19.29 9.55 18.43 <.0001 <.0001 <.0001 . 0.80551 0.93917