Introduction to SAS

advertisement
Introduction to SAS
What is a data set?
• A data set (or dataset) is a collection of data,
usually presented in tabular form. Each
column represents a particular variable. Each
row corresponds to a given member of the
data set in question.
There are three types of datasets
• Cross-sectional
• Time-Series
• Panel (combination of cross-sectional timeseries data sets)
Cross-Sectional Data
• Cross-sectional data refers to data collected by
observing many subjects (such as individuals,
firms or countries/regions) at the same point
of time, or without regard to differences in
time.
Members
Age
Wage
Years of schooling
John
40
100k
14
Paul
34
110k
17
Mary
28
75k
10
Tom
30
130k
16
Sara
37
50k
15
Time-Series Data
• A time series is a sequence of data points, measured
typically at successive times spaced at uniform time
intervals.
Year
GDP xyz
Inflation Rate
2004
34
3.2
2005
30
2.5
2006
37
2.7
2007
38
3
2008
41
2.9
2009
43
3.4
• Frequencies: daily, weekly, monthly, quarterly, annual
Panel Data
• Panel data, also called longitudinal data or
cross-sectional time series data, are data
where multiple cases (people, firms, countries
etc) were observed at two or more time
periods.
Person
Year
Income
Age
Sex
1
2003
1500
27
1
1
2004
1700
28
1
1
2005
2000
29
1
2
2003
2100
41
2
2
2004
2100
42
2
2
2005
2200
43
2
What should you know about your
dataset?
•
•
•
•
What type of dataset do you have?
How many variables do you have?
How many observations do you have?
What kind of variables do you have?
– Numeric. numerical variable is an observed
response that is a numerical value
– String. A string variable is any combination of one
or more characters.
• Are there missing values?
How to store your dataset?
• Microsoft Excel Spreadsheets
Accessing SAS
Version 9.2 or 9.3
Click on ENGLISH 9.2 or 9.3
1. What does SAS look like?
LOG WINDOW
EXPLORER
WINDOW
NEW LIBRARIES
EXECUTE THE
PROGRAM
OUTPUT WINDOW
RESULTS WINDOW
EDITOR WINDOW
Anatomy of a SAS Program
(1) Data name statement
(2) Input statement (list of all variables to be read into the
program)
(3) Transformation statements
(4) Datalines statement (copy & paste from Excel)
(5) Placement of data
(6) PROC statements
– Means
– Corr
– Reg
– Model
– Autoreg
(7) Run Statement
Examples
Spaghetti Sauce Program
data spaghettisauce; Data set name
input week qclassico pclassico qhunts phunts qnewmans pnewmans
qprego pprego qprivl
Input statement
pprivl qragu pragu totalexp;
datalines;
Placement of data after the datalines statement
1
16.557905
1.336341
15.409280
1.311940
17.952117
1.381772
15.187799
1.362120
15.651408
1.356528
29.478939
2.311685
80.401300
2.287727 117.049632
1.909622
276.759921
1.677577 937.542909
2
31.380376
2.305388
75.181905
2.299160 125.986697
1.847495
206.699207
1.777097 845.490186
3
31.762660
2.299778
69.281355
2.160420 123.057729
1.870962
218.231648
1.738040 846.008960
4
28.447741
2.341264
68.898908
2.321191 114.953810
1.932617
204.152369
1.752055 804.175192
5
27.772665
2.340832
77.208027
2.249415 113.247798
1.920066
180.526273
1.846330 782.554156
1.099910
41.363767
1.105020
41.584220
1.111880
34.458333
1.108804
33.825571
1.080379
35.508482
2.246208
1.812694
2.203934
1.796701
2.207156
1.809025
2.205486
1.754891
2.216497
1.798351
2.195138
1.692623
2.175770
1.704789
2.168626
1.772264
2.164687
1.671172
2.158183
1.680284
6
28.251703
2.362670
125.877846
1.899778
910.381585
7
26.947404
2.368843
120.413152
1.877365
864.910385
8
26.669631
2.375479
121.300549
1.823129
793.463874
9
29.190977
2.354548
126.792828
1.855721
898.975245
10
30.564590
2.301370
112.731447
1.930341
869.899250
11
29.502039
2.342312
122.730980
1.912570
894.705963
12
29.454762
2.383079
118.288762
1.892754
921.236157
13
28.853887
2.393748
133.727889
1.822013
869.240450
14
30.275710
2.361550
130.808890
1.849916
994.631492
15
34.241497
2.290308
137.464940
1.858437
1011.737278
96.507708
61.823708
1.060712
16.511740
1.133252 217.893723
84.284722
37.019864
1.048924
16.802342
1.353188 222.342183
81.810965
33.521622
1.071276
16.730153
1.359883 187.289016
97.958015
38.925676
1.045915
16.885963
1.343224 229.625749
96.337535
44.831781
1.073289
16.835041
1.315669 212.556985
76.135599
43.597670
1.104502
16.832199
1.272487 244.799346
69.803347
56.155822
1.118029
18.328200
1.161683 260.440575
72.185000
35.419832
1.108094
18.922787
1.334668 219.052937
110.997722
41.793621
0.970290
18.885386
1.296186 286.263290
91.463049
42.349396
1.059148
18.770848
1.337508 287.937805
2.108769
1.787616
2.202753
1.756095
2.238778
1.587328
2.244342
1.604585
2.219749
1.616331
2.184630
1.815537
2.178318
1.779423
2.148674
1.798278
2.127138
1.835858
2.136197
1.820689
16
43.102764
2.246922
150.014841
1.806350
914.592550
17
35.687632
2.329571
124.371155
1.881480
920.101902
18
37.710794
2.327977
136.538891
1.873221
1067.052402
19
36.972091
2.265346
134.412773
1.827838
1015.942051
20
32.236119
2.393364
131.812201
1.822419
973.126818
21
31.584801
2.409353
137.357622
1.816392
856.927770
22
33.133108
2.406975
169.203190
1.780520
940.026516
23
36.753574
2.363383
131.769601
1.897437
924.339307
24
34.855495
2.399628
151.600412
1.848591
884.805764
25
39.940000
2.369996
131.142332
1.913279
868.475691
80.876807
42.336055
1.067257
18.982216
1.277091 204.939009
80.627606
65.104947
1.084763
16.899180
1.132962 230.349266
131.616811
53.952967
0.983176
16.947751
1.180342 310.249254
90.488003
41.157579
1.049932
16.030220
1.263050 313.810158
92.735918
39.314523
1.032483
17.484182
1.331265 290.076840
77.131493
40.795877
1.078318
17.947063
1.295275 196.146760
90.895588
47.883579
1.047093
18.852538
1.285596 202.989618
97.552040
51.256152
1.048465
19.426319
1.274568 210.257725
92.632436
43.226880
1.050174
20.321799
1.305765 176.466149
71.949897
39.030243
1.102729
19.500000
1.317081 192.511247
2.117485
1.863909
2.130235
1.675408
2.071805
1.708309
2.174521
1.734620
2.168873
1.816932
2.124518
1.666451
2.198892
1.628485
2.177400
1.798869
2.173055
1.816060
2.161882
1.710587
26
33.047390
2.333880
104.450759
1.885650
666.704699
27
38.182377
2.358465
145.691479
1.834819
977.050906
28
39.340907
2.310507
140.632073
1.858516
927.411188
29
42.142760
2.249801
136.383001
1.849351
858.853482
30
33.415941
2.359613
113.736908
1.876736
726.389342
31
38.053214
2.421593
164.054088
1.831481
1044.930162
32
36.574890
2.448129
176.283723
1.757429
1108.443481
33
39.515679
2.460343
162.312382
1.804064
1099.622735
34
49.178044
2.448336
138.827152
1.826863
934.109753
35
42.717913
2.461972
149.951936
1.769869
953.331361
64.127801
30.055429
1.073595
16.576976
1.322393 133.549765
82.609900
50.747411
1.091488
21.204360
1.278392 250.367996
88.645386
39.082103
1.074976
20.679452
1.323507 225.536421
70.741651
39.929939
1.112186
18.398206
1.311964 196.440037
64.464785
33.423684
1.107849
16.740638
1.322969 155.285085
117.697194
50.552102
0.980586
22.297249
1.259899 255.540157
111.200050
47.497145
1.033788
20.088398
1.281080 300.352885
96.845945
45.812445
1.047416
19.503119
1.294824 281.485505
119.427531
40.742757
0.942305
18.707266
1.314488 194.565294
88.517788
57.621016
1.054659
18.438739
1.141232 224.360623
36
41.197544
2.471480
94.505700
1.046608
18.941095
2.156862
135.265353
1.858542
60.168718
1.180721 229.472645
1.719798
958.668060
37
39.788842
2.453568 102.044994
0.985755
18.903699
2.151540 147.601324
1.850108
42.290973
1.306216 234.360074
1.651436 954.238256
38
41.314488
2.395698
85.400518
1.028951
20.735704
2.067240 177.527764
1.656228
36.378665
1.345077 239.807719
1.692423 978.530320
39
42.616783
2.383487
85.074060
1.037431
20.512587
2.059879 160.196547
1.799702
42.088850
1.291107 208.047699
1.687847 925.888402
40
41.664717
2.379957
89.145855
1.069841
18.182713
2.111224 163.945941
1.799987
43.358742
1.282735 226.592166
1.639427 955.119559
41
51.567643
2.222860
83.334685
1.057869
18.994402
2.103811 132.379793
1.884823
53.652463
1.147004 192.771435
1.735189 888.292404
42
41.421016
2.451080
74.908315
1.067700
16.726950
2.162721 129.406980
1.867044
37.261913
1.295095 227.179097
1.576445 865.683152
43
53.792936
2.439119
78.807962
1.057600
16.198722
2.182931 135.693022
1.840945
40.210292
1.306747 200.088160
1.645527 881.513904
44
43.606944
2.456948
81.652867
1.071207
16.419492
2.195529 146.731878
1.853020
46.255420
1.258085 192.757055
1.708901 890.149666
45
37.815625
2.486153
90.360281
1.013287
15.269039
2.216885 133.537747
1.840366
38.729663
1.333144 201.300751
1.644203 847.795877
46
37.094566
2.487414
72.384919
1.068425
14.802451
2.229075
141.855150
1.734876
35.310976
1.343831 159.132530
1.762777
776.671417
47
33.204738
2.459258
64.056208
1.073800
13.912000
2.230996 107.101937
1.820402
32.968196
1.318080 171.275693
1.729661 716.152377
48
35.602401
2.476466
73.726567
1.061919
15.994960
2.215290 108.142838
1.873297
36.839772
1.326406 169.023852
1.756630 750.253771
49
34.042741
2.504729
76.487107
1.074008
15.018651
2.213809 123.314975
1.854718
40.085731
1.330138 169.606295
1.745948 778.821860
50
34.286204
2.485415
80.908097
1.029931
14.701141
2.202930 149.748097
1.759291
33.764603
1.345225 157.402089
1.821746 796.548881
51
32.317382
2.512603
60.741143
1.093612
15.676934
2.159240 113.224372
1.859247
33.115052
1.325042 131.361039
1.850146 678.906266
52
39.603541
2.371574
68.719874
1.064090
18.118513
2.107728 115.728895
1.868074
35.667096
1.322699 152.943591
1.786516 741.838900
;;;
Need this statement after the data
Options nodate;
No date will appear on the output
proc means data=spaghettisauce n mean median std min max cv skewness kurtosis;
var qprego pprego;
run;
proc corr data=spahgettisauce;
var qprego pprego;
run;
Model
Statement
Creation of a data set named
datareg which contains the
predicted values of the
dependent variable and the
residuals
proc reg data=spaghettisauce;
model qprego=pprego / dwprob;
output out=datareg r=resqprego p=predqprego;
run;
proc autoreg data=spaghettisauce;
model qprego=pprego / normal; Test of normality of the residuals
run;
autoreg also produces AIC, SIC, and
within sample MAE, MAPE, and RMSE.
print
proc print data=datareg;
var week qprego pprego resqprego predqprego;
run;
proc reg data=spaghettisauce;
model qprego=pprego / pcorr2 clb cli alpha=.10;
run;
Square of partial
correlation coefficients
Confidence intervals
associated with the estimated
coefficients
Statistics in SAS
Use PROC MEANS or PROC CORR
Proc Means Data = ??? N mean median
std min max cv skewness kurtosis var
var_name1 var_name2…;
The SAS System
The MEANS Procedure
Variable
N
Mean
Median
Std Dev
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
qprego
52
134.5458093
132.9587700
17.8065350
104.4507590
177.5277640
pprego
52
1.8458800
1.8515640
0.0517779
1.6562280
1.9326170
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Coeff of
Variable
Variation
Skewness
Kurtosis
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
qprego
13.2345519
0.5902592
-0.1063091
pprego
2.8050533
-1.0928616
2.5133372
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Regression in SAS
Use PROC REG PROC AUTOREG or
PROC MODEL
Simple and Multiple Regression
Using SAS PROC REG for Simple Linear Regression
• The general syntax for PROC REG is
– PROC REG <options>; <statements>;
• The most commonly used options are:
– DATA=datsetname
• Specifies dataset
– SIMPLE
• Displays descriptive statistics
• The most commonly used statements are:
– MODEL dependentvar = independentvar </ options >;
• Specifies the variable to be predicted (dependentvar) and the
variable that is the predictor (independentvar)
• Several MODEL options are available.
Example
Proc reg data = spaghettisauce;
Model qprego = pprego/ P r cli clb
dwprob;
The SAS System
The REG Procedure
Model: MODEL1
Dependent Variable: qprego
Number of Observations Read
Number of Observations Used
52
52
Analysis of Variance
Source
DF
Model SSR
1
Error SSE
50
Corrected Total SST 51
Sum of
Squares
8631.07541
7539.63173
16171
Root MSE
Dependent Mean
Coeff Var
Variable
Intercept
pprego
DF
1
1
Mean
Square
8631.07541
150.79263
12.27977
134.54581
9.12683
F Value
57.24
R-Square
Adj R-Sq
Parameter Estimates
Parameter
Standard
Estimate
Error
598.31966
61.32413
-251.24810
33.20935
t Value
9.76
-7.57
Pr > F
<.0001
0.5337
0.5244
R2
R2
Pr > |t|
<.0001
<.0001
The SAS System
The REG Procedure
Model: MODEL1
Dependent Variable: qprego
Durbin-Watson D
Pr < DW
Pr > DW
Number of Observations
1st Order Autocorrelation
1.132
0.0004
0.9996
52
0.422
NOTE: Pr<DW is the p-value for testing positive autocorrelation,
and Pr>DW is the p-value for testing negative autocorrelation.
The SAS System
The AUTOREG Procedure
Dependent Variable
qprego
Ordinary Least Squares Estimates
SSE
MSE
SBC
MAE
MAPE
Durbin-Watson
7539.63173
150.79263
414.25971
9.49555836
7.12604319
1.1321
DFE
Root MSE(RMSE)
AIC
AICC
Regress R-Square
Total R-Square
50
12.27977
410.357222
410.60212
0.5337
0.5337
Miscellaneous Statistics
Statistic
Normal Test
Variable
Intercept
pprego
Value
0.4812
Prob
0.7862
DF
Estimate
Standard
Error
1
1
598.3197
-251.2481
61.3241
33.2094
Test of
normality
of residuals
Label
Pr > ChiSq
t Value
Approx
Pr > |t|
9.76
-7.57
<.0001
<.0001
The SAS System
residual
Obs
week
qprego
pprego
resqprego
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
117.050
125.987
123.058
114.954
113.248
125.878
120.413
121.301
126.793
112.731
122.731
118.289
133.728
130.809
137.465
150.015
124.371
136.539
134.413
131.812
137.358
169.203
131.770
151.600
131.142
1.90962
1.84750
1.87096
1.93262
1.92007
1.89978
1.87737
1.82313
1.85572
1.93034
1.91257
1.89275
1.82201
1.84992
1.85844
1.80635
1.88148
1.87322
1.82784
1.82242
1.81639
1.78052
1.89744
1.84859
1.91328
-1.4811
-8.1534
-5.1863
2.2005
-2.6589
4.8738
-6.2221
-18.9614
-5.2805
-0.5937
4.9409
-4.4800
-6.8145
-2.7229
6.0740
5.5372
-1.2302
8.8625
-4.6661
-8.6281
-4.5970
18.2358
10.1774
17.7357
13.5304
predicted variables
predqprego
118.531
134.140
128.244
112.753
115.907
121.004
126.635
140.262
132.073
113.325
117.790
122.769
140.542
133.532
131.391
144.478
125.601
127.676
139.079
140.440
141.955
150.967
121.592
133.865
117.612
Obs
week
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
qprego
pprego
resqprego
predqprego
104.451
145.691
140.632
136.383
113.737
164.054
176.284
162.312
138.827
149.952
135.265
147.601
177.528
160.197
163.946
132.380
129.407
135.693
146.732
133.538
141.855
107.102
108.143
123.315
149.748
113.224
115.729
1.88565
1.83482
1.85852
1.84935
1.87674
1.83148
1.75743
1.80406
1.82686
1.76987
1.85854
1.85011
1.65623
1.79970
1.79999
1.88482
1.86704
1.84095
1.85302
1.84037
1.73488
1.82040
1.87330
1.85472
1.75929
1.85925
1.86807
-20.1029
8.3666
9.2610
2.7093
-13.0564
25.8906
19.5148
17.2604
-0.4966
-3.6915
3.9008
14.1178
-4.6678
14.0486
17.8696
7.6183
0.1786
-0.0927
13.9800
-2.3934
-20.5802
-33.8452
-19.5145
-9.0103
-6.5530
-17.9630
-13.2407
124.554
137.325
131.371
133.674
126.793
138.164
156.769
145.052
139.324
153.643
131.365
133.484
182.196
146.148
146.076
124.761
129.228
135.786
132.752
135.931
162.435
140.947
127.657
132.325
156.301
131.187
128.970
The REG Procedure
Model: MODEL1
Dependent Variable: qprego
Number of Observations Read
Number of Observations Used
52
52
Analysis of Variance
Source
DF
Sum of
Squares
Model
Error
Corrected Total
1
50
51
8631.07541
7539.63173
16171
Root MSE
Dependent Mean
Coeff Var
12.27977
134.54581
9.12683
Mean
Square
8631.07541
150.79263
R-Square
Adj R-Sq
F Value
Pr > F
57.24
<.0001
0.5337
0.5244
Parameter Estimates
Variable
Intercept
pprego
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Squared
Partial
Corr Type II
1
1
598.31966
-251.24810
61.32413
33.20935
9.76
-7.57
<.0001
<.0001
.
0.53375
90% Confidence Limits
495.54624
-306.90382
701.09307
-195.59238
Confidence limits of parameter estimates
square of partial correlation coefficients
Using SAS PROC REG for Multiple Linear Regression
• The general syntax for PROC REG is
– PROC REG <options>; <statements>;
• The most commonly used options are:
– DATA=datsetname
• Specifies dataset
– SIMPLE
• Displays descriptive statistics
• The most commonly used statements are:
– MODEL dependentvar = independentvar </ options >
• Specifies the variable to be predicted (dependentvar)
and the variables that are the predictors
(independentvars)
MODEL STATEMENT OPTIONS
(Place after slash following the list of explanatory
variables.)
• P
Requests a table containing predicted values
from the model
• R
Requests that the residuals be analyzed.
• CLI
Requests the 95 percent upper and lower
confidence limits for an individual value of
the dependent variable.
Example
data firms;
input firm_id capital labor output;
log_output=log(output);
log_capital=log(capital);
log_labor=log(labor);
datalines;
1
8
23
106
2
9
14
81.08
3
4
38
72.8
4
2
97
57.34
5
6
11
66.79
6
6
43
98.23
7
3
93
82.68
8
6
49
99.77
9
8
36
110
10
8
43
118.93
11
4
61
95.05
12
8
31
112.83
13
3
57
64.54
14
6
97
137.22
15
4
93
86.17
16
2
72
56.25
17
3
61
81.1
18
3
97
65.23
19
9
89
149.56
20
3
25
65.43
21
1
81
36.06
22
4
11
56.92
23
2
64
49.59
24
3
10
43.21
25
6
71
121.24
;;;
options nodate;
proc reg data=firms;
model output=labor capital / pcorr2;
run;
proc reg data=firms;
model log_output=log_labor log_capital / pcorr2;
run;
log_output=log(output);
log_capital=log(capital);
log_labor=log(labor);
Transformation statements
The REG Procedure
Model: MODEL1
Dependent Variable: output
Number of Observations Read
Number of Observations Used
Analysis of Variance
Sum of
Squares
Source
DF
Model
SSR
Error
SSE
Corrected Total
2
22
24
17596
3578.83410
21175
Root MSE
Dependent Mean
Coeff Var
12.75438
84.56080
15.08309
SST
25
25
Mean
Square
8798.14334
162.67428
R-Square
Adj R-Sq
F Value
Pr > F
54.08
<.0001
0.8310
0.8156
R2
R2
Parameter Estimates
Variable
Intercept
labor
capital
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Squared
Partial
Corr Type II
1
1
1
2.15525
0.47631
11.64477
9.01440
0.09215
1.13539
0.24
5.17
10.26
0.8132
<.0001
<.0001
.
0.54842
0.82703
Square of partial correlation coefficients
Model: MODEL1
Dependent Variable: log_output
Number of Observations Read
Number of Observations Used
25
25
Source
DF
Analysis of Variance
Sum of
Squares
Model
Error
Corrected Total
2
22
24
3.01454
0.18711
3.20165
Root MSE
Dependent Mean
Coeff Var
0.09222
4.37573
2.10760
Mean
Square
1.50727
0.00851
R-Square
Adj R-Sq
F Value
Pr > F
177.22
<.0001
0.9416
0.9362
R2
R2
Parameter Estimates
Variable
Intercept
log_labor
log_capital
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Squared
Partial
Corr Type II
1
1
1
2.48108
0.25734
0.64011
0.12862
0.02696
0.03473
19.29
9.55
18.43
<.0001
<.0001
<.0001
.
0.80551
0.93917
Download