Topics in Data Management (PPT)

advertisement
Topics in Data Management
SAS Data Step
Combining Data Sets I - SET Statement
• Data available on common variables from different
sources. Multiple datasets with common variable
names, possibly different sampling/experimental units
– Exam scores from students in various sections of STA 2023
– County level data from different state databases
– Flight departure/arrival data from different months
SECTION 1
Student
id #
Amy
Zed
SECTION 5
Student
id #
Alex
Zach
Exam1
1456
2234
Exam2
98
68
Exam1
3410
4561
Exam3
78
84
Exam2
74
92
84
75
Exam3
68
74
.
88
Combining Data Sets I - SET Statement
options nodate nonumber ps=54 ls=80;
data one;
input student $ 1-8 idnum 9-12 exam1 14-16 exam2 18-20 exam3 22-24;
section=1;
cards;
Amy
1456 98 78 84
Zed
2234 68 84 75
;
run;
data five;
input student $ 1-8 idnum 9-12 exam1 14-16 exam2 18-20 exam3 22-24;
section=5;
cards;
Alex
3410 74 68
.
Zach
4561 92 74 88
;
run;
data all;
set one five;
run;
proc print;
run;
quit;
Combining Data Sets I - SET Statement
The SAS System
Obs
1
2
3
4
student
Amy
Zed
Alex
Zach
idnum
1456
2234
3410
4561
exam1
exam2
exam3
98
68
74
92
78
84
68
74
84
75
.
88
section
1
1
5
5
Combining Data Sets II - MERGE Statement
• Data on common sampling/experimental units,
different variables/characteristics measured in
different datasets.
– County data from different government sources
– Store sales data updated over time
Store
Atlanta
Zurich
2003 Sales
1459
1383
Store
Atlanta
Zurich
2004 Sales
1640
1561
Combining Data Sets II - MERGE Statement
options nodate nonumber ps=54 ls=80;
data s2003;
input store $ 1-8 sales03 10-14;
cards;
Atlanta
1459
Zurich
1383
;
run;
data s2004;
input store $ 1-8 sales04 10-14;
cards;
Atlanta
1459
Zurich
1383
;
run;
proc sort data=s2003; by store;
proc sort data=s2004; by store;
data s0304;
merge s2003 s2004;
by store;
run;
proc print;
run;
quit;
The SAS System
Obs
store
sales03
sales04
1
2
Atlanta
Zurich
1459
1383
1459
1383
Creating New Variables From Existing Ones
• Creating Final Grade for Students (Exams 1
and 2 Each Count 30%, Exam 3 40%)
– Total = (0.3*Exam1)+(0.3*Exam2)+(0.4*Exam3)
• Obtaining Sales Growth (%) for stores
– Grow0403=100*(sales04-sales03)/sales03
Grades Example
data all;
set one five;
total=(0.3*exam1)+(0.3*exam2)+(0.4*exam3);
run;
proc print;
var student idnum total;
run;
quit;
The SAS System
Obs
student
idnum
total
1
2
3
4
Amy
Zed
Alex
Zach
1456
2234
3410
4561
86.4
75.6
.
85.0
Building Case Histories
• Have multiple observations of same variable on
individual units (not necessarily the same number
across individuals).
• Want to summarize the measurements for each
individual and obtain single “record”.
– Summary of all Delta flights for each ATL route to other
cities for October 2004
– Arrest record for juveniles over a 5 year period
– Sales histories for individual stores in a retail chain
Building Case Histories
• Step 1: SORT dataset on the variable(s) that define(s)
the individual units/cases.
• Step 2: Set the previous dataset into a new one, using
the same BY statement as in the SORT.
– The new dataset “sees” the old dataset as a series of
“blocks” of measurements by individual cases
• Step 3: Define any variables you want to use to
summarize cases in RETAIN statement.
• Step 4: At beginning of each individual, reset
variables in Step 3 (typically to 0)
• Step 5: At end of each individual OUTPUT record
Example - Brookstone Store Sales&Inventory
• 8 EXCEL Spreadsheets - 4 Quarters X 2 Measures
• 520 stores observed over 52 weeks
• Typical Spreadsheet Portion (4 stores X 6 weeks):
10
11
13
14
KING OF PRUSSIA
COLUMBIA
STAMFORD
PERIMETER
497,813
323,484
372,537
371,841
488,433
319,688
383,548
358,286
478,034
323,110
383,112
357,662
481,757
323,309
383,424
366,635
472,936
320,954
379,760
360,219
479,246
299,307
375,930
354,753
Note that the company provides 13 columns representing the
13 weeks in the quarter for each store…not the way we want
to analyze it. Also, got rid of commas in EXCEL before
exporting to text file.
Reading the Data in SAS
Data inv1;
infile ‘filename’;
input storeid 6-8 storename $ 10-38 @;
do week=1 to 13;
input inv @;
output;
end;
run;
This creates 13 “observations” per store and single inv variable
Reading the Data in SAS
SET
INV1
(Weeks 1-13)
INV2
(Weeks 14-26)
INV3
(Weeks 27-39)
INV4
(Weeks 40-52)
SALES1
(Weeks 1-13)
SALES2
(Weeks 14-26)
SALES3
(Weeks 27-39)
SALES4
(Weeks 40-52)
MERGE
SET
Building a Store Record for Year
• Suppose Management wants following summary
measures for each store:
–
–
–
–
Total sales
Average sales to inventory ratio
Mean and standard deviation of sales
Correlation between sales and inventory
• We need the following quantities counted across
weeks:
– SALES, SALES2, INV, INV2, SALES*INV, SALES/INV
SAS Code to Obtain Measures by Store (P1)
Data inv; set inv1-inv4; run; proc sort; by storeid; run;
Data sales; set sales1-sales4; run; proc sort; by storeid; run;
Data invsales; merge inv sales; by storeid; run;
proc sort; by storeid; run;
Data invsales1;
set invsales;
by storeid;
retain sumsales sumsales2 suminv suminv2 salesxinv sales_inv;
If first.storeid then do;
sumsales=0; sumsales2=0; suminv=0; suminv2=0; salesxinv=0; sales_inv=0;
end;
sumsales=sumsales+sales; sumsales2=sumsales2+(sales**2);
suminv=suminv+inv; suminv2=suminv2+(inv**2);
salesxinv=salesxinv+(sales*inv); sales_inv=sales_inv+(sales/inv);
if last.storeid then do;
totsales=sumsales; meansal_inv=sales_inv/52;
meansales=totsales/52;
varsales=(sumsales2-(sumsales**2)/52)/51; stdsales=sqrt(varsales);
varinv=(suminv2-(suminv**2)/52)/51; stdinv=sqrt(varinv);
covslinv=(salesxinv-(sumsales*suminv)/52)51;
corrslinv=covslinv/(stdsales*stdinv);
output; end; run;
Download