Chapter 1 - Department of Statistics

advertisement
Chapter 1: Introduction
1.1: Course Logistics
1.2: Measuring Efficiencies
1.3: SAS DATA Step Processing
1
Chapter 1: Introduction
1.1: Course Logistics
1.2: Measuring Efficiencies
1.3: SAS DATA Step Processing
2
Objectives





3
List the tasks in the SAS Programming 3 course.
Explain the naming convention that is used for the
course files.
Compare the three levels of exercises that are used
in the course.
Describe, at a high level, how data is used and stored
at Orion Star Sports & Outdoors.
Navigate to the Help facility.
Tasks in the SAS Programming 3 Course
The course topics include techniques for the following
data management tasks:
 compressing SAS data sets
 creating indexes for a quick retrieval of subsets
 performing table lookups using arrays, hash objects,
or formats
 combining data by merging, using the SQL procedure,
or using multiple SET statements
 combining summary and detail data
 sorting and grouping data
 developing a program quickly
4
Resource Utilization
As programmers, you want to perform these tasks
as efficiently as possible and optimize the use of the
following resources:
 programmer time
 I/O
 CPU
 memory
 data storage space
 network bandwidth
5
Business Scenarios
The business scenarios are opportunities to compare
multiple techniques for performing the tasks.
For example:
 Task: Table Lookups
 Possible Techniques:
– DATA step MERGE statement
– PROC SQL joins
– Formats in PUT functions or in FORMAT
statements
– DATA step arrays
– DATA step hash objects
6
7
1.01 Multiple Answer Poll
What type(s) of SAS programs do you write?
a.
b.
c.
d.
e.
f.
8
Data manipulation with the DATA step
Data analysis with procedures
Report writing
A combination of the above
SAS training only; no programs written
Other
Filename Conventions
p304d01x
course ID
Code
9
Type
a
Activity
d
Demo
e
Exercise
s
Solution
chapter #
type
p304a01
p304a02
p304a02s
p304d01
p304d02
p304e01
p304e02
p304s01
p304s02
item #
placeholder
Example:
The SAS Programming 3
course ID is p3, so
p304d01 =
SAS Programming 3,
Chapter 4, Demo 1.
Three Levels of Exercises
Level 1
The exercise mimics an example
presented in the section.
Level 2
Less information and guidance are
provided in the exercise instructions.
Level 3
Only the task you are to perform or
the results to be obtained are provided.
Typically, you will need to use the
Help facility.

10
You are not expected to complete all of the
exercises in the time allotted. Choose the exercise
or exercises that are at the level with which you are
most comfortable.
Orion Star Sports & Outdoors
Orion Star Sports & Outdoors is a fictitious global sports
and outdoors retailer with traditional stores, an online store,
and a large catalog business.
The corporate headquarters is located in the United States
with offices and stores in many countries throughout the
world.
Orion Star has about 1,000 employees and 90,000
customers, processes approximately 150,000 orders
annually, and purchases products from 64 suppliers.
11
Orion Star Data
As is the case with most organizations, Orion Star has
a large amount of data about its customers, suppliers,
products, and employees. Much of this information is
stored in transactional systems in various formats.
Using applications and processes such as SAS Data
Integration Studio, this transactional information was
extracted, transformed, and loaded into a data
warehouse.
Data marts were created to meet the needs of specific
departments such as Marketing.
12
The SAS Help Facility
13
14
1.02 Quiz



15
Start your SAS session.
Open the Help facility.
Determine the path to use to obtain information about
the SAS component objects.
1.02 Quiz – Correct Answer
Determine the path to use to obtain information about the
SAS component objects.
Information relevant to this
course can be found by
following these paths in the
SAS Help facility:
Contents tab
 SAS Products
 Base SAS
 SAS 9.2 Language
Reference Dictionary
 Dictionary of
Component
Object Language
Elements
16
SAS OnlineDoc
You can also obtain information from SAS OnlineDoc.
Information relevant to this
course can be found by
following these paths in
SAS OnlineDoc:
Contents tab
 Products Documentation
A-Z
 Base SAS
 SAS 9.2 Language
Reference Dictionary
 Dictionary of
Component
Object Language
Elements
17
18
Chapter 1: Introduction
1.1: Course Logistics
1.2: Measuring Efficiencies
1.3: SAS DATA Step Processing
19
Objectives




20
Identify the resources used by a SAS program.
Report computer resource usage using SAS system
options.
Interpret resource usage statistics in your operating
environment.
Benchmark resource usage.
Running a SAS Program
What resources are required to run a SAS program?
The programmer must perform the following tasks:
 determine program specifications
 write the program
 test the program
 execute the program
 maintain the program
21
Running a SAS Program
The computer must perform the following actions:
 load the required SAS software into memory
 compile the program
 read the data
 execute the compiled program
 store output data files
 store output reports
22
What Resources Are Used?
CPU
programmer
time
I/O
resources used
memory
network
bandwidth
data storage
space
23
24
1.03 Multiple Answer Poll
Which of the following resources do you need to
conserve?
a.
b.
c.
d.
e.
f.
25
CPU
I/O
Memory
Data storage space
Network bandwidth
Your time
Understanding Efficiency Trade-offs
When you decrease the use of one resource, the use
of other resources might increase.
Resource usage is dependent on your data. A specific
technique might be more efficient with one data set and
less efficient with another.
26
Understanding Efficiency Trade-offs
Data
Data
Space
Decreasing the size
of a SAS data set can
result in an increase in
CPU usage.
12
12
9
9
3
6
3
6
CPU
27
...
Understanding Efficiency Trade-offs
I/O
Decreasing the number
of I/O operations comes
at the expense of increased
memory usage.
Memory
28
Deciding What Is Important for Efficiency
Your Programs
Your Site
29
Your Data
Understanding Efficiency at Your Site
Hardware
System Load
30
Operating Environment
SAS Environment
31
1.04 Multiple Choice Poll
This class uses SAS 9.2.
What is the latest version of SAS that are you running?
a.
b.
c.
d.
32
SAS 8.2
SAS 9.1
SAS 9.2
Other
Knowing How Your Program Will Be Used
The importance of efficiency increases with the following:
 the complexity of the program and/or the size of the
files being processed
 the number of times that the program will be executed
33
Knowing Your Data
34
35
1.05 Multiple Answer Poll
What type(s) of data do you use?
a. SAS data sets
b. External files
c. Data from a relational database – for example,
Oracle, Teradata, or SQL Server
d. Excel spreadsheets
e. OLAP cubes
f. Information maps
g. Other
36
Considering Trade-Offs
In this class, many tasks are performed using one or more
techniques.
To decide which technique is most efficient for a given
task, benchmark, or measure and compare, the resource
usage of each technique.
You should benchmark with the actual data to determine
which technique is the most efficient.
The effectiveness of any efficiency technique
depends greatly on the data with which you use
the technique.
37
Running Benchmarks: Guidelines
To benchmark your programming techniques, do the
following:
 Turn on the appropriate options to report resource
usage.
 Test each technique in a separate SAS session.
 Test only one technique or change at a time, with
as little additional code as possible.
 Run your tests under the conditions that your final
program will use (for example, batch execution,
large data sets, and so on).
38
continued...
Running Benchmarks: Guidelines



Run each program several times and base your
conclusions on averages, not on a single execution.
(This is more critical when you benchmark elapsed
time.)
Exclude outliers from the analysis because that data
might lead you to tune your program to run less
efficiently than it should.
Turn off the options that report resource usage after
testing is finished, because they consume resources.
In a multi-user environment, other computer
activities might affect the running of your program.
39
40
1.06 Multiple Choice Poll
Which of the following SAS programs should be
benchmarked?
a. A report that shows all the customers in the United
Kingdom in March 2006
b. A report that calculates trends in sales at the end
of every day for every department
c. A report showing the projected total cost of a 5%
cost-of-living increase in employee salaries for a
Human Resources project conducted on January 1,
2007
d. A yearly report that calculates the average sales
of a line of apparel for the clothing manager
41
1.06 Multiple Choice Poll – Correct Answer
Which of the following SAS programs should be
benchmarked?
a. A report that shows all the customers in the United
Kingdom in March 2006
b. A report that calculates trends in sales at the end
of every day for every department
c. A report showing the projected total cost of a 5%
cost-of-living increase in employee salaries for a
Human Resources project conducted on January 1,
2007
d. A yearly report that calculates the average sales
of a line of apparel for the clothing manager
42
Tracking Resource Usage
STIMER
STATS
(z/OS only)
SAS
Options
FULLSTIMER
43
MEMRPT
(z/OS only)
Tracking Resources with SAS Options
Windows, UNIX
OPTIONS STIMER | NOSTIMER;
OPTIONS NOFULLSTIMER | FULLSTIMER;
z/OS
STIMER» | NOSTIMER
Invocation option only
OPTIONS STATS | NOSTATS;
OPTIONS MEMRPT | NOMEMRPT;
44
OPTIONS NOFULLSTIMER | FULLSTIMER;
Business Scenario
You should benchmark to determine the most efficient
technique for creating a new variable based on a
condition.
The following methods can be used:
 IF-THEN with an assignment statement
 IF-THEN/ELSE with an assignment statement
 SELECT/WHEN with an assignment statement
45
46
1.07 Quiz
1. Open and submit p301a01a.
Record the user CPU: ____________
Exit SAS.
2. Start SAS.
Open and submit p301a01b.
Record the user CPU: ____________
Exit SAS.
3. Start SAS.
Open and submit p301a01c.
Record the user CPU: ____________
4. Which technique is most efficient?
In z/OS, record the CPU.
47
Sample Windows Log
Partial SAS Log
5
6
7
8
9
10
11
12
13
14
15
16
17
18
options fullstimer;
data _null_;
length var $ 30;
retain var2-var50 0 var51-var100 'ABC';
do x=1 to 100000000;
var1=10000000*ranuni(x);
if var1>1000000 then var='Greater than 1,000,000';
if 500000<=var1<=1000000 then var='Between 500,000 and 1,000,000';
if 100000<=var1<500000 then var='Between 100,000 and 500,000';
if 10000<=var1<100000 then var='Between 10,000 and 100,000';
if 1000<=var1<10000 then var='Between 1,000 and 10,000';
if var1<1000 then var='Less than 1,000';
end;
run;
NOTE: DATA statement used
real time
user cpu time
system cpu time
Memory
OS Memory
Timestamp
48
(Total process time):
1.26 seconds
0.98 seconds
0.04 seconds
278k
4976k
6/29/2010 12:39:21 PM
p301a01a
Sample UNIX Log
Partial SAS Log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
options fullstimer;
data _null_;
length var $30;
retain var2-var50 0 var51-var100 'ABC';
do x=1 to 10000000;
var1=10000000*ranuni(x);
if var1>10000000 then var='Greater than 1,000,000';
if 500000<=var1<=1000000 then var='Between 500,000 and 1,000,000';
if 100000<=var1<500000 then var='Between 100,000 and 500,000';
if 10000<=var1<100000 then var='Between 10,000 and 100,000';
if 1000<=var1<10000 then var='Between 1,000 and 10,000';
if var1<1000 then var='Less than 1,000';
end;
run;
NOTE: DATA statement used (Total process time):
real time
6.62 seconds
user cpu time
5.14 seconds
system cpu time
0.01 seconds
Memory
526k
OS Memory
5680k
Timestamp
6/29/2010 11:55:32 AM
Page Faults
82
Page Reclaims
0
Page Swaps
0
Voluntary Context Switches
91
Involuntary Context Switches
48
Block Input Operations
91
Block Output Operations
0
49
p301a01a
Sample z/OS Log
Partial SAS Log
50
p301a01a
51
Chapter 1: Introduction
1.1: Course Logistics
1.2: Measuring Efficiencies
1.3: SAS DATA Step Processing
52
Objectives


53
List the attributes of a data set page and define how
it relates to the structure of SAS data sets.
Describe how SAS reads and writes data.
SAS Data Set Pages
A SAS data set page has the following attributes:
 It is the unit of data transfer between the operating
system buffers and SAS buffers in memory.
 It includes the number of bytes used by the descriptor
portion, the data values, and any operating system
overhead.
 It is fixed in size when the data set is created, either
to a default value or to a value specified by the
programmer.
54
Using PROC CONTENTS to Report Page Size
proc contents data=orion.sales_history;
run;
Partial PROC CONTENTS Output
16,384*18=
294,912 bytes
Engine/Host Dependent Information
Data Set Page Size
Number of Data Set Pages
First Data Page
Max Obs per Page
Obs in First Data Page
Number of Data Set Repairs
File Name
Release Created
Host Created
55
16384
18
1
92
72
0
S:\workshop\sales_history.sas7bdat
9.0201M0
XP_PRO
56
1.08 Quiz
Use one of the following to determine the page size
of the orion.customer_dim SAS data set:
 the CONTENTS procedure
 the DATASETS procedure
 the SAS Explorer window
What is the page size of the SAS data set
orion.customer_dim?
57
p301a02
1.08 Quiz – Correct Answer
Use one of the following to determine the page size
of the orion.customer_dim SAS data set:
 the CONTENTS procedure
 the DATASETS procedure
 the SAS Explorer window
What is the page size of the SAS data set
orion.customer_dim?
16,384 bytes in Windows
24,576 bytes in UNIX
18,432 bytes in z/OS
58
p301a02
Reading External Files
Input
Raw
Data
memory
59
...
Reading External Files
Input
Raw
Data
I/O
measured
here
Caches
Buffers
memory
Data might be
cached in storage
devices. On UNIX
and Windows, data
can also be cached
by the OS file
system.
60
...
Reading External Files
Input
Raw
Data
I/O
measured
here
Caches
61
Input Buffer
Buffers
memory
...
Reading External Files
Input
Raw
Data
I/O
measured
here
Caches
Input Buffer
Buffers
Data is converted
from external
format to
SAS format.
memory
PDV
ID
62
Gender
Country
Name
...
Reading External Files
Input
Raw
Data
I/O
measured
here
Input Buffer
Buffers
Caches
Data is converted
from external
format to
SAS format.
memory
PDV
ID
Gender
Country
Name
Buffers
63
...
Reading External Files
Input
Raw
Data
I/O
measured
here
Input Buffer
Buffers
Caches
Data is converted
from external
format to
SAS format.
memory
PDV
Output
Buffers
I/O
SAS
Data measured
here
64
ID
Gender
Country
Name
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
memory
65
...
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
I/O
measured
here
Caches
Buffers
memory
Data might be
cached in storage
devices. On UNIX
and Windows, data
can also be cached
by the OS file
system.
66
...
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
I/O
measured
here
Caches
67
memory
...
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
I/O
measured
here
No data
conversion
is necessary.
Caches
memory
PDV
ID
68
Gender
Country
Name
...
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
I/O
measured
here
No data
conversion
is necessary.
Caches
memory
PDV
ID
69
Gender
Country
Name
...
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
I/O
measured
here
No data
conversion
is necessary.
Caches
memory
PDV
ID
70
Gender
Country
Name
...
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
I/O
measured
here
No data
conversion
is necessary.
Caches
memory
PDV
Output
SAS
I/O
Data measured
ID
Gender
Country
Name
here
71
...
Reading a SAS Data Set with a SET Statement
Input
SAS
Data
I/O
measured
here
memory
Sequential processing
continues
until the pointer
PDV
reaches the end of
the file.
Output
SAS
I/O
Data measured
here
72
ID
Gender
Country
Name
73
Exercise
These exercises reinforce the concepts
discussed previously.
74
Chapter Review
1. What are the six resources consumed
by SAS programs?
2. What is the correct way to benchmark SAS programs?
3. What is a SAS data set page size?
75
Chapter Review Answers
1. What are the six resources consumed
by SAS programs?
 programmer time
 network bandwidth
 CPU
 Memory
 I/O
 disk storage space
76
continued...
Chapter Review Answers
2. What is the correct way to benchmark SAS programs?
a. Turn on the system options to report resource
usage.
b. Test each technique in a separate SAS session.
c. Test only one technique or change at a time.
d. Run the test under final conditions.
e. Run each program three to five times and
average the results.
f. Exclude outliers.
g. Turn off the resource usage reporting options.
77
continued...
Chapter Review Answers
3. What is a SAS data set page size?
The size of the SAS data set page is the unit of
data transfer between the system buffers and the
SAS buffers in memory. The default transfer is one
data set page at a time.
The page size determines the amount of memory
that is used when data is read and written. The
number of pages effects the I/O.
78
Download