Getting your data into SAS: entering data with viewtable

advertisement
An overview of SAS system and how to run it.
 What is SAS?
SAS (Statistical analysis system) is many things. Among other things, it can be used for data
management and report generation, as a statistical package including descriptive and inferential
statistics (for a wide variety of problems) and as a general programming environment, including a
portion called IML which allows for easy handling of matrices. The complete package is extremely
comprehensive, and continues to expand at a rapid rate. This course provides an introduction to
SAS with the primary focus on data management and statistical computing, including how to
program to handle customized analyses.
Important links through OIT:
- Info on OIT Classrooms (which have SAS):
http://www.oit.umass.edu/computer-classrooms
-
Info on SAS site licenses if you want to buy your own copy:
http://www.oit.umass.edu/support/software/statistical-software-site-licences
- Basic instructions for using SAS, including online documentation for SAS
http://www.umass.edu/statdata/resources.html

Where do I run SAS?
a. On a PC with Windows: You can run SAS in any of the OIT computer classrooms or
you can buy a copy (renewable each year) for your own machine.
b. For graduate students in Mathematics and Statistics: There is a dedicated PC in
LGRT1537 with SAS on it.
c. SAS on demand. SAS now allows students to register and run SAS directly through a
SAS server. For this class, you will need to register for SAS OnDemand for Academics
and then access Client Bundle (Enterprise Guide and Enterprise Miner). Here is how to
get started:
 Access the following Web site: http://support.sas.com/ondemand/index.html#account
 Review the information and follow the steps at this site.
 If you have additional questions about using SAS® OnDemand for Academics, see http://support.sas.com/ondemand or contact me.

How do I run SAS?
On a PC, SAS is run interactively in the windows environment. When SAS is invoked there are
various sub-windows, the three main ones being the program window, the output window and
the log window. When you are actually using programming statements (as opposed to click and
go as in Analyst, etc.) the programming statements are put in the program window. The
program is then submitted (by clicking on the running figure), with output going to the output
window and the log window containing information (including error messages) about the job.
To run only portions of the program in the program window, you highlight that portion and
then submit.You can open and save files from the various windows. Various other aspects of
running in this environment will be demonstrated in class and are described in the on-line help.

The general structure of running SAS using programming statements versus “click and go”.
a. Using programming statements.
The data step creates a SAS data file. The data step might involve reading data from a
file in a particular way or consist of programming statements which generate a data file
(or a combination of the two). SAS data sets can also be created by importing data.
Once a SAS data set is in place, there is a large number of pre-programmed procedures
(procs) that can be run on that data. These are invoked with a proc statement. For the
most part, each proc has a fixed set of rules and options that must be followed.
b. Running SAS without “programming”.
There are various options now available in SAS for running SAS without programming,
but instead using drop down menus and “click and go”. We will not cover this type of
analysis. The analyst, the main tool for the “click and go” analysis, is easy to use
(although if you are proficient with SAS language, it is sometimes faster and easier to
program directly), creates the accompanying SAS code, produces graphs and has some
statistical tools not available in other parts of SAS (such as power and sample size
calculations). At the same time, it does not have all of the options available through the
procedures.
Getting your data into SAS: entering data with viewtable






Menu: tools  table editor
A blank viewtable is displayed. The rows (observations) are labeled with numbers and the
columns (variables) are labeled with letters. You can enter the data and SAS will automatically
figure out if your columns are numeric or character, according to the data values
The column names can be changed by double clicking the default names. The column
attributes such as the variable type, can be changed by right clicking the column names and
open the Column Attributes Window.
To save the data, click the save icon above the viewtable. Select a library, which corresponds
to a directory, and then specify the member name of your table. Note if you choose the library
Work, the data will only be available for the current SAS session.
Open and edit existing tables by clicking the open icon above the blank viewtable. Or one can
open an existing table through SAS explorer or windows explorer.
To reference the table you created in viewtable, use libraryname.membername. For example,
Proc print data=sasuser.test;
Run;
Getting started using SAS software
The SAS program




A SAS program is a sequence of statements executed in order
Every SAS statement ends with a semicolon
Layout of SAS programs is flexible
1) SAS statements can be in upper or lower case
2) Statements can continue on the next line (don’t split words)
3) Statements can be on the same line as other statements, but use semicolon to separate
4) Statements can start in any column
Comments can be inserted into a program. Comments can be enclosed in asterisk (*) and
semicolon (;), or a slash asterisk (/*) and an asterisk slash (*/)
*Add a new variable to the existing data sasuser.test;
data sasuser.test;
set sasuser.test;
height_cm=2.5*height;
proc print data=sasuser.test; /* print the results*/
run;

SAS programs are constructed from two basic building blocks: DATA steps and PROC steps.
Most statements work in only one type of step. The basic differences between them:
DATA steps
PROC steps
begin with DATA statements
begin with PROC statements
read and modify data
perform specific analysis or function
many statements can be used: input data, do only a handful of possible statements for each
loops, if-then/else logic and many functions SAS procedure
create a SAS data set
produce results or report

A step ends when SAS encounters a new step, a RUN statement, or the end of the program.
Run statements tell SAS to run all the proceeding lines of the step and are among those rare
global statements that are not part of a DATA or PROC step.
SAS data sets





SAS data sets are also called tables, observations are also called rows, and variables are also
called columns.
There are just two types of variables: numeric and character. A variable containing letters or
special characters must be character data. Sometimes data that consist only of numerals make
more sense as character data than as numeric, ZIP codes, for example.
Missing character data are represented by blanks, and missing numeric data are represented by
a single period
Rules for SAS names:
1). Names must be 32 characters or fewer in length
2). Names must start with a letter or an underscore ( _ )
3). Names can contain only letters, numerals, or underscores. No other special characters
4). Names can contain upper and lower case letters. SAS is insensitive to case.
Viewing data sets with SAS explorer: menu: view  explorer, where you can
1). Create new library
2). View existing data sets
3). List the properties of a SAS data set
Submitting SAS programs




SAS programs are entered into the editor window. Highlight the part of the program that you
want to run, then click on the submit icon on top
After submission, any notes, errors, or warnings associated with your program as well as the
program statements themselves will appear in the log window. It is very important to check
SAS log for errors before proceeding to the results.
If your program generates any printable results, then they will appear in the Output window
The Results window is like a table of contents for your output window
Using SAS system options


System options control SAS –how it works, what the output looks like, how much memory is
used, error handing and a host of other things.
In SAS windowing environment, two common ways of specifying system options:
1) Change selected options in the SAS System Options window. Menu: Tools  options
2) Use the OPTIONS global statement as a part of your SAS program:
OPTIONS LINESIZE=80 NODATE;
 Common options:
CENTER|NOCENTER
Controls whether output are centered or left-aligned. Default:
CENTER
DATE|NODATE
Controls whether or not today’s date will appear at the top of each
page of output. Default: DATE
LINESIZE = n
Controls the maximum length of output lines. Possible
values for n are 64 to 256. Default varies according to system.
NUMBER|NONUMBER
Controls whether or not page number appear on each page of SAS
output. Default: NUMBER
ORIENTATION=PORTRAIT Specifies the orientation for printing output. Default:
PORTRAINT
ORIENTATION=LANDSCAPE
PAGENO = n
Starts numbers output pages with n. Default is 1.
PAGESIZE = n
Controls the maximum number of lines per page of
output.
Possible values for n are 15 to 32767. Default
varies.
RIGHTMARGIN = n
Specifies size of margin (such as 0.75in or 2cm) to be used
for printing output. Default: 0in
LEFTMARGIN = n
TOPMARGIN = n
BOTTOMMARGIN = n
YEARCUTOFF = yyyy
Specifies the cutoff year for DATE informats/functions
Getting your data into SAS: continued
Methods of getting data into SAS:
1) Enter data directly with viewtable
2) Creating SAS data sets from raw data files
3) Converting other software’s data files into SAS data sets
4) Reading other software’s data files directly
Reading files with the Import Wizard
Menu: File  Import Data. Show example: bands.csv
The Import Wizard can read all types of delimited files including comma-separated values
(CSV) files which are a common file type for moving data between applications. If you have
SAS/ACCESS for PC File Formats software, then you can also read a number of popular PC
file types (Excel, Access, Lotus, etc).
Creating SAS data sets from raw data files

Telling SAS where to find your raw data
If your data are in raw data files (also referred to as text, ASCII, sequential, or flat files), using
the DATA step to read the data give you the most flexibility.
1) Internal raw data: use the CARDS or DATALINES statement to indicate internal data. The
CARDS or DATALINES statement must be the last statement in the DATA step. All lines
in the SAS program following the DATALINES statement are considered data until SAS
encounters a semicolon. The semicolon must be on a line by itself.
DATA uspresidents;
INPUT President $ Party $ Number;
DATALINES;
Adams
F 2
Lincoln R 16
Grant
R 18
Kennedy D 35
;
RUN;
2) External raw data: use the INFILE statement to tell SAS the filename and location
DATA uspresidents;
INFILE 'C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\president.txt';
INPUT President $ Party $ Number;
RUN;


SAS log shows the number of records read from the file, which should be
checked to ensure the file is read correctly.
In some operating systems, SAS assumes external files have a record length of
256 or less. The default can be changed with the LRECL= option in the INFILE
statement:
INFILE 'C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\president.txt'
LRECL=2000;

Reading Raw Data Separated by Spaces
The simplest INPUT statement simply list the variable names after the INPUT keyword in the
order they appear in the data file. This type of input is called list input (or free formatted input).
It is appropriate for reading data separated by spaces. It is easy to use but have a few
limitations:
a. You must read all the data in a record – no skipping
b. Any missing data must be indicated with a period
c. Character data cannot have embedded spaces and cannot be greater than eight
characters in length
d. Special data such as dates needs formatted input, therefore cannot be read by list
input
DATA HTWT;
INFILE 'C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\htwt.txt';
INPUT ID gender$ age height weight;
RUN;

Reading Raw Data Arranged in Columns
Column input can be used if each of the variable’s value is always found in the same place of
the data line. It has the following advantages over list input:
i. Spaces are not required between values
ii. Missing values can be left blank
iii. Character data can have embedded spaces or longer than 8 characters
iv. You can skip unwanted variables
DATA sales;
INPUT visitingteam $ 1-20 ConcessionSales 21-24 BleacherSales 25-28
ourhits 29-31 theirhits 32-34 ourRuns 35-37 theirRuns 38-40;
*---+----1----+----2----+----3----+----4;
DATALINES;
Columbia Peaches
35 67 1 10 2 1
Plains Peanuts
210
2 5 0 2
Gilroy Garlics
151035 12 11 7 6
Sacramento Tomatoes 124 85 15 4 9 1
;

Reading Raw Data Not in Standard Format
Numbers with embedded commas or dollar signs, and dates like 10-31-2003 or 31OCT03 are
examples of non-standard data. These data can be read with formatted input.
DATA contest;
INPUT name $ 1-16 age type $ date mmddyy10. sccore1
*format date mmddyy10.;
*format amount comma9.2;
*---+----1----+----2----+----3----+----4----+;
DATALINES;
Alica Grossman
13 c 10-28-2003 7.8 $1,000.8
Matthew Lee
9 D 10-30-2003 6.5 $1,023.4
;
PROC PRINT DATA=contest;
RUN;
General format for informat (input format):
Character
Numeric
$informatw. $informatw.d
amount comma9.1;
Date
$informatw.
Commonly used informats:
a. For character: $w.
b. For numeric: w.d, commaw.d
c. For date: mmddyyw.
Summary on types of input
1. list input: easy to use
2. column input: allow space in each variable field; do not require spaces between
variables
3. formatted input: read special data
/*Exercise1*/
Variable names: Name Age Type Date Score1-Score5
Alicia Grossman 13 c 10-28-2003 7.8 6.5 7.2 8.0 7.9
Matthew Lee
9 d 10-30-2003 6.5 5.9 6.8 6.0 8.1
Elizabeth Garcia 10 C 10-29-2003 8.9 7.9 8.5 9.0 8.8
/*Exercise2*/
Variable names: ID Name Score;
1 stevenson 89
2 cody 100
3 smith 55
4 gettlefinger 92
/*Exercise3*/
Variable names: Park Name State Year Acreage
Yellowstone
ID/MT/WY 1872
4,065,493
Everglades
FL 1934
1,398,800
Yosemite
CA 1864
760,917
Great Smoky Mountains NC/TN 1926
520,269
Wolf Trap Farm
VA 1966
130

Mixing input styles
One needs to be careful when mixing input styles. Note when SAS reads a line of raw data,
it uses a pointer to mark its place. Each style of input uses the pointer a little differently.
After reading a variable,
1. if formatted input or column input, the pointer is left in the column that immediately
follows the last column that is read.
2. List input, on the other hand, uses a scanning method to determine the pointer
location. With list input, the pointer reads until a blank is reached and then stops in
the next column. To read the next variable value, the pointer moves automatically to
the first nonblank column, discarding any leading blanks it encounters.
There are three commonly used column pointers and two line pointers.
Three column pointers:
1. @n, put pointer at column n
2. +n, move pointer to the right by n columns
3. @character, put pointer next to the end of the specified character
data contest;
input name $16. age 3. +1 type $1. +1 date mmddyy10. sccore1 5.1
dollar comma10.1;
format dollar comma9.2;
*---+----1----+----2----+----3----+----4----+;
datalines;
Alica Grossman
13 c 10-28-2003 7.8 $1,000.8
Matthew Lee
9 D 10-30-2003 6.5 $1,023.4
;
proc print data=contest;
run;
+2
or equivalently,
data contest;
input name $16. age 3. @21 type $1. @23 date mmddyy10. sccore1 5.1
@38 dollar comma8.1;
*---+----1----+----2----+----3----+----4----+;
datalines;
Alica Grossman
Matthew Lee
13 c 10-28-2003 7.8 $1,000.8
9 D 10-30-2003 6.5 $1,023.4
;
/* @'character' pointer */
data weblogs;
input @'[' accessdate date11. @'/' file :$20.;
datalines;
130.192.70.235 -- [08/Jun/2001:23:51:32 - 0700] "GET /rover.jpg http/1.1" 200
66820
128.32.236.8 -- [08/Jun/2001:23:51:40 - 0700] "GET /grooming.html http/1.0"
200 8471
;
proc print data=weblogs;
run;
Colon informat modifier (:$20 in the above example) : The colon informat modifier indicates that the
value is to be read from the next nonblank column until the pointer reaches the next blank column or
the end of the data line, whichever comes first. Though the data step continues reading until it
reaches the next blank column, it truncates the value of a character variable if the field is longer that its
formatted length. If the length of the variable has not been previously defined, its
value is read and stored with the informat length.
/* Note that pointer can also be used to change the order of reading variables*/

Multiple lines of data per observation
data highlow;
input city $ state $ normalHigh normalLow RecordHigh RecordLow;
datalines;
Nome AK
55 44
88 29
Miami FL
90 75
97 65
Raleigh NC
88 68
105 50
;
proc print;
run;
/*SAS secret: the pointer moves to the next line if there are more variables than data
on the current line*/
data highlow;
input city $ state $;
input normalHigh normalLow;
input RecordHigh RecordLow;
datalines;
Nome AK
55 44
88 29
Miami FL
90 75
97 65
Raleigh NC
88 68
105 50
;
proc print;
run;
/*SAS secret: the pointer moves to the next line after an input statement;*/
Two line pointers: moves pointer to the next line (/) or the specified line (#n). Using line
pointers, one can skip certain lines and read selective lines.
data highlow;
input city $ state $
/
normalHigh normalLow
#3 RecordHigh RecordLow;
datalines;
Nome AK
55 44
88 29
Miami FL
90 75
97 65
Raleigh NC
88 68
105 50
;
proc print;
run;
SAS secret: Data steps execute line by line and observation by observation
data highlow;
input city $ state $
/
normalHigh normalLow #3;
datalines;
Nome AK
55 44
88 29
Miami FL
90 75
97 65
Raleigh NC
88 68
105 50
;
proc print;
run;

Multiple observations per line and reading part of raw data
Use line holders: double trailing @@ or single trailing @. Double trailing @@: a double
trailing at the end of the sas input statement instructs sas to hold that line of data,
continuing to read observations until it either runs out of data or reaches an input statement
that does not end with a double trailing. Single trailing releases the pointer to the next line
until it either runs out of data, or reaches an input statement that does not end with trailing,
or SAS reaches the end of the data step for an observation. Double trailing holds the line
more strongly. The double trailing holds a line even when SAS starts building a new
observation.
@@ is mainly used to read data with multiple observations per line
data rainfall;
input city $ state $ normalRain meanDaysRain @@;
datalines;
Nome AK 2.5 15 Miami FL 6.75
18 Raleigh NC . 12
;
proc print;
run;
@ is mainly used to read part of raw data
/* The following data step outputs "freeway" data only*/
data traffic;
input type $ name $ 9-38 AMtraffic PMtraffic;
if type="freeway";
Mtraffic=mean(of AMtraffic PMtraffic);
*---+----1----+----2----+----3----+----4;
datalines;
freeway 408
3684 3459
surface Martin Luther King Jr. Blvd. 1590 1234
surface Broadway
1259 1290
surface Rodeo Dr.
1890 2067
freeway 608
4583 3860
freeway 808
2386 2518
surface Lake Shore Dr.
1590 1234
surface Pennsylvania Ave.
1259 1290
;
proc print;
run;
data traffic;
input type $ @;
if type="surface" then delete;
if type="freeway" then input name $
PMtraffic;
datalines;
freeway 408
3684
surface Martin Luther King Jr. Blvd. 1590
surface Broadway
1259
surface Rodeo Dr.
1890
freeway 608
4583
freeway 808
2386
surface Lake Shore Dr.
1590
surface Pennsylvania Ave.
1259
;
proc print;
run;
9-38 AMtraffic
3459
1234
1290
2067
3860
2518
1234
1290
/*double trailing and single trailing do not make a difference in the following
example*/
data traffic;
input type $ @@;
if type="surface" then
input name_c $ 9-38 AMtraffic PMtraffic;
if type="freeway" then
input name_n $ 9-38 AMtraffic PMtraffic;
datalines;
freeway 408
3684 3459
surface Martin Luther King Jr. Blvd. 1590 1234
surface Broadway
1259 1290
surface Rodeo Dr.
1890 2067
freeway 608
4583 3860
freeway 808
2386 2518
surface Lake Shore Dr.
1590 1234
surface Pennsylvania Ave.
1259 1290
;
proc print;
run;
Exercise:
Each line contains a three-digit number, a two-digit number, and a four-digit
number, with four sets of numbers for each subject;
datalines;
123121234217874444123872345873235432
192837465748392919283747372818182838
;
Solution 1: put the data in a file called, for example, ex.txt in the directory
C:\Users\anna\Documents\stat597.F13. Then input with the following SAS program
data ex;
infile "C:\Users\anna\Documents\stat597.F13\ex.txt";
input num1 3. num2 2. num3 4. @@;
Solution 2: with in-stream data (datalines), SAS automatically padded spaces to the
datalines so that the datalines have width 80. There is no easy way of removing it (see a
discussion for the exactly same problem here:
http://www.listserv.uga.edu/cgi-bin/wa?A2=ind0405a&L=sas-l&P=1399) One way to read this
data with datalines:
data ex;
do set=1 to 4;
input num1 3. num2 2. num3 4. @;
output;
end;
datalines;
123567890123456789123567890123456789
321459876321459876321459876321459876
;
run;
proc print;run;

Reading delimited files with the data step
To read files delimited by comma, or tab, use the dlm option with the infile statement.
*Reading comma delimited files;
data HTWT;
infile datalines dlm=',';
input ID name$ age height weight;
datalines;
1, Ali G, 23, 68, 155
2, Mat L, , 61, 102
3, Liz G, 55, 70, 202
;
proc print;run;
*Use dlm='09'X to read tab delimited files;
data HTWT;
infile "C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\htwt_tab.txt" dlm='09'X;
input ID name$ age height weight;
proc print;run;
data HTWT;
infile datalines dlm=',' dsd;
input ID name $ age height weight;
datalines;
1, "G, Ali", 23, 68, 155
2, "L, Mat", , 61, 102
3, "G, Liz", 55, 70, 202
;
proc print data=htwt;
run;
More options with the infile statement
1. DLM option. Two delimiters in a row is treated as one.
2. DSD option (DSD represents delimiter-sensitive data). It does:
1) Ignores delimiter enclosed in quotation marks
2) Strip off the quotation marks
3) Two consecutive delimiters indicate a missing value
3. firstobs=
4. obs=
5. lrecl=256
6. missover: missing values anticipated at the end of a data line
7. truncover: use truncover if you are using column or formatted input for the last variable and
expect some of its data values are shorter than others.
8. pad: pad each line of data with blank spaces
data HTWT;
infile 'C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\htwt_tab.txt' dlm='09'X
lrecl=5;
input ID gender$ age height weight;
proc print data=htwt;
run;
Note The DATALINES automatically PADs the data whereas the external file does not.
/*Exercise*/
Variable names: Name, Number, Address
John Garcia
114
Sylvia Chung 1302
Martha Newton
45
;
Maple Ave.
Washington Drive
S.E. 14th St.
/*Exercise*/
Variable names: BandName, GigDate, EightPM, NinePM, TenPM, ElevenPM
Lupine Lights,12/3/2003,45,63,70,
Awesome Octaves,12/15/2003,17,28,44,12
"Stop,Drop, and Rock-N-Roll",1/5/2004,34,62,77,91
The Silveyville Jazz Quartet,1/18/2004,38,30,42,43
Catalina Converts,1/13/2004,56,,65,34
;

Reading delimited files or PC files with the import procedure
Proc import datafile='filename' out=data-set DBMS=identifier REPLACE;
If filename is with extension .csv, .txt, .xls, dbf, .wk1, .wk3, .wk4, DBMS option can be
skiped. Otherwise, use DBMS=DLM, and specify delimiter with
DELIMITER='delimiter-character' statement;
Proc import datafile='C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\Bands.csv'
out=bandsData REPLACE;
getnames=no;
run;
proc print;run;

Permanent SAS data sets
When a SAS data set is created and given a two level name: libname.memberName, it is a
permanent unless the libname is WORK or unspecified. A library can be defined through
point-and-click the menus. It can also be defined through a SAS statement:
libname mysas "C:\Documents and Settings\anna\Desktop\MyDesktop\597.F12";
Proc import datafile='C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\Bands.csv'
out=mysas.bandsData REPLACE;
getnames=no;
run;
You can see a SAS data set called bandsData.sas7bdat created under the directory to which the
library mysas corresponds.
Another way to create permanent SAS data set is through direct referencing:
data "C:\Documents and
Settings\anna\Desktop\MyDesktop\597.F12\bands";
infile datalines dlm="," dsd;
input BandName :$50. GigDate :mmddyy10. EightPM NinePM TenPM ElevenPM;
datalines;
Lupine Lights,12/3/2003,45,63,70,
Awesome Octaves,12/15/2003,17,28,44,12
"Stop,Drop, and Rock-N-Roll",1/5/2004,34,62,77,91
The Silveyville Jazz Quartet,1/18/2004,38,30,42,43
Catalina Converts,1/13/2004,56,,65,34
;
run;
You can see a SAS data set called bandsData.sas7bdat created under the specified directory

The contents procedure
proc contents data=datasetName;
Observation 1
Observation 2
Observation 3
Observation 4
Line 1
Line 2
Line 3
Input data
SAS program
Output data
Observation 1
Observation 2
Observation 3
Observation 4
Download