Lect_1 2008

advertisement

Data Processing with PC-SAS

J. Michael Oakes, PhD

Associate Professor

Division of Epidemiology

University of Minnesota oakes007@umn.edu

Lecture 1/4

Lecture 1

• COMPUTER NUMBER SYSTEMS

• DATA AND DATABASES

• PC-SAS INTERFACE

• SAS HELP

• INTRO TO PDV

• RESEARCH ETHICS

Binary Numbers

Decimal Numbers (base-10 system) and Digits:

A digit is a single place that can hold numerical values between 0 and 9.

Digits are combined together in groups to create larger numbers

It is understood that in the number 6,357 the 7 is filling the "1s place," the 5 is filling the 10s place, the 3 is filling the 100s place the 6 is filling the 1,000s place.

So you could express things this way if you wanted to be explicit:

(6 * 1000) + (3 * 100) + (5 * 10) + (7 * 1) = 6000 + 300 + 50 + 7 = 6357

1

Binary Numbers

Another way to express 6357 would be to use powers of 10.

Assuming that we are going to represent the exponentiation with the “hat” or ^ symbol, we may to express the quantity as,

(6 * 10^3) + (3 * 10^2) + (5 * 10^1) + (7 * 10^0) = 6357

Notice each digit is a placeholder for the next higher power of 10, starting in the first digit with 10 raised to the power of zero (recall X^0 = X).

Binary Numbers

Definition of “Binary” from the OED -

Binary arithmetic: a method of computation in which the binary scale is used, suggested by Leibnitz.

Binary scale: the scale of notation whose ratio is 2, in which, therefore,

1 of the ordinary (denary) scale is expressed by 1, 2 by 10, 3 by 11, 4 by 100, etc.

Binary Numbers

All programs and data are ultimately recognized as just patterns of 0's and 1's by the digital computer.

2

Binary Numbers

Why Use 'Em?

For computers, binary numbers are great stuff because:

Noise-resistant

Simple

Mimic Circuit

Not continuous, thus on or off.

They are simple to work with -- no big addition tables and multiplication tables to learn, just do the same things over and over, very fast.

They just use two values of voltage, magnetism, or other signal, which makes the hardware easier to design and more noise resistant.

Binary Numbers

Decimal 1 is binary 0001

Decimal 3 is binary 0011

Decimal 6 is binary 0110

Decimal 9 is binary 1001

Each digit "1" in a binary number represents a power of two, and each "0" represents zero:

0001 is 2 to the zero power, 2^0, or 1

0010 is 2 to the 1st power, 2^1, or 2

0100 is 2 to the 2nd power, 2^2, or 4

1000 is 2 to the 3rd power, 2^3, or 8.

2 0 + 2 1 + 2 2 + 2 3 + 2 4 + … + 2 n

Binary Numbers

When you see a number like "0101" you can figure out what it means by adding the powers of 2:

0101 = 0 + 4 + 0 + 1 = 5

1010 = 8 + 0 + 2 + 0 = 10

0111 = 0 + 4 + 2 + 1 = 7

3

Binary Numbers

The word bit is a shortening of the words "Binary digIT"

It is the smallest possible unit of information.

Each digit or “place” is a bit:

2 0 + 2 1 + 2 2 + 2 3 + 2 4 + 2 5 + 2 6 + 2 7 + 2 8 + 2 9 + 2 10 place 1 2 3 4 5 6 7 8 bit 0 1 2 3 4 5 6 7

Binary Numbers

8 bits is usually called a "byte“

Which is the size usually used to represent an alphabetic character in ASCII -- "A" is 65, or 01000001

With 8 bits in a byte, you can represent 256 values ranging from 0 to 255, as shown here:

0 = 00000000

1 = 00000001

2 = 00000010

...

254 = 11111110

255 = 11111111

Binary Numbers

When you start talking about lots of bytes, you get into prefixes like kilo, mega and giga, as in kilobyte, megabyte and gigabyte.

The following table shows the multipliers:

Name Abbr. Size

Kilo

Mega

Giga

Tera

Peta

Exa

K

M

G

T

P

E

2^10 = 1,024

2^20 = 1,048,576

2^30 = 1,073,741,824

2^40 = 1,099,511,627,776

2^50 = 1,125,899,906,842,624

2^60 = 1,152,921,504,606,846,976

4

Character Representation

Bytes are frequently used to hold individual characters in a text document.

Data are usually either numbers or letters (strings).

Of course, memory can only hold binary numbers, so we have to agree how to interpret those numbers (a kind of code) when we want them to represent letters (or numbers, for that matter).

(That computers can process text or string data is what made them popular!)

Character on Screen

1

2

3

Binary Value

0000 0001

0000 0010

0000 0011 b

B a 0110

A 0100 0001

0110 0010

0100 0010

Character Representation

String - A sequence of printable characters, delimited by quotes in

Stata, SAS, etc.

Examples: “hello world” “String” “1243212”

“ ” is called the null string

Numeric - Easy to code, but there is an issue of precision

The number 1.1 in binary is 1.10001100110011… repeating… much like 1/11 in decimal, which is 0.09090909…

SAS “rounds” 1.1 to 1.100000238419, which is off a little.

Character Representation

Also store ID variables as STRINGS!!!

Say that we have a tiny data file which has just ID variables like the one below.

123456789

123456790

123456791

123456792

123456793

123456794

123456795

123456796

If we go to list out the values, they are displayed in scientific notation so it is hard to read the values.

id

1. 1.23e+08

2. 1.23e+08

3. 1.23e+08

4. 1.23e+08

5. 1.23e+08

6. 1.23e+08

7. 1.23e+08

8. 1.23e+08 id

1. 123456792

2. 123456792

3. 123456792

4. 123456792

5. 123456792

6. 123456792

7. 123456792

8. 123456800

5

Character Representation

SAS - Data and storage types

The default length of numeric variables in SAS data sets is

8 bytes.

Control the length of SAS numeric variables with the LENGTH statement in the DATA step (more on this later)

In the PC-SAS, the Windows data type of numeric values that have a length of 8 is LONG REAL. The precision of this type of floatingpoint values is 16 decimal digits

A character variable with a length of 1 byte to serve the same purpose as dummy (0,1) variable

Character Representation

Regardless of how much precision is available, there is still the problem that some numbers cannot be represented exactly.

In the decimal number system, the fraction 1/3 cannot be represented exactly in decimal notation. Likewise, most decimal fractions (for example, .1) cannot be represented exactly in the binary numbering system.

Character Representation

Imprecision can also cause problems with comparisons.

Consider the following example in which the PUT statement is not executed: data _null_; x=1/3; if x=.33333 then put 'MATCH'; run;

6

Character Representation

If you add the ROUND function, as in the following example, the PUT statement is executed: data _null_; x=1/3; if round(x,.00001) =.33333 then put'MATCH'; run;

In general, if you are doing comparisons with fractional values, it is good practice to use the ROUND function.

ASCII Data

Bytes are frequently used to hold individual characters in a text document.

Data are usually either numbers or letters. Of course, memory can only hold binary numbers, so we have to agree how to interpret those numbers (a kind of code) when we want them to represent letters (or numbers, for that matter).

ASCII

- The American Standard Code for Information Interchange is a standard seven-bit code that was proposed by American National Standards

(ANSI) in 1963, and finalized in 1968. Other sources also credit much of the work on ASCII to work done in 1965 by Robert W. Bemer.

In this system, 7 bits are used to represent 128 (27) different letters, numbers, punctuation, and special codes. When the eighth bit is used (why waste it?), we have Extended ASCII, in which 256 characters are available. The "upper" 128 characters are not as standard as the first 128. Refer to an Extended ASCII chart on web.

ASCII Data

If you use Notepad in Windows 95/98 to create a text file containing the words,

Four score and seven years ago

Notepad would use 1 byte of memory per character

(including 1 byte for each space character between the words -- ASCII character 32).

When Notepad stores the sentence in a file on disk, the file will also contain 1 byte per character and per space.

7

ASCII Data

If you were to look at the file as a computer looks at it, you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character (see below). So on disk, the numbers for the file look like this:

F o u r a n d s e v e n

70 111 117 114 32 97 110 100 32 115 101 118 101 110

By looking in the ASCII table, you can see a one-toone correspondence between each character and the

ASCII code used.

Character Representation

History of ASCII Code is history of code, beginning with MORSE code http://tronweb.super-nova.co.jp/characcodehist.html

EBCDIC Data

ASCII code was adopted by all U.S. computer manufacturers except IBM, which developed a proprietary character code for its mainframe computers

IBM created a proprietary 8-bit character code (2^8 = 256 code points) called EBCDIC [pronounced eb-see-dick ], which stands for

"Extended Binary Coded Decimal Interchange Code."

It was used on the successful IBM System/360 mainframe computer series, which hit the market in April 1964

Today, many businesses that have data in EBCDIC files are converting them into ASCII and other non-proprietary formats

8

Hence, Two Kinds of “Data”

Data are either:

Numeric

(e.g., 1 12001 0.4532)

String

(e.g., Y NO “This is totally boring!”)

Databases

Database - a computerized record keeping system.

• More completely, it is a system involving data, the hardware that physically stores that data, the software that utilizes the hardware's file system in order to (1) store the data and (2) provide a standardized method for retrieving or changing the data, and finally, the users who turn the data into information.

• In the 1960s, databases were created to solve the problems with file-oriented systems in that they were compact, fast, easy to use, current, accurate, allowed the easy sharing of data between multiple users, and were secure.

• A database might be as complex and demanding as an account tracking system used by a insurance company to manage the constantly changing accounts of thousands of subscribers, or it could be as simple as a collection of email addresses on your computer.

• The important thing is that a database allows one to store data and get it or modify it when one needs to easily and efficiently regardless of the amount of data being manipulated. What the data is and how demanding you will be when retrieving and modifying that data is simply a matter of scale.

Databases

Traditionally, databases ran on large, powerful mainframes for business applications.

Such machines use packages like Oracle 8 or Sybase SQL

(structured query language) Server. SAS is really a database program!

However with the advent of small, powerful personal computers, databases have become more readily usable by the average computer user. Microsoft's Access is a popular

PC-based engine.

Today, databases have quickly become integral to the design, development, and services offered by web sites.

9

Can we talk about data?

Survey Quex

Docs Notes

Observation

Data Entry

Data set / base

Basic Data Structure

Obs-1

Obs-2

Obs-3

Obs-n

Var-1 Var-2 Var-3

1 2.5

100

2

3

1.7

21.0

91

211

Var-m

Basic Data Structure

Terms & Synonyms

Row, Record, Tupple

Column, Variable, Field

Linking variable, key

Table or tables

10

Flat Files

• ASCII or EBCDIC (or unicode)

• A flat file is a file containing records that have no structured interrelationship. The term is frequently used to describe a textual document from which all word processing or other structure characters have been removed (simple structure).

• Suffix usually ends in dat, txt, csv

Flat Files

F231012S2

F190101T2

M181012

F231011T1

M142222S0

How many variables? How many observations?

Flat Files

F231012S2

F190101T2

M181012

F231011T1

M142222S0

11

Flat Files

Observations or records

Variables or fields

12 345678

1 F231012S2

2 F190101T2

3 M181012

4 F231011T1

5 M142222S0

Flat Files

This is fixed-format (flat) file

F231012S2

F190101T2

M181012

F231011T1

M142222S0

Flat Files

Comma Delimited (*.csv)

F,23,1,0,1,2,S,2

F,19,0,1,0,1,T,2

M,18,1,0,1,2

F,23,1,0,1,1,T,1

M,14,2,2,2,2,S,0

12

Text Editors

Avoid WORD, WP and other word-processors!

A word processor work is aimed at styling text, integrating tables and figures, creating footnotes and indexes, and so on.

A text editor by contrast is predisposed to treat a pure text file quite literally… no junk!!!

See (study!) http://repec.org/docs/textEditors.html

Text Editors

Text editors are useful for

• Writing SAS programs

• Reviewing logs and output

• Examining (large) for data files

Text Editors

There are many text editors available to you!

Windows native Notepad / Wordpad

• Emacs, Pico and Vi (Popular Unix/VMS editors)

• SAS Enhanced Text Editor

• Stata native *.do and *.log file editor

• PFE -- (for HUGE data files; no longer supported, but free, and nice) http://www.lancs.ac.uk/people/cpaap/pfe/

• BBE Edit (for Macs)

13

Binary Files

A binary file is computer-readable but not human-readable.

All executable programs (*.exe) are stored in binary files, as are most numeric data files.

Access *.mdb

Dbase

Excel

SAS

SPSS

Stata

*.dbf

*.xls

*.sas7bdat

*.sav

*.dta

More about Data

If data are in raw “flat file”

ASCII or EBCDIC

Fixed or delimited format

Infile to SAS

If data are in “binary” format

Excel, Access, Stata

More on this Day 4…

Convert to SAS

Hierarchical Data

Hierarchical Data (def):

Data that is nested or grouped in another set of data.

It requires a little more thought!

14

Hierarchical Data nested nested within Schools…

Example: Patients within a Doctor, within a

Health-plan, within a Macroeconomy…

Example: Blood-pressures nested within person, over time!

Hierarchical Data

The essential element of hierarchical data is that things within a group are typically more alike than things between groups. There is a clustering or dependence.

This is how nature works!

Hierarchical Data

The similarities within a group are due to common exposures (e.g., toxic dumps and teachers) and common force effects of sorting things into more homogeneous groups (e.g., money to buy a house or a good hospital).

15

Hierarchical Data

Unit of Analysis: When it comes to

(hierarchical) data, it all depends on our level of interest in the detail at any given level!

Hierarchical Data

SUBJECT

7

8

9

4

5

6

1

2

3

AGE

43

31

2

22

48

55

67

71

2

SEX

M

F

F

F

F

M

M

F

M

HH

4

4

4

2

3

3

1

1

1

NEIGH

2

2

2

1

1

1

1

1

1

Hierarchical Data

SUBJECT

1

1

2

1

1

3

3

2

3

AGE

43

43

43

43

15

15

27

27

27

SEX

M

M

M

M

F

.

.

F

.

HH

1

2

3

1

1

4

4

3

4

HT

0

0

0

0

0

0

0

0

0

SBP

142

131

122

128

118

119

159

122

118

16

Relational Data

The "relation" comes from the fact that the tables can be linked to each other. These kind of relations can be quite complex in nature, and would be hard to replicate in the standard flat-file format.

One major advantage of the relational model is that, if a database is designed efficiently, there should be no duplication of any data, which helps maintain database integrity and can represent a huge saving in file size, which is important when dealing with large volumes of data.

Relational databases also have functions "built in" that help them to retrieve, sort and edit the data in many different ways, and so can go quite some way to speeding things up.

Developed in the 1980s to overcome slowness of hierarchical and multiple flat-file systems.

Relational Data

Graduate School SPH

Bursars Office Financial Aid

Should each group have their own flat file?

We have a group of related tables!

Relational Data

Graduate School

SPH

Bursars Office

Financial Aid

Prof. Oakes’ Query…

17

Relational Data

Enrollment

Hospital

Clinic

Claims

Manipulating Data S

ets

Append

A append

B

A

B

Manipulating Data

Sets

Merge

A merge

B A B

By some linking

Variable (e.g., ID)

18

Manipulating Data

Sets

Collapse

A

A’

Collapsed and statistics

(e.g., sums, means) are generated.

Manipulating Data

Sets

Subset

A

Use if command a

1 a

2 a

3

The SAS System, finally!

19

What Is the SAS System?

The SAS System is an integrated system of software products that enables you to perform

• data entry, retrieval, and management

• report writing and graphics

• statistical and mathematical analysis

• business planning, forecasting, and decision support

• operations research and project management

• quality improvement

• applications development.

In addition, you can integrate with SAS many SAS business solutions that enable you to perform large scale business functions, such as data warehousing and data mining, human resources management and decision support, financial management and decision support, and others.

What Is the SAS System?

SAS is a computer program for managing and analyzing data… It’s a tool !

What Is the SAS System?

Data SAS Program Output

SAS Reads Data as per instructions

SAS Writes Data and/or Text as per instructions

20

SAS, What’s the Big Deal?

Most popular stats program

Critical to all PH research

SAS programmers, > $60k

Challenging / terrifying to learn

Semi-unique language/terms

Terrible help files

SAS, What’s the Big Deal?

With respect to statistical programs, most applied statistician/researchers spend most time doing data mgt.

activities in preparation for analyses.

This is a pre-statistical activity!!!!!

The SAS System

Main Flavors of SAS:

PC-SAS*

VMS SAS - epi

UNIX SAS - biostat

21

The SAS System

Versions of SAS

Pre 6.12

6.12

8.2*

9.0

Exploring PC-SAS

SAS for PC has three main “windows”

Program Editor

(text editor)

Program Log

Output

Exploring PC-SAS

Let’s demonstrate a couple of simple of programs

See ‘day 1 programs.sas’

22

Help for SAS

Nothing replaces experience / trial and error

Help files from progam

SAS Manuals

Online/CD SAS manuals

SAS Books By Users

SAS Listserve

SAS on Web, esp. UCLA!

SUBMIT

DATA STEP

PROGRAM

Data Step Processing

COMPILE CREATE

Input

Buffer

PDV

Descript.

Info

End Data Step

PROGRAM

Set missing values process

DATA statement

NO

RECORD

TO

READ?

read

INPUT record

YES execute other

STATEMENTS

WRITES observation to

SAS data

RETURN

SUBMIT

DATA STEP

PROGRAM

Data Step Processing

COMPILE

Data step begins with the DATA statement in your program.

In this phase, SAS checks the syntax of the

SAS statements and compiles them, that is, automatically translates the statements into machine code.

SAS then identifies the type and length of each new variable, and determines whether a type conversion is necessary for each subsequent reference to a variable.

23

Data Step Processing

CREATE

Input

Buffer

PDV

Descript.

Info

In this phase, SAS creates:

Input Buffer: A logical area in RAM into which SAS reads each record of raw data when SAS reads raw data.

Program Data Vector (PDV): A logical area in RAM where SAS builds a data set, one observation at a time. From here, SAS writes the values to a SAS data set as a single observation. Along with data set variables and newly computed variables, the PDV contains two automatic variables, _N_ and _ERROR_.

Descriptor Information: Information that SAS creates and maintains about each SAS data set, including data set attributes and variable attributes.

It contains, for example, the name of the data set and its member type, the date and time that the data set was created, and the number, names and data types (character or numeric) of the variables.

Data Step Processing data total_points (drop=TeamName); input TeamName $ ParticipantName $ Event1 Event2 Event3;

TeamTotal = (Event1 + Event2 + Event3); datalines;

Knights Sue 6 8 8

Cardinals Jane 9 7 8

Knights John 7 7 7

Knights Lisa 8 9 9

Knights Fran 7 6 6

Knights Walter 9 8 10;

Run;

Data Step Processing

Knights Sue 6 8 8

TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_

Drop Drop Drop

Build PDV for Named Variables

24

Data Step Processing

Knights Sue 6 8 8

Set missing values

TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_

.

.

.

0 1 0

Drop Drop Drop

Fill-in PDV place-holders for variables

Data Step Processing

Knights Sue 6 8 8

TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_

Knights

Drop

Sue 6 8 8 0 1

Drop

0

Drop read

INPUT record

Fill PDV with “data”

Data Step Processing

Knights Sue 6 8 8

TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_

Knights

Drop

Sue 6 8 8 22 1

Drop

0

Drop

Calculate “TeamTotal” variable execute other

STATEMENTS

25

Data Step Processing

Knights Sue 6 8 8

ParticipantName Event1 Event2 Event3 TeamTotal

Sue 6 8 8 22

Write/Output to SAS dataset

WRITES observation to

SAS data

Data Step Processing

Cardinals Jane 9 7 8

TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_

.

.

.

0 2 0

Drop Drop Drop

Return and set _N_ to 2, Repeat Sequence

RETURN

Knights Sue 6 8 8

TeamName ParticipantName

Drop

TeamName

Knights

Drop

ParticipantName

Sue

Event1 Event2 Event3 TeamTotal _N_

1

_ERROR_ . . .

0 0

Drop Drop

Event1

. . .

0 0

6 8 8 0 1 0

Drop Drop

TeamName

Knights

Drop

ParticipantName

Sue

ParticipantName

Sue

Event1 Event2 Event3 TeamTotal _N_ _ERROR_

6 8 8 22

Event1 Event2 Event3 TeamTotal

6 8 8 22

1 0

Drop Drop

Cardinals Jane 9 7 8

TeamName ParticipantName

Drop

Event1 Event2 Event3 TeamTotal _N_

2

_ERROR_ . . .

0 0

Drop Drop

26

SUBMIT

DATA STEP

PROGRAM

Data Step Processing

COMPILE CREATE

Input

Buffer

PDV

Descript.

Info

End Data Step

PROGRAM

Set missing values process

DATA statement

NO

RECORD

TO

READ?

read

INPUT record

YES execute other

STATEMENTS

WRITES observation to

SAS data

RETURN

Data Step Processing

Research Ethics

• Poorly designed studies are unethical.

• Analysis of unethically collected data is unethical.

• Analysts have an ethical obligation to “get it right.”

• IRBs are legitimate mediators of the research process.

• Study the Belmont Report and 45 CFR 46.

• Identifiers: What are they and how to eliminate them

27

Lab 1

(First Hour) Directed Learning

• Surf web, esp UCLA

• Find ASCII tables on web

• Tour PC-SAS Interface

• Run first program

• Examine SAS HELP

(Second Hour) Lab Assignment

• Modify SAS windows (color/sizes)

• Edit program: modify libnames and change system options

• Do assignment (see website)

Homework 1

See Syllabus

28

Download