J. Michael Oakes, PhD
Associate Professor
Division of Epidemiology
University of Minnesota oakes007@umn.edu
Lecture 1/4
• COMPUTER NUMBER SYSTEMS
• DATA AND DATABASES
• PC-SAS INTERFACE
• SAS HELP
• INTRO TO PDV
• RESEARCH ETHICS
Binary Numbers
Decimal Numbers (base-10 system) and Digits:
A digit is a single place that can hold numerical values between 0 and 9.
Digits are combined together in groups to create larger numbers
It is understood that in the number 6,357 the 7 is filling the "1s place," the 5 is filling the 10s place, the 3 is filling the 100s place the 6 is filling the 1,000s place.
So you could express things this way if you wanted to be explicit:
(6 * 1000) + (3 * 100) + (5 * 10) + (7 * 1) = 6000 + 300 + 50 + 7 = 6357
1
Binary Numbers
Another way to express 6357 would be to use powers of 10.
Assuming that we are going to represent the exponentiation with the “hat” or ^ symbol, we may to express the quantity as,
(6 * 10^3) + (3 * 10^2) + (5 * 10^1) + (7 * 10^0) = 6357
Notice each digit is a placeholder for the next higher power of 10, starting in the first digit with 10 raised to the power of zero (recall X^0 = X).
Binary Numbers
Definition of “Binary” from the OED -
Binary arithmetic: a method of computation in which the binary scale is used, suggested by Leibnitz.
Binary scale: the scale of notation whose ratio is 2, in which, therefore,
1 of the ordinary (denary) scale is expressed by 1, 2 by 10, 3 by 11, 4 by 100, etc.
Binary Numbers
All programs and data are ultimately recognized as just patterns of 0's and 1's by the digital computer.
2
Binary Numbers
Why Use 'Em?
For computers, binary numbers are great stuff because:
Noise-resistant
Simple
Mimic Circuit
Not continuous, thus on or off.
They are simple to work with -- no big addition tables and multiplication tables to learn, just do the same things over and over, very fast.
They just use two values of voltage, magnetism, or other signal, which makes the hardware easier to design and more noise resistant.
Binary Numbers
Decimal 1 is binary 0001
Decimal 3 is binary 0011
Decimal 6 is binary 0110
Decimal 9 is binary 1001
Each digit "1" in a binary number represents a power of two, and each "0" represents zero:
0001 is 2 to the zero power, 2^0, or 1
0010 is 2 to the 1st power, 2^1, or 2
0100 is 2 to the 2nd power, 2^2, or 4
1000 is 2 to the 3rd power, 2^3, or 8.
2 0 + 2 1 + 2 2 + 2 3 + 2 4 + … + 2 n
Binary Numbers
When you see a number like "0101" you can figure out what it means by adding the powers of 2:
0101 = 0 + 4 + 0 + 1 = 5
1010 = 8 + 0 + 2 + 0 = 10
0111 = 0 + 4 + 2 + 1 = 7
3
Binary Numbers
The word bit is a shortening of the words "Binary digIT"
It is the smallest possible unit of information.
Each digit or “place” is a bit:
2 0 + 2 1 + 2 2 + 2 3 + 2 4 + 2 5 + 2 6 + 2 7 + 2 8 + 2 9 + 2 10 place 1 2 3 4 5 6 7 8 bit 0 1 2 3 4 5 6 7
Binary Numbers
8 bits is usually called a "byte“
Which is the size usually used to represent an alphabetic character in ASCII -- "A" is 65, or 01000001
With 8 bits in a byte, you can represent 256 values ranging from 0 to 255, as shown here:
0 = 00000000
1 = 00000001
2 = 00000010
...
254 = 11111110
255 = 11111111
Binary Numbers
When you start talking about lots of bytes, you get into prefixes like kilo, mega and giga, as in kilobyte, megabyte and gigabyte.
The following table shows the multipliers:
Name Abbr. Size
Kilo
Mega
Giga
Tera
Peta
Exa
K
M
G
T
P
E
2^10 = 1,024
2^20 = 1,048,576
2^30 = 1,073,741,824
2^40 = 1,099,511,627,776
2^50 = 1,125,899,906,842,624
2^60 = 1,152,921,504,606,846,976
4
Bytes are frequently used to hold individual characters in a text document.
Data are usually either numbers or letters (strings).
Of course, memory can only hold binary numbers, so we have to agree how to interpret those numbers (a kind of code) when we want them to represent letters (or numbers, for that matter).
(That computers can process text or string data is what made them popular!)
Character on Screen
1
2
3
Binary Value
0000 0001
0000 0010
0000 0011 b
B a 0110
A 0100 0001
0110 0010
0100 0010
String - A sequence of printable characters, delimited by quotes in
Stata, SAS, etc.
Examples: “hello world” “String” “1243212”
“ ” is called the null string
Numeric - Easy to code, but there is an issue of precision
The number 1.1 in binary is 1.10001100110011… repeating… much like 1/11 in decimal, which is 0.09090909…
SAS “rounds” 1.1 to 1.100000238419, which is off a little.
Also store ID variables as STRINGS!!!
Say that we have a tiny data file which has just ID variables like the one below.
123456789
123456790
123456791
123456792
123456793
123456794
123456795
123456796
If we go to list out the values, they are displayed in scientific notation so it is hard to read the values.
id
1. 1.23e+08
2. 1.23e+08
3. 1.23e+08
4. 1.23e+08
5. 1.23e+08
6. 1.23e+08
7. 1.23e+08
8. 1.23e+08 id
1. 123456792
2. 123456792
3. 123456792
4. 123456792
5. 123456792
6. 123456792
7. 123456792
8. 123456800
5
SAS - Data and storage types
The default length of numeric variables in SAS data sets is
8 bytes.
Control the length of SAS numeric variables with the LENGTH statement in the DATA step (more on this later)
In the PC-SAS, the Windows data type of numeric values that have a length of 8 is LONG REAL. The precision of this type of floatingpoint values is 16 decimal digits
A character variable with a length of 1 byte to serve the same purpose as dummy (0,1) variable
Regardless of how much precision is available, there is still the problem that some numbers cannot be represented exactly.
In the decimal number system, the fraction 1/3 cannot be represented exactly in decimal notation. Likewise, most decimal fractions (for example, .1) cannot be represented exactly in the binary numbering system.
Imprecision can also cause problems with comparisons.
Consider the following example in which the PUT statement is not executed: data _null_; x=1/3; if x=.33333 then put 'MATCH'; run;
6
If you add the ROUND function, as in the following example, the PUT statement is executed: data _null_; x=1/3; if round(x,.00001) =.33333 then put'MATCH'; run;
In general, if you are doing comparisons with fractional values, it is good practice to use the ROUND function.
ASCII Data
Bytes are frequently used to hold individual characters in a text document.
Data are usually either numbers or letters. Of course, memory can only hold binary numbers, so we have to agree how to interpret those numbers (a kind of code) when we want them to represent letters (or numbers, for that matter).
ASCII
- The American Standard Code for Information Interchange is a standard seven-bit code that was proposed by American National Standards
(ANSI) in 1963, and finalized in 1968. Other sources also credit much of the work on ASCII to work done in 1965 by Robert W. Bemer.
In this system, 7 bits are used to represent 128 (27) different letters, numbers, punctuation, and special codes. When the eighth bit is used (why waste it?), we have Extended ASCII, in which 256 characters are available. The "upper" 128 characters are not as standard as the first 128. Refer to an Extended ASCII chart on web.
ASCII Data
If you use Notepad in Windows 95/98 to create a text file containing the words,
Four score and seven years ago
Notepad would use 1 byte of memory per character
(including 1 byte for each space character between the words -- ASCII character 32).
When Notepad stores the sentence in a file on disk, the file will also contain 1 byte per character and per space.
7
ASCII Data
If you were to look at the file as a computer looks at it, you would find that each byte contains not a letter but a number -- the number is the ASCII code corresponding to the character (see below). So on disk, the numbers for the file look like this:
F o u r a n d s e v e n
70 111 117 114 32 97 110 100 32 115 101 118 101 110
By looking in the ASCII table, you can see a one-toone correspondence between each character and the
ASCII code used.
History of ASCII Code is history of code, beginning with MORSE code http://tronweb.super-nova.co.jp/characcodehist.html
EBCDIC Data
ASCII code was adopted by all U.S. computer manufacturers except IBM, which developed a proprietary character code for its mainframe computers
IBM created a proprietary 8-bit character code (2^8 = 256 code points) called EBCDIC [pronounced eb-see-dick ], which stands for
"Extended Binary Coded Decimal Interchange Code."
It was used on the successful IBM System/360 mainframe computer series, which hit the market in April 1964
Today, many businesses that have data in EBCDIC files are converting them into ASCII and other non-proprietary formats
8
Data are either:
(e.g., 1 12001 0.4532)
(e.g., Y NO “This is totally boring!”)
Databases
Database - a computerized record keeping system.
• More completely, it is a system involving data, the hardware that physically stores that data, the software that utilizes the hardware's file system in order to (1) store the data and (2) provide a standardized method for retrieving or changing the data, and finally, the users who turn the data into information.
• In the 1960s, databases were created to solve the problems with file-oriented systems in that they were compact, fast, easy to use, current, accurate, allowed the easy sharing of data between multiple users, and were secure.
• A database might be as complex and demanding as an account tracking system used by a insurance company to manage the constantly changing accounts of thousands of subscribers, or it could be as simple as a collection of email addresses on your computer.
• The important thing is that a database allows one to store data and get it or modify it when one needs to easily and efficiently regardless of the amount of data being manipulated. What the data is and how demanding you will be when retrieving and modifying that data is simply a matter of scale.
Databases
Traditionally, databases ran on large, powerful mainframes for business applications.
Such machines use packages like Oracle 8 or Sybase SQL
(structured query language) Server. SAS is really a database program!
However with the advent of small, powerful personal computers, databases have become more readily usable by the average computer user. Microsoft's Access is a popular
PC-based engine.
Today, databases have quickly become integral to the design, development, and services offered by web sites.
9
Survey Quex
Docs Notes
Observation
Data Entry
Data set / base
Obs-1
Obs-2
Obs-3
Obs-n
Var-1 Var-2 Var-3
1 2.5
100
2
3
1.7
21.0
91
211
Var-m
•
•
•
•
10
Flat Files
• A flat file is a file containing records that have no structured interrelationship. The term is frequently used to describe a textual document from which all word processing or other structure characters have been removed (simple structure).
• Suffix usually ends in dat, txt, csv
F231012S2
F190101T2
M181012
F231011T1
M142222S0
How many variables? How many observations?
F231012S2
F190101T2
M181012
F231011T1
M142222S0
11
Observations or records
Variables or fields
12 345678
1 F231012S2
2 F190101T2
3 M181012
4 F231011T1
5 M142222S0
This is fixed-format (flat) file
F231012S2
F190101T2
M181012
F231011T1
M142222S0
Comma Delimited (*.csv)
F,23,1,0,1,2,S,2
F,19,0,1,0,1,T,2
M,18,1,0,1,2
F,23,1,0,1,1,T,1
M,14,2,2,2,2,S,0
12
Text Editors
Avoid WORD, WP and other word-processors!
A word processor work is aimed at styling text, integrating tables and figures, creating footnotes and indexes, and so on.
A text editor by contrast is predisposed to treat a pure text file quite literally… no junk!!!
See (study!) http://repec.org/docs/textEditors.html
Text Editors
Text editors are useful for
• Writing SAS programs
• Reviewing logs and output
• Examining (large) for data files
Text Editors
There are many text editors available to you!
•
Windows native Notepad / Wordpad
• Emacs, Pico and Vi (Popular Unix/VMS editors)
• SAS Enhanced Text Editor
• Stata native *.do and *.log file editor
• PFE -- (for HUGE data files; no longer supported, but free, and nice) http://www.lancs.ac.uk/people/cpaap/pfe/
• BBE Edit (for Macs)
13
Binary Files
All executable programs (*.exe) are stored in binary files, as are most numeric data files.
Access *.mdb
Dbase
Excel
SAS
SPSS
Stata
*.dbf
*.xls
*.sas7bdat
*.sav
*.dta
If data are in raw “flat file”
ASCII or EBCDIC
Fixed or delimited format
Infile to SAS
If data are in “binary” format
Excel, Access, Stata
Convert to SAS
Hierarchical Data
14
Hierarchical Data nested nested within Schools…
Example: Patients within a Doctor, within a
Health-plan, within a Macroeconomy…
Example: Blood-pressures nested within person, over time!
Hierarchical Data
The essential element of hierarchical data is that things within a group are typically more alike than things between groups. There is a clustering or dependence.
Hierarchical Data
The similarities within a group are due to common exposures (e.g., toxic dumps and teachers) and common force effects of sorting things into more homogeneous groups (e.g., money to buy a house or a good hospital).
15
Hierarchical Data
Unit of Analysis: When it comes to
(hierarchical) data, it all depends on our level of interest in the detail at any given level!
Hierarchical Data
SUBJECT
7
8
9
4
5
6
1
2
3
AGE
43
31
2
22
48
55
67
71
2
SEX
M
F
F
F
F
M
M
F
M
HH
4
4
4
2
3
3
1
1
1
NEIGH
2
2
2
1
1
1
1
1
1
Hierarchical Data
SUBJECT
1
1
2
1
1
3
3
2
3
AGE
43
43
43
43
15
15
27
27
27
SEX
M
M
M
M
F
.
.
F
.
HH
1
2
3
1
1
4
4
3
4
HT
0
0
0
0
0
0
0
0
0
SBP
142
131
122
128
118
119
159
122
118
16
Relational Data
The "relation" comes from the fact that the tables can be linked to each other. These kind of relations can be quite complex in nature, and would be hard to replicate in the standard flat-file format.
One major advantage of the relational model is that, if a database is designed efficiently, there should be no duplication of any data, which helps maintain database integrity and can represent a huge saving in file size, which is important when dealing with large volumes of data.
Relational databases also have functions "built in" that help them to retrieve, sort and edit the data in many different ways, and so can go quite some way to speeding things up.
Developed in the 1980s to overcome slowness of hierarchical and multiple flat-file systems.
Relational Data
Graduate School SPH
Bursars Office Financial Aid
Should each group have their own flat file?
We have a group of related tables!
Relational Data
Graduate School
SPH
Bursars Office
Financial Aid
Prof. Oakes’ Query…
17
Relational Data
Enrollment
Hospital
Clinic
Claims
ets
Append
A append
B
A
B
Sets
A merge
B A B
By some linking
Variable (e.g., ID)
18
Sets
Collapse
A
A’
Collapsed and statistics
(e.g., sums, means) are generated.
Sets
Subset
A
Use if command a
1 a
2 a
3
19
The SAS System is an integrated system of software products that enables you to perform
• data entry, retrieval, and management
• report writing and graphics
• statistical and mathematical analysis
• business planning, forecasting, and decision support
• operations research and project management
• quality improvement
• applications development.
In addition, you can integrate with SAS many SAS business solutions that enable you to perform large scale business functions, such as data warehousing and data mining, human resources management and decision support, financial management and decision support, and others.
SAS Reads Data as per instructions
SAS Writes Data and/or Text as per instructions
20
Most popular stats program
Critical to all PH research
SAS programmers, > $60k
Challenging / terrifying to learn
Semi-unique language/terms
Terrible help files
Main Flavors of SAS:
21
Versions of SAS
SAS for PC has three main “windows”
(text editor)
Let’s demonstrate a couple of simple of programs
See ‘day 1 programs.sas’
22
Help files from progam
SAS Manuals
Online/CD SAS manuals
SAS Books By Users
SAS Listserve
SAS on Web, esp. UCLA!
SUBMIT
DATA STEP
PROGRAM
Data Step Processing
COMPILE CREATE
Input
Buffer
PDV
Descript.
Info
End Data Step
PROGRAM
Set missing values process
DATA statement
NO
RECORD
TO
READ?
read
INPUT record
YES execute other
STATEMENTS
WRITES observation to
SAS data
RETURN
SUBMIT
DATA STEP
PROGRAM
Data Step Processing
COMPILE
Data step begins with the DATA statement in your program.
In this phase, SAS checks the syntax of the
SAS statements and compiles them, that is, automatically translates the statements into machine code.
SAS then identifies the type and length of each new variable, and determines whether a type conversion is necessary for each subsequent reference to a variable.
23
Data Step Processing
CREATE
Input
Buffer
PDV
Descript.
Info
In this phase, SAS creates:
Input Buffer: A logical area in RAM into which SAS reads each record of raw data when SAS reads raw data.
Program Data Vector (PDV): A logical area in RAM where SAS builds a data set, one observation at a time. From here, SAS writes the values to a SAS data set as a single observation. Along with data set variables and newly computed variables, the PDV contains two automatic variables, _N_ and _ERROR_.
Descriptor Information: Information that SAS creates and maintains about each SAS data set, including data set attributes and variable attributes.
It contains, for example, the name of the data set and its member type, the date and time that the data set was created, and the number, names and data types (character or numeric) of the variables.
Data Step Processing data total_points (drop=TeamName); input TeamName $ ParticipantName $ Event1 Event2 Event3;
TeamTotal = (Event1 + Event2 + Event3); datalines;
Knights Sue 6 8 8
Cardinals Jane 9 7 8
Knights John 7 7 7
Knights Lisa 8 9 9
Knights Fran 7 6 6
Knights Walter 9 8 10;
Run;
Data Step Processing
Knights Sue 6 8 8
TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_
Drop Drop Drop
Build PDV for Named Variables
24
Data Step Processing
Knights Sue 6 8 8
Set missing values
TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_
.
.
.
0 1 0
Drop Drop Drop
Fill-in PDV place-holders for variables
Data Step Processing
Knights Sue 6 8 8
TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_
Knights
Drop
Sue 6 8 8 0 1
Drop
0
Drop read
INPUT record
Fill PDV with “data”
Data Step Processing
Knights Sue 6 8 8
TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_
Knights
Drop
Sue 6 8 8 22 1
Drop
0
Drop
Calculate “TeamTotal” variable execute other
STATEMENTS
25
Data Step Processing
Knights Sue 6 8 8
ParticipantName Event1 Event2 Event3 TeamTotal
Sue 6 8 8 22
Write/Output to SAS dataset
WRITES observation to
SAS data
Data Step Processing
Cardinals Jane 9 7 8
TeamName ParticipantName Event1 Event2 Event3 TeamTotal _N_ _ERROR_
.
.
.
0 2 0
Drop Drop Drop
Return and set _N_ to 2, Repeat Sequence
RETURN
Knights Sue 6 8 8
TeamName ParticipantName
Drop
TeamName
Knights
Drop
ParticipantName
Sue
Event1 Event2 Event3 TeamTotal _N_
1
_ERROR_ . . .
0 0
Drop Drop
Event1
. . .
0 0
6 8 8 0 1 0
Drop Drop
TeamName
Knights
Drop
ParticipantName
Sue
ParticipantName
Sue
Event1 Event2 Event3 TeamTotal _N_ _ERROR_
6 8 8 22
Event1 Event2 Event3 TeamTotal
6 8 8 22
1 0
Drop Drop
Cardinals Jane 9 7 8
TeamName ParticipantName
Drop
Event1 Event2 Event3 TeamTotal _N_
2
_ERROR_ . . .
0 0
Drop Drop
26
SUBMIT
DATA STEP
PROGRAM
Data Step Processing
COMPILE CREATE
Input
Buffer
PDV
Descript.
Info
End Data Step
PROGRAM
Set missing values process
DATA statement
NO
RECORD
TO
READ?
read
INPUT record
YES execute other
STATEMENTS
WRITES observation to
SAS data
RETURN
Data Step Processing
• Poorly designed studies are unethical.
• Analysis of unethically collected data is unethical.
• Analysts have an ethical obligation to “get it right.”
• IRBs are legitimate mediators of the research process.
• Study the Belmont Report and 45 CFR 46.
• Identifiers: What are they and how to eliminate them
27
(First Hour) Directed Learning
• Surf web, esp UCLA
• Find ASCII tables on web
• Tour PC-SAS Interface
• Run first program
• Examine SAS HELP
(Second Hour) Lab Assignment
• Modify SAS windows (color/sizes)
• Edit program: modify libnames and change system options
• Do assignment (see website)
See Syllabus
28