The first step in data analysis
Using SAS/BASE® to connect to third-party relational data base software to extract data needed for
program evaluation research using administrative data operational reports e.g. routine surveillance
1.
What is a relational database?
2.
Contact your DBA for how to connect to your database(s)?
3.
How to write queries using PROC SQL
SHRUG, 2014-05-02 1
Set of tables
tables made up of rows and columns
Trade names of relational databases (RDB):
Oracle, Teradata, SQL Server, DB2, Access
RDB is software which is designed to retain large amounts of data
transactional DB
reporting/warehousing DB
SHRUG, 2014-05-02 2
Transactional DB designed to increase the speed for frontend users
complex table and table join structures
Warehousing DB designed for efficient storage and retrieval for reporting
simpler table designs and table join structures
Queries for either design use same syntax (code)
queries for warehouses will be simpler to write
SHRUG, 2014-05-02 3
Why use relational databases?
relational databases use a concept called “normalization”
Normalization reduces the amount of redundant data and allows for updates to data with less error
There are degrees of normalization
first degree second degree third degree and higher degrees
SHRUG, 2014-05-02 4
First degree normalization
each row pertains to a single entity: a patient, an encounter, a physician each column pertains to a characteristic of the entity: e.g. date of birth, sex, date of encounter, etc
Table 1: Subjects with demographic information
ID
0001
0002
FirstName
John
Devbani
Gender
M
F
BirthCity
Moncton
Kolkata
BirthCountry
Canada
India
SHRUG, 2014-05-02 5
Table 1: Subjects with improper 1NF
SubjID
0001
FirstName
John
Gender
43
BirthCity
Moncton
BirthCountry
New Brunswick
0002 Raha F West Bengal India
What impact does violating the first degree normalization have on your query
if you want all patients born in Canada?
if you want all male patients?
SHRUG, 2014-05-02 6
Table 2 has employer information about rows in Table 1
Table 2: Business addresses
Name City Prov PostalCode
John
Devbani
Halifax
Halifax
NS
NS
B3K 6R8
B3H 2Y9
The table above has some redundant information:
name is repeated from Table 1, province is embedded in the postal code
Better design – two or even 3 tables
SHRUG, 2014-05-02 7
Table 2: Revised with 2NF
SubjID PostalCode
0001
0002
B3K 6R8
B3H 2Y9
Table 3: Creating a secondary table for 2NF
PostalCode
B3K 6R8
B3H 2Y9
City
Halifax
Halifax
Prov
NS
NS
SHRUG, 2014-05-02 8
Table 2 now no longer contains name – it’s replaced with the subject ID
to get the subject’s name we link the table to the table in the first example, using SUBJID/ID column
we get the province and city by linking Table 2 and 3 using the POSTALCODE column
SUBJID is a primary key in Tables 1 and 2
POSTALCODE is a foreign key in Table 2, but a primary key in
Table 3
SHRUG, 2014-05-02 9
primary key – a column or combination of columns that uniquely identify each row in the table
e.g. patient medical record needs at least 3 columns to identify a unique record: patient ID, date of encounter, and provider ID
foreign key – a column or combination of columns that is used to link data between two tables
SHRUG, 2014-05-02 10
Can you see the advantage of splitting the data into different tables?
share examples of your data where normalization is used
higher degrees of normalization work similarly to the examples above
you have to go through more tables for higher levels of normalization in order to link to the data that you need
SHRUG, 2014-05-02 11
Explain to DBA that you need to query data, but have no need to write to the database
this helps them to determine where you belong on a user matrix
DBA or IT install necessary software on your machine
Google has lots of information on SAS Connect
SAS Connect documentation
SHRUG, 2014-05-02 12
proc sql;
•
• User name is provided by DBA/IT
In this example the password is held in the macro DBPASS
• Statement to have Oracle print any messages to the SAS log connect to oracle
(user = <userid> password="&dbpass” path = prod );
%put &sqlxmsg;
This is an example of “pass-through” code
SHRUG, 2014-05-02 13
Recall that slide 13 showed pass-through facility in SAS
most of the query is done on the database
Can use libname statement to connect instead of passthrough
advantage to this method is that you are programming in
SAS (using SAS functions and formats)
SAS determines which program (SAS or RDB) will handle statements more efficiently
SHRUG, 2014-05-02 14
Example using a libname statement:
1.
2.
3.
libname onco odbc dsn='Oncolog' schema=dbo;
1.
2.
The name of the library
Tells SAS that you are using an ODBC engine
3.
DSN – use the name of the database that was used to set up the odbc connection
NOTE: schema statement is not always required
SHRUG, 2014-05-02 15
Once view is created, you use the EXPLORER tab in SAS and use as normal dataset
SHRUG, 2014-05-02 16
Using the “view columns” in SAS EXPLORER
SHRUG, 2014-05-02 17
• Double click on table to get to see the data
• NOTE: columns that identify personal information have been removed from this screen shot
SHRUG, 2014-05-02 18
You may have software from the RDB:
TOAD (for Oracle)
SQL Developer (for Oracle)
SQL Server
Teradata
All vendors may have some limited function
“development” software that allows:
Viewing data
Viewing the “type” of a column: char, num, date, etc.
Writing SQL queries
SHRUG, 2014-05-02 19
SHRUG, 2014-05-02 20
PROC SQL DATA STEP proc sql; create <table/view> <name> as select <var1>
, <var2>
, etc from <table/view> where <apply data filters> quit; data <dataset name>; set <dataset> ( keep= <list of variables> where=(<apply filters>)); run;
Example: Create a dataset (table) with men aged 50 to 74.
Assume the source table is called “demographics” and contains variables: subjectID, age and sex
SHRUG, 2014-05-02 21
PROC SQL proc sql; create table men5074 as select subjectID
, age from work.demographics
where sex=‘M’ and age
DATA STEP data men5074
(drop=sex); set work.demographics
(keep=subjectid sex age where=(sex='M' and
50<=age<=74));
; quit;
SHRUG, 2014-05-02 22
Request for report received
Realize as you look through data elements needed for complete the request that relevant columns reside in two or more tables
Background Information
1.
Need to know which tables and which columns are relevant. Useful to have data dictionary, otherwise 3 rd party software very helpful
2.
Need to know what filters to apply: sex, time period of interest, diagnosis codes, etc are all commonly applied filters
SHRUG, 2014-05-02 23
• Create a map to guide your query
• names of tables that go in ‘FROM’ statement of
SQL or ‘SET’ statement in DATA step
• names of columns that you need
• use meaningful arrows to connect work. table1 uniqueID filter
<other columns> work.table2
uniqueID filter 1 filter 2
<other columns>
SHRUG, 2014-05-02 24
*** sort the first dataset; proc sort data=<dataset1>; by <var(s)>; run;
*** sort the second dataset; proc sort data=<dataset2>; by <var(s)>; *** same var(s) as first sort; run;
SHRUG, 2014-05-02 25
*** find records common to both tables; data <result dataset>; merge <dataset1> (in=in_a)
<dataset2> (in=in_b); by <var(s)>;
*** we only want a list of records with data in table a AND in table b; if in_a and in_b; run;
SHRUG, 2014-05-02 26
Syntax: Multiple table PROC SQL using temporary tables proc sql feedback; create table <result table name> as select <columns from either or both tables below> from
*** temporary table from table1;
(select <column(s)> from <table 1> where <apply filter(s)>) a inner join
*** temporary table from table2;
(select <column(s)> from <table 2> where <apply filter(s)>) b on a.<pk>=b.<pk>
; quit;
SHRUG, 2014-05-02 27
Syntax: Multiple table PROC SQL. “Oraclestyle (PL/SQL)” proc sql feedback; create table <result table> as select <columns from one or more tables> from <table1> a
, <table2> b where a.pk=b.pk
<apply additional filters>
; quit;
SHRUG, 2014-05-02 28
RIGHT join – join Table A to B only if an observation exists in Table B
SHRUG, 2014-05-02 29
PROC SQL
no need to sort temporary tables needed to think about type of join
in this case wanted patients only if they were in both tables join columns need to be same type but can have different names (slides 6 and 9)
DATA STEP
needed to sort data by subjectID
key variable to join demographic to cancersite table
“by” variables need to have same name and type
What would you do if you found out that one record in table 1 matched to multiple records in table 2?
SHRUG, 2014-05-02 30
Table relationships are important:
one-to-one: each record in first table has a maximum of one record in the second table (through primary key)
one-to-many: each record in one table may have multiple rows in second table. Example:
Table 1 contains all patients with a flag indicating whether or not they are “active”
Table 2 contains all GP appointments for each patient
many-to-many
SHRUG, 2014-05-02 31
Task 1 - single table
Task 2 – two tables
Task 3 – multiple tables
Task 4 – reusing a table multiple times
SHRUG, 2014-05-02 32
You are asked to provide a count of the female participants in a cancer screening program who are aged 50 years as of
May 31, 2013. Break down the birth dates by month
Approach 1
Create view of the table required and use SAS to analyze data
SHRUG, 2014-05-02 33
demographic data for participants is stored in table
“PARTICIPANTS”
sex_cd is a coded variable: 222=F, 223=M, 240=U
birth_dt is the column containing birth dates
although birth_dt appears as a date type column, in SAS
Oracle dates are datetime types in SAS
For a participant to be considered 50 years of age on May
31, 2013, their birthday must occur between June 1, 1962 and May 31, 1963
SHRUG, 2014-05-02 34
Task 1a – Create view using pass-through code proc sql feedback noprint; connect to oracle as myconn (user=&userid password=&pw path=&path); create view participant as select * from connection to myconn
(select * from csprod.participant
where sex_cd= 222 and trunc(birth_dt) between to_date('19620601','YYYYMMDD‘) and to_date('19630501','YYYYMMDD‘) and del_dt is null
); disconnect from myconn; quit;
SHRUG, 2014-05-02 35
Create view participant
this syntax translates to “Create a view called ‘participant’
select *
‘*’ is a wildcard and means select all
where
“%”, “_” – see Task 1b
multiple (%) or single (_) byte of data, in contrast to the entire column. Only used to scan a column.
SHRUG, 2014-05-02 36
trunc()
recall that SAS will treat Oracle, Teradata, SQL dates as
DATETIME
trunc() is an Oracle function that looks only at the DATE part of the column
SHRUG, 2014-05-02 37
to_date(‘<yourdate>’, ‘<yourdate format>’)
in this example to_date(‘19620601’, ‘YYYYMMDD’)
take the string 19620601 and treat it as a date with the format YYYYMMDD
could use other formats: YYYYMONDD, MM-DD-YYYY, etc
BETWEEN operator – works as you would expect, includes both lower and upper limits specified
SHRUG, 2014-05-02 38
Task 1a – Contents of view
Variables in Creation Order
# Variable
1 PARTICIPANT_ID
2 FIRST_NAME
3 LAST_NAME
4 SEX_CD
5 BIRTH_DT
6 MIDDLE_NAME
20 DEL_USER_ID
21 DEL_REASON_CD
22 VERSION_NUM
23 RELIGION_CD
Type Len Flags Format
Num
Num
Num
Char
Num
Num
Num
8 P--
Char 100 P--
Char 100 P--
8 P--
8 P--
Char 100 P--
7 PRFD_NAME
8 PRFD_LANG_CD
9 NO_FUTURE_CONTACT_IND
Char
Num
Char
20 P--
8 P--
1 P--
10 EMAIL Char 100 P--
11 PRFD_CONTACT_METHOD_CD Num 8 P--
12 PART_STATUS_CD
13 STATUS_DT
14 STATUS_SOURCE_CD
Num
Num
Num
8 P--
8 P--
8 P--
Char 1 P-15 TWIN_IND
16 COMMENT_TXT
17 ROW_DT
18 ROW_USER_ID
19 DEL_DT
Char 200 P--
Num
Char
Num
8 P--
20 P--
8 P--
20 P--
8 P--
8 P--
8 P--
21.
$100.
$100.
21.
$100.
$20.
21.
$1.
$100.
21.
21.
21.
$1.
$200.
$20.
$20.
21.
21.
21.
Informat
21.
$100.
$100.
21.
Label
PARTICIPANT_ID
FIRST_NAME
LAST_NAME
SEX_CD
DATETIME20.
DATETIME20.
BIRTH_DT
$100.
$20.
21.
$1.
$100.
21.
21.
MIDDLE_NAME
PRFD_NAME
PRFD_LANG_CD
NO_FUTURE_CONTACT_IND
PRFD_CONTACT_METHOD_CD
PART_STATUS_CD
DATETIME20.
DATETIME20.
STATUS_DT
21.
$1.
$200.
STATUS_SOURCE_CD
TWIN_IND
COMMENT_TXT
DATETIME20.
DATETIME20.
ROW_DT
$20.
ROW_USER_ID
DATETIME20.
DATETIME20.
DEL_DT
$20.
21.
21.
21.
DEL_USER_ID
DEL_REASON_CD
VERSION_NUM
RELIGION_CD
SHRUG, 2014-05-02
NOTE: The contents show exactly the same columns as slide 31
39
dob Frequency Percent
June 1184 7.56
July 1207 7.71
August
September
October
1241
1232
1232
7.92
7.87
7.87
November
December
January
February
March
April
May
1104
1053
3657
1113
1327
1266
45
7.05
6.72
23.35
7.11
8.47
8.08
0.29
Cumulative
Frequency
1184
2391
3632
4864
6096
7200
8253
11910
13023
14350
15616
15661
Cumulative
Percent
7.56
15.27
23.19
31.06
38.92
45.97
52.70
76.05
83.16
91.63
99.71
100.00
• More detailed analysis uncovered that missing month/day combinations were defaulted to January 1
SHRUG, 2014-05-02 40
REQUEST
Find a list of CCI codes for hysterectomy
use of single table
example of PROC SQL with SAS data
filter using “like” operator
SHRUG, 2014-05-02 41
SHRUG, 2014-05-02 42
proc sql feedback; create table hystcd as select a.f1 as cci_code
, a.f3 as long_desc
, substr(a.f1,1,5) as rubric from work.cci_raw as a where upcase(a.f3) like '%HYSTERECTOMY%' or (upcase(a.f3) like '%EXCISION%' and upcase(a.f3) like '%UTERUS%') or (upcase(a.f3) like '%EXCISION%' and upcase(a.f3) like '%CERVIX%') order by a.f1
; quit;
SHRUG, 2014-05-02 43
RESULTING DATASET
SHRUG, 2014-05-02 44
Report the number of men and women who turned 60 as of May 31, 2013, enrolled in the colorectal cancer screening program. Do not include participants with unknown sex
participant table contains demographic information: sex, birth date
participant_program table contains data for participants and screening program
program_id=1 indicates colorectal cancer screening program program_status_cd=263 indicates that a participant is active in the program
SHRUG, 2014-05-02 45
CSPROD.PARTICIPANT
participant_id sex_cd ≠ 240 birth_dt between 01Jun1952 and 31May1953 del_dt is null
CSPROD.PARTICIPANT_PROGRAM
participant_id program_status_cd=263 program_id=1 del_dt is null
SHRUG, 2014-05-02 46
*** METHOD 2 - Oracle pass through. Simple code; proc sql feedback noprint; connect to oracle as myconn (user=&userid password=&pw path=&path); create table part60 as select * from connection to myconn
(select ptc.gender
, count(*) from (select participant_id
, sex_cd
SHRUG, 2014-05-02 47
*** METHOD 2 - Oracle pass through. Simple code;
, case when sex_cd=222 then 'F' else 'M' end as gender from csprod.participant
where trunc(birth_dt) between to_date('19520601','YYYYMMDD') and to_date('19530531','YYYYMMDD') and sex_cd <> 240 and del_dt is null) ptc
SHRUG, 2014-05-02 48
inner join
(select participant_id from csprod.participant_program
where program_id=1 and program_status_cd=263 and del_dt is null) pp on ptc.participant_id=pp.participant_id
group by ptc.gender
; disconnect from myconn; quit;
SHRUG, 2014-05-02 49
proc sql feedback noprint; connect to oracle as myconn
(user=&userid password=&pw path=&path); create table part60 as select * from connection to myconn
Create a SAS dataset called “part60” select all columns from the query (seen after connection statement)
SHRUG, 2014-05-02 50
(select ptc.gender
, count(*) from (select participant_id
, sex_cd
, case when sex_cd=222 then 'F' else 'M' end as gender from csprod.participant
where trunc(birth_dt) between to_date('19520601','Y
YYYMMDD') and to_date('19530531','Y
YYYMMDD‘) and sex_cd <> 240 and del_dt is null) ptc
Put these columns in the SAS dataset part60
Create a temporary table called ‘ptc’
Table PTC contains columns as listed from the
PARTICIPANT table, with the restrictions shown in the
WHERE clause
SHRUG, 2014-05-02 51
inner join
(select participant_id from csprod.participant_program
where program_id=1 and program_status_cd=263 and del_dt is null)pp on ptc.participant_id= pp.participant_id
group by ptc.gender
; disconnect from myconn; quit;
Create temporary table, ‘PP’ from
PARTICIPANT_PROGRAM with restrictions defined in the
WHERE clause
SHRUG, 2014-05-02 52
This example shows an inner join: want participants, and the # males and females participating in CRC screening program age 60 as of May 31, 2013
PTC
C
PP
• Area C is the result of the inner join
• Temporary table PTC: a subset of csprod.participant
• Temporary table PP: a subset of csprod.part_program
SHRUG, 2014-05-02 53
What will be the query result?
What’s the table/dataset name?
How many rows?
How many columns?
What are the columns called?
SHRUG, 2014-05-02 54
SHRUG, 2014-05-02 55
REQUEST
Find number of patients with invasive kidney cancer (ICD-O-
3=C64.9) diagnosed between 2008 and 2010. Breakdown counts by age and sex. Interested in age < 60 and age ≥ 60
BACKGROUND
remove any patients who were deleted
remove any tumors that were deleted
diagnoses are in table called “oldiagnostic”
sex is in table called “olpatient”
birth date in table called “person”
SHRUG, 2014-05-02 56
onco.oldiagnostic
personser ≠ deleted diagnosticser ≠ deleted dxstate=‘NS’ substr(icdohistocode,6,1)='3' year(dateinitial..) in (2008,
2009, 2010) onco.olpatient
personser olsex onco.person
personser persontype=‘patient’ datepart(dateofbirth)
SHRUG, 2014-05-02 57
proc sql feedback; create table onco_coh as select a.*
, b.olsex
, f.birth_dt
, floor(yrdif(f.birth_dt,a.initdx_dt,'act/act')) as ageatdx from
SHRUG, 2014-05-02 58
/*** get cases ***/
(select o.personser
, o.diagnosticser
, datepart(o.DateInitialDiagnosis) as initdx_dt format=date9.
, o.icdositecode
from onco.oldiagnostic
o where o.icdositecode in ('C64.9')
/*** only invasive cancers ***/ and substr(o.icdohistocode,6,1)='3' and 2010 and year(o.dateinitialdiagnosis) between 2008 and o.dxstate
='NS’
SHRUG, 2014-05-02 59
/*** patient not deleted ***/ and o.personser not in
(SELECT ps1.Personser
FROM onco.OlPatientSup ps1
WHERE ps1.PersonSer = o.PersonSer
and ps1.identifier = 'CCRPatientReportingStatu'
AND ps1.String IN ('04','05') and ps1.FieldSeq = 0)
SHRUG, 2014-05-02 60
/*** diagnosis not deleted ***/ and o.diagnosticser not in
(SELECT ds1.diagnosticser
FROM onco.OLdiagnosticsup ds1
WHERE ds1.PersonSer = o.PersonSer
and o.diagnosticser = ds1.diagnosticser
and ds1.identifier = 'CCRPrimaryReportingStatu'
AND ds1.String IN ('04','05') and ds1.FieldSeq = 0)) a
SHRUG, 2014-05-02 61
/*** get patient's sex ***/ left join
(select personser
, olsex from onco.olpatient) b on a.personser=b.personser
/*** get birth date ***/ left join
(select personser
, datepart(DateOfBirth) as birth_dt format=date9.
from onco.person
where lowcase(persontype)='patient') f on a.personser=f.personser
; quit;
SHRUG, 2014-05-02 62
Sex
M
F
Total
Age at diagnosis
Under 60 60 and older Total
128
76
204
247
147
394
375
223
598
SHRUG, 2014-05-02 63
•
•
•
•
Self-join
Correlated sub-query
Outer from and where
UNION
SHRUG, 2014-05-02 64
77 /* select candidates for babes becoming mothers */
78 proc
79 ; sql
80 create table Candidates as
81 select B1.BrthDate
82 , B1.BirthID
83 , B2.DLMBDate
84 , B2.ContctID
85 from SASDM.DelnBrth as B1 /* babes */
86 , SASDM.DelnBrth as B2 /* mums */
87 where B1.BrthDate = B2.DLMBDate;
NOTE: Table WORK.CANDIDATES created, with 855040 rows and 4 columns.
SHRUG, 2014-05-02 65
OB/Research has data in:
Clinical ultrasound db
Maternal serum screening db
Objective: find all mothers with abnormal screening and see if the ultrasound indicated risk for restricted growth
(small baby)
SHRUG, 2014-05-02 66
create table Work.WithAtlee as
/* VP data only available after 2003 not
2000 */ select One18.* /* 18-wk US */
, M.MO365/*perinatal data*/
, M.Wgt4Age
/* 45 lines omitted here */
SHRUG, 2014-05-02 67
, M.DLPrvNND
, M.DLPrvFTD /* no such variable as IUGR in */
, M.MotherID in /* previous pregnancy - back link */
( select Prev.MotherID
from SASDM.DelnBrth as Prev
, M.DLPrvLBW
, M.Prev_PTD
where Prev.MotherID = M.MotherID
and Prev.Wgt4Age in ( 1, 2)/* pick<5th*/ and Prev.BrthDate < M.BrthDate ) as Previous_IUGR
SHRUG, 2014-05-02 68
from Work.VP1USper18 as One18
, SASDM.Monster as M /*inner join-only want */ where (One18.ContctID = M.ContctID)
/* matches, in both databases, and linkable */ and ( M.DLDschD8 between
'01Oct2003'D and '30Sep2008'D ) )
/* 5 full years, 10/2003 on */ and ( not Major_Anom ); exclusion */
/* clarified this as complete
SHRUG, 2014-05-02 69
create table Work.NoPlace as select VS.*
, . as GA
, . as CountyNum from Work.VS_2302 as VS where VS.VS_Deaths_ID not in
( select VS_Deaths_ID from Work.FADBPlace
union select VS_Deaths_ID from Work.NSAPDPlace ) and ( VS.Age_Code in ( 1, 2 ) ) and (BrthD8 between '01Jan2010'D and '31Dec2010'D);
SHRUG, 2014-05-02 70
Review code, functions, operators
Recap learning objectives
SHRUG, 2014-05-02 71
proc sql skeleton
libname to connect to database
pass-through to connect to database
trunc(): an Oracle function
to_date(): an Oracle function
case-when: SQL statements
is null/is not null
wildcards: “*”, “_”
SHRUG, 2014-05-02 72
RDB tables normalization keys: primary/foreign joins
Connecting in SAS
ODBC
Oracle client server info from IT connecting: libname vs pass-through
PROC SQL compare to datastep single table multiple tables self-joins http://support.sas.com/documentation/cdl/en/acreldb/65247/PDF/default/acreldb.pdf
SHRUG, 2014-05-02 73
John Fahey
Biostatistician, Halifax shrug.president@gmail.com
Devbani Raha
Cancer Care Nova Scotia shrugexec@gmail.com
74 Photograph taken by Ken DeBacker, 2011
SHRUG, 2014-05-02