SQL

advertisement
This document is based on SAS SQL Procedure User’s Guide
What is SQL?
Structured Query Language (SQL) is a widely used language for retrieving and updating data in tables and/or views of those tables. It has its origins in and is
primarily used for retrieval of tables in relational databases. PROC SQL is the SQL implementation within the SAS System.
Purpose of SQL
PROC SQL enables you to perform the following tasks:








generate reports
generate summary statistics
retrieve data from tables or views
combine data from tables or views
create tables, views, and indexes
update the data values in PROC SQL tables
update and retrieve data from database management system (DBMS) tables
modify a PROC SQL table by adding, modifying, or dropping columns
Similarity with data step (We will come back to this at the end of this lecture note)
Retrieving Data from a Single Table
PROC SQL;
CREATE TABLE AS
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY;
When you construct a SELECT statement, you must specify the clauses in the above order. Only the SELECT and FROM clauses are required.
Libname mysasdir "C:\Users\anna\Documents\stat597.F13";
proc contents data=Sashelp.demographics;run;
proc sql;
create table twocontdata as
select Cont, sum(Pop)
from Sashelp.demographics
group by Cont
having Cont in (91, 95)
order by Cont;
proc print;run;
1.
The SELECT statement is the primary tool of PROC SQL. You use it to identify, retrieve, and manipulate columns of data from a table. You can also
use several optional clauses within the SELECT statement to place restrictions on a query. A SELECT clause lists the Name column and a FROM
clause lists the table in which the Name column resides.
2.
The WHERE clause enables you to restrict the data that you retrieve by specifying a condition that each row of the table must satisfy.
3.
The ORDER BY clause enables you to sort the output from a table by one or more Columns.
4.
The GROUP BY clause enables you to break query results into subsets of rows. When you use the GROUP BY clause, you use an aggregate function in
the SELECT clause or a HAVING clause to instruct PROC SQL how to group the data.
5.
The HAVING clause works with the GROUP BY clause to restrict the groups in a query’s results based on a given condition.
Selecting Columns in a Table
Syntax: SELECT <distinct> variable1 <attributes>, variable2 <attributes>, variable3 <attributes>
from table;
1.
Use an asterisk in the SELECT clause to select all columns in a table.
2.
3.
You can eliminate the duplicate rows from the results by using the DISTINCT keyword in the select clause
Variables can be variable names, constants in quotes, calculations, or conditional assigned values
proc sql outobs=10;
select 'The 2005 population for', ISONAME, 'is', pop
from Sashelp.demographics;
proc sql outobs=10;
select ISONAME, pop*popUrban as UrbanPopulation
from Sashelp.demographics;
proc sql outobs=10;
select ISONAME, pop*popUrban as UrbanPopulation, pop-calculated UrbanPopulation as
ruralPopulation
from Sashelp.demographics;
Here CALCULATED is a keyword indicating the variable following it is a variable calculated within the query. You can specify a calculated column only in a
SELECT clause or a WHERE clause.
You can use conditional logic within a query by using a CASE expression to conditionally assign a value. You can use a CASE expression anywhere that you
can use a column name.
proc sql outobs=150;
select ISONAME,
case
when pop > min(pop)+2/3*(max(pop)-min(pop)) then "large"
when min(pop)+2/3*(max(pop)-min(pop)) >= pop >
min(pop)+1/3*(max(pop)-min(pop))
then "medium"
else "small"
end as sizegroup
from Sashelp.demographics;
proc sql outobs=20;
select ISONAME,
case cont
when 91 then
when 92 then
when 93 then
when 94 then
when 95 then
"North America"
"South America"
"Europe"
"Africa"
"Asia"
when 96 then "Australia"
else "Antarctica"
end as continent
from Sashelp.demographics;
Variable attributes can be format=, label=, or length=, which determine how SAS data is displayed.
proc sql outobs=20;
select ISONAME,
case cont
when 91 then "North America"
when 92 then "South America"
when 93 then "Europe"
when 94 then "Africa"
when 95 then "Asia"
when 96 then "Australia"
else "Antarctica"
end as continent length=8
from Sashelp.demographics;
Sorting data: You can sort query results with an ORDER BY clause by specifying any of the
columns in the table, including unselected or calculated columns.
proc sql outobs=10;
select ISONAME, pop format=comma10.
from Sashelp.demographics;
order by pop;
proc sql outobs=10;
select ISONAME, pop*popUrban as UrbanPopulation
from Sashelp.demographics
order by UrbanPopulation desc;
Retrieving Rows That Satisfy a Condition: the WHERE clause enables you to retrieve only rows from a table that satisfy a condition. WHERE clauses can
contain any of the columns in a table, including unselected columns.
proc sql outobs=10;
select ISONAME, pop
from Sashelp.demographics
where cont = 93;
Summarizing Data: You can use an aggregate function (or summary function) to produce a statistical summary of data in a table. If you specify one
column as the argument to an aggregate function, then the values in that column are calculated. If you specify multiple arguments, then the arguments or
columns that are listed are calculated. You can use aggregate functions in the SELECT or HAVING clauses.
proc sql outobs=10;
select ISONAME, range(pop, pop*popurban) as RuralPopulation
from Sashelp.demographics
where calculated RuralPopulation < 10000
order by RuralPopulation desc;
proc sql outobs=10;
select ISONAME, pop, max(pop)
from Sashelp.demographics;
proc sql outobs=10;
select count(distinct cont)
from Sashelp.demographics;
Grouping Data: the GROUP BY clause groups data by a specified column or columns. When you use
a GROUP BY clause, you also use an aggregate function in the SELECT clause or in a HAVING clause to instruct PROC SQL in how to summarize the data for
each group. PROC SQL calculates the aggregate function separately for each group.
proc sql outobs=10;
select cont, sum(pop)
from Sashelp.demographics
group by cont;
Without the aggregate function, group by is the same as the order by statement.
Filtering Grouped Data: You can use a HAVING clause with a GROUP BY clause to filter grouped data. The HAVING clause affects groups in a way that
is similar to the way in which a WHERE clause affects individual rows. When you use a HAVING clause, PROC SQL displays only the groups that satisfy the
HAVING expression.
proc sql;
select Cont,
sum(Pop) as TotalPopulation format=comma16.,
count(*) as Count
from Sashelp.demographics
group by Cont
having count(*) gt 15
order by Cont;
Selecting Data from More Than One Table by Using Joins. Types of join:
inner join, outer join (left, right, and full), cross join and union join
data one;
input x y;
datalines;
1 2
2 3
;
data two;
input x z;
datalines;
2 5
3 6
4 9
;
proc sql;
select *
from one, two
where one.x=two.x;
proc sort data=Mapsgfk.world out=Mysasdir.worldnew nodupkey;
by cont id;
run;
proc sql;
select a.cont, a.id, pop, ISONAME, x, y
from Mysasdir.worldnew as a, Sashelp.demographics as b
where a.cont=b.cont and a.id=b.id;
proc sql;
select a.cont, a.id, pop, b.ISONAME, x, y, ISOalpha3
from Mysasdir.worldnew as a, Sashelp.demographics as Maps.names as c
where a.cont=b.cont=c.cont and a.id=b.id=c.id ;
Self join: the following example finds for each continent the continents whose population is less.
proc sql;
create table contpop as
select cont, sum(pop) as totalpop
from Sashelp.demographics
group by Cont;
select a.cont, a.totalpop, '|', b.cont, b.totalpop
from contpop as a, contpop as b
where a.totalpop > b.totalpop;
Outer joins are inner joins that are augmented with rows from one table that do not
match any row from the other table in the join. Use the ON clause instead
of the WHERE clause to specify the column or columns on which you are joining the
tables. However, you can continue to use the WHERE clause to subset the query result.
proc sql;
select a.statecode, a.city, a.pop, b.zip
from Mapsgfk.uscity a left join Sashelp.zipcode b
on a.statecode=b.statecode and a.city=b.city;
proc sql;
select a.statecode, a.city, a.pop, b.zip
from Mapsgfk.uscity a right join Sashelp.zipcode b
on a.statecode=b.statecode and a.city=b.city;
proc sql;
select a.statecode, a.city, a.pop, b.zip
from Mapsgfk.uscity a full join Sashelp.zipcode b
on a.statecode=b.statecode and a.city=b.city;
Cross join: A cross join is a Cartesian product; it returns the product of two tables.
proc sql;
select *
from one,two;
or equivalently
proc sql;
select *
from one cross join two;
Union join combines two tables without attempting to match rows.
proc sql;
select *
from one union join two;
Comparing DATA Step Match-Merges with PROC SQL Joins:
1. DATA Step Match-Merges are full joins
data merged;
merge one two;
by x;
run;
proc print data=merged noobs;
run;
2. DATA Step Match-Merges match according to the position when there are one-to-multiple matches, while PROC SQL
Joins do the cross join.
data FLTSUPER;
input Flight Supervisor $;
datalines;
145 Kang
145 Ramirez
150 Miller
150 Picard
155 Evanko
157 Lei
;
data FLTDEST;
input Flight Destination $;
datalines;
145 Brussels
145 Edmonton
150 Paris
150 Madrid
165 Seattle
;
data merged;
merge fltsuper fltdest;
by flight;
run;
proc print data=merged noobs;
title ’Table MERGED’;
run;
proc sql;
title ’Table JOINED’;
select *
from fltsuper s, fltdest d
where s.Flight=d.Flight;
Combining Queries with Set Operators: set operators stack query results. Set operators combine columns from two
queries based on their position in the referenced tables without regard to the individual column names. Columns in the
same relative position in the two queries must have the same data types. The column names of the tables in the first
query become the column names of the output table.
1.
2.
3.
4.
UNION produces all unique rows from both queries.
EXCEPT produces rows that are part of the first query only.
INTERSECT produces rows that are common to both query results.
OUTER UNION concatenates the query results.
data a;
input x y $;
datalines;
1 one
2 two
2 two
3 three
;
data b;
input x z $;
datalines;
1 one
2 two
4 four
;
proc sql;
select * from a
union
select * from b;
proc sql;
select * from a
union all /*the all option keeps duplicate rows*/
select * from b;
proc sql;
select * from a
except
select * from b;
proc sql;
select * from a
intersect
select * from b;
proc sql;
select * from a
outer union
select * from b;
proc sql;
select * from a
outer union corresponding /*the corresponding or corr option overlays columns with the same
names*/
select * from b;
Using PROC SQL with the SAS Macro Facility
If you specify a single macro variable in the INTO clause, then PROC SQL assigns the variable the value from the first row only of the appropriate column in the
SELECT list.
proc sql noprint;
select x, z
into :xmacro, :zmacro
from b;
%put &xmacro &zmacro;
proc sql noprint;
select x, z
into :xmacro1-:xmacro3, :zmacro1-:zmacro3
from b;
%put &xmacro2 &zmacro3;
Concatenate Values in Macro Variables with the SEPARATED BY keywords to specify a character to delimit the values in the macro variable.
proc sql noprint;
select z
into :zmacros separated by " "
from b;
%put &zmacros;
Example: Generate a subset for each flower type in the tropical sales data and print
proc sql;
select distinct variety
into :varieties separated by " "
from flower;
%macro print;
%let i=1;
%let variety=%scan(&varieties, &i);
%do %while ("&variety" ~= "");
proc print data=flower;
where variety="&variety";
run;
%let i=%eval(&i+1);
%let variety=%scan(&varieties, &i);
%end;
%mend;
%print;
Practical Problem-Solving with PROC SQL
Example1: You want to count the number of duplicate rows in a table and generate an output column that shows how many times each row occurs.
data duplicate;
input Obs LastName $ FirstName $ City $ State $;
datalines;
1 Smith John Richmond Virginia
2 Johnson Mary Miami Florida
3 Smith John Richmond Virginia
4 Reed Sam Portland Oregon
5 Davis Karen Chicago Illinois
6 Davis Karen Chicago Illinois
7 Thompson Jennifer Houston Texas
8 Smith John Richmond Virginia
9 Johnson Mary Miami Florida
;
proc sql;
select lastname, firstname, city, state, count(*)
from duplicate
group by lastname, firstname, city, state;
proc means data=duplicate n;
class lastname firstname city state;
run;
Example 2: Compute weighted average for females and males
data sample;
input Obs Value Weight Gender $;
datalines;
1 2893.35 9.0868 F
2 56.13 26.2171 M
3 901.43 -4.0605 F
4 2942.68 -5.6557 M
5 621.16 24.3306 F
6 361.50 13.8971 M
7 2575.09 29.3734 F
8 2157.07 7.0687 M
9 690.73 -40.1271 F
10 2085.80 24.4795 M
;
proc sql;
select sum(value*weight)/sum(weight) as weightedavg, gender
from sample
where weight > 0
group by gender;
proc means data=sample mean;
var value;
weight weight;
class gender;
run;
Example 3: You have two copies of a table. One of the copies has been updated. You want to see which rows have been changed.
data oldtable;
infile datalines dlm="" dsd;
input id Last $ First $ Middle $ Phone $ Location $;
datalines;
5463 Olsen Mary K. 661-0012 R2342
6574 Hogan Terence H. 661-3243 R4456
7896 Bridges Georgina W. 661-8897 S2988
4352 Anson Sanford "" 661-4432 S3412
5674 Leach Archie G. 661-4328 S3533
7902 Wilson Fran R. 661-8332 R4454
0001 Singleton Adam O. 661-0980 R4457
9786 Thompson Jack "" 661-6781 R2343
;
data newtable;
infile datalines dlm="" dsd;
input id Last $ First $ Middle $ Phone $ Location $;
datalines;
5463 Olsen Mary K. 661-0012 R2342
6574 Hogan Terence H. 661-3243 R4456
7896 Bridges Georgina W. 661-2231 S2987
4352 Anson Sanford "" 661-4432 S3412
5674 Leach Archie G. 661-4328 S3533
7902 Wilson Fran R. 661-8332 R4454
0001 Singleton Adam O. 661-0980 R4457
9786 Thompson John C. 661-6781 R2343
2123 Chen Bill W. 661-8099 R4432
;
proc sql;
create table one as
select *
from oldtable a
except
select *
from newtable b;
create table two as
select *
from newtable a
except
select *
from oldtable b;
create table three as
select * from one
outer union corr
select * from two;
proc print data=three; run;
Example 4: You are forming teams for a new league by analyzing the averages of bowlers when they were members of other bowling leagues. When
possible you will use each bowler’s most recent league average. However, if a bowler was not in a league last year, then you will use the bowler’s average from
the prior year.
data League1;
input Fullname $18. bowler AvgScore;
datalines;
Alexander Delarge 4224 164
John T Chance
4425 .
Jack T Colton
4264 .
1412 141
Andrew Shepherd
4189 185
;
data League2;
input First $7. Last $13. bowler AvgScore;
datalines;
Alex
Delarge
4224 156
Mickey Raymond
1412 .
4264 174
Jack
Chance
4425 .
Patrick O’Malley
4118 164
;
proc sql;
/*create table joined as*/
select
case
when a.fullname is missing then b.first||b.last
else a.fullname
end as fullname,
case
when a.avgscore is missing then b.avgscore
else a.avgscore
end as newscore,
case
when a.bowler is missing then b.bowler
else a.bowler
end as newbowler
from League1 as a full join League2 as b
on a.bowler=b.bowler;
Example 5: create output that shows the full name and ID number of each employee who has a supervisor, along with the full name and ID number of that
employee’s supervisor.
data employees;
input Obs ID $ LastName $ Name $ Supervisor $;
datalines;
1 1001 Smith John 1002
2 1002 Johnson Mary None
3 1003 Reed Sam None
4 1004 Davis Karen 1003
5 1005 Thompson Jennifer 1002
6 1006 Peterson George 1002
7 1007 Jones Sue 1003
8 1008 Murphy Janice 1003
9 1009 Garcia Joe 1002
;
proc sql;
select a.*, b.id as SupervisorID, b.Lastname, b.Name
from employees as a inner join employees as b
on a.supervisor=b.id;
Example 6: You want to analyze answers to a survey question to determine how each state responded. Then you
want to compute the percentage of each answer that a given state contributed. For example, what percentage
of all NO responses came from North Carolina?
data survey;
input obs state $ answer $;
datalines;
1 NY YES
2 CA YES
3 NC YES
4 NY YES
5 NY YES
6 NY YES
7 NY NO
8 NY NO
9 CA NO
10 NC YES
;
proc freq data=survey;
table state*answer;
run;
proc sql;
create table t1 as
select answer, count(*) as countanswer
from survey
group by answer;
create table t2 as
select state, answer, count(*) as countstate
from survey
group by answer, state;
create table percentagetable as
select t1.answer, t2.state, countstate/countanswer as percentage
from t1, t2
where t1.answer=t2.answer;
Example 7: There is one input table, called SALES, that contains detailed sales information.There is one
record for each sale for the first quarter that shows the site, product,invoice number, invoice amount, and
invoice date. You want to use this table to create a summary report that shows the sales for each product
for each month of the quarter.
data sales;
input Site $ Product $ Invoice $ InvoiceAmount InvoiceDate $;
datalines;
V1009 VID010 V7679 598.5 980126
V1019 VID010 V7688 598.5 980126
V1032 VID005 V7771 1070 980309
V1043 VID014 V7780 1070 980309
V421 VID003 V7831 2000 980330
V421 VID010 V7832 750 980330
V570 VID003 V7762 2000 980302
V659 VID003 V7730 1000 980223
V783 VID003 V7815 750 980323
V985 VID003 V7733 2500 980223
V966 VID001 V5020 1167 980215
V98 VID003 V7750 2000 980223
;
proc sql;
select distinct product, sum(InvoiceAmount) as totalsales, substr(InvoiceDate, 4,1) as month
from sales
group by calculated month, product;
Example 8: There is one input table, called CHORES, that contains the following data. You want to reorder
this chore list so that all the chores are grouped by season, starting with spring and progressing through
the year. Simply ordering by Season makes the list appear in alphabetical sequence: fall, spring, summer,
winter.
data chores;
input Project $ Hours Season $;
datalines;
weeding 48 summer
pruning 12 winter
mowing 36 summer
mulching 17 fall
raking 24 fall
raking 16 spring
planting 8 spring
planting 8 fall
sweeping 3 winter
edging 16 summer
seeding 6 spring
tilling 12 spring
aerating 6 spring
feeding 7 summer
rolling 4 winter
;
proc sql;
select Project, Hours, Season
from (select Project, Hours, Season,
case
when Season = ’spring’ then 1
when Season = ’summer’ then 2
when Season = ’fall’ then 3
when Season = ’winter’ then 4
else .
end as Sorter
from chores)
order by Sorter;
Download