This document is based on SAS SQL Procedure User’s Guide What is SQL? Structured Query Language (SQL) is a widely used language for retrieving and updating data in tables and/or views of those tables. It has its origins in and is primarily used for retrieval of tables in relational databases. PROC SQL is the SQL implementation within the SAS System. Purpose of SQL PROC SQL enables you to perform the following tasks: generate reports generate summary statistics retrieve data from tables or views combine data from tables or views create tables, views, and indexes update the data values in PROC SQL tables update and retrieve data from database management system (DBMS) tables modify a PROC SQL table by adding, modifying, or dropping columns Similarity with data step (We will come back to this at the end of this lecture note) Retrieving Data from a Single Table PROC SQL; CREATE TABLE AS SELECT FROM WHERE GROUP BY HAVING ORDER BY; When you construct a SELECT statement, you must specify the clauses in the above order. Only the SELECT and FROM clauses are required. Libname mysasdir "C:\Users\anna\Documents\stat597.F13"; proc contents data=Sashelp.demographics;run; proc sql; create table twocontdata as select Cont, sum(Pop) from Sashelp.demographics group by Cont having Cont in (91, 95) order by Cont; proc print;run; 1. The SELECT statement is the primary tool of PROC SQL. You use it to identify, retrieve, and manipulate columns of data from a table. You can also use several optional clauses within the SELECT statement to place restrictions on a query. A SELECT clause lists the Name column and a FROM clause lists the table in which the Name column resides. 2. The WHERE clause enables you to restrict the data that you retrieve by specifying a condition that each row of the table must satisfy. 3. The ORDER BY clause enables you to sort the output from a table by one or more Columns. 4. The GROUP BY clause enables you to break query results into subsets of rows. When you use the GROUP BY clause, you use an aggregate function in the SELECT clause or a HAVING clause to instruct PROC SQL how to group the data. 5. The HAVING clause works with the GROUP BY clause to restrict the groups in a query’s results based on a given condition. Selecting Columns in a Table Syntax: SELECT <distinct> variable1 <attributes>, variable2 <attributes>, variable3 <attributes> from table; 1. Use an asterisk in the SELECT clause to select all columns in a table. 2. 3. You can eliminate the duplicate rows from the results by using the DISTINCT keyword in the select clause Variables can be variable names, constants in quotes, calculations, or conditional assigned values proc sql outobs=10; select 'The 2005 population for', ISONAME, 'is', pop from Sashelp.demographics; proc sql outobs=10; select ISONAME, pop*popUrban as UrbanPopulation from Sashelp.demographics; proc sql outobs=10; select ISONAME, pop*popUrban as UrbanPopulation, pop-calculated UrbanPopulation as ruralPopulation from Sashelp.demographics; Here CALCULATED is a keyword indicating the variable following it is a variable calculated within the query. You can specify a calculated column only in a SELECT clause or a WHERE clause. You can use conditional logic within a query by using a CASE expression to conditionally assign a value. You can use a CASE expression anywhere that you can use a column name. proc sql outobs=150; select ISONAME, case when pop > min(pop)+2/3*(max(pop)-min(pop)) then "large" when min(pop)+2/3*(max(pop)-min(pop)) >= pop > min(pop)+1/3*(max(pop)-min(pop)) then "medium" else "small" end as sizegroup from Sashelp.demographics; proc sql outobs=20; select ISONAME, case cont when 91 then when 92 then when 93 then when 94 then when 95 then "North America" "South America" "Europe" "Africa" "Asia" when 96 then "Australia" else "Antarctica" end as continent from Sashelp.demographics; Variable attributes can be format=, label=, or length=, which determine how SAS data is displayed. proc sql outobs=20; select ISONAME, case cont when 91 then "North America" when 92 then "South America" when 93 then "Europe" when 94 then "Africa" when 95 then "Asia" when 96 then "Australia" else "Antarctica" end as continent length=8 from Sashelp.demographics; Sorting data: You can sort query results with an ORDER BY clause by specifying any of the columns in the table, including unselected or calculated columns. proc sql outobs=10; select ISONAME, pop format=comma10. from Sashelp.demographics; order by pop; proc sql outobs=10; select ISONAME, pop*popUrban as UrbanPopulation from Sashelp.demographics order by UrbanPopulation desc; Retrieving Rows That Satisfy a Condition: the WHERE clause enables you to retrieve only rows from a table that satisfy a condition. WHERE clauses can contain any of the columns in a table, including unselected columns. proc sql outobs=10; select ISONAME, pop from Sashelp.demographics where cont = 93; Summarizing Data: You can use an aggregate function (or summary function) to produce a statistical summary of data in a table. If you specify one column as the argument to an aggregate function, then the values in that column are calculated. If you specify multiple arguments, then the arguments or columns that are listed are calculated. You can use aggregate functions in the SELECT or HAVING clauses. proc sql outobs=10; select ISONAME, range(pop, pop*popurban) as RuralPopulation from Sashelp.demographics where calculated RuralPopulation < 10000 order by RuralPopulation desc; proc sql outobs=10; select ISONAME, pop, max(pop) from Sashelp.demographics; proc sql outobs=10; select count(distinct cont) from Sashelp.demographics; Grouping Data: the GROUP BY clause groups data by a specified column or columns. When you use a GROUP BY clause, you also use an aggregate function in the SELECT clause or in a HAVING clause to instruct PROC SQL in how to summarize the data for each group. PROC SQL calculates the aggregate function separately for each group. proc sql outobs=10; select cont, sum(pop) from Sashelp.demographics group by cont; Without the aggregate function, group by is the same as the order by statement. Filtering Grouped Data: You can use a HAVING clause with a GROUP BY clause to filter grouped data. The HAVING clause affects groups in a way that is similar to the way in which a WHERE clause affects individual rows. When you use a HAVING clause, PROC SQL displays only the groups that satisfy the HAVING expression. proc sql; select Cont, sum(Pop) as TotalPopulation format=comma16., count(*) as Count from Sashelp.demographics group by Cont having count(*) gt 15 order by Cont; Selecting Data from More Than One Table by Using Joins. Types of join: inner join, outer join (left, right, and full), cross join and union join data one; input x y; datalines; 1 2 2 3 ; data two; input x z; datalines; 2 5 3 6 4 9 ; proc sql; select * from one, two where one.x=two.x; proc sort data=Mapsgfk.world out=Mysasdir.worldnew nodupkey; by cont id; run; proc sql; select a.cont, a.id, pop, ISONAME, x, y from Mysasdir.worldnew as a, Sashelp.demographics as b where a.cont=b.cont and a.id=b.id; proc sql; select a.cont, a.id, pop, b.ISONAME, x, y, ISOalpha3 from Mysasdir.worldnew as a, Sashelp.demographics as Maps.names as c where a.cont=b.cont=c.cont and a.id=b.id=c.id ; Self join: the following example finds for each continent the continents whose population is less. proc sql; create table contpop as select cont, sum(pop) as totalpop from Sashelp.demographics group by Cont; select a.cont, a.totalpop, '|', b.cont, b.totalpop from contpop as a, contpop as b where a.totalpop > b.totalpop; Outer joins are inner joins that are augmented with rows from one table that do not match any row from the other table in the join. Use the ON clause instead of the WHERE clause to specify the column or columns on which you are joining the tables. However, you can continue to use the WHERE clause to subset the query result. proc sql; select a.statecode, a.city, a.pop, b.zip from Mapsgfk.uscity a left join Sashelp.zipcode b on a.statecode=b.statecode and a.city=b.city; proc sql; select a.statecode, a.city, a.pop, b.zip from Mapsgfk.uscity a right join Sashelp.zipcode b on a.statecode=b.statecode and a.city=b.city; proc sql; select a.statecode, a.city, a.pop, b.zip from Mapsgfk.uscity a full join Sashelp.zipcode b on a.statecode=b.statecode and a.city=b.city; Cross join: A cross join is a Cartesian product; it returns the product of two tables. proc sql; select * from one,two; or equivalently proc sql; select * from one cross join two; Union join combines two tables without attempting to match rows. proc sql; select * from one union join two; Comparing DATA Step Match-Merges with PROC SQL Joins: 1. DATA Step Match-Merges are full joins data merged; merge one two; by x; run; proc print data=merged noobs; run; 2. DATA Step Match-Merges match according to the position when there are one-to-multiple matches, while PROC SQL Joins do the cross join. data FLTSUPER; input Flight Supervisor $; datalines; 145 Kang 145 Ramirez 150 Miller 150 Picard 155 Evanko 157 Lei ; data FLTDEST; input Flight Destination $; datalines; 145 Brussels 145 Edmonton 150 Paris 150 Madrid 165 Seattle ; data merged; merge fltsuper fltdest; by flight; run; proc print data=merged noobs; title ’Table MERGED’; run; proc sql; title ’Table JOINED’; select * from fltsuper s, fltdest d where s.Flight=d.Flight; Combining Queries with Set Operators: set operators stack query results. Set operators combine columns from two queries based on their position in the referenced tables without regard to the individual column names. Columns in the same relative position in the two queries must have the same data types. The column names of the tables in the first query become the column names of the output table. 1. 2. 3. 4. UNION produces all unique rows from both queries. EXCEPT produces rows that are part of the first query only. INTERSECT produces rows that are common to both query results. OUTER UNION concatenates the query results. data a; input x y $; datalines; 1 one 2 two 2 two 3 three ; data b; input x z $; datalines; 1 one 2 two 4 four ; proc sql; select * from a union select * from b; proc sql; select * from a union all /*the all option keeps duplicate rows*/ select * from b; proc sql; select * from a except select * from b; proc sql; select * from a intersect select * from b; proc sql; select * from a outer union select * from b; proc sql; select * from a outer union corresponding /*the corresponding or corr option overlays columns with the same names*/ select * from b; Using PROC SQL with the SAS Macro Facility If you specify a single macro variable in the INTO clause, then PROC SQL assigns the variable the value from the first row only of the appropriate column in the SELECT list. proc sql noprint; select x, z into :xmacro, :zmacro from b; %put &xmacro &zmacro; proc sql noprint; select x, z into :xmacro1-:xmacro3, :zmacro1-:zmacro3 from b; %put &xmacro2 &zmacro3; Concatenate Values in Macro Variables with the SEPARATED BY keywords to specify a character to delimit the values in the macro variable. proc sql noprint; select z into :zmacros separated by " " from b; %put &zmacros; Example: Generate a subset for each flower type in the tropical sales data and print proc sql; select distinct variety into :varieties separated by " " from flower; %macro print; %let i=1; %let variety=%scan(&varieties, &i); %do %while ("&variety" ~= ""); proc print data=flower; where variety="&variety"; run; %let i=%eval(&i+1); %let variety=%scan(&varieties, &i); %end; %mend; %print; Practical Problem-Solving with PROC SQL Example1: You want to count the number of duplicate rows in a table and generate an output column that shows how many times each row occurs. data duplicate; input Obs LastName $ FirstName $ City $ State $; datalines; 1 Smith John Richmond Virginia 2 Johnson Mary Miami Florida 3 Smith John Richmond Virginia 4 Reed Sam Portland Oregon 5 Davis Karen Chicago Illinois 6 Davis Karen Chicago Illinois 7 Thompson Jennifer Houston Texas 8 Smith John Richmond Virginia 9 Johnson Mary Miami Florida ; proc sql; select lastname, firstname, city, state, count(*) from duplicate group by lastname, firstname, city, state; proc means data=duplicate n; class lastname firstname city state; run; Example 2: Compute weighted average for females and males data sample; input Obs Value Weight Gender $; datalines; 1 2893.35 9.0868 F 2 56.13 26.2171 M 3 901.43 -4.0605 F 4 2942.68 -5.6557 M 5 621.16 24.3306 F 6 361.50 13.8971 M 7 2575.09 29.3734 F 8 2157.07 7.0687 M 9 690.73 -40.1271 F 10 2085.80 24.4795 M ; proc sql; select sum(value*weight)/sum(weight) as weightedavg, gender from sample where weight > 0 group by gender; proc means data=sample mean; var value; weight weight; class gender; run; Example 3: You have two copies of a table. One of the copies has been updated. You want to see which rows have been changed. data oldtable; infile datalines dlm="" dsd; input id Last $ First $ Middle $ Phone $ Location $; datalines; 5463 Olsen Mary K. 661-0012 R2342 6574 Hogan Terence H. 661-3243 R4456 7896 Bridges Georgina W. 661-8897 S2988 4352 Anson Sanford "" 661-4432 S3412 5674 Leach Archie G. 661-4328 S3533 7902 Wilson Fran R. 661-8332 R4454 0001 Singleton Adam O. 661-0980 R4457 9786 Thompson Jack "" 661-6781 R2343 ; data newtable; infile datalines dlm="" dsd; input id Last $ First $ Middle $ Phone $ Location $; datalines; 5463 Olsen Mary K. 661-0012 R2342 6574 Hogan Terence H. 661-3243 R4456 7896 Bridges Georgina W. 661-2231 S2987 4352 Anson Sanford "" 661-4432 S3412 5674 Leach Archie G. 661-4328 S3533 7902 Wilson Fran R. 661-8332 R4454 0001 Singleton Adam O. 661-0980 R4457 9786 Thompson John C. 661-6781 R2343 2123 Chen Bill W. 661-8099 R4432 ; proc sql; create table one as select * from oldtable a except select * from newtable b; create table two as select * from newtable a except select * from oldtable b; create table three as select * from one outer union corr select * from two; proc print data=three; run; Example 4: You are forming teams for a new league by analyzing the averages of bowlers when they were members of other bowling leagues. When possible you will use each bowler’s most recent league average. However, if a bowler was not in a league last year, then you will use the bowler’s average from the prior year. data League1; input Fullname $18. bowler AvgScore; datalines; Alexander Delarge 4224 164 John T Chance 4425 . Jack T Colton 4264 . 1412 141 Andrew Shepherd 4189 185 ; data League2; input First $7. Last $13. bowler AvgScore; datalines; Alex Delarge 4224 156 Mickey Raymond 1412 . 4264 174 Jack Chance 4425 . Patrick O’Malley 4118 164 ; proc sql; /*create table joined as*/ select case when a.fullname is missing then b.first||b.last else a.fullname end as fullname, case when a.avgscore is missing then b.avgscore else a.avgscore end as newscore, case when a.bowler is missing then b.bowler else a.bowler end as newbowler from League1 as a full join League2 as b on a.bowler=b.bowler; Example 5: create output that shows the full name and ID number of each employee who has a supervisor, along with the full name and ID number of that employee’s supervisor. data employees; input Obs ID $ LastName $ Name $ Supervisor $; datalines; 1 1001 Smith John 1002 2 1002 Johnson Mary None 3 1003 Reed Sam None 4 1004 Davis Karen 1003 5 1005 Thompson Jennifer 1002 6 1006 Peterson George 1002 7 1007 Jones Sue 1003 8 1008 Murphy Janice 1003 9 1009 Garcia Joe 1002 ; proc sql; select a.*, b.id as SupervisorID, b.Lastname, b.Name from employees as a inner join employees as b on a.supervisor=b.id; Example 6: You want to analyze answers to a survey question to determine how each state responded. Then you want to compute the percentage of each answer that a given state contributed. For example, what percentage of all NO responses came from North Carolina? data survey; input obs state $ answer $; datalines; 1 NY YES 2 CA YES 3 NC YES 4 NY YES 5 NY YES 6 NY YES 7 NY NO 8 NY NO 9 CA NO 10 NC YES ; proc freq data=survey; table state*answer; run; proc sql; create table t1 as select answer, count(*) as countanswer from survey group by answer; create table t2 as select state, answer, count(*) as countstate from survey group by answer, state; create table percentagetable as select t1.answer, t2.state, countstate/countanswer as percentage from t1, t2 where t1.answer=t2.answer; Example 7: There is one input table, called SALES, that contains detailed sales information.There is one record for each sale for the first quarter that shows the site, product,invoice number, invoice amount, and invoice date. You want to use this table to create a summary report that shows the sales for each product for each month of the quarter. data sales; input Site $ Product $ Invoice $ InvoiceAmount InvoiceDate $; datalines; V1009 VID010 V7679 598.5 980126 V1019 VID010 V7688 598.5 980126 V1032 VID005 V7771 1070 980309 V1043 VID014 V7780 1070 980309 V421 VID003 V7831 2000 980330 V421 VID010 V7832 750 980330 V570 VID003 V7762 2000 980302 V659 VID003 V7730 1000 980223 V783 VID003 V7815 750 980323 V985 VID003 V7733 2500 980223 V966 VID001 V5020 1167 980215 V98 VID003 V7750 2000 980223 ; proc sql; select distinct product, sum(InvoiceAmount) as totalsales, substr(InvoiceDate, 4,1) as month from sales group by calculated month, product; Example 8: There is one input table, called CHORES, that contains the following data. You want to reorder this chore list so that all the chores are grouped by season, starting with spring and progressing through the year. Simply ordering by Season makes the list appear in alphabetical sequence: fall, spring, summer, winter. data chores; input Project $ Hours Season $; datalines; weeding 48 summer pruning 12 winter mowing 36 summer mulching 17 fall raking 24 fall raking 16 spring planting 8 spring planting 8 fall sweeping 3 winter edging 16 summer seeding 6 spring tilling 12 spring aerating 6 spring feeding 7 summer rolling 4 winter ; proc sql; select Project, Hours, Season from (select Project, Hours, Season, case when Season = ’spring’ then 1 when Season = ’summer’ then 2 when Season = ’fall’ then 3 when Season = ’winter’ then 4 else . end as Sorter from chores) order by Sorter;