Introduction to SQL Session 2 Retrieving Data From Multiple Tables DATA FROM MORE THAN ONE TABLE Objective • Select data from more than one table by joining tables together. • Use subqueries to select data from one table based on data values from another table • Combine the results of more than one query by using set operators. DATA FROM MORE THAN ONE TABLE You can combine data from multiple sources by using two different methods, Joins and Subqueries. • Combining data from multiple tables using joins o Inner Joins o Outer Joins • Combining data from multiple tables using subqueries DATA FROM MORE THAN ONE TABLE JOINS Definition In order to select the data from the tables, join the tables in a query. It will not affect the original tables. However, there is a good and not so good way to go about this. •Not so good way: proc sql; distinct tables. select * from one, two; *Where “one” and “two” represent two •The Problem: Joining tables in this way returns the Cartesian product of the tables. In other words, each row in the first table is combined with each row in the other. This can be very large and is not recommended. DATA FROM MORE THAN ONE TABLE JOINS • Better way: Use an Inner Join. Imagine we have tables one and two. proc sql; select * from one, two; where one.x = two.x; Where “x” is a variable in both tables. • Improvement: The inner join returns only the subset of information that matches in each table. For that, we use the WHERE clause to limit the selection. Within the WHERE statement, choose the the columns that you want want to be compared for matching values. Notes: 1. The column name (variable name) is preceded the table names. Also, you can select data from more than two tables by separating table names with commas within the FROM clause. 2. Let’s use WidgeOne and Ucdavis2 as example tables now. DATA FROM MORE THAN ONE TABLE JOINS You can also use the WHERE clause to join columns with other operators as well. Examples 1.where Plant like ‘D%’ ; Return only those Plant values that begin with D by using the wild card “%”. 2.where Plant ne “Dallas”; Return values not equal to Dallas values. 3.where Plant like ‘D%’ and Gender = Female; Only return values that satisfy all factors of the WHERE clause. Note: PROC SQL treats nulls as distinct entities and as matches for joins. However, unless you have a specific need for this, you will want only nonmissing values. Therefore, use the IS NOT MISSING operator: where one.b = two.b and one.b is not missing; This clause specifies to use only matching b values that are not missing. DATA FROM MORE THAN ONE TABLE JOINS Example of Inner Join Using the WHERE Clause ods graphics on; ods rtf; proc sql; select a.plant, a.gender, a.position, a.jobgrade, a.yronjob, b.sex, b.gpa from sql.WidgeOne a inner join sql.UcDavis b on a.gender = b.sex where a.jobgrade in (9,7) and a.yronjob = 11.1; quit; ods rtf close; ods graphics off; NOTE: The graphic notations are there to make the output easily copied. Also, notice that WidgeOne is assigned to table “a” while UcDavis is assigned to table b. DATA FROM MORE THAN ONE TABLE INNER JOINS Notes • Use as many variables as necessary by listing them in the SELECT clause separated by commas. • Remember, the table names from which you retrieve data should be listed in the FROM clause. DATA FROM MORE THAN ONE TABLE OUTER JOINS Definition Outer joins are just inner joins that are supplemented with rows from one table that do not match any row from the other table in the join. This includes more than matching data. To do this, use the ON clause instead of the WHERE clause. However, you can use the WHERE clause in addition to subset the query result. There are Left Outer Joins and Right Outer Joins. DATA FROM MORE THAN ONE TABLE LEFT OUTER JOINS Definition • Left outer joins list matching rows and rows from, you guessed it, the left-hand table (the first table listed in the FROM clause) that do not match any row in the righthand table. • A left join uses the keywords LEFT JOIN and ON to generate data. DATA FROM MORE THAN ONE TABLE LEFT OUTER JOINS Example proc sql; select a.plant, b.sex, b.gpa from sql.WidgeOne a left join sql.UcDavis b on a.gender = b.sex; quit; The above code lists the plants from WidgeOne with the gender and gpa of people from UcDavis, by using a left join. The left join lists all plants, regardless of the data provided in UcDavis. DATA FROM MORE THAN ONE TABLE LEFT OUTER JOINS Another Example proc sql; title 'Widgets and Students'; select a.plant, a.jobgrade, b.sex 'Gender', b.tv, b.alcohol from sql.WidgeOne a left join sql.UcDavis b on a.Gender = b.sex where b.alcohol gt 10 order by a.plant; *Note: We assign the title Gender to the variable ‘sex’ from second table. quit; The above code lists the plants and job grades of the employees from WidgeOne regardless of UcDavis. But in addition, it lists the matching (based on the “ON” clause) data from UcDavis. However, the data are limited based on the “WHERE” clause and are ordered by Plant from WidgeOne. DATA FROM MORE THAN ONE TABLE RIGHT OUTER JOINS Definition • A right join is specified with the keywords RIGHT JOIN and ON. • It is the opposite of a left join. • Nonmatching rows from the right-hand table* are included with all matching rows in the output. • The example for this one reverses the order for the join. - Uses a right join to select data. DATA FROM MORE THAN ONE TABLE RIGHT OUTER JOINS Right Outer Join A RIGHT JOIN is opposite of LEFT. Nonmatching rows from the right-hand table (second listed) are included with all matching rows in the output: proc sql; title 'Widgets and Students'; select a.plant, a.jobgrade, b.sex 'Gender', b.tv, b.alcohol from sql.WidgeOne2 a right join sql.UcDavis2 b on a.Gender = b.sex; quit; DATA FROM MORE THAN ONE TABLE FULL OUTER JOINS A full outer join is used to select all matching and nonmatching rows. In other words, the FULL OUTER JOIN is used to select all matching and nonmatching rows from indicated variables. TO DO • Specified with the keywords FULL JOIN and ON • The full outer join is another way to grab all information or all information from a selected group. DATA FROM MORE THAN ONE TABLE FULL OUTER JOINS EXAMPLE proc sql outobs = 10; title ‘Plant Locations and/or Employee GPA’s’; select Plant, GPA from sql.WidgeOne full join sql.UcDavis on Gender = Sex; quit; The output will only yield 10 observations with the following: all matching and nonmatching rows from the Plant and GPA observations of the WidgeOne and UcDavis tables for the first 10 rows encountered. DATA FROM MORE THAN ONE TABLE POSITION COUNTS What if the position of the joined data matters to you? A DATA step to merge the data first might be in order. Problem You want to merge two tables and the position of the values is important. Solution Use a DATA step merge to merge the data based on the BY variable so that the values appear in the PROC SQL table in a way that makes sense. *The following two slides explain the Merge Procedure. DATA FROM MORE THAN ONE TABLE POSITION COUNTS DATA Step to Merge Data • Merging combines observations from two or more SAS data sets into a single observation in a new SAS data set. • In match merging, use a BY statement to combine observations from the input data sets based on common values of the BY variable. • There may exist more that one variable in the BY statement. o SAS will merge based on the first variable listed then proceed to merge based on the subsequent variables. o To use the BY variable, remember to sort the data beforehand. DATA FROM MORE THAN ONE TABLE POSITION COUNTS DATA Step to Merge Data SAS Code proc sort data = sql.WidgeOne; by Gender; run; proc sort data = sql.UcDavis (rename = (Sex = Gender)); by Gender; run; data sql.NewSet; merge sql.WidgeOne sql.UcDavis; by Gender; run; DATA FROM MORE THAN ONE TABLE USING SUBQUERIES TO SELECT DATA • A table join combines multiple tables into a new table • A subquery selects rows from one table based on values in another table. • Another name for it is Inner Query. • It is a query-expression that is nested as part of another query-expression. • It is enclosed in parentheses. • A subquery can return a single row and column or multiple rows and columns. • It can be used in a WHERE or HAVING clause with a comparison operator. • Depending on the clause that contains it, a subquery can return a single value or multiple values. DATA FROM MORE THAN ONE TABLE USING SUBQUERIES TO SELECT DATA Definition • A single-value subquery returns a single row and column. • It can be used in a WHERE or HAVING clause with a comparison operator. • The subquery must return one value, or else the query fails o If this happens, an error message will be written to the log. DATA FROM MORE THAN ONE TABLE USING SUBQUERIES TO SELECT DATA Example proc sql; select plant, jobgrade from sql.WidgeOne where jobgrade in (select gpa from sql.UcDavis where gpa = 4); quit; Only the job grades that have employees with a GPA of 4.0 are selected along with their corresponding plant locations. Notice that the output may consist of more than one row or column. DATA FROM MORE THAN ONE TABLE USING SUBQUERIES TO SELECT DATA Multiple-Value Subqueries Definition • A Multiple-Value Subquery can return more than one value from one column. • It is also used in a WHERE or HAVING expression that contains IN or a comparison operator. • The IN operator is modified by ANY or ALL. DATA FROM MORE THAN ONE TABLE USING SUBQUERIES TO SELECT DATA Example proc sql; select gender, jobgrade from sql.WidgeOne where jobgrade in (select computer from sql.UcDavis); quit; Notice that you are selecting a variable in the first table (jobgrade from WidgeOne) based off criteria from the second table (gpa from UcDavis). DATA FROM MORE THAN ONE TABLE JOINS VERSUS SUBQUERIES • A table join combines multiple tables into a new table. • A subquery selects rows from one table based on values in another. • A subquery, or inner query, is nested as part of another query expression. DATA FROM MORE THAN ONE TABLE COMBINING QUERIES WITH SET OPERATORS PRC SQL can combine the results of two or more queries in various ways by using the following set operators: •UNION – produces all unique rows from both queries. •EXCEPT – produces rows that are part of the first query only. •INTERSECT – produces rows that are common to both query results. •OUTER UNION – concatenates the query results. Place a semicolon after the last SELECT statement only. Set operators combine columns from two queries based on their position in the referenced tables without regard to the individual column names. Also, columns in the same relative position in the query become the column names of the output table. DATA FROM MORE THAN ONE TABLE COMBINING QUERIES WITH SET OPERATORS TO DO •Place a semicolon after the last SELECT statement only. o Set operators combine columns from two queries based on their position in the referenced tables without regard to the individual column names. o Also, columns in the same relative position in the query become the column names of the output table. •Try each set of code in turn. •It is helpful to think of the operators as one does in Set Theory. Ask yourself, “What part of the two sets is captured?” •Draw two intersecting circles representing the tables, shading the relevant areas as in Set Theory. DATA FROM MORE THAN ONE TABLE COMBINING QUERIES WITH SET OPERATORS UNION Combines two query results proc sql; title ‘WidgeOne UNION UcDavis’; select jobgrade from sql.WidgeOne union select computer from sql.UcDavis; quit; You can also use the ALL keyword to request that duplicate rows remain in the output. select jobgrade from sql.WidgeOne union all select computer from sql.UcDavis; DATA FROM MORE THAN ONE TABLE COMBINING QUERIES WITH SET OPERATORS EXCEPT Returns rows that result from the first query but not the second. proc sql; title ‘WidgeOne EXCEPT UcDavis’; select jobgrade from sql.WidgeOne except select gpa from sql.UcDavis; quit; The above code returns rows that result from WidgeOne but not from UcDavis. EXCEPT does not return duplicate rows that do not occur in the second query. Adding ALL keeps any duplicate rows that do not occur in the second query. select * from sql.WidgeOne except all select * from sql.UcDavis; DATA FROM MORE THAN ONE TABLE COMBINING QUERIES WITH SET OPERATORS INTERSECT Works like the mathematical operator: Returns rows from the first query that also occur in the second. proc sql; title ‘WidgeOne INTERSECT UcDavis’; select jobgrade from sql.WidgeOne intersect select computer from sql.UcDavis; quit; The above code returns rows that result from WidgeOne and from UcDavis but none that occur independently. Again, adding ALL means that the output would contain the rows produced by the first query that are matched oneto-one with a row produced by the second query. In this example, the output would match that of the above code. DATA FROM MORE THAN ONE TABLE COMBINING QUERIES WITH SET OPERATORS OUTER UNION Concatenates the results of the queries. proc sql; title ‘WidgeOne OUTER UNION UcDavis’; select gender from sql.WidgeOne outer union select gpa from sql.UcDavis; quit; The above code returns rows that result from WidgeOne followed by the results from UcDavis. Notice that the OUTER UNION does not overlay columns from the two tables. To overlay columns in the same position, use the CORRESPONDING keyword. select from sql.WidgeOne outer union corr select from sql.UcDavis; DATA FROM MORE THAN ONE TABLE CLOSING NOTES • It’s been my experience that the two variables must be the same type when using set operators. • Of course, a join or a subquery is used when you reference information from multiple tables. • Use a subquery when the result that you want requires more than one query and each subquery provides a subset of the table involved in the query. • If a membership question is asked, then a subquery is usually used. If the query requires a NOT EXISTS condition, then you must use a subquery because NOT EXISTS operates only in a subquery; the same principle holds true for the EXISTS condition. DATA FROM MORE THAN ONE TABLE CULMINATED CODE EXAMPLE: ods graphics on; ods rtf; proc sql; create table sql.table as select a.plant, a.position, a.jobgrade, a.Post_Training_Productivity format=comma14. as Total, a.yronjob, b.gpa from sql.WidgeOne a inner join sql.UcDavis1 b on a.gender = b.gender where a.jobgrade in (9,7) and a.yronjob = 11.1; quit; ods rtf close; ods graphics off;