Introduction to SQL Session 1 Retrieving Data From a Single Table What is SQL? Definition of SQL: • The original Structured Query Language was designed by an IBM research center in 1974-75 and introduced commercially by Oracle in 1979. • There are different dialects of SQL, but it remains as close to a standard query language as you will get. • Some standard SQL commands are as follows: o SELECT DELETE o INSERT CREATE o UPDATE DROP What is SQL? Uses: PROC SQL is used for the following tasks: • Generate reports • Generate summary statistics • Retrieve data from tables or views • Combine data from tables or views • Create tables, views, and indexes • Update the data values in PROC SQL tables • Update and retrieve data from database management system (DBMS) tables. What is SQL? Definition of SAS SQL: • Structured Query Language (SQL) is a language that is used to retrieve and update data. SQL uses relational tables and databases • SQL procedure simply refers to the implementation of SQL by using a SAS procedure, PROC SQL. • PROC SQL can replace many SAS procedures or DATA steps. • There are many options, functions, informants, and formats that can be used within PROC SQL. • This course focuses on using SQL inside of SAS What is a Table? As review, a SAS data file is the following: a type of SAS data set that contains both the data values and the descriptor information. SQL uses tables from which to work. Definition: A PROC SQL table is the same as a SAS data file. It can be thought of as having two dimensions: •Rows – Observations •Columns - Variables TABLE OF CONTENTS Retrieving Data From A Single File Clauses SELECT and FROM WHERE ORDER HAVING Useful Tools Eliminating Duplicate Rows Determining Structure Adding Text To Output Calculating Values Specific Clauses Retrieving Data From Single Table: Clauses 1. SELECT and FROM The SELECT statement is the primary tool of PROC SQL. A SQL procedure must contain a SELECT clause and a FROM clause. The following is sufficient for a workable procedure: proc sql; select Name from sql.FileName; quit; where Name is the desired variable identifier. Notice: There is no semi-colon after the SELECT clause. Remember, the statement is read in its entirety before placing semi-colon. Retrieving Data From Single Table: Clauses SELECT •You SELECT data from the Source File FROM •You get data in the FROM file *Let’s use the table (data set), Widgeone, for this course. Retrieving Data From Single Table: Clauses 2. WHERE The WHERE clause is a restriction tool used to limit the amount of retrieved data. Using WHERE will produce a table with only those rows that satisfy the condition of the clause: proc sql; title 'Widgeone Table'; select Plant from widget.widgeone where position = "HRLY" and jobgrade = 5; quit; Note: Both conditions must be satisfied to be printed in the report. The conditions are separated by the word “and.” This shows that there is more than one condition that limits the data. Also, widget is a library created for the tables which is not necessary to repeat. Retrieving Data From Single Table: Clauses 3. ORDER The ORDER clause enables the user to sort columns in ascending/descending alphabetical order (for numerical values, ascending or descending values). select Plant, jobgrade, gender, Pre_Training_Productivity from widget.widgeone where jobgrade in (5,6,7) order by plant; Note: If ordering by multiple columns, separate variables by a comma (order by jobgrade, Pre_Training_Productivity desc;) 4. GROUP BY This clause enables the user to break table results into subsets of rows: group by Plant; Retrieving Data From Single Table: Clauses 5. HAVING The HAVING clause is another way to restrict the query results by working along with the GROUP BY clause. Together, they group data that satisfy the information given by the HAVING clause. For example, from widget.widgeone group by jobgrade having jobgrade gt 5; quit; Retrieving Data From Single Table: SAS Code proc sql ; title 'Widgeone Table'; create table sql.table as select Plant, SUM(Post_Training_Productivity) format=comma14. as TotalProductivity, count(*) as Count from widget.widgeone group by plant having count(*) gt 17; quit; By inserting the Create Table clause, the results go to the library sql with the name, Table. This is a good way to put the results in a more permanent place. Also, count(*) is an aggregate function that returns all plants with more than 17 rows. In this case, it will return Dallas only. Notice that several variables can be listed in the SELECT statement including formatted functions. USEFUL TOOLS • Eliminating duplicate rows from the table To get query results with only unique values, add the word “distinct” to the SELECT clause: proc sql; select distinct Plant, Avg(Post_Training_Productivity) as Average format 8.2 from Widget.WidgeOne; quit; • Determining the structure of a table The DESCRIBE TABLE statement allows the user to see information about the data within the file. proc sql; describe table Widget.WidgeOne; USEFUL TOOLS • Adding text to output o By putting string text within single quotation marks, you can add text to your tables within the rows: select ‘Factory’ , Plant , ‘in Production’ from Widget.WidgeOne; where Plant is a variable name. o By putting in a TITLE statement, you can add a title to your table: proc sql; title ‘Widgets Work!’; select plant, Gender from sql.filename; USEFUL TOOLS: CALCULATING VALUES • Calculated data within PRC SQL: o By putting the calculation within the procedure, values can be manipulated from the data set with the ending result within the table. o We have seen an example already from Slide 13 using Average. o Here is another example with a Title: proc sql; title ‘What is the Sum of Training Hours?’ select distinct Plant, sum(Post_Training_Productivity) as Sum format 8.2; from Widget.WidgeOne; quit; USEFUL TOOLS: SUMMARIZING DATA • • • • • • • • • • AVG, MEAN COUNT, FREQ, N MAX/MIN NMISS RANGE STD SUM T CSS VAR mean or average of data number of nonmissing values largest value number of missing values range of values standard deviation sum of values Student’s t, H0 = population=0 corrected sum of squares variance of data USEFUL TOOLS: SUMMARIZING DATA • How to use summarizing data, two examples: o MEAN proc sql; select distinct Plant, Avg(Post_Training_Productivity) as Average format 8.2 from Widget.WidgeOne; quit; Note: Notice that the variables are inside parentheses, similar to all functions. If you reference a calculated value such as Average within the WHERE clause, use the word “calculated” prior to using it. o COUNT proc sql; select count(distinct Plant) as Count from Widget.WidgeOne; where “distinct” is used to count nonduplicates. USEFUL TOOLS: THE WHERE CLAUSE • The WHERE clause is very flexible and can be used to distinguish your data in very specific ways. o Do not include missing data IS MISSING or IS NOT MISSING are missing indicators. o Used in conjunction with AND/OR to combine clauses: select Plant, sum(Post_Training_Productivity) as Total from widget.widgeone where Total lt 50 and Plant is not missing ; Note: Both conditions must be met. (lt means “less than.”) o Used in conjunction with the LIKE operator: select Name from widget.widgeone where Plant like ‘D%’; Note: The condition will be filled if Plant begins with “D.” The % is a wild card within SAS. USEFUL TOOLS: GROUPING • Grouping by multiple columns o To group by multiple columns (variables), separate the column names with commas within the GROUP BY clause. where Plant is not missing group by Plant, Post_Training_Productivity; USEFUL TOOLS: HAVING VERSUS WHERE • • • The HAVING clause with the GROUP BY clause affects groups in a way that is similar to the way in which a WHERE clause affects individual rows. PRC SQL only displays values satisfying the HAVING clause. It is helpful to know when to use HAVING and when to use WHERE. The HAVING clause can also use aggregate functions (summarizing functions) like the WHERE clause: select Plant, SUM(Post_Training_Productivity) format=comma14. as TotalProductivity, count(*) as Count from widget.widgeone group by plant having count(*) gt 17; Note: SQL will include the count of non-missing values to the table (* means to count every non-missing.) However, the HAVING clause limits the count to those greater than 15. USEFUL TOOLS: HAVING VERSUS WHERE HAVING WHERE __________________________________________________________________________ • Includes groups of rows Includes individual rows • Must follow GROUP BY if used Must precede any GROUP BY used with GROUP BY • When no GROUP BY, acts like Is not affected by GROUP BY clause WHERE • Is processed after GROUP BY Is processed before GROUP BY clause, if and any summations there is one and before summations. PROC SQL Vs. SAS Program Total Post Training Hours Grouped by Plant •Let’s see the difference between using Prc Sql and SAS procedures to find the total hours grouped by the two different plants. PROC SQL Vs. SAS Program Total Post Training Hours Grouped by Plant •Proc Sql proc sql; select Plant, sum(Post_Training_Hours) as Total format=comma15. from sql.WidgeOne where Gender = ‘Female’ group by Plant order by Total; quit; PROC SQL Vs. SAS Program Total Post Training Hours Grouped by Plant •SAS Program proc summary data = sql.WidgeOne; where Gender = ‘Female’; class Plant; var Post_Training_Hours; output out = sumPost sum = Total; run; proc sort data = SumPost; by Total; run; proc print data = SumPost noobs; var Plant Total; format Total comma15.; where_type_=1; run; PROC SQL VS. SAS Program • The example shows that PROC SQL can achieve the same results as base SAS software but often with fewer and shorter statements. • The SELECT statement that is shown performs summation, grouping, sorting, and row selection. It also displays results without PROC PRINT. SUMMARY • The following code represents a summary of the clauses we discussed. proc sql; title ‘Widgeone Table’; create table widget.table as select Plant, sum(Post_Training_Productivity) format=comma14. as Total from widget.widgeone where Plant is not missing; quit; SUMMARY CONTINUED • Now we use HAVING and GROUP BY as well. proc sql; title ‘Widgeone Table’; create table widget.table as select Plant, sum(Post_Training_Productivity) format=comma14. as Total, jobgrade from widget.widgeone having jobgrade gt 5 group by plant; quit;