SQL IN SAS 1

advertisement
Introduction to SQL
Session 1
Retrieving Data From a Single
Table
What is SQL?
Definition of SQL:
• The original Structured Query Language was designed
by an IBM research center in 1974-75 and introduced
commercially by Oracle in 1979.
• There are different dialects of SQL, but it remains as close
to a standard query language as you will get.
• Some standard SQL commands are as follows:
o SELECT
DELETE
o INSERT
CREATE
o UPDATE
DROP
What is SQL?
Uses:
PROC SQL is used for the following tasks:
• Generate reports
• Generate summary statistics
• Retrieve data from tables or views
• Combine data from tables or views
• Create tables, views, and indexes
• Update the data values in PROC SQL tables
• Update and retrieve data from database management
system (DBMS) tables.
What is SQL?
Definition of SAS SQL:
• Structured Query Language (SQL) is a language that is
used to retrieve and update data. SQL uses relational
tables and databases
• SQL procedure simply refers to the implementation of
SQL by using a SAS procedure, PROC SQL.
• PROC SQL can replace many SAS procedures or DATA
steps.
• There are many options, functions, informants, and
formats that can be used within PROC SQL.
• This course focuses on using SQL inside of SAS
What is a Table?
As review, a SAS data file is the following: a type of SAS
data set that contains both the data values and the
descriptor information.
SQL uses tables from which to work.
Definition:
A PROC SQL table is the same as a SAS data file. It can be
thought of as having two dimensions:
•Rows – Observations
•Columns - Variables
TABLE OF CONTENTS
 Retrieving Data From A Single File
 Clauses
 SELECT and FROM
 WHERE
 ORDER
 HAVING
 Useful Tools
 Eliminating Duplicate Rows
 Determining Structure
 Adding Text To Output
 Calculating Values
 Specific Clauses
Retrieving Data From Single Table:
Clauses
1. SELECT and FROM
The SELECT statement is the primary tool of PROC SQL. A SQL
procedure must contain a SELECT clause and a FROM clause.
The following is sufficient for a workable procedure:
proc sql;
select Name
from sql.FileName;
quit;
where Name is the desired variable identifier.
Notice: There is no semi-colon after the SELECT clause.
Remember, the statement is read in its entirety before
placing semi-colon.
Retrieving Data From Single Table:
Clauses
SELECT
•You SELECT data from the Source File
FROM
•You get data in the FROM file
*Let’s use the table (data set), Widgeone, for this course.
Retrieving Data From Single Table:
Clauses
2. WHERE
The WHERE clause is a restriction tool used to limit the amount
of retrieved data. Using WHERE will produce a table with only
those rows that satisfy the condition of the clause:
proc sql;
title 'Widgeone Table';
select Plant
from widget.widgeone
where position = "HRLY" and jobgrade = 5;
quit;
Note: Both conditions must be satisfied to be printed in the report. The
conditions are separated by the word “and.” This shows that there is more
than one condition that limits the data. Also, widget is a library created for
the tables which is not necessary to repeat.
Retrieving Data From Single Table:
Clauses
3. ORDER
The ORDER clause enables the user to sort columns in
ascending/descending alphabetical order (for numerical
values, ascending or descending values).
select Plant, jobgrade, gender, Pre_Training_Productivity
from widget.widgeone
where jobgrade in (5,6,7)
order by plant;
Note: If ordering by multiple columns, separate variables by a comma
(order by jobgrade, Pre_Training_Productivity desc;)
4. GROUP BY
This clause enables the user to break table results into subsets
of rows:
group by Plant;
Retrieving Data From Single Table:
Clauses
5. HAVING
The HAVING clause is another way to restrict the query results
by working along with the GROUP BY clause. Together, they
group data that satisfy the information given by the HAVING
clause. For example,
from widget.widgeone
group by jobgrade
having jobgrade gt 5;
quit;
Retrieving Data From Single Table:
SAS Code
proc sql ;
title 'Widgeone Table';
create table sql.table as
select Plant, SUM(Post_Training_Productivity)
format=comma14. as TotalProductivity, count(*) as Count
from widget.widgeone
group by plant
having count(*) gt 17;
quit;
By inserting the Create Table clause, the results go to the
library sql with the name, Table. This is a good way to put the
results in a more permanent place. Also, count(*) is an
aggregate function that returns all plants with more than 17
rows. In this case, it will return Dallas only. Notice that several
variables can be listed in the SELECT statement including
formatted functions.
USEFUL TOOLS
• Eliminating duplicate rows from the table
To get query results with only unique values, add the word
“distinct” to the SELECT clause:
proc sql;
select distinct Plant, Avg(Post_Training_Productivity) as Average
format 8.2
from Widget.WidgeOne;
quit;
• Determining the structure of a table
The DESCRIBE TABLE statement allows the user to see
information about the data within the file.
proc sql;
describe table Widget.WidgeOne;
USEFUL TOOLS
• Adding text to output
o By putting string text within single quotation marks, you
can add text to your tables within the rows:
select ‘Factory’ , Plant , ‘in Production’
from Widget.WidgeOne; where Plant is a variable name.
o By putting in a TITLE statement, you can add a title to
your table:
proc sql;
title ‘Widgets Work!’;
select plant, Gender
from sql.filename;
USEFUL TOOLS: CALCULATING VALUES
• Calculated data within PRC SQL:
o By putting the calculation within the procedure, values
can be manipulated from the data set with the ending
result within the table.
o We have seen an example already from Slide 13 using
Average.
o Here is another example with a Title:
proc sql;
title ‘What is the Sum of Training Hours?’
select distinct Plant, sum(Post_Training_Productivity) as Sum format 8.2;
from Widget.WidgeOne;
quit;
USEFUL TOOLS: SUMMARIZING DATA
•
•
•
•
•
•
•
•
•
•
AVG, MEAN
COUNT, FREQ, N
MAX/MIN
NMISS
RANGE
STD
SUM
T
CSS
VAR
mean or average of data
number of nonmissing values
largest value
number of missing values
range of values
standard deviation
sum of values
Student’s t, H0 = population=0
corrected sum of squares
variance of data
USEFUL TOOLS: SUMMARIZING DATA
• How to use summarizing data, two examples:
o MEAN
proc sql;
select distinct Plant, Avg(Post_Training_Productivity) as Average format 8.2
from Widget.WidgeOne;
quit;
Note: Notice that the variables are inside parentheses, similar
to all functions. If you reference a calculated value such
as Average within the WHERE clause, use the word
“calculated” prior to using it.
o COUNT
proc sql;
select count(distinct Plant) as Count
from Widget.WidgeOne; where “distinct” is used to count nonduplicates.
USEFUL TOOLS: THE WHERE CLAUSE
• The WHERE clause is very flexible and can be used to
distinguish your data in very specific ways.
o Do not include missing data
IS MISSING or IS NOT MISSING are missing indicators.
o Used in conjunction with AND/OR to combine clauses:
select Plant, sum(Post_Training_Productivity) as Total
from widget.widgeone
where Total lt 50 and Plant is not missing ;
Note: Both conditions must be met. (lt means “less than.”)
o Used in conjunction with the LIKE operator:
select Name
from widget.widgeone
where Plant like ‘D%’;
Note: The condition will be filled if Plant begins with “D.” The % is a wild
card within SAS.
USEFUL TOOLS: GROUPING
• Grouping by multiple columns
o To group by multiple columns (variables), separate the
column names with commas within the GROUP BY
clause.
where Plant is not missing
group by Plant, Post_Training_Productivity;
USEFUL TOOLS: HAVING VERSUS WHERE
•
•
•
The HAVING clause with the GROUP BY clause affects groups
in a way that is similar to the way in which a WHERE clause
affects individual rows. PRC SQL only displays values satisfying
the HAVING clause.
It is helpful to know when to use HAVING and when to use
WHERE.
The HAVING clause can also use aggregate functions
(summarizing functions) like the WHERE clause:
select Plant, SUM(Post_Training_Productivity)
format=comma14. as TotalProductivity, count(*) as Count
from widget.widgeone
group by plant
having count(*) gt 17;
Note: SQL will include the count of non-missing values to the table (* means
to count every non-missing.) However, the HAVING clause limits the count to
those greater than 15.
USEFUL TOOLS: HAVING VERSUS WHERE
HAVING
WHERE
__________________________________________________________________________
• Includes groups of rows
Includes individual rows
• Must follow GROUP BY if used
Must precede any GROUP BY used
with GROUP BY
• When no GROUP BY, acts like
Is not affected by GROUP BY clause
WHERE
• Is processed after GROUP BY
Is processed before GROUP BY clause, if
and any summations
there is one and before
summations.
PROC SQL Vs. SAS Program
Total Post Training Hours Grouped by Plant
•Let’s see the difference between using Prc Sql
and SAS procedures to find the total hours
grouped by the two different plants.
PROC SQL Vs. SAS Program
Total Post Training Hours Grouped by Plant
•Proc Sql
proc sql;
select Plant, sum(Post_Training_Hours) as Total format=comma15.
from sql.WidgeOne
where Gender = ‘Female’
group by Plant
order by Total;
quit;
PROC SQL Vs. SAS Program
Total Post Training Hours Grouped by Plant
•SAS Program
proc summary data = sql.WidgeOne;
where Gender = ‘Female’;
class Plant;
var Post_Training_Hours;
output out = sumPost sum = Total;
run;
proc sort data = SumPost;
by Total;
run;
proc print data = SumPost noobs;
var Plant Total;
format Total comma15.;
where_type_=1;
run;
PROC SQL VS. SAS Program
• The example shows that PROC SQL can
achieve the same results as base SAS software
but often with fewer and shorter statements.
• The SELECT statement that is shown performs
summation, grouping, sorting, and row
selection. It also displays results without PROC
PRINT.
SUMMARY
• The following code represents a summary of the
clauses we discussed.
proc sql;
title ‘Widgeone Table’;
create table widget.table as
select Plant, sum(Post_Training_Productivity) format=comma14. as Total
from widget.widgeone
where Plant is not missing;
quit;
SUMMARY CONTINUED
• Now we use HAVING and GROUP BY as well.
proc sql;
title ‘Widgeone Table’;
create table widget.table as
select Plant, sum(Post_Training_Productivity) format=comma14. as Total,
jobgrade
from widget.widgeone
having jobgrade gt 5
group by plant;
quit;
Download