Introduction to Writing SQL This 7 week course is intended as a self-guided training course for individuals or groups who want to learn how to write custom SQL queries against the Data Warehouse. The textbook for the class is Mastering Oracle SQL, 2nd Edition by Sanjay Mishra & Alan Beaulieu (ISBN: 0596006322), and the SQL portion of the class follows the book closely. Each week, you’ll read one chapter and do a short homework assignment to review & practice what you learned. Not doing this reading & homework will jeopardize your ability to follow the course, as each week builds on those before. It’s also recommended you keep a business question specific to your role in mind, and try answering it as you develop your SQL & ETL skills. Section SQL Topics Week 1 SELECT & FROM Clauses Table and Column Aliases Types of Elements Concatenating ORDER BY Clause WHERE Clause – SQL’s Filter Using Comparison Operators Using Other Operators Handling Cells with No Data – aka NULLs Aggregate Queries Pulling DISTINCT Records Aggregate Functions GROUP BY and HAVING Clauses Joining 2 or More Tables INNER vs. OUTER Joins WHERE Clause Conditions w/ OUTER Joins One-to-Many Joins DATE vs. DATETIME columns The TO_CHAR() Function with Dates The TRUNC() Function with Dates The TO_DATE() Function Using BETWEEN with Dates Other Date functions Subqueries Avoiding 1-to-many joins The DECODE(), CASE & NVL Functions Data Type Consistency Week 2 Week 3 Week 4 Week 5 Week 6 Week 7 Answer Key 08/04/2011 Key Tables & Virts 08/09/2011 Week 1 – Basic Structure of SQL SQL Topics: SELECT & FROM Table Aliases Column Aliases with and without Spaces 1 Types of Column Content Concatenating ORDER BY SQL TOPICS SQL is a language used (for our purposes) to ask a Database for specific information in a specific format. We generally use it to pull sets of data, known as Result Sets, which we use to create reports and answer business questions, such as “What was the List Price value of all goods shipped to customers in 2008?” or “When did we receive the first unit of Twilight Book 1 from Hachette?”. There are two necessary sections, or clauses, to an SQL query: SELECT and FROM SELECT & FROM SELECT tells the database a list of one or more Elements that you want to include in your results FROM tells the database a list of one or more tables (or views) from which you want to pull the information A basic query with just these necessary clauses might be: SELECT WAREHOUSE_ID , NAME FROM D_WAREHOUSES; This query would pull two elements – in this case columns WAREHOUSE_ID and NAME - from the table named D_WAREHOUSES, producing a Result Set like this one. Note that only part of the result set is shown: WAREHOUSE_ID IMJO TUL1 GCWP NRT3 A00L MSC7 ECEL TAJ9 NAME Ingram Micro, Jonestown, PA Coffeyville Granite City Tools – MN Ichikawa SED International-Dallas Bemrose Booth Amazon Wireless Target: Light Source WAREHOUSE_ID and NAME are the column names that we wanted to pull, so we listed them as elements in the SELECT clause. Notice that the elements in the SELECT clause are separated by commas in the SQL query. The query also ends with a semicolon, which lets Oracle know that’s the end of the query. ETL Manager doesn’t require this, but it’s good practice to include it. You may also notice that I put the comma at the beginning of a new line followed by the next element, which is a little different than might seem intuitive. This is because the comma is only there because there is a second element. If I wanted to delete the NAME column, I’d also need to delete the comma; otherwise I’d get an error. I find that by putting it at the start of the new line where the new element is, I can easily delete that row when editing and won’t miss it and cause an error. SELECT WAREHOUSE_ID FROM D_WAREHOUSES; Table Aliases For reasons that will become apparent when you start joining multiple tables together, it’s best to use Table Aliases when writing your SQL. A Table Alias is a shorthand name, like nickname, that tells the query from which table each column referenced comes. To alias a table, you simply add your nickname after the table name, with a space separating them (fcs in the example below). You also put that alias at the start of each column name, separated from the column name by a period, like this: SELECT WAREHOUSE_ID , NAME FROM D_WAREHOUSES; BECOMES: SELECT fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs; The table alias ensures that the system knows exactly where each column is derived. For example, in a more complex query you might have two tables, each with a WAREHOUSE_ID column, and Oracle needs to know to which table the column is associated. 2 Column Aliases Another type of alias is the Column Alias. This is a way to change the name of a column to something that’s more meaningful to you or your customers, and is what shows up as the Column Headers in your result set. A few examples of Column aliases are below: SELECT fcs.WAREHOUSE_ID FC_Code , fcs.REGION_ID AS Region , fcs.NAME AS "FC Name" FROM D_WAREHOUSES fcs; There are two ways you can change the name of a column. The first is to simply put a space between the column name, and your alias for it after, as we’ve done above to alias the first column, WAREHOUSE_ID, to FC_Code. You can also put the word AS in between the column name and your alias, as we’ve done to alias the second column, REGION_ID, to Region. Including AS isn’t necessary, but arguably makes it more clear that you’ve aliased the column. If you want to alias a column to something that has a space in it, like we’ve done to alias the column NAME to FC Name, you have to enclose it in double quotes, so the system knows where the alias starts and ends. I usually avoid including spaces in column aliases, because it can lead to problems in more complex queries, and the standard is to use underscores, as we did with FC_Code. Also, be sure not to start your Column Aliases with a number, or make them a SQL keyword (like DATE, CUBE or FROM), as this will cause confusing errors. Types of Elements A SELECT clause can include elements beyond just columns from tables. There are a number of different elements that can be included, depending on your needs, including: Literal values, such as numbers (13) or text strings (‘Howdy!’), that return exactly what you enter Expressions (aka formulas), such as doi.QUANTITY_SHIPPED + 5, which do math or other logical procedures Function calls, such as TO_CHAR(ddo.ORDER_DAY,’MM/DD/YYYY’) that transform column information Pseudocolumns, such as ROWID, ROWNUM, or LEVEL (Pseudocolumns aren’t columns that actually exist in any table, but are columns you can include in any query for specific uses.) An example of a query with each of these types of columns is: SELECT fcs.WAREHOUSE_ID , 13 , 'Howdy!' , fcs.REGION_ID + 5 , SUBSTR(fcs.NAME,0,5) , ROWNUM FROM D_WAREHOUSES fcs; Which yields this a result set that includes these rows: WAREHOUSE_ID TGC2 AAOP SAIN JACK WCSC ABE1 IBEW LEX1 SEA1 13 13 13 13 13 13 13 13 13 13 'HOWDY!' Howdy! Howdy! Howdy! Howdy! Howdy! Howdy! Howdy! Howdy! Howdy! FCS.REGION_ID+5 6 6 6 6 6 6 6 6 6 SUBSTR(FCS.NAME,0,5) Alen Trend Saint Jacks West Allen Ingra Lexin Seatt ROWNUM 1 5 7 8 9 757 969 970 971 You’ll notice that the 13 wasn’t enclosed in single quotation marks, like Howdy!. This is because the system understands the 13 is a number. Text strings, like ‘Howdy!’ or ‘PHL1’ or ‘I love SQL’ need to be enclosed in single-quotes whenever you use them. Note that the quotes don’t show up in your results. Also – the single quote in MS Word, Excel and Outlook is a different character, so it’s best not to edit SQL in these programs (stick to Notepad, Notepad++, or the ETL Manager Profile SQL window). Notepad++ is a favorite of many coders, as you can adjust the ‘Language’ to SQL and get helpful formatting, as well as indent entire sections with tab. It’s available for download in Advertised Programs as “Open Source Notepad++”. 3 Concatenating You can also concatenate information together in your select clause, including columns, numbers, text strings, etc. Unlike Excel, where you concatenate using the ampersand (&) symbol, SQL uses two pipes ||. To get two pipes, hold down shift and hit the key just above the Enter key on your keyboard twice. An example of concatenating is in the query below: SELECT fcs.WAREHOUSE_ID , 'Howdy! from ' || fcs.NAME FROM D_WAREHOUSES fcs; Which yields a result set that includes: WAREHOUSE_ID SBTK DGJP ABGM ABGL LHR2 'HOWDY!FROM'||FCS.NAME Howdy! from Softbank BB Howdy! from Digital Goods JP Howdy! from Step2 UK Limited Howdy! from Universal Cycles Howdy! from Plot 8 - Marston Gate Notice that you need to include your space after 'Howdy! from inside the quotations in order to get it in the results, otherwise, they’d look like: WAREHOUSE_ID SBTK DGJP ABGM ABGL LHR2 HOWDY!FROM'||FCS.NAME Howdy! fromSoftbank BB Howdy! fromDigital Goods JP Howdy! fromStep2 UK Limited Howdy! fromUniversal Cycles Howdy! fromPlot 8 - Marston Gate ORDER BY Sometimes, the order of the results is important to answering your question or to displaying the results in the most meaningful way, like ranking the highest units at the top, or alphabetizing a list of vendor codes. To order your results, you add an ORDER BY clause to the end of the query, and then specify in that clause which element(s) to order your results by, and even which direction. For example, you might want to see a list of warehouses names in alphabetical order: SELECT fcs.NAME FROM D_WAREHOUSES fcs ORDER BY fcs.NAME; NAME 2 Red Hens 3 GIRLS DESIGN/KITTY A GO 32 North Corp A & W Products Co In A C R Logistics A Plus Marketing A'HOMESTEAD SHOPPE INC A-America Inc A.S. Diamonds AAB Gourmet - Garden City ABM Corp - Mira Loma Beijing Notice that numbers, spaces & symbols are ordered before letters, so 2 Red Hens comes before A.S. Diamonds, which comes before AAB Gourmet. The default ordering is Ascending (0-9,A-Z). 4 You may also want to sort a list in the other direction. A common use case is when you want the highest number of something at the top of a list, like the highest number of glance views in a list of ASINs. To do that, you add a space and the word DESC after the element in your ORDER BY clause, to specify descending order: SELECT fcs.NAME FROM D_WAREHOUSES fcs ORDER BY fcs.NAME DESC; NAME Beijing A'HOMESTEAD SHOPPE INC ABM Corp - Mira Loma A-America Inc AAB Gourmet - Garden City A.S. Diamonds A Plus Marketing A C R Logistics A & W Products Co In 32 North Corp 3 GIRLS DESIGN/KITTY A GO 2 Red Hens You may also want to order your results by multiple elements, in which case you include them all in your ORDER BY clause, in order of importance, separated by commas: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID FROM D_WAREHOUSES fcs ORDER BY fcs.REGION_ID DESC , fcs.WAREHOUSE_ID; REGION_ID 3 3 3 3 3 3 2 2 2 1 1 1 WAREHOUSE_ID AARF AARG CAN1 DEKN DGJP NRT1 GLA1 LEJ1 LHR1 KTKN MYTK RNO1 Each element indicated in your ORDER BY clause can be sorting in a different direction. In the example above, we ordered the REGION_ID column descending, and ordered the WAREHOUSE_ID column ascending (which is the default). You can also use any type of element in your ORDER_BY clause, just like in your SELECT clause, including function calls and expressions. 5 Week 1 Homework: 1. Read Chapter One in Mastering Oracle SQL. 2. Create a query that pulls an alphabetized list of Warehouse IDs from the table D_WAREHOUSES, changing the name of the Warehouse ID column to ‘FC’. 3. Edit the query to add the column REGION_ID, add an element called ‘CALC’ that multiplies the Region ID by 10, add an element called ‘FACTOR’ that is populated with the number 10 for all records, and add an element called FC_REGION that concatenates the WAREHOUSE_ID and the REGION_ID columns with an underscore in between (e.g. PHL1_1). Here is a description of the D_WAREHOUSES table. We’ll talk more about exploring tables & columns in the future, and what all this information means, but for now, all you need to know is that the list of column names, so you can play around a bit with querying this table using ETL Manager. Table Name: D_WAREHOUSES Column Name CAN_SHIP_INTERNALLY DB_NAME DW_CREATION_DATE DW_LAST_UPDATED HAS_AMAZON_INVENTORY IP_ADDRESS_LIST_ID IS_DELAYED_ALLOCATION IS_DROPSHIP IS_RETURNS_ONLY NAME REGION_ID Data Type CHAR VARCHAR2 DATE DATE CHAR NUMBER CHAR CHAR CHAR VARCHAR2 NUMBER WAREHOUSE_ID CHAR Data Length 1 8 7 7 1 22 1 1 1 50 22 4 Data Precision 38 Nullable? N Y N N N Y N N N N N N Num Distinct 2 57 19 1 2 51 2 2 2 3340 3 3453 Remember: One of the great things about SQL is that there are usually several ways to get to the same answer. Different people’s minds think about and solve problems in different ways, and you’ll likely find some methods that work for you that may be different than what your peers are doing. A good SQL coder is a creative SQL coder, so don’t be afraid to try something ‘off-book’. 6 Week 2 – Building Queries to Pull Just the Results You Want SQL Topics The WHERE Clause – SQL’s Filter Using Comparison Operators Using Other Operators Handling Cells with No Data – aka NULLs SQL TOPICS The WHERE Clause – SQL’s Filter Although the SELECT and FROM clauses are the only required sections of a SQL query, they only allow you to pull every record from a table – not just the ones you want. Imagine querying the D_CUSTOMER_SHIPMENT_ITEMS table, which BI Metadata shows has over 5 billion rows of data. The output would be too large for Excel, and you’d have a lot of information you don’t really want. That’s where the WHERE clause comes in. I think of WHERE as the filters I put on the table, to filter out what I don’t want, and only let what I do want get through to my result set. Each ‘filter’ in the where clause is a Condition that must be true in order to be returned by the query. The WHERE clause goes after the FROM clause, but before the ORDER_BY clause (if you’re using one). The WHERE clause starts with the word WHERE, and then is followed by one or more filters, called conditions. For example, if we wanted to pull an alphabetical list of FCs in Japan & China (REGION_ID 3), we could run the following: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 ORDER BY fcs.WAREHOUSE_ID; This would return the following dataset, limited to only WHERE the REGION_ID is equal to 3: REGION_ID 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 WAREHOUSE_ID AARF AARG CAN1 DEKN DGJP FFSA FMTT FUOS KCFK KTKN MYTK NRT1 NRT2 NRT3 OSKF OTOS PEK3 SBTK SHA1 YYGF NAME ¿¿¿¿ Amazon¿¿ Guangzhou ¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Digital Goods JP ¿¿¿¿¿¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿ Kenko.com, INC. ¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿¿Narita Yachiyo-shi Ichikawa Osakaya Books ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Beijing Softbank BB Shanghai ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Of course, you’re not limited to just one condition. You may want to filter you results by several criteria, and so would have multiple conditions in the WHERE clause. For example, we might limit the above query further to only those FCs with WAREHOUSE_IDs that start with the letter Y. (Don’t worry about what LIKE ‘Y%’ means exactly right now – we’ll get to that shortly. Just know it means ‘starts with Y’): SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID Yeilding the following result set: 7 , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND fcs.WAREHOUSE_ID LIKE 'Y%' ORDER BY fcs.WAREHOUSE_ID; REGION_ID 3 WAREHOUSE_ID YYGF NAME ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Notice that we separated the two conditions in the WHERE clause with AND. This means that BOTH the first condition (fcs.REGION_ID = 3) AND the second condition (fcs.WAREHOUSE_ID LIKE ‘Y%’) must be true. 8 You can also separate multiple conditions with OR, in which case either condition must be true. If we change the AND in our query above to an OR, the results are much different: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 OR fcs.WAREHOUSE_ID LIKE 'Y%' ORDER BY fcs.WAREHOUSE_ID; In this case, we pulled all FCs where the REGION_ID is equal to 3 OR where the WAREHOUSE_ID begins with Y, so we got all the FCs in Region 3 regardless of what letter they start with, plus all the FCs in other regions that start with Y – which happens to only include YAHA. REGION_ID 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 3 WAREHOUSE_ID AARF AARG CAN1 DEKN DGJP FFSA FMTT FUOS KCFK KTKN MYTK NRT1 NRT2 NRT3 OSKF OTOS PEK3 SBTK SHA1 YAHA YYGF NAME ¿¿¿¿ Amazon¿¿ Guangzhou ¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Digital Goods JP ¿¿¿¿¿¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿ Kenko.com, INC. ¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿¿Narita Yachiyo-shi Ichikawa Osakaya Books ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Beijing Softbank BB Shanghai Yamazaki Tableware -- Hackettstown ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ And you can even get more complex by using parentheses to change how the AND and OR logic is applied. For example, maybe you want a list of all FCs where the REGION_ID is 3 and either the WAREHOUSE_ID starts with Y or it starts with D. We could do that using parentheses: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND (fcs.WAREHOUSE_ID LIKE 'Y%' OR fcs.WAREHOUSE_ID LIKE 'D%') ORDER BY fcs.WAREHOUSE_ID; REGION_ID 3 3 3 WAREHOUSE_ID DEKN DGJP YYGF NAME ¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Digital Goods JP ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ The results include only FCs where REGION_ID = 3 AND where either the WAREHOUSE_ID starts with Y OR where the WAREHOUSE_ID starts with D. Pages 20-22 of your textbook has additional examples and some charts on how various logical combinations of AND and OR, with and without parentheses are evaluated. Using Comparison Operators In the examples above, we used two different ‘comparison operators’ in our WHERE clauses to limit our results: the equals symbol (=) and LIKE. There are many more comparison operators available to help us apply conditions in our query. The equals sign can be used to evaluate if something is equal to something else in a condition, as we did when we put fcs.REGION_ID = 3 in our WHERE clause in the example above. Equality can also be evaluated for columns that contain text strings (VAR and VARCHAR data type columns), in which case you must put that text string in a set of single quotation marks. For example: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID = 'PHL1'; REGION_ID 1 WAREHOUSE_ID PHL1 NAME New Castle 9 You can also evaluate whether something is NOT equal to something else, using either of two symbols: <> or !=. If we changed the operator in the query above from = to !=, the query would give you all FCs except for PHL1. SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID != 'PHL1'; If you want records that are greater than or less than a certain value, you can use the > and < symbols, as you do the =. You can also evaluate if something is greater than or equal to using >=, and evaluate if something is less than or equal to using <=. And just like = and !=, they can be used on text strings, too. SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID >= 'WYTN'; REGION_ID 1 1 WAREHOUSE_ID YAHA WYTN NAME Yamazaki Tableware -- Hackettstown WYNIT, Inc. The operator IN can also be used in a WHERE clause, when you have a list of things that you want to check for. To pull data from D_WAREHOUSES where the FC was either PHL1 or RNO1, we could do it with two conditions, this way: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID = 'PHL1' OR fcs.WAREHOUSE_ID = 'RNO1'; REGION_ID 1 1 WAREHOUSE_ID PHL1 RNO1 NAME New Castle Fernley WAREHOUSE_ID PHL1 RNO1 NAME New Castle Fernley Or we could get the same results with a single condition by using the IN operator: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID IN ('PHL1','RNO1'); REGION_ID 1 1 When you use the IN operator, you follow it with a list of values inside a set of parentheses, separated by commas. The condition could be read as WHERE the WAREHOUSE_ID matches any of the values in the list ‘PHL1’,’RNO1’, so it returns information for any record where WAREHOUSE_ID matches any value in the list. With just two values, as in the example, either the OR or IN method takes about the same amount of time to write – but when you have many more values to evaluate, such as a list of 50 vendor codes, IN becomes much quicker. (Note that the upper limit on values in the list of an IN condition is reportedly 1000.) You can also query for records that are NOT IN a list, by putting NOT in front of IN. If we changed the condition in the last example from IN to NOT IN, we’d get every FC except PHL1 and RNO1. SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID NOT IN ('PHL1','RNO1'); 10 Another great shortcut operator is BETWEEN. To query for warehouses PHL1 and PHL2, we can query: This way: Or this way: Or even this way: SELECT fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE (fcs.WAREHOUSE_ID='PHL1' OR fcs.WAREHOUSE_ID='PHL2'); SELECT fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID IN ('PHL1','PHL2'); SELECT fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID >= 'PHL1' AND fcs.WAREHOUSE_ID <= 'PHL2'; But imagine that you had a very long range, such as a date range of many weeks, and only wanted to pull a portion of them. You wouldn’t want to have to list all of them. The quicker way to do this type of query would be to use the BETWEEN operator. To use the BETWEEN operator, you follow the column you’re evaluating by the word BETWEEN, then the first value in the range, followed by AND, and finally the last value in the range. It’s important to remember that the BETWEEN operator is Inclusive, meaning your results will include anything between the numbers AND anything that matches the numbers. SELECT fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID BETWEEN 'PHL1' AND 'PHL3'; WAREHOUSE_ID NAME PHL1 New Castle PHL2 Chambersburg PHL3 Centerpoint By using BETWEEN, we can return PHL1, PHL3 (the ends of the range) and PHL2 – which falls between them alphabetically. Yet another comparison operator that we used earlier is LIKE. The LIKE operator evaluates matching for columns with text strings (CHAR and VARCHAR columns), and is usually used with a ‘pattern matching character’. The two ‘pattern matching characters’ (aka wildcards) are % and _. The percent (%) symbol matches to a string of characters of any length, whereas the underscore (_) symbol matches to any one character. Now our previous example of FCs starting with the letter Y should make more sense: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND fcs.WAREHOUSE_ID LIKE 'Y%'; REGION_ID 3 WAREHOUSE_ID YYGF NAME ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ So we are looking for any FC where REGION_ID = 3 and WAREHOUSE_ID is like a text string that starts with a letter Y, and is followed by any number of characters. If we wanted to be more specific, we could query for any FC that starts with PH and ends with 1 using the single character wildcard: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID LIKE 'PH_1'; REGION_ID 1 1 1 WAREHOUSE_ID PHL1 PHX1 PH01 NAME New Castle Phoenix LaserShip Philly So we get back the three FCs that start with PH, end with 1, and have a single character in between: PHL1, PHX1 and PH01. The LIKE operator can also be negated, like IN, with the addition of NOT: SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.WAREHOUSE_ID NOT LIKE 'PH_1'; 11 Using Other Operators in the WHERE clause Just like in the SELECT clause, you can use mathematical operators like +, -, and * in the WHERE clause to evaluate conditions. The following query would return all FCs in Region 2, given that 2+1 = 3. An odd example, but I promise this is useful when you begin using dates. SELECT fcs.REGION_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID + 1 = 3; Handling Cells with No Data – aka NULLs A NULL is a blank cell. A void. Nothing. Nada. Ziltch. But not Zero. Zero is something, which represents nothing. Confused yet? REGION_ID 1 1 0 WAREHOUSE_ID PHL4 PHL5 PHL1 NAME Carlisle New Castle In the imaginary table above, the third record has a REGION_ID that is 0. But that second record has a NAME value that is NULL. It’s empty. Since something can never be equal to nothing, you can’t use many of the usual conditional operators to evaluate whether a record in a column is NULL. So SQL has special operators for NULLs, and some special functions for dealing with them, too. If we wanted to look for any records in D_WAREHOUSES where the IP Address is NULL, we’d use the IS NULL operator: SELECT fcs.REGION_ID , fcs.IP_ADDRESS_LIST_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND fcs.IP_ADDRESS_LIST_ID IS NULL; REGION_ID 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 IP_ADDRESS_LIST_ID WAREHOUSE_ID NRT1 AARG AARF OSKF KCFK SBTK DGJP OTOS YYGF MYTK FFSA FMTT DEKN FUOS KTKN NAME Narita Amazon¿¿ ¿¿¿¿ Osakaya Books Kenko.com, INC. Softbank BB Digital Goods JP ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿¿¿¿¿¿¿ ¿¿¿¿¿¿¿ And just like IN and LIKE, you can negate IS NULL by sticking in a NOT: SELECT fcs.REGION_ID , fcs.IP_ADDRESS_LIST_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND fcs.IP_ADDRESS_LIST_ID IS NOT NULL; The IS NOT query will return the opposite results – all FCs where the IP Address field isn’t blank. 12 Since some columns have nulls (and we can tell which by the Nullable field in BI Metadata), and since <> or != operators will exclude NULLs, you have to be careful sometimes if you want all records where a field is EITHER NULL or is not equal to a specified value. You could write two conditions in your WHERE clause to evaluate the same column, like this: SELECT fcs.REGION_ID , fcs.IP_ADDRESS_LIST_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND (fcs.IP_ADDRESS_LIST_ID IS NULL OR fcs.IP_ADDRESS_LIST_ID != 1035); But thankfully, SQL has a handy function called NVL, which translates any NULL values to another value that you specify, so you can use standard comparison operators to evaluate the column in a single condition, without much extra work. SELECT fcs.REGION_ID , fcs.IP_ADDRESS_LIST_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND NVL(fcs.IP_ADDRESS_LIST_ID,0) != 1035; The format of the NVL function is NVL, followed by a parenthesis, inside of which are your column name, a comma, and then what you want nulls to be translated to. In the example above, we translated any nulls in the column IP_ADDRESS_LIST_ID to the number 0. Then, we evaluate the results for whether they are not equal to 1035. Since the nulls are converted to 0, they are not equal to 1035, and will appear in the results. REGION_ID 3 3 3 3 3 3 3 3 3 3 3 IP_ADDRESS_LIST_ID REGION_ID 3 3 3 3 IP_ADDRESS_LIST_ID 1039 1041 1040 25 1040 25 1039 1041 WAREHOUSE_ID AARF AARG CAN1 DEKN NRT3 OSKF OTOS PEK3 SBTK SHA1 YYGF NAME ¿¿¿¿ Amazon¿¿ Guangzhou ¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Ichikawa Osakaya Books ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ Beijing Softbank BB Shanghai ¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿¿ WAREHOUSE_ID PEK3 SHA1 CAN1 NRT3 NAME Beijing Shanghai Guangzhou Ichikawa Had we left out the NVL function, the results would be very different: SELECT fcs.REGION_ID , fcs.IP_ADDRESS_LIST_ID , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND fcs.IP_ADDRESS_LIST_ID != 1035; And NVL can also be used in the SELECT clause, to replace NULLs with something more meaningful to the audience of the data. For example, we might change any NULLS in the IP Address column to a zero, like this: SELECT fcs.REGION_ID , NVL(fcs.IP_ADDRESS_LIST_ID,0) , fcs.WAREHOUSE_ID , fcs.NAME FROM D_WAREHOUSES fcs WHERE fcs.REGION_ID = 3 AND fcs.IP_ADDRESS_LIST_ID IS NULL AND fcs.WAREHOUSE_ID LIKE '___K'; REGION_ID 3 3 3 NVL(FCS.IP_ADDRESS_LIST_ID,0) 0 0 0 WAREHOUSE_ID SBTK KCFK MYTK NAME Softbank BB Kenko.com, INC. ¿¿¿¿¿¿¿¿¿¿¿¿¿¿- 13 Week 2 Homework: 1. Read Chapter Two in Mastering Oracle SQL 2. Make sure you’re signed up to the etl-users@amazon.com mailing list (and have a Outlook rule in place for those emails). 3. Create a query that pulls a list of warehouses that are in North America (Region 1) and have Amazon inventory from the D_WAREHOUSES table. Be sure to run an Explain Plan on the query before running it. 4. Edit the query to add the FC Name as an element called ‘FC Name’, and include only FCs with the word ‘Logistics’ in their name. Remember to run an Explain Plan first. 5. Check out the table PRODUCT_GROUPS in BI Metadata. How many rows does it have? How many columns? Which columns might have NULLs in them? What type of information is in the PRODUCT_GROUP column? What type of table is it? 6. Create a query that pulls a list of GL Product Group Codes, in numerical order, from the PRODUCT_GROUPS table. Include the column SHORT_DESC in your results, and replace any null values in that column with the word ‘Unknown’. (Although there isn’t a column explicitly named GL_PRODUCT_GROUP in the table, one of the columns contains this information. Use BI Metadata and look at the Data Types of the columns, and make an educated guess about which column to pull. 7. Edit the query to include the DESCRIPTION column, to only return results with a GL Product Group value of at least 14, and only return results with a DESCRIPTION in the following list: Books, Universal, Shops, Advertising, or Art. 8. If you haven’t studied logic, or are having difficulty wrapping your head around the difference between the results you’d get from WHERE A AND B OR C and WHERE A AND (B OR C), do a little Googling on logic. Concepts like Modus Ponens and Modus Tollens will aid you greatly in writing and understanding SQL. 9. Run Explain plans on the following queries, but DO NOT RUN THEM. These are good examples of bad queries: SELECT ddo.order_id FROM d_distributor_orders ddo , d_warehouses fcs WHERE ddo.warehouse_id = fcs.warehouse_id; SELECT ddo.order_id FROM d_distributor_orders ddo , d_distributor_order_items doi; 14 Week 3 – Aggregate Queries and HAVING Clause SQL Topics Aggregate Queries Pulling DISTINCT Records Aggregate Functions Aggregate Functions with DISTINCT The GROUP BY Clause The HAVING Clause SQL TOPICS Aggregate Queries So far, we’ve created queries that pull all rows of data from a table using SELECT and FROM and used the WHERE clause to limit which rows we pull. Now we’re going to aggregate (group together) multiple rows of data into a single row in the result set, using the DISTINCT keyword, the GROUP BY clause, and some aggregate operators. DISTINCT The simplest form of aggregate query is one where you simply want to know all the unique values in a certain column of a table. For example, you might want a list of all the possible values for the REGION_ID column in D_WAREHOUSES, so you know how to limit your query properly. There are over 3000 rows of data in D_WAREHOUSES, but you can use DISTINCT to pull only the unique values for the REGION_ID column. To do that, write your query as you would to pull all the records for that column, but put the word DISTINCT after the SELECT but before the column, like this: SELECT /*+ use_hash(fcs) */ DISTINCT fcs.REGION_ID FROM D_WAREHOUSES fcs; REGION_ID 1 2 3 The query returns 3 rows of data, one for each DISTINCT value in the REGION_ID column. Even though each value is in the table many times in many records, the addition of the DISTINCT keyword limits the results to only the unique values. If your SELECT clause has multiple elements, DISTINCT will return all the unique combinations of elements. Now that you know that the values in REGION_ID are 1,2, and 3, you might want to know whether each Region has Delayed Allocation warehouses or not. To do this, you again put DISTINCT before your first column: SELECT /*+ use_hash(fcs) */ DISTINCT fcs.REGION_ID , fcs.IS_DELAYED_ALLOCATION FROM D_WAREHOUSES fcs; REGION_ID 1 1 2 3 IS_DELAYED_ALLOCATION Y N N N Now results tell us that Region 1 has some warehouses that are Delayed Allocations (Y) and some that are not (N), but the other two regions only have warehouses that are not Delayed Allocation nodes. There are two rows for REGION_ID 1, because the value in the IS_DELAYED_ALLOCATION for each is distinct, and DISTINCT finds all unique combinations of all the elements in the SELECT clause. (Notice that you only include the DISTINCT keyword once, after the first element, even when there are multiple element.) Other examples of using DISTINCT would be to find out what all the unique ORDER_TYPE values are in D_DISTRIBUTOR_ORDERS, or to find a list of all ASINs we’ve ordered from a specific vendor in the past 6 months, and which of those were ever backordered. 15 Aggregate Functions The DISTINCT function can help you get lists of unique values, and even answer some business questions, but you’ll also find you want to count the number of POs placed, or sum the total quantity we ordered on a PO, or find the first date that we received something from a vendor. All of these require the use of aggregate functions. The main aggregate functions are COUNT – which counts how many values there are in a column MAX – which finds the maximum value in a column MIN – which finds the minimum value in a column SUM – which adds together the values in a column AVG – which averages the values in a column All of these functions are used in the SELECT clause. The format is to start with the function, and then put the column you want to aggregate in parentheses after it, like SUM(doi.QUANTITY) or COUNT(fcs.REGION_ID). Make sure to put the table alias inside the function along with the column name. If we wanted to COUNT how many records there are in the D_WAREHOUSES table, we could write the following: SELECT /*+ use_hash(fcs) */ COUNT(fcs.NAME) FROM D_WAREHOUSES fcs; COUNT(FCS.WAREHOUSE_ID) 4960 Using COUNT to count the number of records in the NAME column, we know that there are 3513 records in the D_WAREHOUSES table, without having to pull all the records and count them ourselves. If we wanted to know when the first and last dates that a record was entered into the D_WAREHOUSE table, we could use the MIN and MAX functions: SELECT /*+ use_hash(fcs) */ MIN(fcs.DW_CREATION_DATE) , MAX(fcs.DW_CREATION_DATE) FROM D_WAREHOUSES fcs; MIN(FCS.DW_CREATION_DATE) 1/20/2009 17:45 MAX(FCS.DW_CREATION_DATE) 8/8/2011 7:07 From the results, we learn that the first record was created by DataWarehouse on 1/20/2009, and the last record was created on 3/20/2009. Notice that the same column name (DW_CREATION_DATE) was evaluated in both fields of the SELECT clause, but in the first field we ran the MIN function on that column, and in the second field we ran the MAX function on that column. The SUM function allows you to add up everything in a column, and get a total. One example might be if you want to know how many units we submitted on a specific PO. We can find that out using the SUM function: SELECT /*+ use_hash(doi) */ SUM(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090319','YYYYMMDD') AND doi.ORDER_ID = 'C4811075'; SUM(DOI.QUANTITY_SUBMITTED) 37 If we want to know the average number of units submitted on that PO, we could exchange out the SUM function for AVG: SELECT /*+ use_hash(doi) */ AVG(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090319','YYYYMMDD') AND doi.ORDER_ID = 'C4811075'; AVG(DOI.QUANTITY_SUBMITTED) 7.4 This shows us that for the ASINs we ordered on PO C4811075, the average number of units ordered was 7.4. 16 Putting all these functions together, we could learn a lot about the PO in one query: SELECT /*+ use_hash(doi) */ COUNT(doi.ISBN) , MIN(doi.QUANTITY_SUBMITTED) , MAX(doi.QUANTITY_SUBMITTED) , SUM(doi.QUANTITY_SUBMITTED) , AVG(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090319','YYYYMMDD') AND doi.ORDER_ID = 'C4811075'; COUNT(DOI.ISBN 5 MIN(DOI.QUANTITY_SUBMITTED 1 MAX(DOI.QUANTITY_SUBMITTED 32 SUM(DOI.QUANTITY_SUBMITTED 37 AVG(DOI.QUANTITY_SUBMITTED 7.4 We learn that we ordered 5 ASINs on this PO, that the minimum units ordered was 1 but the maximum was 32, that we ordered 37 units total, and the average was 7.4. Aggregate Functions with DISTINCT Sometimes, particularly with the COUNT function, you’ll want to find out how many unique records are in a table, which might be different than the count of total records. For example, in the table D_WAREHOUSES, we learned earlier that there are 4960 records, by using COUNT to count the NAME column. SELECT /*+ use_hash(fcs) */ COUNT(fcs.NAME) FROM D_WAREHOUSES fcs; COUNT(FCS.WAREHOUSE_ID) 4960 However, some of those names might repeat, so there might not be 4960 unique NAME values in the table. To find that out, we combine DISTINCT with our aggregate function, but this time putting it inside the function, before the column name. In this example, we put DISTINCT inside the COUNT function, to COUNT the DISTINCT values in the fcs.NAME column: SELECT /*+ use_hash(fcs) */ COUNT(DISTINCT fcs.NAME) FROM D_WAREHOUSES fcs; COUNT(DISTINCTFCS.NAME) 4790 By adding DISTINCT to our COUNT function, we find that although there are 4960 values in the NAME column, there are only 4790 DISTINCT values in that column, so some must repeat. GROUP BY So far, we’ve aggregated information for a whole table (in the case of D_WAREHOUSES) and for a set of records limited by the WHERE clause (as in our queries D_DISTRIBUTOR_ORDER_ITEMS to learn about PO C4811075). Now we’ll talk about how to use those same aggregate functions to group sets of records together for each unique value in certain columns, while aggregating other columns. For example, we might want to know how many units we ordered on each PO we placed with Wiley on a given day. We could query how many units we ordered from Wiley on 3/16/2009, like this: SELECT /*+ use_hash(doi) */ SUM(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY = to_date('20090316','YYYYMMDD') AND doi.DISTRIBUTOR_ID = 'WILEY'; SUM(DOI.QUANTITY_SUBMITTED) 57218 But that doesn’t tell us for each PO. Rather than run the query once for each PO, we can add ORDER_ID to the SELECT clause and add the GROUP BY clause with ORDER_ID, so the query SUMs up the number of units ordered for each PO. The GROUP BY clause comes after the WHERE clause (but before the ORDER_BY clause, if you’re using one), and indicates which columns from your 17 SELECT clause you want to group the results by. To group our query above by PO, we’d add it to the SELECT clause and to the GROUP BY clause, like this: SELECT /*+ use_hash(doi) */ doi.ORDER_ID , SUM(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY = to_date('20090316','YYYYMMDD') AND doi.DISTRIBUTOR_ID = 'WILEY' GROUP BY doi.ORDER_ID; ORDER_ID M7444521 P4010601 R7453213 U1897503 Q0625613 SUM(DOI.QUANTITY_SUBMITTED) 1 3 57203 5 6 Now we know how much we ordered on each of the 5 POs we placed with Wiley on that day. We can add additional columns to our SELECT clause to get more information. If they’re an aggregate column, such as a COUNT or AVG function, then we don’t need to put them in the GROUP BY clause. But if they aren’t an aggregate column, we’ll need to also add them to the GROUP BY clause. For example, we could add a COUNT of DISTINCT ASINs on each PO, as well as add the STATUS column - which is an ASIN level attribute in the table that indicates whether that ASIN was Backordered (BO) or not on that PO. For each PO, there may be some ASINs that are backordered, and some that aren’t. SELECT /*+ use_hash(doi) */ doi.ORDER_ID , doi.STATUS , SUM(doi.QUANTITY_SUBMITTED) , COUNT(DISTINCT doi.ISBN) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY = to_date('20090316','YYYYMMDD') AND doi.DISTRIBUTOR_ID = 'WILEY' GROUP BY doi.ORDER_ID , doi.STATUS; ORDER_ID M7444521 P4010601 Q0625613 Q0625613 R7453213 R7453213 U1897503 STATUS BO BO SUM(DOI.QUANTITY_SUBMITTED) 1 3 2 4 1296 55907 5 COUNT(DISTINCTDOI.ISBN) 1 1 2 3 416 5930 1 We don’t need to put COUNT(DISTINCT doi.ISBN) in the GROUP BY clause, because that column includes an aggregate function. But we do need to put doi.STATUS in the GROUP BY clause, because it doesn’t include an aggregate function. You’ll notice that since we grouped by two columns, we got some additional rows of data. That’s because POs Q0625613 and R7453213 both had some ASINs that were backordered, and some that were not. Our SUM and COUNT data is now grouped by both ORDER_ID and STATUS. For people familiar with Excel Pivot tables, it can be helpful to think of queries using GROUP BY as something like a Pivot table, with certain fields being grouped and certain columns being summed, counted, averaged, etc. Each time you add in a new level of grouping, the columns being aggregated change. A GROUP BY clause is only needed if you have BOTH Aggregate functions and non-Aggregate elements in your SELECT clause. One easy way to make sure they’re in synch is to copy all the elements in your SELECT clause and paste them in your GROUP BY clause, then delete any elements with Aggregate functions (SUM, COUNT, MIN, etc). (You also need to delete any Column Aliases from the GROUP BY clause.) 18 The HAVING Clause Once you begin aggregating, you’ll find that you may want to limit your results to only records where the result of an aggregation meets a certain criteria. For example, we might only want to look at POs were we ordered 1 unit on the entire PO. We can’t do this in the WHERE clause, because the conditions in the WHERE clause are evaluated before we aggregate. Going back to our earlier example, where we summed the units submitted on all POs for Wiley on 3/16/2009, if we tried to find all POs where we only ordered one unit on the entire PO by limiting the WHERE clause, we’d get the wrong results: SELECT /*+ use_hash(doi) */ doi.ORDER_ID , SUM(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY = to_date('20090316','YYYYMMDD') AND doi.DISTRIBUTOR_ID = 'WILEY' AND doi.QUANTITY_SUBMITTED = 1 GROUP BY doi.ORDER_ID; ORDER_ID M7444521 Q0625613 R7453213 SUM(DOI.QUANTITY_SUBMITTED) 1 4 1811 We’re actually looking at all the POs WHERE we only ordered one unit of at least one ASIN on that PO, and then summing the quantities of those ASINs – which we can see because the SUM of the QUANTITY_SUBMITTED on two POs is greater than one. This is a totally valid query, but doesn’t answer the question we were asking: Which POs submitted to Wiley on 3/16/09 only had one unit submitted on the entire PO. To get the answer to that, we use a HAVING clause. A HAVING clause is put at the end of an aggregate query, after the GROUP BY, to limit the results AFTER the aggregation is done. It’s a filter, just like the WHERE clause, but the filtering is done after things are summed and counted and averaged. SELECT /*+ use_hash(doi) */ doi.ORDER_ID , SUM(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY = to_date('20090316','YYYYMMDD') AND doi.DISTRIBUTOR_ID = 'WILEY' GROUP BY doi.ORDER_ID HAVING SUM(doi.QUANTITY_SUBMITTED) = 1; ORDER_ID M7444521 SUM(DOI.QUANTITY_SUBMITTED) 1 One way to think about having is to imagine the results of the query if we’d run it without the HAVING clause, then filter those by the conditions in the HAVING clause. We actually ran this query without the HAVING clause in an earlier example, getting: ORDER_ID M7444521 P4010601 R7453213 U1897503 Q0625613 SUM(DOI.QUANTITY_SUBMITTED) 1 3 57203 5 6 So we could expect the result we got – only PO M7444521 had just a single unit ordered on the entire PO. It’s worth noting that since the HAVING clause adds a second round of filtering to the query, it can add a lot of time to the query, too. 19 Week 3 Homework: 1. Read Chapter 4 in Mastering Oracle SQL. 2. Check out the table D_DISTRIBUTOR_ORDERS in the BI Metadata. How is it Partitioned? 3. Create a query to count how many POs have been created for the vendor code PRBRC in the US, using the table D_DISTRIBUTOR_ORDERS. Remember to use BI Metadata to determine which columns are Partitioned, and make sure you include those in your WHERE clause. Run the query through an Explain Plan before running it. 4. Add elements to find the first ORDER_DAY and the last ORDER_DAY for PRBRC POs 5. Add a column to count how many distinct ORDER_DAY values there are. 6. Add a column to sum up the total SHIPPING_COST. 7. Add DISTRIBUTOR_ID as an element of your SELECT clause. You’ll need to add a GROUP BY clause, since this is not an aggregated column. 8. Add HANDLER as an element in the SELECT clause. Who created the most POs for PRBRC? When was the last time danac created one? 9. Use the HAVING clause to limit the results to only handlers who created between 30 and 40 PRBRC POs. 10. Further limit the results to only handlers who created PRBRC POs on at least 10 different days. 11. Rerun the query, but this time, have it publish to the folder \\ant\dept\BMVDSA\Books\ETL_Practice\, rather than email you the results. Have the file name include both the Job Profile and Job Run Wildcards, with the appropriate (.txt) extension for a tab delimited text file. 20 Week 4 - Joining Tables SQL Topics Joining 2 or More Tables – Old School & New School Approaches INNER Joins vs. OUTER Joins WHERE Clause Conditions with OUTER Joins One-to-Many Joins SQL TOPICS Joining 2 or More Tables – Old School & New School Approaches Getting data out of one table is great, but ETL allows you the flexibility to join multiple tables in the Data Warehouse together and pull custom data sets that meet your business needs. With the roll out version 9i of Oracle SQL, a new method of joining tables was introduced, which is what our text, and I, will use. But you’ll surely run into code that uses the old syntax, so I recommend reading the Appendix on page 449 of Mastering Oracle SQL, so you aren’t left confused when you find commas in the FROM clause and (+) in the WHERE clause. There are several advantages to the new syntax that you can read about in your text, and I feel it’s easier to understand than the old syntax. The ‘New School’ approach to joining tables uses the FROM clause to indicate which tables you want information from AND how they are joined together. For example, if I wanted to join the VENDORS table (which has lots of great Vendor Master data) to the O_AMAZON_BUSINESS_GROUPS table to translate the AMAZON_BUSINESS_GROUP_ID number into the description of the business group that I’m familiar with, I’d do the following: SELECT /*+ use_hash(v,o_abg) */ v.VENDOR_ID , v.PRIMARY_VENDOR_CODE , v.VENDOR_NAME , v.AMAZON_BUSINESS_GROUP_ID , o_abg.TYPE FROM VENDORS v JOIN O_AMAZON_BUSINESS_GROUPS o_abg ON v.AMAZON_BUSINESS_GROUP_ID = o_abg.ID WHERE v.PRIMARY_VENDOR_CODE = 'RANDO'; VENDOR_ID 3453 PRIMARY_VENDOR_CODE RANDO VENDOR_NAME Random House AMAZON_BUSINESS_GROUP_ID 1 TYPE US Books The syntax is to start your FROM clause and enter the name and alias of the first table. Then specify the type of join (in this case a standard inner JOIN) and the name and alias of the second table. Follow that by the word ON, and then indicate which columns define the join between your two tables, with an equals sign between them. Above, we joined the VENDORS table to the O_AMAZON_BUSINESS_GROUPS table FROM VENDORS v JOIN O_AMAZON_BUSINESS_GROUPS o_abg and returned results where the AMAZON_BUSINESS_GROUP_ID in VENDORS is equal to the ID in O_AMAZON_BUSINESS_GROUPS. ON v.AMAZON_BUSINESS_GROUP_ID = o_abg.ID 21 You can also join ON multiple columns between two tables, by adding them to the ON clause, separated by AND: SELECT /*+ use_hash(ddo,doi) */ ddo.ORDER_ID , doi.ISBN , doi.QUANTITY_SUBMITTED FROM D_DISTRIBUTOR_ORDERS ddo JOIN D_DISTRIBUTOR_ORDER_ITEMS doi ON ddo.ORDER_ID = doi.ORDER_ID AND ddo.DISTRIBUTOR_ID = doi.DISTRIBUTOR_ID WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'N9161983' AND doi.REGION_ID = 1 AND doi.ORDER_DAY = TO_DATE('20090312','YYYYMMDD'); ORDER_ID N9161983 ISBN 0321357973 QUANTITY_SUBMITTED 1 *When you join two tables, and both have Partitioning Schemes, be sure to include conditions in your WHERE clause to ensure you’re making use of the partitions in both tables.* You can also join 3 or more tables together, of course, by specifying the JOIN type and JOIN ON condition for each additional table: SELECT /*+ use_hash(ddo,doi) */ ddo.DISTRIBUTOR_ID , v.VENDOR_NAME , ddo.ORDER_ID , doi.ISBN , doi.QUANTITY_SUBMITTED FROM D_DISTRIBUTOR_ORDERS ddo JOIN D_DISTRIBUTOR_ORDER_ITEMS doi ON ddo.ORDER_ID = doi.ORDER_ID AND ddo.DISTRIBUTOR_ID = doi.DISTRIBUTOR_ID JOIN VENDORS v ON ddo.DISTRIBUTOR_ID = v.PRIMARY_VENDOR_CODE WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'N9161983' AND doi.REGION_ID = 1 AND doi.ORDER_DAY = TO_DATE('20090312','YYYYMMDD'); DISTRIBUTOR_ID PEAED VENDOR_NAME Pearson Technology Group ORDER_ID N9161983 ISBN 321357973 QUANTITY_SUBMITTED 1 You can see that we joined ddo to doi on two columns, and we joined ddo to v on one column. Each table needs to be joined to at least one other table to avoid a Cartesian join. Notice that as you begin joining multiple tables, you can begin including columns from all the tables as elements in your SELECT clause, and include conditions in your WHERE clause on columns from each of those tables. This is where the need for table aliases becomes clear – to let Oracle know that you want the DISTRIBUTOR_ID from D_DISTRIBUTOR_ORDERS, not VENDORS. 22 INNER Joins vs. OUTER joins There are two main types of JOINs used in writing SQL: INNER and OUTER JOINs. INNER JOINs will likely be what you use most often, and is the default join type (thus you only need to type JOIN to use it). They return only results where the condition specified in your JOIN ON section is true. In other words, it returns only records where it finds matching records in both tables. In the example above, the INNER JOIN limits the results to only return records from the table VENDORS that match to records in the table O_AMAZON_BUSINESS_GROUPS where the join condition v.AMAZON_BUSINESS_GROUP_ID = o_abg.ID is true. Because INNER JOIN is the default join type, any query where the join type is simply JOIN is actually an INNER JOIN. An OUTER JOIN is used when you want to join two tables but you want all the records from one table and any results from the second table that match. OUTER JOINs can be of two main types, which seem confusing at first, but are really quite simple: LEFT and RIGHT OUTER JOINs. One way to think about the differences between INNER and OUTER joins is with a Venn diagram, where each circle represents a table. An INNER JOIN (or simply JOIN) selects only those records that have values in common between both tables (the grey section, labeled B). An OUTER JOIN selects all records from the primary table, and any matching records for the secondary table (where the secondary table has values in common with the primary table). A LEFT OUTER JOIN (or simply LEFT JOIN) would select A + B, whereas a RIGHT JOIN would select B + C. The ON condition(s) specified in the JOIN indicate what values are evaluated for commonality. To further illustrate the difference between INNER JOINs, LEFT JOINs, and RIGHT JOINs, we’ll use a silly example, joining the tables O_WAREHOUSES and D_WAREHOUSES in several ways. The results will be meaningless from a business sense, but hopefully illustrate the differences in these types of joins. First off, we’ll look at the contents of these tables for all WAREHOUSE_ID values that start with ‘SDF’: O_WAREHOUSES D_WAREHOUSES SELECT /*+ use_hash(ow) */ ow.WAREHOUSE_ID ow_warehouse_id FROM O_WAREHOUSES ow WHERE ow.WAREHOUSE_ID LIKE 'SDF_'; SELECT /*+ use_hash(dw) */ dw.WAREHOUSE_ID dw_warehouse_id FROM D_WAREHOUSES dw WHERE dw.WAREHOUSE_ID LIKE 'SDF_'; OW_WAREHOUSE_ID SDF1 SDF2 SDF3 SDF4 SDF6 DW_WAREHOUSE_ID SDF1 SDF2 SDF4 SDF6 As the results above show, the O_WAREHOUSES has records for SDF1, SDF2, SDF3, SDF4 and SDF6, while the D_WAREHOUSES table only has records for SDF1, SDF2, SDF4 and SDF6. 23 If we do an INNER JOIN of these two tables, we’ll only get results where a match is found between the two tables (as defined by the columns in our ON condition: SELECT /*+ use_hash(ow,dw) */ ow.WAREHOUSE_ID ow_warehouse_id , dw.WAREHOUSE_ID dw_warehouse_id FROM O_WAREHOUSES ow JOIN D_WAREHOUSES dw ON ow.WAREHOUSE_ID = dw.WAREHOUSE_ID WHERE ow.WAREHOUSE_ID LIKE 'SDF_'; OW_WAREHOUSE_ID SDF1 SDF2 SDF4 SDF6 DW_WAREHOUSE_ID SDF1 SDF2 SDF4 SDF6 Since D_WAREHOUSES doesn’t have a record for WAREHOUSE_ID SDF3, no result is returned from either table with an INNER JOIN. If we change the query to an LEFT JOIN, the results will change: SELECT /*+ use_hash(ow,dw) */ ow.WAREHOUSE_ID ow_warehouse_id , dw.WAREHOUSE_ID dw_warehouse_id FROM O_WAREHOUSES ow LEFT JOIN D_WAREHOUSES dw ON ow.WAREHOUSE_ID = dw.WAREHOUSE_ID WHERE ow.WAREHOUSE_ID LIKE 'SDF_'; OW_WAREHOUSE_ID SDF1 SDF2 SDF3 SDF4 SDF6 DW_WAREHOUSE_ID SDF1 SDF2 SDF4 SDF6 This time, we got results for all the records in O_WAREHOUSES, and the matching records (where they existed) in D_WAREHOUSES, and got a NULL in the second column where it didn’t find a match. The difference between a LEFT JOIN and a RIGHT JOIN is simply which tables are listed on the LEFT and RIGHT of the JOIN. In our last example, O_WAREHOUSES is on the LEFT of the LEFT JOIN and D_WAREHOUSES is on the RIGHT of the LEFT JOIN. In a LEFT JOIN, the table on the LEFT is given priority, and is the table that will return all results, even if no match is found in the table on the RIGHT of the JOIN. The same query could be written as a RIGHT JOIN and get the same results, simply by switching the order of the tables: SELECT /*+ use_hash(ow,dw) */ ow.WAREHOUSE_ID ow_warehouse_id , dw.WAREHOUSE_ID dw_warehouse_id FROM D_WAREHOUSES dw RIGHT JOIN O_WAREHOUSES ow ON ow.WAREHOUSE_ID = dw.WAREHOUSE_ID WHERE ow.WAREHOUSE_ID LIKE 'SDF_'; OW_WAREHOUSE_ID SDF1 SDF2 SDF3 SDF4 SDF6 DW_WAREHOUSE_ID SDF1 SDF2 SDF4 SDF6 The difference between RIGHT and LEFT JOINs is strictly placement of table names in the SQL. To keep things simple, I always use LEFT JOINs. But it’s no better or worse than switching between LEFT and RIGHT JOINs, or using RIGHT JOINs exclusively. I recommend using whatever works best for you. There are some additional types of JOINs described in the text, but these are rarely used and often wildly inefficient. 24 WHERE Clause Conditions with OUTER Joins Regardless of whether you have an OUTER join specified or not, anything in your WHERE clause will limit your results. If you include a condition in your WHERE clause that applies to the secondary table on the RIGHT of a LEFT JOIN (or on the LEFT of a RIGHT JOIN), the query will not act like an OUTER join, because you’ve limited the results with conditions on both tables, making it behave like an INNER join. You’ve essentially overridden the OUTER JOIN by limiting the results to only records that exist in the secondary table. For example, if we added a WHERE clause condition that applies to the DESCRIPTION column in O_PAYMENT_ITEM_TYPES – which is on the RIGHT of a LEFT JOIN, we get only results where that condition is true – making the query behave like a INNER join: SELECT /*+ use_hash(o_pit,o_pt) */ o_pit.PAYMENT_ITEM_TYPE_ID , o_pit.DESCRIPTION , o_pt.PAYMENT_TYPE_ID , o_pt.DESCRIPTION FROM O_PAYMENT_TYPES o_pt LEFT JOIN O_PAYMENT_ITEM_TYPES o_pit ON o_pit.PAYMENT_ITEM_TYPE_ID = o_pt.PAYMENT_TYPE_ID WHERE o_pit.DESCRIPTION = 'Refund'; PAYMENT_ITEM_TYPE_ID 2 DESCRIPTION Refund PAYMENT_TYPE_ID 2 DESCRIPTION_1 zShops One way around this problem is to place those conditions in the JOIN clause, like this: SELECT /*+ use_hash(o_pit,o_pt) */ o_pit.PAYMENT_ITEM_TYPE_ID , o_pit.DESCRIPTION , o_pt.PAYMENT_TYPE_ID , o_pt.DESCRIPTION FROM O_PAYMENT_TYPES o_pt LEFT JOIN O_PAYMENT_ITEM_TYPES o_pit ON o_pit.PAYMENT_ITEM_TYPE_ID = o_pt.PAYMENT_TYPE_ID AND o_pit.DESCRIPTION = 'Refund'; PAYMENT_ITEM_TYPE_ID 2 DESCRIPTION Refund PAYMENT_TYPE_ID 1 2 5 6 7 8 DESCRIPTION_1 Auctions zShops zMe Marketplace MVP Catalogue Putting the condition in the JOIN clause no longer limits the full query, as when it was in the WHERE clause, but it does still limit the results. Think of it as limiting only the JOIN when it’s in the JOIN clause, but limiting the whole query when in the WHERE clause. One-to-Many Joins As the final topic this week, I wanted to end with a warning about JOINs of all kinds, by introducing the concept of the ‘grain’ of a table. People talk about the grain of a table, and they mean the level of detail is in that table. For example, D_DISTRIBUTOR_ORDERS is at the grain of POs. That means it contains just one row of data for each Purchase Order. The related table D_DISTRIBUTOR_ORDER_ITEMS is at the grain of the PO and ASIN, so it has one row for each unique combination of PO and ASIN. The somewhat related table, D_DISTRIBUTOR_SHIPMENT_ITEMS contains all the records of PO items that have been received, and its grain is PO, ASIN and Shipment – because a single ASIN can be received to a single PO on multiple occasions. Knowing the grain of a table (usually by looking at some sample data) is important to understanding how to properly join to it. 25 If I join D_DISTRIBUTOR_ORDER_ITEMS (with a grain of PO/ASIN) to D_DISTRIBUTOR_SHIPMENT_ITEMS (with a grain of PO/ASIN/Shipment) on the PO and ASIN columns (ORDER_ID and ISBN), the results look straightforward for PO L9549101: SELECT /*+ use_hash(doi,dsi) */ doi.ORDER_ID , doi.ISBN , doi.QUANTITY_SUBMITTED , doi.QUANTITY , dsi.QUANTITY_UNPACKED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN D_DISTRIBUTOR_SHIPMENT_ITEMS dsi ON doi.ORDER_ID = dsi.ORDER_ID AND doi.ISBN = dsi.ISBN WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090115','YYYYMMDD') AND doi.ORDER_ID = 'L9549101' AND dsi.REGION_ID = 1 AND dsi.RECEIVED_DAY = to_date('20090119','YYYYMMDD') ORDER_ID L9549101 ISBN 0316032220 QUANTITY_SUBMITTED 40 QUANTITY 40 QUANTITY_UNPACKED 40 It shows we submitted 40 units of ASIN 0316032220, 40 units were confirmed (QUANTITY), and 40 units were received (QUANTITY_UNPACKED). However, for an ASIN on a PO that was received in multiple shipments, things can look a little odd in the results: SELECT /*+ use_hash(doi,dsi) */ doi.ORDER_ID , doi.ISBN , doi.QUANTITY_SUBMITTED , doi.QUANTITY , dsi.QUANTITY_UNPACKED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN D_DISTRIBUTOR_SHIPMENT_ITEMS dsi ON doi.ORDER_ID = dsi.ORDER_ID AND doi.ISBN = dsi.ISBN WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090126','YYYYMMDD') AND doi.ORDER_ID = 'R1735263' AND doi.ISBN = '0738210943' AND dsi.REGION_ID = 1 AND dsi.RECEIVED_DAY BETWEEN to_date('20090205','YYYYMMDD') AND to_date('20090213','YYYYMMDD') ORDER_ID R1735263 R1735263 ISBN 738210943 738210943 QUANTITY_SUBMITTED 19 19 QUANTITY 19 19 QUANTITY_UNPACKED 6 13 Because there are two records in the table D_DISTRIBUTOR_SHIPMENT_ITEMS that match to the PO and ASIN we are querying in D_DISTRIBUTOR_ORDER_ITEMS, we get two records back. This is a One-to-Many join. Sometimes that’s just what you want, but in this case, we might mistakenly think that we ordered 38 units (19+19), which is twice what we actually ordered. We’ll explore some ways to avoid this issue later, but wanted you to begin thinking about table granularity and be aware of how it can result in one-to-many joins and possible double-counting of records. 26 Week 4 Homework: 1. Read Chapter 3 and the Appendix covering the old join syntax in Mastering Oracle SQL. 2. Check out the tables D_MP_ASINS_ESSENTIALS and D_ASINS_MARKETPLACE_ATTRIBUTES in BI Metadata. What are the Partitioned columns in each table? What do you think the grain is of each table? Which table includes the BINDING column? Which table contains the ITEM_NAME column? 3. Write a single query, joining those two tables together, to determine the name and binding of ASIN 0385240880 in the Marketplace related to your business. Be sure to make use of Partitions in your WHERE clause, and use the ‘use_hash’ hint in your SELECT clause. Run it through Explain Plan, then run it. 4. Edit the query to add the Manufacturer Code for the ASIN. Be sure to make use of Partitions. Run it through Explain Plan then run it. (hint: Look at the first example query from week 3, in the ETL section). 5. Create a query to pull the Vendor Code, Vendor Name, Business Group ID and Business Group Name for Vendor ID 3453. Run it through Explain Plan then run it. (hint: check out the first example query from week 4.) 6. Edit the query to pull a list of all Business IDs and Business Group Names for Canada that do not match to any Vendor Codes. Run it through Explain Plan then run it. (hint: look for records where the VENDOR_ID IS NULL). 7. Create a query that emails you every day with ASIN level details (including PO, Vendor, ASIN, and Quantity) of all receipts to POs for vendor BATBO in the US. Let it run daily for a few days. Since D_DISTRIBUTOR_SHIPMENT_ITEMS is partitioned by RECEIVED_DAY (in addition to REGION_ID), but we haven’t covered Dates yet, please include this in your WHERE clause: AND RECEIVED_DAY = TO_DATE(‘{RUN_DATE_YYYYMMDD}’,’YYYYMMDD’), in addition to a condition on the other partitioned column. Run it through Explain Plan before you schedule it. 27 Week 5 – Dealing with Dates in SQL SQL Topics DATE vs. DATETIME columns The TO_CHAR() Function with Dates The TRUNC() Function with Dates The TO_DATE() Function Using BETWEEN with Dates Other Date functions SQL TOPICS DATE vs. DATETIME columns While working with Data Warehouse tables, you’ll find two types of DATE columns: DATE columns that are truncated to only the Month, Day, and Year information (e.g. 12/31/2008), and DATE columns that also contain the Hour, Minute, and Seconds (e.g. 12/31/2008 08:13:52) – known as the DATETIME format. Both types of columns are of the Data Type ‘DATE’, and store full date & time information, but the DATE format columns are truncated to the beginning of the first second of the day. Although it’s not always obvious from just looking at BI Metadata which type a column is, most of the DATETIME fields have DATETIME in their name (like the columns ORDER_DATETIME and CONFIRMATION_DATETIME in D_DISTRIBUTOR_ORDERS), while DATE type columns often use DATE or DAY in their column name (like ORDER_DAY in D_DISTRIBUTOR_ORDERS). This isn’t a hard and fast rule, however, even within a single table. For example, the column CREATION_DATE in D_DISTRIBUTOR_ORDERS is actually a DATETIME field, which we see via this query of the various date fields in D_DISTRIBUTOR_ORDERS for PO M5969483. SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.CREATION_DATE , ddo.ORDER_DAY , ddo.ORDER_DATETIME , ddo.CONFIRMATION_DATETIME FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 CREATION_DATE 3/25/2009 18:14 ORDER_DAY 3/25/2009 ORDER_DATETIME 3/25/2009 11:14 CONFIRMATION_DATETIME 3/25/2009 11:52 The TO_CHAR() Function with Dates There are many ways to write a date, from the US standard of 03/31/2009 to the UK standard of 31/03/2009, writing them as March 31st, 2009, or combinations of words and numbers, like 31-MAR-09. Some of these formats can be very precise, while others are less so. For example, if a Book was published on 31-MAR-09, do we know if it was published in 2009 or 1909? Unfortunately, we don’t, and programs like Excel may make assumptions that could be wrong. When writing SQL queries, you may find you want to control the format of a date column in your results, so you always know what format it will be in and so there is never any question of exactly what the date means. To do this, we use the TO_CHAR() function, which converts the DATE to a character string, in a format specified by you. To use the TO_CHAR() function, you include the column name followed by a comma and then the format (enclosed in single quotes) within the parentheses. For example, we could convert the ORDER_DATETIME to just the Month, Day, and Year format, we put the ORDER_DATETIME column name in the TO_CHAR() function and then enter the format MM/DD/YYYY in single quotes, like this: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID 28 , ddo.ORDER_DATETIME , TO_CHAR(ddo.ORDER_DATETIME,'MM/DD/YYYY') , ddo.ORDER_DAY , TO_CHAR(ddo.ORDER_DAY,'MM/DD/YYYY HH24:SS:MI') FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 ORDER_DATETIME 3/25/2009 11:14:56 AM TO_CHAR(DDO.ORDER_DATETIME,'MM 3/25/2009 ORDER_DAY 3/25/2009 TO_CHAR(DDO.ORDER_DAY,'MM/DD/Y 03/25/2009 00:00:00 The format of the column returned in your results is what we specified, without the time stamp information. Also, notice that we also formatted the ORDER_DAY column to include the full DATETIME - hours, minutes and seconds - in column 5 of our results. It returns 03/25/2009 00:00:00, because the time data is always stored in, but is stored as the beginning of the first second of the day. There are numerous formats you can use to get dates into the style you want, and you can mix-and-match components, as well. A table begins on page 135 in Mastering Oracle SQL with a detailed list of options and their output, but here are some examples: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME , TO_CHAR(ddo.ORDER_DATETIME,'YYYYMMDD') , TO_CHAR(ddo.ORDER_DATETIME,'D') , TO_CHAR(ddo.ORDER_DATETIME,'DAY') , TO_CHAR(ddo.ORDER_DATETIME,'CC') , TO_CHAR(ddo.ORDER_DATETIME,'HH AM" on a "Day", the "DDDTH" day of "YYYY"') FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 ORDER_DATETIME 3/25/2009 11:14 YYYYMMDD 20090325 D 4 DAY WEDNESDAY CC 21 HH AM" on a "Day", the "DDDTH" day of "YYYY" 11 AM on a Wednesday, the 084TH day of 2009 You can get very simple (like finding the Century with CC) or very complex, such as creating a text string. Think about the format that will be most meaningful to the people using your data. And don’t take for granted that a date field will output MM/DD/YYYY if you don’t specify a format – ETL often seems to default to the troublesome DD-MON-YY format (e.g. 31-MAR-09). 29 The TRUNC() Function with Dates Another type of conversion you can do to a DATE field is to truncate the date using the TRUNC() function. TRUN() is used much like TO_CHAR, but instead of translating the DATE field into a character string, it truncates it to the level you specify, but leaves it in a DATE format. One common example is to truncate a date to the first day of the week, which can be done like this: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME , TRUNC(ddo.ORDER_DATETIME,'D') FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 ORDER_DATETIME 3/25/2009 11:14 TRUNC(DDO.ORDER_DATETIME,'D') 3/22/2009 You’ll notice that when we used TRUNC with the ‘D’ option, it truncated the ORDER_DATETIME of 3/25/2009 11:14 to the first second of the first hour of the first day of the week: 3/22/2009. A similar option, ‘DDD’, will truncate a date to the first second of the first hour of the same day – essentially chopping off the timestamp information from a DATETIME field: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME , TRUNC(ddo.ORDER_DATETIME,'DDD') FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 ORDER_DATETIME 3/25/2009 11:14 TRUNC(DDO.ORDER_DATETIME,'DDD' 3/25/2009 Superficially, this looks like the same result we got from the TO_CHAR() function, but because TRUNC returns it’s result still in a DATE format, we can perform math functions on the result, such as adding days, and logical functions like comparing to another date. Since truncating a DATETIME to the start of that day is probably the most common use of the TRUNC() function, the developers of SQL made it the default. So you can get the same result as above by leaving off a format, saving yourself some time: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME , TRUNC(ddo.ORDER_DATETIME) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 ORDER_DATETIME 3/25/2009 11:14 TRUNC(DDO.ORDER_DATETIME) 3/25/2009 Like TO_CHAR(), there are numerous options to choose from when using TRUNC(), which are listed in a table that begins on page 159 of Mastering Oracle SQL. Here are just a few examples, truncating to the beginning of the month, quarter, year, and century: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME , TRUNC(ddo.ORDER_DATETIME,'MM') , TRUNC(ddo.ORDER_DATETIME,'Q') , TRUNC(ddo.ORDER_DATETIME,'Y') , TRUNC(ddo.ORDER_DATETIME,'CC') FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 ORDER_DATETIME 3/25/2009 11:14 MM 3/1/2009 Q 1/1/2009 Y 1/1/2009 CC 1/1/2001 30 The TO_DATE() Function One frequent use of DATE columns, besides returning them in your results, is to use them in your WHERE clause to limit your results. In fact, DATE columns are commonly used as partitions on tables, so this use is very common. A function called TO_DATE() comes in handy when working with DATE columns in your WHERE clause. It’s essentially the opposite of the TO_CHAR() function – turning a character string into a DATE format. This is vital, because you can’t compare a column that is in a DATE format to a text string – only to a DATE. So when setting a conditional in your WHERE clause, you use the TO_DATE() function to translate a text string into a DATE format, and then compare a DATE column to it. For example, if we wanted to see which POs have an ORDER_DAY of 3/25/2009, we’d compare the ORDER_DAY field to the text string 03/25/2009, but we’d convert that text string to a date before doing the comparison using TO_DATE, like this: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DAY FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND ddo.ORDER_DAY = TO_DATE('03/25/2009','MM/DD/YYYY'); ORDER_ID P0618301 M5969483 ORDER_DAY 3/25/2009 3/25/2009 The TO_DATE() function is taking the text string 03/25/2009, and converting it to a date format. The second part of the TO_DATE() function indicates what format the text string is in, so it knows which numbers are the month, which are the day, and which are the year. We could get the same results using a different format, as long as we change our text string to match that format: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DAY FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND ddo.ORDER_DAY = TO_DATE('20090325','YYYYMMDD'); ORDER_ID P0618301 M5969483 ORDER_DAY 3/25/2009 3/25/2009 If the format of your text string and the format are not the same, however, you’ll get an error. For example, the following would cause an error, because the format of the text string (‘20090325’) is not the same as the format indicated in the function (‘MM/DD/YYYY’): SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DAY FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND ddo.ORDER_DAY = TO_DATE('20090325','MM/DD/YYYY'); ORA-12801: error signaled in parallel query server P054, instance db-dw2-6001.iad6.amazon.com:dw2-1 (1) ORA-01843: not a valid month 31 Using BETWEEN with Dates You can limit a DATE field to a specific date using the equal operator, but you can use other operators to build conditions in your WHERE clause, too. The BETWEEN operator is commonly used to define a specific date range, that begins with the first date specified, and ends with last date specified. Below, the query is limited to the date range 3/23/2009 through 3/25/2009: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DAY FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND ddo.ORDER_DAY BETWEEN TO_DATE('20090323','YYYYMMDD') AND TO_DATE('20090325','YYYYMMDD'); ORDER_ID M9119427 M2666981 U3517863 R5273263 N5183001 T0475345 M5969483 P0618301 ORDER_DAY 3/23/2009 3/23/2009 3/23/2009 3/23/2009 3/23/2009 3/23/2009 3/25/2009 3/25/2009 It’s important when using BETWEEN with DATETIME fields to remember that the second date listed in the range (03/25/2009 in our example) is the end of the range, and that a date of 03/25/2009 means the first second of the first minute of the first hour of that day. It’s actually 03/25/2009 00:00:00. When working with fields that are in the DATE format that isn’t an issue, as the example above shows. However, if we changed the WHERE condition so that it was on the ORDER_DATETIME field, instead of the ORDER_DAY field, we’ll see a problem: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND ddo.ORDER_DATETIME BETWEEN TO_DATE('20090323','YYYYMMDD') AND TO_DATE('20090325','YYYYMMDD'); ORDER_ID M9119427 M2666981 U3517863 R5273263 N5183001 T0475345 ORDER_DATETIME 3/23/2009 19:42 3/23/2009 20:26 3/23/2009 19:41 3/23/2009 19:42 3/23/2009 20:27 3/23/2009 19:41 Even though our DATE range ends with 03/25/2009, we don’t get any results for that day – even though we know from our previous example that 2 POs were created that day for RANDO. That’s because the ORDER_DATETIME value for those 2 POs were after 03/25/2009 00:00:00 – the start of 03/25/2009. Another way of saying that is that 03/25/2009 03:03:48 (the order datetime of PO P0618301) is greater than 03/25/2009 00:00:00, so is outside the range specified by the BETWEEN clause. We can solve this problem by using a DATE column for our WHERE clause, if one is available, or by using the TRUNC() function in our WHERE clause, so that we’re comparing the ORDER_DATETIME value truncated to the start of the day to our BETWEEN range. SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND TRUNC(ddo.ORDER_DATETIME) BETWEEN TO_DATE('20090323','YYYYMMDD') AND TO_DATE('20090325','YYYYMMDD'); ORDER_ID U3517863 T0475345 R5273263 M9119427 M2666981 N5183001 P0618301 M5969483 ORDER_DATETIME 3/23/2009 19:41 3/23/2009 19:41 3/23/2009 19:42 3/23/2009 19:42 3/23/2009 20:26 3/23/2009 20:27 3/25/2009 3:03 3/25/2009 11:14 Now the results show the two POs placed on 3/25/2009, because the truncated version of the ORDER_DATETIME field is within the date range. You could also change the second date in the range to be one date larger (03/26/2009 in our example) without using the TRUNC() function, but then you’d risk getting results that happened to occur at 03/26/2009 00:00:00, which is a possibility with some data sets. Using TRUNC() is a cleaner, safer, and easier method. When in doubt, use TRUNC(). 32 Adding and Subtracting with Dates Just like numerical fields, you can add to and subtract from DATE column values, both in your SELECT and WHERE clauses. When adding and subtracting from DATE column values a value of 1 is equal to 1 day and not 1 hour or 1 minute or 1 second. If we add 1 to the ORDER_DATETIME values we returned in our last example, we see it increases the DATE value by 1 full day: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME , ddo.ORDER_DATETIME + 1 FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND TRUNC(ddo.ORDER_DATETIME) BETWEEN TO_DATE('20090323','YYYYMMDD') AND TO_DATE('20090325','YYYYMMDD'); ORDER_ID M9119427 P0618301 M2666981 U3517863 R5273263 M5969483 N5183001 T0475345 ORDER_DATETIME 3/23/2009 19:42 3/25/2009 3:03 3/23/2009 20:26 3/23/2009 19:41 3/23/2009 19:42 3/25/2009 11:14 3/23/2009 20:27 3/23/2009 19:41 DDO.ORDER_DATETIME+1 3/24/2009 19:42 3/26/2009 3:03 3/24/2009 20:26 3/24/2009 19:41 3/24/2009 19:42 3/26/2009 11:14 3/24/2009 20:27 3/24/2009 19:41 Thus, the value 3/23/2009 19:42 becomes 3/24/2009 19:42 – one full day later. (To add hours, minutes or seconds to a date, use a fraction, such as 1/24 to add an hour, or 20/1440 to add twenty minutes.) Perhaps a more common use is to add and subtract days from a date value in your WHERE clause. For example, we could rewrite our query to change the BETWEEN range a bit, like this: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DATETIME FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'RANDO' AND TRUNC(ddo.ORDER_DATETIME) BETWEEN TO_DATE('20090325','YYYYMMDD')-2 AND TO_DATE('20090325','YYYYMMDD'); ORDER_ID U3517863 T0475345 R5273263 M9119427 M2666981 N5183001 P0618301 M5969483 ORDER_DATETIME 3/23/2009 19:41 3/23/2009 19:41 3/23/2009 19:42 3/23/2009 19:42 3/23/2009 20:26 3/23/2009 20:27 3/25/2009 3:03 3/25/2009 11:14 Instead of the start of the range being 03/23/2009, we’ve made it 2 days prior to the date 03/25/2009. This may seem strange, but we’ll see how that can be very helpful in just a minute, when we talk about the Run Date Wildcard available in ETL Manager. 33 Other Date functions Although TO_CHAR(), TRUNC(), and TO_DATE() are probably the most commonly used DATE functions, SQL includes several more that you may find useful. These include: ROUND( date , format ) – used to round a date up or down to the nearest day, month, year, etc. ADD_MONTHS( date , number of months) – used to add (or subtract) months from a date LAST_DAY( date) – used to determine the last day of the month the date falls in NEXT_DAY( data , weekday ) – used to find the date of the next day following the date specified of the weekday specified MONTHS_BETWEEN( later date, earlier date) – used to determine how many months are between two dates Here are some examples of these functions in action: SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.ORDER_DAY , ROUND(ddo.ORDER_DAY,'D') , ADD_MONTHS(ddo.ORDER_DAY,-5) , LAST_DAY(ddo.ORDER_DAY) , NEXT_DAY(ddo.ORDER_DAY,'Friday') , MONTHS_BETWEEN(ddo.ORDER_DAY,TO_DATE('20090101','YYYYMMDD')) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID = 'M5969483'; ORDER_ID M5969483 ORDER_DAY 3/25/2009 ROUND() 3/22/2009 ADD_MONTHS() 10/25/2008 LAST_DAY 3/31/2009 NEXT_DAY 3/27/2009 MONTHS_BETWEEN 2.774193548 (We subtracted months using ADD_MONTHS and -5 as our number of months.) There’s more information on using these fields in your text. You can also use many of the standard aggregate functions, like AVG(), COUNT(), MAX(), and MIN() on DATE fields. 34 Week 5 Homework: 1. Read Chapter 6 of Mastering Oracle SQL. Pay close attention to tables 6-1 (pg 135) and 6-2 (pg 159). 2. Write a query to determine on what date the record in D_WAREHOUSES for the FC PHL1 was created. Use the TO_CHAR function to ensure the date is returned in the format MM/DD/YYYY. 3. Edit the query to determine what the first and last days of the week that record was created were, and format the dates in the UK standard format (e.g. 31/10/2008). 4. Write a query to find the WAREHOUSE_ID for all records in the D_WAREHOUSES table that were not created on 1/20/2009, for FCs outside of North America. (hint: you’ll need to use TRUNC() and TO_DATE(), and you should get about 6 records returned.) 5. Edit the query to determine what day of the week each of those records were created. 6. Write a query to pull a list of all PO and ASINs, with their submitted quantities and order dates, for the vendor code ‘DCCOM’, during the date range of 1/1/2009 through 1/15/2009, in the US. Be sure to make use of partitioned columns in your WHERE clause, and run your query through Explain Plan before scheduling it. 7. Edit the query to sum up the quantity field by ASIN, removing the order date and PO fields. 8. Write a query against the table D_MP_ASINS_ESSENTIALS to pull the ASIN, ITEM_NAME, STREET_DAY, and PUBLICATION_DAY columns for any US Books ASIN with a PUBLICATION_DAY greater than 1/1/2020. 9. Edit the query to create an element that returns the STREET_DAY if it’s not null, but returns the PUBLICATION_DAY if STREET_DAY is null. (hint: use the NVL() function to return pub date when street date is null.) This is a standard method used to determine release date. 10. Create a query that emails you a summary of all the POs you created the previous day, with count of ASINs and total units submitted for each PO, as well as any other details you’re interested in, such as order type and vendor code (use BI Metdata to find what fields are available). Schedule this query to run daily, and let it run for at least 7 days. If you’re not a buyer, select the login of a buyer to use for your query. (hint: you’ll need to use D_DISTRIBUTOR_ORDER_ITEMS to get the ASIN level info.) 35 Week 6 – Subqueries SQL Topics Subqueries Avoiding 1-to-many joins SQL TOPICS Subqueries A subquery is a whole SQL statement that’s nested within another SQL statement – like a query within a query. The subquery runs first then its results are stored in memory temporarily - like a temporary table – and then it’s discarded when the full SQL statement is done running. Subqueries can be in the FROM clause and incorporated into a JOIN, or (less commonly due to efficiency issues) in the WHERE clause to limit the results of the outer query. Here are examples of each: FROM Clause JOIN to a Subquery: ASIN 037584726X 0789399903 ITEM_NAME The Big Book of Princesses (Giant Coloring Book) Skylines: American Cities Yesterday and Today WHERE Clause limit using a Subquery: ASIN 037584726X 0789399903 SELECT /*+ use_hash(dma,ords) */ dma.ASIN , dma.ITEM_NAME , ords.QUANTITY_SUBMITTED FROM D_MP_ASINS_ESSENTIALS dma JOIN (SELECT /*+ use_hash(doi) */ doi.ISBN , doi.QUANTITY_SUBMITTED FROM d_distributor_order_items doi WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090406','YYYYMMDD') AND doi.ORDER_ID = 'S2236807') ords ON dma.ASIN = ords.ISBN WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1; QUANTITY_SUBMITTED 7 1 SELECT /*+ use_hash(dma) */ dma.ASIN , dma.ITEM_NAME FROM D_MP_ASINS_ESSENTIALS dma WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 AND dma.ASIN IN (SELECT /*+ use_hash(doi) */ doi.ISBN FROM d_distributor_order_items doi; WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090406','YYYYMMDD') AND doi.ORDER_ID = 'S2236807'); ITEM_NAME The Big Book of Princesses (Giant Coloring Book) Skylines: American Cities Yesterday and Today Subqueries are just like any SQL statement, but are enclosed in parentheses within another query. I think of the results of that subquery as a table – so when you JOIN to a subquery, you’ll alias it, like you would a table, because you’ll need to define the columns from each table in the JOIN condition and you may want to return some of the columns from your subquery in your results. 36 Stepping back to our first example of a subquery in the FROM clause, we see that we’ve inserted a full SELECT/FROM/WHERE query, enclosed in parentheses in the FROM clause, and inner JOINed to it to effectively limit the ASINs in the table D_MP_ASINS_ESSENTIALS to only those that match to the ASINs returned by the subquery – namely the ASINs on PO S2236807. SELECT /*+ use_hash(dma,ords) */ dma.ASIN , dma.ITEM_NAME , ords.QUANTITY_SUBMITTED FROM D_MP_ASINS_ESSENTIALS dma JOIN (SELECT /*+ use_hash(doi) */ doi.ISBN , doi.QUANTITY_SUBMITTED FROM d_distributor_order_items doi WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090406','YYYYMMDD') AND doi.ORDER_ID = 'S2236807') ords ON dma.ASIN = ords.ISBN WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1; ASIN ITEM_NAME 037584726X 0789399903 The Big Book of Princesses (Giant Coloring Book) Skylines: American Cities Yesterday and Today QUANTITY_SUBMITTED 7 1 The subquery could be run on its own, giving you the list of ASINs – which is the first thing that happens when the SQL statement runs. It runs the subquery, and then stores the results like a temporary table. SELECT /*+ use_hash(doi) */ doi.ISBN , doi.QUANTITY_SUBMITTED FROM d_distributor_order_items doi WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090406','YYYYMMDD') AND doi.ORDER_ID = 'S2236807'; ISBN 0789399903 037584726X QUANTITY_SUBMITTED 1 7 Then the query JOINs the table D_MP_ASINS_ESSENTIALS to that temporary table, limiting the results because it’s an INNER join, but also returning information from that temporary table – the QUANTITY_SUBMITTED. 37 Of course, you can also do an OUTER JOIN to a subquery, such as in this example: SELECT /*+ use_hash(dma,ords) */ dma.ASIN , dma.ITEM_NAME , ords.QUANTITY_SUBMITTED FROM D_MP_ASINS_ESSENTIALS dma LEFT JOIN (SELECT /*+ use_hash(doi) */ doi.ISBN , doi.QUANTITY_SUBMITTED FROM d_distributor_order_items doi WHERE doi.REGION_ID = 1 AND doi.ORDER_DAY = to_date('20090406','YYYYMMDD') AND doi.ORDER_ID = 'S2236807') ords ON dma.ASIN = ords.ISBN WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 AND dma.ASIN IN ('037584726X','0789399903','0394873742'); ASIN 0789399903 037584726X 0394873742 ITEM_NAME Skylines: American Cities Yesterday and Today The Big Book of Princesses (Giant Coloring Book) Richard Scarry's Biggest Word Book Ever! QUANTITY_SUBMITTED 1 7 In this case, the LEFT OUTER JOIN resulted in all results being pulled from the left table (D_MP_ASINS_ESSENTIALS) and results from the table on the right (our subquery) were returned, if available. Avoiding 1-to-many joins One of the many uses of subqueries is to avoid 1-to-many joins – situations where the grain of one table is different than the grain of another, which can result in errors. Here’s an example of a 1-to-many join that causes a problem, from data tables that hold Problem Receive information. In the table O_RECEIVE_PROBLEM_ITEMS, we find one record associated with RECEIVE_PROBLEM_ITEM_ID 5739750, which shows a QUANTITY of 1 was received into Problem Receive for ASIN B00158THNW. SELECT /*+ use_hash(rpi) */ rpi.RECEIVE_PROBLEM_ITEM_ID , rpi.ASIN , rpi.QUANTITY FROM O_RECEIVE_PROBLEM_ITEMS rpi WHERE rpi.RECEIVE_PROBLEM_ITEM_ID IN (5739750); RECEIVE_PROBLEM_ITEM_ID 5739750 ASIN B00158THNW QUANTITY 1 And in the table O_RPI_PROBLEM_LIST, we find that there are two records associated with that same RECEIVE_PROBLEM_ITEM_ID, one for each of the two problem types found to have occurred for that item. SELECT /*+ use_hash(rpl) */ rpl.RECEIVE_PROBLEM_ITEM_ID , rpl.RECEIVE_PROBLEM_TYPE FROM O_RPI_PROBLEM_LIST rpl WHERE rpl.RECEIVE_PROBLEM_ITEM_ID IN (5739750); RECEIVE_PROBLEM_ITEM_ID 5739750 5739750 RECEIVE_PROBLEM_TYPE OVERAGE WRONG_DC Based on these two queries, we know that the one unit of ASIN B00158THNW recorded as RPI ID 5739750 had two problems. It was an OVERAGE on the PO and it was delivered to the WRONG_DC. 38 If we join the two tables, the one record in the first table is duplicated for each record in the second table, including QUANTITY: SELECT /*+ use_hash(rpi,rpl) */ rpi.RECEIVE_PROBLEM_ITEM_ID , rpi.ASIN , rpl.RECEIVE_PROBLEM_TYPE , rpi.QUANTITY FROM O_RECEIVE_PROBLEM_ITEMS rpi JOIN O_RPI_PROBLEM_LIST rpl ON rpi.RECEIVE_PROBLEM_ITEM_ID = rpl.RECEIVE_PROBLEM_ITEM_ID AND rpi.WAREHOUSE_ID = rpl.WAREHOUSE_ID WHERE rpi.RECEIVE_PROBLEM_ITEM_ID IN (5739750); RECEIVE_PROBLEM_ITEM_ID 5739750 5739750 ASIN B00158THNW B00158THNW RECEIVE_PROBLEM_TYPE OVERAGE WRONG_DC QUANTITY 1 1 Based on this data, one might think that there were 2 units that arrived, not 1. The problem gets even less obvious when we aggregate the query, counting the RECEIVE_PROBLEM_TYPE and summing the QUANTITY from our results: SELECT /*+ use_hash(rpi,rpl) */ rpi.RECEIVE_PROBLEM_ITEM_ID , rpi.ASIN , COUNT(rpl.RECEIVE_PROBLEM_TYPE) , SUM(rpi.QUANTITY) FROM O_RECEIVE_PROBLEM_ITEMS rpi JOIN O_RPI_PROBLEM_LIST rpl ON rpi.RECEIVE_PROBLEM_ITEM_ID = rpl.RECEIVE_PROBLEM_ITEM_ID AND rpi.WAREHOUSE_ID = rpl.WAREHOUSE_ID WHERE rpi.RECEIVE_PROBLEM_ITEM_ID IN (5739750) GROUP BY rpi.RECEIVE_PROBLEM_ITEM_ID , rpi.ASIN; RECEIVE_PROBLEM_ITEM_ID 5739750 ASIN B00158THNW COUNT(RPL.RECEIVE_PROBLEM_TYPE 2 SUM(RPI.QUANTITY) 2 One way we could get around this problem is to use a subquery to aggregate the results from the O_RPI_PROBLEM_LIST table first, then join them to the O_RECEIVE_PROBLEM_ITEMS table: SELECT /*+ use_hash(rpi,rpl2) */ rpi.RECEIVE_PROBLEM_ITEM_ID , rpi.ASIN , rpl2.PROBLEM_COUNT , rpi.QUANTITY FROM O_RECEIVE_PROBLEM_ITEMS rpi JOIN (SELECT /*+ use_hash(rpl) */ rpl.RECEIVE_PROBLEM_ITEM_ID , rpl.WAREHOUSE_ID , COUNT(rpl.RECEIVE_PROBLEM_TYPE) PROBLEM_COUNT FROM O_RPI_PROBLEM_LIST rpl WHERE rpl.RECEIVE_PROBLEM_ITEM_ID IN (5739750) GROUP BY rpl.RECEIVE_PROBLEM_ITEM_ID , rpl.WAREHOUSE_ID ) rpl2 ON rpi.RECEIVE_PROBLEM_ITEM_ID = rpl2.RECEIVE_PROBLEM_ITEM_ID AND rpi.WAREHOUSE_ID = rpl2.WAREHOUSE_ID WHERE rpi.RECEIVE_PROBLEM_ITEM_ID IN (5739750); RECEIVE_PROBLEM_ITEM_ID 5739750 ASIN B00158THNW PROBLEM_COUNT 2 QUANTITY 1 39 Now we get the proper results, showing the quantity of 1 unit, with 2 problems. Notice that we aliased the column COUNT(rpl.RECEIVE_PROBLEM_TYPE) to PROBLEM_COUNT in our subquery, then referred to the column by its alias in the outer query. Because the subquery is executed by Oracle first, and the results are saved as a table that’s then used for the outer query, any column aliases in the subquery are now the column names of that temporary table, and that’s how you must refer to them in the outer query. This type of subquery is something that can be used any time you have two tables with a different grain of data that you need to join, such as when you want to join PO ASIN information from D_DISTRIBUTOR_ORDER_ITEMS to PO ASIN Shipment information from D_DISTRIBUTOR_SHIPMENT_ITEMS. 40 Week 6 Homework: As always, remember to include WHERE clause conditions on any and all Partitioned columns, and run any query you write through Explain Plan before running it. 1. Read Chapter 5 of Mastering Oracle SQL 2. Create a segment of the following ASINs: 0394873742 037584726X 0789399903 0345431391 0375847278 037584726X 0887767702 3. Create a query that emails you the list of distinct ASINs in this segment. 4. Use the query you created as a subquery in a query to find a list of all POs that were placed on 4/6/2009 in the US that included those ASINs. In your results, include the PO, Vendor Code, ASIN, and quantity confirmed. (Hint: check out D_DISTRIBUTOR_ORDER_ITEMS. The column QUANTITY_ORDERED indicates the number of units confirmed.) 5. Edit the query to switch out the condition on Legal Entity ID in your WHERE clause to use the Legal Entity ID wildcard, and rerun the query. Make sure your Job is set up to be partitioned by Legal Entity ID. 6. Edit the query to switch out the segment ID in your subquery for the Free Form Tag Wildcard, and edit your job to put the segment ID in the Free Form Tag field. Extra Credit 7. Edit the query to add a JOIN to the VENDORS table to get the name of each Vendor. 8. Edit the query to add a JOIN to the D_MP_ASINS_ESSENTIALS table to get the title of each ASIN. 9. Edit the query to remove the PO Field, and sum the number of units confirmed per ASIN, per Vendor. 10. Edit the query to return only those ASINs where the sum of units ordered was greater than 3. 41 Week 7 –DECODE & CASE SQL Topics The DECODE() Function The CASE Function SQL TOPICS The DECODE() Function DECODE() is one of SQL’s functions that fills the need of If-Then functionality. It’s essentially a way to translate or decode the values in a column to another value. The format is DECODE(A,B,C,D) – which functions as: if A is equal to B, then return C, otherwise, return D. It’s very much like the Excel function = IF(A=B,C,D). The first value (A) is generally a column in one of the tables you’re querying, and B is a value that would be found in that column of that table. C is what you want that value translated to in your results, and D is what you want returned in that column of your results if A doesn’t match B. The B and C spots in the function can be repeated, giving you the ability to translate any of several values in a single column to new values in your results (e.g. DECODE(A,B1,C1,B2,C2,B3,C3,D) ). One example is the need to translate Order Type numbers to Order Type codes – such as translating the number 17 to NP and 9 to LA – because PO Order Type is stored in all the key tables (e.g D_DISTRIBUTOR_ORDERS) as a number. There is a table in the Data Warehouse that translates the number to text, but it doesn’t translate it to the two character code folks are familiar with: SELECT /*+ use_hash(vot) */ vot.VENDOR_ORDER_TYPE , vot.VENDOR_ORDER_TYPE_DESC FROM VENDOR_ORDER_TYPES vot WHERE vot.VENDOR_ORDER_TYPE IN (0,4); VENDOR_ORDER_TYPE 0 2 VENDOR_ORDER_TYPE_DESC None Specified / Distributor O Special Order As the sample above shows, the table includes a description, which isn’t always clear. For example, ‘Pubdirect Order’ is the Advantage Order Type, and ‘None Specified/Distributor O’ is actually DS. Most folks seem to talk about these in terms of Order Type Codes (like DS and LA), so it can be very useful to translate to those values when you run your queries. You can use DECODE() to do this: SELECT /*+ use_hash(vot) */ vot.VENDOR_ORDER_TYPE , DECODE(vot.VENDOR_ORDER_TYPE,0,'DS',NULL) FROM VENDOR_ORDER_TYPES vot WHERE vot.VENDOR_ORDER_TYPE IN (0,4); VENDOR_ORDER_TYPE 0 4 DECODE(VOT.VENDOR_ORDER_TYPE,0 DS Here we’ve decoded the VENDOR_ORDER_TYPE column, and anytime the value in that column is 0, we return ‘DS’ as the result, otherwise it returns NULL. So for 0 we get DS, and for 4 we get a null returned. Translating one value is useful, but DECODE() can be used for multiple values, allowing you to specify what you want returned for each. Here’s an example where we are decoding multiple values (0, 2, and 4) to what we want to see returned (DS, SP, and PD): 42 SELECT /*+ use_hash(vot) */ vot.VENDOR_ORDER_TYPE , DECODE(vot.VENDOR_ORDER_TYPE,0,'DS',2,'SP',4,'PD','Unknown') FROM VENDOR_ORDER_TYPES vot WHERE vot.VENDOR_ORDER_TYPE IN (0,2,3,4); VENDOR_ORDER_TYPE 0 2 3 4 DECODE(VOT.VENDOR_ORDER_TYPE,0 DS SP Unknown PD In this example, the DECODE is translating the values in the column vot.VENDOR_ORDER_TYPE. When it finds a value in that column that’s equal to 0, it returns the text string ‘DS’. When it finds 2, it returns ‘SP’. When it finds 4, it returns ‘PD’, and if it finds anything else (3 in this example) it returns ‘Unknown’. The values returned can be text (as in the examples above), a number, or even another column. For example, we could change the ‘Unknown’ value in the above query to the VENDOR_ORDER_TYPE_DESC column: SELECT /*+ use_hash(vot) */ vot.VENDOR_ORDER_TYPE , DECODE(vot.VENDOR_ORDER_TYPE,0,'DS',2,'SP',4,'PD',vot.VENDOR_ORDER_TYPE_DESC) FROM VENDOR_ORDER_TYPES vot WHERE vot.VENDOR_ORDER_TYPE IN (0,2,3,4); VENDOR_ORDER_TYPE 0 2 3 4 DECODE(VOT.VENDOR_ORDER_TYPE,0 DS SP Publisher Order PD Instead of Unknown, we get the value of the column VENDOR_ORDER_TYPE_DESC for any column that doesn’t match one of the value we’ve already defined in the DECODE() statement – in this case, order type number 3. You can keep adding pairs of values to translate various values, up to about 125 pairs. For example, here’s the full decode to translate the numbers to the code for most of the current Order Type values: SELECT /*+ use_hash(vot) */ vot.VENDOR_ORDER_TYPE ,DECODE(vot.VENDOR_ORDER_TYPE,0,'DS',1,'OP',2,'SP',3,'PB',4,'PD',6,'SU',7,'IS',8,'MS',9,'L A',10,'LB',11,'LC',12,'LD',13,'SA',14,'SB',15,'SC',16,'SD',17,'NP',18,'RE',19,'VP',20,'MU' ,21,'T1',22,'T2',23,'T3',24,'B1',25,'B2',26,'B3',27,'M1',28,'M2',29,'M3',30,'R1',31,'R2',3 2,'R3',33,'PT',34,'DR',35,'MX', vot.VENDOR_ORDER_TYPE) AS ORDER_TYPE FROM VENDOR_ORDER_TYPES vot; 43 The CASE Function The CASE function is similar to DECODE, but with more advanced options. With CASE, you can evaluate not just if a column is equal to a value, but if an expression is true, and return your results depending on whether or not that expression is true. Here’s the same example we explored with DECODE above, but using CASE: SELECT /*+ use_hash(vot) */ vot.VENDOR_ORDER_TYPE , CASE WHEN vot.VENDOR_ORDER_TYPE = 0 THEN 'DS' WHEN vot.VENDOR_ORDER_TYPE = 2 THEN 'SP' WHEN vot.VENDOR_ORDER_TYPE = 4 THEN 'PD' ELSE 'Unknown' END FROM VENDOR_ORDER_TYPES vot WHERE vot.VENDOR_ORDER_TYPE IN (0,2,3,4); VENDOR_ORDER_TYPE 0 2 3 4 CASEWHENVOT.VENDOR_ORDER_TYPE= DS SP Unknown PD In this example, we again evaluated the vot.VENDOR_ORDER_TYPE column, using the equal operator to see if it was equal to various values. This is functionally identical to what DECODE does, just in a different way. CASE really shows its value when you use other types of operators (rather than equals), or when it’s evaluating multiple columns. In the example below, we evaluate the DEAL_CODE column to see if it’s NULL, and if it’s not NULL, return ‘Deal Buy’. If it is NULL, then we move on to the next WHEN/THEN combo, which checks the ORDER_TYPE column to see if it’s a 9, in which case it returns ‘LA’, and so on. SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , ddo.DEAL_CODE , ddo.ORDER_TYPE , CASE WHEN ddo.DEAL_CODE IS NOT NULL THEN 'Deal Buy' WHEN ddo.ORDER_TYPE = 9 THEN 'LA' WHEN ddo.ORDER_TYPE = 2 THEN 'SP' ELSE NULL END FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.ORDER_ID IN ('L3937793','B3074533','Q9166581'); ORDER_ID B3074533 L3937793 Q9166581 DEAL_CODE D0000001069 ORDER_TYPE 2 9 9 CASEWHENDDO.DEAL_CODEISNOTNULL SP LA Deal Buy It’s important to note that the first WHEN/THEN combination in a CASE statement is the first that’s evaluated, and if it’s true, the following WHEN/THEN combinations aren’t evaluated, even if they’re true. In the example above, the first evaluation found that the DEAL_CODE column was NOT NULL for the third record, so it returned ‘Deal Buy’ and stopped evaluating the rest of the CASE statement. So even though the ORDER_TYPE was 9 (the second WHEN/THEN combination), because the previous WHEN/THEN was true, the CASE statement stopped. So the order you enter your WHEN/THEN combinations in a CASE statement can impact your results. 44 The NVL2 Function Back in Week X we discussed the NVL() function, which translates any Null values to whatever you specify, and leaves Non-Null values as is. A related but slightly more powerful function is NVL2(). NVL2() gives you the option of translating the Non-Null values to something else, too. The format is NVL2(A,B,C) – where A is the column or element to evaluate, B is what to return if it’s NOT Null, and C is what to return if it IS Null. For example, we might want to return a ‘N’ if we find a Null and return a ‘N’ if we find a Non-Null, as when we’re defining which ASINs are Textbooks: SELECT /*+ use_hash(dma,dmma) */ dma.ASIN , dma.ITEM_NAME , dmma.TEXTBOOK_TYPE , NVL2(dmma.TEXTBOOK_TYPE,'Y','N') FROM D_MP_ASINS_ESSENTIALS dma LEFT JOIN D_MP_MEDIA_ASINS dmma ON dma.MARKETPLACE_ID = dmma.MARKETPLACE_ID AND dma.ASIN = dmma.ASIN WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 AND dma.ASIN IN ('0596006322','B00167YLVA','B004GEB67C'); ASIN ITEM_NAME TEXTBOOK_TYPE B004GEB67C Beginning SQL Joes 2 Pros: (SQL Exam Prep Series 70-433 Volume 1 of 5) (DVD) N B00167YLVA Fiskars SQL-7312 Squeeze Paper Punch, Large, Comma, Comma, Chameleon N 0596006322 Mastering Oracle SQL, 2nd Edition unknown IS_TEXTBOOK Y Notice that a LEFT JOIN was used, because not all ASINs are found in the D_MP_MEDIA_ASINS table. Data Type Consistency When using functions like NVL(), DECODE(), CASE and NVL2() that convert values, it’s important to keep data types in mind (meaning character strings, dates and numbers). These functions may fail if you mix data types in the outputs. For example, the query below mixes numerical values (15+2) with text strings (‘N’) in the NVL2() function, resulting in an ORA-01722: invalid number error. SELECT /*+ use_hash(dma,dmma) */ dma.ASIN , dma.ITEM_NAME , dmma.TEXTBOOK_TYPE , NVL2(dmma.TEXTBOOK_TYPE,15+2,'N') FROM D_MP_ASINS_ESSENTIALS dma LEFT JOIN D_MP_MEDIA_ASINS dmma ON dma.MARKETPLACE_ID = dmma.MARKETPLACE_ID AND dma.ASIN = dmma.ASIN WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 AND dma.ASIN IN ('0596006322','B00167YLVA','B004GEB67C'); So when using these helpful functions, be sure to keep their outputs all of the same data type. 45 Week 7 Homework: 1. Read Chapter 9 in Mastering Oracle SQL, through page 219. 2. Write a query to pull a list of all Purchase Orders placed last week for vendor code SIMON (or your favorite vendor code) in the US, using a DECODE statement to translate the Order Type number to the 2-letter code. 3. Add a CASE statement to the query, and when the Deal Code field isn’t blank, return ‘Deal Buy PO’. Otherwise, return ‘Auto’ for any POs that are of order types DS, SP, PD, SU, LA, LD, or NP, and ‘Manual’ for any other POs. (hint: Use the IN operator to avoid having to enter so many WHEN/THEN combinations.) 4. Schedule the query to run every week for the previous week, and publish to a text file. 5. Link the output of your file to an Excel spreadsheet, so that you can update it every week with the new data. 6. If you haven’t already, go back to Homework #2 from Week 2: Make sure you’re signed up to the etl-users@amazon.com mailing list. This is vital not only as a resource for you when you run into trouble, but as a way to ensure you’re notified when significant changes to the ETL Manager or to specific tables are going to occur. Sign up, create a rule to move all the messages to a specific folder, and check that folder every so often. Read through the emails periodically to see what you can learn. And when you see questions you know the answer to, help out the other folks in the etl-users community. Continuing Learning: 1. 2. Think of what types of data, if you had it at your fingertips in a report, would make your job easier, and do one or all of the following: a. Pick one and use BI Metadata to try to find the tables & columns you need. Create an ETL Job, scheduling it to run daily, weekly, or monthly using wildcards to ensure it will always include the data you need. Link the output file to an Excel spreadsheet. Show your boss what you’ve done. b. Pick one that you think might be similar to a report already in existence, and ask the owner of that report for a copy of their SQL. Edit the SQL to fit your use case and set it up to run to meet your needs. c. Pick one and email the etl-user@amazon.com mailing list to see if someone has a similar report that you could use as a starting point. Using Google, Mastering Oracle SQL, and other resources, keep learning new functions and operators to expand what you can do with SQL. I recommend reading up on UPPER(), TRIM(), RTRIM(), LTRIM(), SUBSTR(), COALESCE(), RANK(), PARTITION and PARTITION BY, WITH, EXISTS, UNION and UNION ALL. 46 Answer Key Below are possible answers to the weekly homework assignments. There are almost always multiple ways to write the SQL to get the correct answer, so these answers present only one of those possibilities. It’s recommended that you attempt all the homework exercises prior to looking at the answers. If you get stuck, be sure you’ve read the corresponding chapter in Mastering Oracle SQL, and reread through the week’s lesson. Some of the homework exercises use functions, operators, and other code that was taught in prior weeks, so you may want to refer back to prior week’s lessons if something doesn’t seem familiar or isn’t found in the chapter and lesson for that week’s homework. Also, remember to use the BI Metadata, Explain Plan, and Wikis as references to help you with your homework exercises. Some of the exercises are specifically designed to encourage your use of these resources, as they will be vital to your success at writing SQL at Amazon. Good Luck! Week 1 – The Basics of ETL Manager and Basic Structure of SQL 2. SELECT d_w.WAREHOUSE_ID FC FROM D_WAREHOUSES d_w ORDER BY d_w.WAREHOUSE_ID; 3. SELECT d_w.WAREHOUSE_ID FC , d_w.REGION_ID , d_w.REGION_ID * 10 CALC , 10 FACTOR , d_w.WAREHOUSE_ID || '_' || d_w.REGION_ID FC_REGION FROM D_WAREHOUSES d_w ORDER BY d_w.WAREHOUSE_ID; Week 2 – Exploring Tables and Building Queries to Pull Just the Results You Want 3. SELECT d_w.WAREHOUSE_ID FROM D_WAREHOUSES d_w WHERE d_w.REGION_ID = 1 AND d_w.HAS_AMAZON_INVENTORY = 'Y'; 4. SELECT d_w.WAREHOUSE_ID , d_w.NAME "FC Name" FROM D_WAREHOUSES d_w WHERE d_w.REGION_ID = 1 AND d_w.HAS_AMAZON_INVENTORY = 'Y' AND d_w.NAME LIKE '%Logistics%'; 6. SELECT pg.PRODUCT_GROUP , NVL(pg.SHORT_DESC,'Unknown') FROM PRODUCT_GROUPS pg ORDER BY pg.PRODUCT_GROUP; 7. SELECT pg.PRODUCT_GROUP , NVL(pg.SHORT_DESC,'Unknown') , pg.DESCRIPTION FROM PRODUCT_GROUPS pg WHERE pg.PRODUCT_GROUP >= 14 AND pg.DESCRIPTION IN ('Books','Universal','Shops','Advertising','Art') ORDER BY pg.PRODUCT_GROUP; 47 Week 3 –Partitions, Scheduling Jobs to Publish to Folders, Using the Job Run Wildcard, Aggregate Queries and HAVING Clause 2. D_DISTRIBUTOR_ORDERS is partitioned by REGION_ID. 3. SELECT /*+ use_hash(ddo) */ COUNT(ddo.ORDER_ID) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC'; 4. SELECT /*+ use_hash(ddo) */ COUNT(ddo.ORDER_ID) , MIN(ddo.ORDER_DAY) , MAX(ddo.ORDER_DAY) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC'; 5. SELECT /*+ use_hash(ddo) */ COUNT(ddo.ORDER_ID) , MIN(ddo.ORDER_DAY) , MAX(ddo.ORDER_DAY) , COUNT(DISTINCT ddo.ORDER_DAY) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC'; 6. SELECT /*+ use_hash(ddo) */ COUNT(ddo.ORDER_ID) , MIN(ddo.ORDER_DAY) , MAX(ddo.ORDER_DAY) , COUNT(DISTINCT ddo.ORDER_DAY) , SUM(ddo.SHIPPING_COST) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC'; 7. SELECT /*+ use_hash(ddo) */ ddo.DISTRIBUTOR_ID , COUNT(ddo.ORDER_ID) , MIN(ddo.ORDER_DAY) , MAX(ddo.ORDER_DAY) , COUNT(DISTINCT ddo.ORDER_DAY) , SUM(ddo.SHIPPING_COST) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC' GROUP BY ddo.DISTRIBUTOR_ID; 48 8. SELECT /*+ use_hash(ddo) */ ddo.DISTRIBUTOR_ID , ddo.HANDLER , COUNT(ddo.ORDER_ID) , MIN(ddo.ORDER_DAY) , MAX(ddo.ORDER_DAY) , COUNT(DISTINCT ddo.ORDER_DAY) , SUM(ddo.SHIPPING_COST) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC' GROUP BY ddo.DISTRIBUTOR_ID , ddo.HANDLER; 9. SELECT /*+ use_hash(ddo) */ ddo.DISTRIBUTOR_ID , ddo.HANDLER , COUNT(ddo.ORDER_ID) , MIN(ddo.ORDER_DAY) , MAX(ddo.ORDER_DAY) , COUNT(DISTINCT ddo.ORDER_DAY) , SUM(ddo.SHIPPING_COST) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC' GROUP BY ddo.DISTRIBUTOR_ID , ddo.HANDLER HAVING COUNT(ddo.ORDER_ID) BETWEEN 30 AND 40; 10. SELECT /*+ use_hash(ddo) */ ddo.DISTRIBUTOR_ID , ddo.HANDLER , COUNT(ddo.ORDER_ID) , MIN(ddo.ORDER_DAY) , MAX(ddo.ORDER_DAY) , COUNT(DISTINCT ddo.ORDER_DAY) , SUM(ddo.SHIPPING_COST) FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'PRBRC' GROUP BY ddo.DISTRIBUTOR_ID , ddo.HANDLER HAVING COUNT(ddo.ORDER_ID) BETWEEN 30 AND 40 AND COUNT(DISTINCT ddo.ORDER_DAY) >= 10; Week 4 - Joining Tables 2. Both tables are partitioned by REGION_ID & MARKETPLACE_ID and are at the REGION_ID/MARKETPLACE_ID/ASIN grain. Both tables include the column ITEM_NAME, though D_MP_ASINS_ESSENTIALS is considered the authority table for this information. D_ASINS_MARKETPLACE_ATTRIBUTES includes the column BINDING. 49 3. SELECT /*+ use_hash(dma,da) */ da.ASIN , dma.ITEM_NAME , da.BINDING FROM D_ASINS_MARKETPLACE_ATTRIBUTES da JOIN D_MP_ASINS_ESSENTIALS dma ON da.ASIN = dma.ASIN WHERE da.ASIN = '0385240880' AND da.REGION_ID = 1 AND da.MARKETPLACE_ID = 1 AND dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1; 4. SELECT /*+ use_hash(dma,da,dmam) */ da.ASIN , dma.ITEM_NAME , da.BINDING , dmam.MANUFACTURER_CODE FROM D_ASINS_MARKETPLACE_ATTRIBUTES da JOIN D_MP_ASINS_ESSENTIALS dma ON da.ASIN = dma.ASIN JOIN D_MP_ASIN_MANUFACTURER dmam ON da.ASIN = dmam.ASIN WHERE da.ASIN = '0385240880' AND da.REGION_ID = 1 AND da.MARKETPLACE_ID = 1 AND dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 AND dmam.MARKETPLACE_ID = 1; 5. SELECT /*+ use_hash(v,o_abg) */ v.PRIMARY_VENDOR_CODE , v.VENDOR_NAME , v.AMAZON_BUSINESS_GROUP_ID , o_abg.TYPE FROM VENDORS v JOIN O_AMAZON_BUSINESS_GROUPS o_abg ON v.AMAZON_BUSINESS_GROUP_ID = o_abg.ID WHERE v.VENDOR_ID = 3453; 6. SELECT /*+ use_hash(v,o_abg) */ o_abg.ID , o_abg.TYPE FROM VENDORS v RIGHT OUTER JOIN O_AMAZON_BUSINESS_GROUPS o_abg ON v.AMAZON_BUSINESS_GROUP_ID = o_abg.ID WHERE v.VENDOR_ID IS NULL AND o_abg.TYPE LIKE 'CA%'; 7. SELECT /*+ use_hash(dsi) */ dsi.ORDER_ID , dsi.ISBN , dsi.QUANTITY_UNPACKED FROM D_DISTRIBUTOR_SHIPMENT_ITEMS dsi WHERE dsi.REGION_ID = 1 AND dsi.LEGAL_ENTITY_ID = 101 AND dsi.RECEIVED_DAY = TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD') AND dsi.DISTRIBUTOR_ID = 'BATBO'; 50 Week 5 – Dealing with Dates in SQL and Using the Run Date Wildcard 2. SELECT /*+ use_hash(d_w) */ d_w.WAREHOUSE_ID , TO_CHAR(d_w.DW_CREATION_DATE,'MM/DD/YYYY') FROM D_WAREHOUSES d_w WHERE d_w.WAREHOUSE_ID = 'PHL1'; 3. SELECT /*+ use_hash(d_w) */ d_w.WAREHOUSE_ID , TO_CHAR(d_w.DW_CREATION_DATE,'MM/DD/YYYY') , TO_CHAR(TRUNC(d_w.DW_CREATION_DATE,'D'),'MM/DD/YYYY') , TO_CHAR(TRUNC(d_w.DW_CREATION_DATE,'D')+6,'MM/DD/YYYY') FROM D_WAREHOUSES d_w WHERE d_w.WAREHOUSE_ID = 'PHL1'; 4. SELECT /*+ use_hash(d_w) */ d_w.WAREHOUSE_ID , d_w.DW_CREATION_DATE FROM D_WAREHOUSES d_w WHERE TRUNC(d_w.DW_CREATION_DATE) <> TO_DATE('20090120','YYYYMMDD') AND d_w.REGION_ID <> 1; 5. SELECT /*+ use_hash(d_w) */ d_w.WAREHOUSE_ID , d_w.DW_CREATION_DATE , TO_CHAR(d_w.DW_CREATION_DATE,'Day') FROM D_WAREHOUSES d_w WHERE TRUNC(d_w.DW_CREATION_DATE) <> TO_DATE('20090120','YYYYMMDD') AND d_w.REGION_ID <> 1; 6. SELECT /*+ use_hash(doi) */ doi.ORDER_ID , doi.ISBN , doi.QUANTITY_SUBMITTED , doi.ORDER_DAY FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY BETWEEN TO_DATE('20090101','YYYYMMDD') AND TO_DATE('20090115','YYYYMMDD') AND doi.DISTRIBUTOR_ID = 'DCCOM'; 7. SELECT /*+ use_hash(doi) */ doi.ISBN , SUM(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY BETWEEN TO_DATE('20090101','YYYYMMDD') AND TO_DATE('20090115','YYYYMMDD') AND doi.DISTRIBUTOR_ID = 'DCCOM' GROUP BY doi.ISBN; 51 8. SELECT /*+ use_hash(dma) dma.ASIN , dma.ITEM_NAME , dma.STREET_DAY , dma.PUBLICATION_DAY FROM D_MP_ASINS_ESSENTIALS WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 AND dma.GL_PRODUCT_GROUP = AND dma.PUBLICATION_DAY >= */ dma 14 TO_DATE('20200101','YYYYMMDD') ; 9. SELECT /*+ use_hash(dma) */ dma.ASIN , dma.ITEM_NAME , dma.STREET_DAY , dma.PUBLICATION_DAY , NVL(dma.STREET_DAY,dma.PUBLICATION_DAY) FROM D_MP_ASINS_ESSENTIALS dma WHERE dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 AND dma.GL_PRODUCT_GROUP = 14 AND dma.PUBLICATION_DAY >= TO_DATE('20200101','YYYYMMDD') ; 10. SELECT /*+ use_hash(ddo,doi) */ ddo.ORDER_ID , ddo.ORDER_TYPE , ddo.DISTRIBUTOR_ID , COUNT(doi.ISBN) , SUM(doi.QUANTITY_SUBMITTED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN D_DISTRIBUTOR_ORDERS ddo ON doi.ORDER_ID = ddo.ORDER_ID WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY = TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD') AND ddo.REGION_ID = 1 AND ddo.HANDLER = 'username' GROUP BY ddo.ORDER_ID , ddo.ORDER_TYPE , ddo.DISTRIBUTOR_ID; Week 6 – Subqueries, Segments, and More Wildcards 3. SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = 623271; 4. SELECT /*+ use_hash(doi,seg) */ doi.ORDER_ID , doi.DISTRIBUTOR_ID , doi.ISBN , doi.QUANTITY_ORDERED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = 623271) seg ON doi.ISBN = seg.ASIN WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = 101 AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD'); 52 5. SELECT /*+ use_hash(doi,seg) */ doi.ORDER_ID , doi.DISTRIBUTOR_ID , doi.ISBN , doi.QUANTITY_ORDERED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = 623271) seg ON doi.ISBN = seg.ASIN WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = {LEGAL_ENTITY_ID} AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD'); 6. SELECT /*+ use_hash(doi,seg) */ doi.ORDER_ID , doi.DISTRIBUTOR_ID , doi.ISBN , doi.QUANTITY_ORDERED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = {FREE_FORM}) seg ON doi.ISBN = seg.ASIN WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = {LEGAL_ENTITY_ID} AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD'); 7. SELECT /*+ use_hash(doi,v) */ doi.ORDER_ID , doi.DISTRIBUTOR_ID , v.VENDOR_NAME , doi.ISBN , doi.QUANTITY_ORDERED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = {FREE_FORM}) seg ON doi.ISBN = seg.ASIN JOIN VENDORS v ON doi.DISTRIBUTOR_ID = v.PRIMARY_VENDOR_CODE WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = {LEGAL_ENTITY_ID} AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD'); You might notice that when you join to VENDORS you lose data for all vendor codes that are less than 5 characters long, like BTM. This is because the Vendor Code is stored in the DISTRIBUTOR_ID field of D_DISTRIBUTOR_ORDER_ITEMS with trailing spaces, whereas the PRIMARY_VENDOR_CODE field in VENDORS doesn’t include those spaces, so it doesn’t find matches between ‘BTM ‘ and ‘BTM’. This can be remedied by using a function called RTRIM(), which trims spaces off the RIGHT side of whatever column you put in the function. We could rewrite this query using the RTRIM() function in our JOIN, to trim the spaces off the DISTRIBUTOR_ID field when joining to the PRIMARY_VENDOR_CODE field, so they’ll match even for Vendor Codes that are 2, 3, or 4 characters long. 53 SELECT /*+ use_hash(doi,v) */ doi.ORDER_ID , doi.DISTRIBUTOR_ID , v.VENDOR_NAME , doi.ISBN , doi.QUANTITY_ORDERED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = {FREE_FORM}) seg ON doi.ISBN = seg.ASIN JOIN VENDORS v ON RTRIM(doi.DISTRIBUTOR_ID) = v.PRIMARY_VENDOR_CODE WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = {LEGAL_ENTITY_ID} AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD'); Check out https://w.amazon.com/?RTrimVendorCode for more information on RTRIM(), including which tables do and don’t include those leading spaces on the Vendor Code column. 8. SELECT /*+ use_hash(doi,v,dma,seg) */ doi.ORDER_ID , doi.DISTRIBUTOR_ID , v.VENDOR_NAME , doi.ISBN , dma.ITEM_NAME , doi.QUANTITY_ORDERED FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = {FREE_FORM}) seg ON doi.ISBN = seg.ASIN JOIN VENDORS v ON doi.DISTRIBUTOR_ID = v.PRIMARY_VENDOR_CODE JOIN D_MP_ASINS_ESSENTIALS dma ON doi.ISBN = dma.ASIN WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = {LEGAL_ENTITY_ID} AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD') AND dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 54 9. SELECT /*+ use_hash(doi,v,dma,seg) */ doi.DISTRIBUTOR_ID , v.VENDOR_NAME , doi.ISBN , dma.ITEM_NAME , SUM(doi.QUANTITY_ORDERED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = {FREE_FORM}) seg ON doi.ISBN = seg.ASIN JOIN VENDORS v ON doi.DISTRIBUTOR_ID = v.PRIMARY_VENDOR_CODE JOIN D_MP_ASINS_ESSENTIALS dma ON doi.ISBN = dma.ASIN WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = {LEGAL_ENTITY_ID} AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD') AND dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 GROUP BY doi.DISTRIBUTOR_ID , v.VENDOR_NAME , doi.ISBN , dma.ITEM_NAME; 10. SELECT /*+ use_hash(doi,v,dma,seg) */ doi.DISTRIBUTOR_ID , v.VENDOR_NAME , doi.ISBN , dma.ITEM_NAME , SUM(doi.QUANTITY_ORDERED) FROM D_DISTRIBUTOR_ORDER_ITEMS doi JOIN (SELECT DISTINCT ASIN FROM PRODUCT_SEGMENT_MEMBERSHIP WHERE SEGMENT_ID = {FREE_FORM}) seg ON doi.ISBN = seg.ASIN JOIN VENDORS v ON doi.DISTRIBUTOR_ID = v.PRIMARY_VENDOR_CODE JOIN D_MP_ASINS_ESSENTIALS dma ON doi.ISBN = dma.ASIN WHERE doi.REGION_ID = 1 AND doi.LEGAL_ENTITY_ID = {LEGAL_ENTITY_ID} AND doi.ORDER_DAY = TO_DATE('20090406','YYYYMMDD') AND dma.REGION_ID = 1 AND dma.MARKETPLACE_ID = 1 GROUP BY doi.DISTRIBUTOR_ID , v.VENDOR_NAME , doi.ISBN , dma.ITEM_NAME HAVING SUM(doi.QUANTITY_ORDERED) > 3; 55 Week 7 - DECODE & CASE, Troubleshooting, Stealing SQL from DSS Queries, and Linking ETL Output to Excel 2. SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , DECODE(ddo.ORDER_TYPE,0,'DS',1,'OP',2,'SP',3,'PB',4,'PD',6,'SU',7,'IS',8,'MS',9,'LA',10,'L B',11,'LC',12,'LD',13,'SA',14,'SB',15,'SC',16,'SD',17,'NP',18,'RE',19,'VP',20,'MU',21,'T1' ,22,'T2',23,'T3',24,'B1',25,'B2',26,'B3',27,'M1',28,'M2',29,'M3',30,'R1',31,'R2',32,'R3',3 3,'PT',34,'DR',35,'MX',ddo.ORDER_TYPE) AS ORDER_TYPE FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'SIMON' AND ddo.ORDER_DAY BETWEEN TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD')-6 AND TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD'); 3. SELECT /*+ use_hash(ddo) */ ddo.ORDER_ID , DECODE(ddo.ORDER_TYPE,0,'DS',1,'OP',2,'SP',3,'PB',4,'PD',6,'SU',7,'IS',8,'MS',9,'LA',10,'L B',11,'LC',12,'LD',13,'SA',14,'SB',15,'SC',16,'SD',17,'NP',18,'RE',19,'VP',20,'MU',21,'T1' ,22,'T2',23,'T3',24,'B1',25,'B2',26,'B3',27,'M1',28,'M2',29,'M3',30,'R1',31,'R2',32,'R3',3 3,'PT',34,'DR',35,'MX',ddo.ORDER_TYPE) ORDER_TYPE , CASE WHEN ddo.DEAL_CODE IS NOT NULL THEN 'Deal Buy PO' WHEN ddo.ORDER_TYPE IN (0,2,4,6,9,12,17) THEN 'Auto' ELSE 'Manual' END ORDER_METHOD FROM D_DISTRIBUTOR_ORDERS ddo WHERE ddo.REGION_ID = 1 AND ddo.LEGAL_ENTITY_ID = 101 AND ddo.DISTRIBUTOR_ID = 'SIMON' AND ddo.ORDER_DAY BETWEEN TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD')-6 AND TO_DATE('{RUN_DATE_YYYYMMDD}','YYYYMMDD'); 56