Ashley Ohmann June 20, 2013 * What is Custom SQL? * What can I do with it? * Join conditions * Unions and Self Joins * Ranks * Derived Tables * Tableau queries each data source using SQL that’s specific to the data type * The differences querying different data source types are in * Syntax * Functions * Field length limitations * Tableau allows the SQL used to query a data source to be customized (relatively easily) in order to manipulate the joins, filters, and field lengths and types to produce a more accurate output * Every data connection, whether it’s to one or multiple tables, has custom SQL underpinning it. * The syntax between different database types varies—for instance, in most flat file and server-based data sources, table and field names are identified by brackets. * With Hive, the bracket has been replaced by ` * The functions in a query must be functions that the database can understand, which is a particular issue with SQL queries that aggregate * For the most part, all databases use some version of SQL, but there are minor differences * For instance, flat Microsoft-based sources, like Access, Excel, and flat files, use the conventions of Microsoft’s JET driver * Another example: the ISNULL function, which is used to replace nulls in SQL Server data sources and is a corollary to IFNULL in Tableau, is not read by Teradata—one would use COALESCE instead. * Join conditions tell the database how the fields in different tables are related to each other * A “table” is a queried set of data, not necessarily all of the rows and columns in a pre-defined table in a database * Custom SQL is a good tool for using joins to transform a data source * Inner join: keeps records common to both tables; this join type usually has the least number of null values, provided that the join condition is the right one * Left join: keeps all records from the first table and the matching records from the second table * Right join: keeps all records from the second table and the matching records from the first table * Outer/full join: keeps all records from both tables * Self join: refers to the same table in the inner and outer query. A union can be a self join. * Union: stacks tables on top of each other. * Unions are particularly useful for transforming a data source, either because it was not set-up properly, or because one would like to create a visualization that needs for it to be set-up in a certain way * The Region Sales example data source presents the current and previous year’s sales by Division and Region for two different battery types. * The format is similar to that of a pivot table, and the problem with analyzing it is that the same measure is duplicated according to a qualitative attribute. * In order to analyze this data source and to create metrics based on the measures, it needs to be restructured: * We need to add a field that identifies the battery type as an attribute * We need to put the sales for this year and last year, respective, in one column * Connect to the Union Data.xls data source, and select the Region Sales sheet. * Click the “Custom SQL” option under Step 2 of the data connection * Click the expansion button to see the whole Custom SQL window * Re-order the fields so that Division comes first, then Region * Highlight and copy all of the text, then underneath it, type UNION on a new row, then create another two carriage returns, and paste the code * In the outer query, create a new field that identifies the battery type as an attribute * Alias the battery type’s sales this year with a generic field name, and do the same for last year’s sales. Then, delete the fields that reference the other battery type * In the bottom query, paste the field that identifies the battery type attribute, but change the value to “Energizer” * Alias the sales fields for that battery type so that they have the same names as the sales fields in the outer query * Make sure that the commas are in the right places * Click OK, and then give the data source a good name * Duplicate the data source, and call it “Union with Calculated Fields” * We’ll use the existing join conditions to add in some other calculations * Custom SQL can be used to apply conditions to different groups of data based on an attribute of the group * It might be useful to add a quota, and in this case, the quote will be the previous year’s sales multiplied by 1.15 for Energizer and 1.10 for Duracell * Edit the Custom SQL to multiple the previous year’s sales by the factors listed above, and call the new field [Quota] * This data is useful for making a nice little bullet chart. Add a calculated field in Tableau identifying whether or not the sales exceeded the quota, and put it on the color shelf. * One of the shortcomings of Tableau is that it doesn’t have a real ranking function * RAWSQL can be used to rank, but for a large data set with ranking conditions on several fields, it’s best to edit the SQL and include a ranking function * It doesn’t look like an Excel data connection supports this, so one would need to pre-rank the data in Excel * For SQL Server or Oracle data connections, RANK and DENSE_RANK functions are appropriate * * Syntax: RANK ( ) OVER ( [ partition_by_clause ] order_by_clause ) Looks like: RANK() OVER (PARTITION BY i.LocationID ORDER BY i.Quantity DESC) AS Rank * I have added the sample data to my SQL Server and written a ranking query—see the next slide. SELECT [Region Sales].[Division] AS [Division], [Region Sales].[Region] AS [Region], 'E' as [Battery Type], [Region Sales].[Energizer] as [TY Sales], DENSE_RANK() OVER (ORDER BY [Region Sales].[Energizer] DESC) AS TY_Rank FROM [dbo].[Region Sales] [Region Sales] Union SELECT [Region Sales].[Division] AS [Division], [Region Sales].[Region] AS [Region], 'D' as [Battery Type], [Region Sales].[Duracell] AS [TY Sales], DENSE_RANK() OVER (ORDER BY [Region Sales].[Duracell] DESC) AS Rank FROM [dbo].[Region Sales] [Region Sales] * Derived tables in a query are very useful for transforming data as well as aggregating at different dimension levels * We’ll use a different version of our previous data set to create a derived table * The YOY Sales sheet in our data source shows, for each Division, Region, Battery Type, and Year, the Sales. * We want to have this year’s sales and last year’s sales as separate columns, though. * Unfortunately, we can’t use a derived table with an Excel data connection, so the example on the next slide is for code from a SQL Server. SELECT [TY].[Battery Type] AS [Battery Type], [TY].[Division] AS [Division], [TY].[Region] AS [Region], [TY].[Sales] AS [TY Sales], [LY].[LY Sales] AS [LY Sales] FROM [YOY Sales] [TY] Left Join (SELECT [YOY Sales].[Battery Type] AS [Battery Type], [YOY Sales].[Division] AS [Division], [YOY Sales].[Region] AS [Region], [YOY Sales].[Sales] AS [LY Sales] FROM [YOY Sales] where [YOY Sales].[Year] = 2012) as [LY] on [TY].[Battery Type] = [LY].[Battery Type] and [TY].[Division] = [LY].[Division] and [LY].[Region] = [LY].[Region] where [TY].[Year] = 2013