Data Profiler. Regular Expressions in Analysis with MS SQL Server Table of Contents Introduction .................................................................................................................................................. 1 SQL Execution Engine vs Java........................................................................................................................ 1 Developing of SQL Server Regular Expressions Function.............................................................................. 2 Deploying of Regular Expression Function to SQL Server ............................................................................. 2 Talend Setup ................................................................................................................................................. 3 Known Issues................................................................................................................................................. 4 Introduction Regular Expressions are very powerful tool for data analysis in Data Profiler. They are very easy to use and at the same time, Regular Expressions generate outstanding result with minimum efforts. There are two ways to run analysis in Data Profiler: SQL and Java. SQL execution engine carries out analysis on database server; Java execution engine retrieves all data and apply analysis on client side. As a result, Java option requests more time and consumes more resource because data is sent over network to client computer for processing. MS SQL Server doesn’t support Regular Expressions natively, and SQL option is not applicable to MS SQL Server data directly. Fortunately, since MS SQL Server 2005, Microsoft introduced CLR support, and the rich .NET Framework libraries can be available in SQL queries in form of user defined functions and procedures. Significant advantage of SQL execution engine for MS SQL Server can be reached for big data. Java execution engine shows better performance on small data sets. SQL Execution Engine vs Java The tests have been done on the database with 25 million records. Regular Expressions analysis for 1 field has been applied, and the regular expression was “^[0-9]{4}$”. # 1 MS SQL Server Configuration - SQL Server R2 Developer Talend Workstation Configuration Ran on the same computer SQL Runtime, seconds 182 Java Runtime, seconds 1090 Difference, % 599 2 - 4 cores - 8 GB RAM - x64 - SQL Server R2 Developer - 4 cores - 8 GB RAM - x64 with SQL server - Linux Fedora x64 - 2 cores - 4 GB RAM - Laptop 151 569 379 Developing of SQL Server Regular Expressions Function First of all, SQL Server function has to be developed in Microsoft Visual Studio. Create a new Visual Basic or Visual C# SQL CLR Database project. The sample of the class for C# is below public partial class RegExpBase { [Microsoft.SqlServer.Server.SqlFunction(IsDeterministic=true, IsPrecise=true)] public static SqlBoolean RegExpMatch(string matchString, string pattern) { bool isSuccess=false; if (matchString != null) { Regex r = new Regex(pattern); isSuccess = r.Match(matchString).Success; } return isSuccess; } }; Also, it was attached two files 1. PrgxRegExpProject.zip – Visual Studio 2010 C# SQL CLR database project 2. PrgxRegExpLibrary.zip –ready for deployment library. Deploying of Regular Expression Function to SQL Server Next stage is to deploy the SQL CLR library to SQL Server. It requests changing some SQL Server settings to enable CLR ALTER DATABASE [Database Name] set TRUSTWORTHY ON go sp_configure 'clr enabled', 1 go reconfigure go To register the new CLR assembly and function, place the DLL library to MS SQL Server folder and follow the script USE [Database Name] go CREATE ASSEMBLY a_PrgxRegExp FROM 'E:\SqlFiles\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\Clr\PrgxRegExp.dll' WITH PERMISSION_SET = UNSAFE; go CREATE FUNCTION [dbo].[f_PrgxRegExpMatch](@Pattern NVARCHAR(4000), @MatchString NVARCHAR(4000)) RETURNS BIT WITH EXECUTE AS OWNER AS EXTERNAL NAME a_PrgxRegExp.RegExpBase.RegExpMatch go The last step is to test the new function. The first statement result is 1, the second one is 0. select dbo.f_PrgxRegExpMatch('1254','^[0-9]{4}$'); select dbo.f_PrgxRegExpMatch('eh5','^[0-9]{4}$'); Talend Setup Finally, add a new item to Regular_Expression_Matching indicator The indicator statement is SELECT COUNT(CASE WHEN dbo.f_PrgxRegExpMatch(<%=__COLUMN_NAMES__%>,<%=__PATTERN_EXPR__%>)=1 THEN 1 END), COUNT(*) FROM <%=__TABLE_NAME__%> <%=__WHERE_CLAUSE__% Now, everything is ready to create Regular Expressions patterns for data validation. Known Issues 1. Regular Expressions do not work for the PRGX remote project. I could set it up for local project. I think PRGX remote project was set up improperly or it is a Talend bug 2. Drill down functionality to show valid and invalid rows after completion of analysis does not work. I will send a bug request to Talend support