Data Profiler. Regular Expressions in Analysis with MS SQL Server

advertisement
Data Profiler. Regular Expressions in
Analysis with MS SQL Server
Table of Contents
Introduction .................................................................................................................................................. 1
SQL Execution Engine vs Java........................................................................................................................ 1
Developing of SQL Server Regular Expressions Function.............................................................................. 2
Deploying of Regular Expression Function to SQL Server ............................................................................. 2
Talend Setup ................................................................................................................................................. 3
Known Issues................................................................................................................................................. 4
Introduction
Regular Expressions are very powerful tool for data analysis in Data Profiler. They are very easy to use
and at the same time, Regular Expressions generate outstanding result with minimum efforts.
There are two ways to run analysis in Data Profiler: SQL and Java. SQL execution engine carries out
analysis on database server; Java execution engine retrieves all data and apply analysis on client side. As
a result, Java option requests more time and consumes more resource because data is sent over
network to client computer for processing.
MS SQL Server doesn’t support Regular Expressions natively, and SQL option is not applicable to MS SQL
Server data directly. Fortunately, since MS SQL Server 2005, Microsoft introduced CLR support, and the
rich .NET Framework libraries can be available in SQL queries in form of user defined functions and
procedures.
Significant advantage of SQL execution engine for MS SQL Server can be reached for big data. Java
execution engine shows better performance on small data sets.
SQL Execution Engine vs Java
The tests have been done on the database with 25 million records. Regular Expressions analysis for 1
field has been applied, and the regular expression was “^[0-9]{4}$”.
#
1
MS SQL Server
Configuration
- SQL Server R2 Developer
Talend Workstation
Configuration
Ran on the same computer
SQL
Runtime,
seconds
182
Java
Runtime,
seconds
1090
Difference,
%
599
2
- 4 cores
- 8 GB RAM
- x64
- SQL Server R2 Developer
- 4 cores
- 8 GB RAM
- x64
with SQL server
- Linux Fedora x64
- 2 cores
- 4 GB RAM
- Laptop
151
569
379
Developing of SQL Server Regular Expressions Function
First of all, SQL Server function has to be developed in Microsoft Visual Studio. Create a new Visual
Basic or Visual C# SQL CLR Database project. The sample of the class for C# is below
public partial class RegExpBase
{
[Microsoft.SqlServer.Server.SqlFunction(IsDeterministic=true, IsPrecise=true)]
public static SqlBoolean RegExpMatch(string matchString, string pattern)
{
bool isSuccess=false;
if (matchString != null)
{
Regex r = new Regex(pattern);
isSuccess = r.Match(matchString).Success;
}
return isSuccess;
}
};
Also, it was attached two files
1. PrgxRegExpProject.zip – Visual Studio 2010 C# SQL CLR database project
2. PrgxRegExpLibrary.zip –ready for deployment library.
Deploying of Regular Expression Function to SQL Server
Next stage is to deploy the SQL CLR library to SQL Server. It requests changing some SQL Server settings
to enable CLR
ALTER DATABASE [Database Name] set TRUSTWORTHY ON
go
sp_configure 'clr enabled', 1
go
reconfigure
go
To register the new CLR assembly and function, place the DLL library to MS SQL Server folder and follow
the script
USE [Database Name]
go
CREATE ASSEMBLY a_PrgxRegExp
FROM 'E:\SqlFiles\MSSQL10_50.MSSQLSERVER\MSSQL\DATA\Clr\PrgxRegExp.dll'
WITH PERMISSION_SET = UNSAFE;
go
CREATE FUNCTION [dbo].[f_PrgxRegExpMatch](@Pattern NVARCHAR(4000),
@MatchString NVARCHAR(4000))
RETURNS BIT
WITH EXECUTE AS OWNER
AS EXTERNAL NAME a_PrgxRegExp.RegExpBase.RegExpMatch
go
The last step is to test the new function. The first statement result is 1, the second one is 0.
select dbo.f_PrgxRegExpMatch('1254','^[0-9]{4}$');
select dbo.f_PrgxRegExpMatch('eh5','^[0-9]{4}$');
Talend Setup
Finally, add a new item to Regular_Expression_Matching indicator
The indicator statement is
SELECT COUNT(CASE WHEN
dbo.f_PrgxRegExpMatch(<%=__COLUMN_NAMES__%>,<%=__PATTERN_EXPR__%>)=1 THEN 1 END),
COUNT(*) FROM <%=__TABLE_NAME__%> <%=__WHERE_CLAUSE__%
Now, everything is ready to create Regular Expressions patterns for data validation.
Known Issues
1. Regular Expressions do not work for the PRGX remote project. I could set it up for local project. I
think PRGX remote project was set up improperly or it is a Talend bug
2. Drill down functionality to show valid and invalid rows after completion of analysis does not
work. I will send a bug request to Talend support
Download