Grid Enabled Distributed Data Mining and Conversion of Unstructured Data Paul Donachy Terrence J harmer Ron H Perrott Belfast e-Science Centre www.qub.ac.uk/escience Jens Rasch Sarah Bearder Martin Beckett Datactics Ltd www.datactics.co.uk Abstract With the explosion in size of data warehouses and the proliferation of databases, handling large volumes of unstructured data is still the most critical element currently affecting companies attempting to control their data assets. This presents, even experienced data managers, with a host of potential problems, including matching, transformation, and integration of various disparate data sources. At present, it can not be stressed enough how poorly developed many of the current practices are and how as the size of datasets is only going to increase massively in the near future, it is of significant commercial importance to develop scalable affordable techniques for handling these issues. In terms of both crime prevention and national security, it is imperative to streamline and provide a mechanism for seamless access and mining to such volumes of unstructured data. A grid enabled environment has the potential to solve this problem by providing the core data mining engine with secure, reliable and scaleable high bandwidth access to the various distributed data sources and formats across various administrative domains. The architecture and a roadmap for a Gridenabled distributed data matching solution based on such technology will be presented in this paper. 1 Introduction The explosion in the size of data warehouses and the proliferation of databases, handling large volumes of unstructured data is still the most critical element currently affecting companies attempting to control their data assets. This presents, even experienced data managers, with a host of potential problems, including matching, transformation, and integration of various disparate data sources. The emergence of Grid technology and the ability to provide secure, reliable and scaleable high bandwidth access to distributed data sources across various administrative domains is set to play an important role in the area of data mining. This paper will present the background and motivation for the industrial e-Science project GEDDM. The industrial partner in this project, Datactics Ltd, has developed the world’s first fully fuzzy parallelised data matching algorithm. We describe the effect fuzzy matching has on data mining and the computational consequences. This paper will present the nature of this industry with typical business examples to explain the types of errors encountered in real world data and the fuzzy approach to data matching. The CPU intensive demands and consequences of this process will be described and a solution using message passing will be outlined. Results are presented for a wide range of hardware scenarios to implement parallel solutions from SMP to Beowulf clusters. The architecture for a fully service oriented Grid enabled distributed data matching solution is presented. 2 Current Practice Volume and structure of data are still the most critical elements currently affecting companies attempting to control their data assets. At present, it can not be stressed enough how poorly developed many of the current practices are and how as the size of datasets is increasing massively in the near future, it is of significant commercial importance to develop scalable affordable techniques for handling these issues. In terms of both crime prevention and national security, it is imperative to streamline and provide a mechanism for seamless access and mining to such volumes of unstructured data. Bringing together data from different sources poses a number of difficulties. There are straightforward problems of format and standards in representing addresses and dates and employing different systems, which were intended for different uses and emphasis different parts of the data. In addition to the rich source of natural errors there are also deliberate errors introduced for fraud. We are particularly concerned with errors, which lead to duplication - that is multiple physical records in a database, which actually refer to the same entity. We have classified typical errors in databases, which lead to duplicate entries into a number of classes: • Deletion, insertion, and replacement: Single character errors in a field with no obvious reason. Often either “typos” or extra characters inserted when translating between different data formats. • Phonetic errors: Single errors due to mis-hearing or “mis-thinking”. Especially common with numbers eg. 15 / 50. The specific errors are often dependant on the language being used. • Visual errors: Similar to typing errors but it is possible to predict which character was intended. These are becoming more common as legacy paper based data is computerised or systems are used to automate entering of manual forms. Typical errors are confusing “m” and “n” and OCR system confusing 1(one) and l (letter L). • Equivalent words: An interesting class of errors occurs when words are regarded as equivalent by human operators such as the Ms/Mrs or road/avenue/street/terrace. This can even be deliberate in some areas to claim a more desirable address. • Names are a rich source of these errors. To everyone except a computer “Richard” is equal to “Dick”, this problem is even more prevalent in cultures where names contain honorifics or have different forms depending on the situation. • Inconsistency errors: These are errors, which can be identified automatically. Mismatches between the gender of a title and first name or between a postcode and post town. Of late, the Industrial partner has had requests from major customers in the US (banking and legal sectors) to interrogate data sources in numerous structures, formats and disparate administrative locations. These business opportunities all present a similar technical problem, in that interfacing to such vast amounts of information in a common structured parallel approach across such disparate structure and sources was a bottle neck and problematic. (E.g. one legal customer held over 45Tb of data in various formats including email, PDF, web logs, various RDBMS). 3 Deduplication The Datactics “DataTrawler” product consists of a graphical user interface running on Windows and most Unix like systems. This shows the current view of the data and a tree view of all the operations performed. Operations range from simple import and conversion of data to complex fuzzy logic deduplication of large datasets but all can be performed with no programming or specialized database knowledge. Each operation is actually carried out by a separate engine, which may be running locally or on a different platform or even in a cluster. The results of the engine are reflected in the GUI. In addition a batch script can be generated allowing the sequence of operations to be performed automatically. An audit report of all operations is generated for all runs. The range of errors described above need a variety of approaches which are all contained The mixture of mostly random point errors and structural errors lead to our unique approach combining user supplied domain IP and fuzzy matching technology together with HPC computing power. Match files: All the operations in the tool are first subject to a set of user editable templates. These allow specific patterns of letters and numbers to be matched, or equivalent words to be compared or certain characters to be removed. This has the advantage of allowing non-programming users to contribute domain specific business intelligence and allowing easy specialization and customization for a specific country. Fuzzy: All the matching engines within Duplitrix can employ Fuzzy matching. This allows a defined number of allowable errors to be specified for each match. An error is a single missing character or a swapped pair of characters. The level of Fuzziness can be tuned depending on the data to be matched. For example in first names we might specify an overall level of 1 error but for long, complicated, foreign or often mis-spelt names a higher error level would be permitted. Since many of the errors are just as likely to occur in the first character allowing a fuzzy match prevents many of the indexing techniques commonly used to search data. Ultimately this becomes a brute force approach of matching every character of every record against every other record. Two recent advances have made possible this level of error matching on large datasets. Cheap commodity high performance PCs with large memory/storage coupled with fast networking and cheap (free) available operating systems, which allow clusters of machines to be built easily. Messaging: The Datactics matching engines use MPI (specifically mpich) to operate seamlessly on SMP and clusters under Windows, Linux and a wide range of Unixlike operating systems. 4 Towards a Grid Architecture The Grid based Distributed Data Mining architecture presented here is based on the Open Grid Services Architecture (OGSA) model [1] derived from the Open Grid Services Infrastructure specification defined by the OGSI Working Group within the GGF [2]. The Open Grid Services Architecture (OGSA) represents an evolution towards a Grid architecture based on Web services concepts and technologies. It describes and defines a service oriented architecture composed of a set of interfaces and their corresponding behaviors to facilitate distributed resource sharing and accessing in heterogeneous dynamic environments [3]. However before widespread adoption happens within this sector a number of fundamental areas will need to be addressed: Service Requestor BIND FIND Transport Medium PUBLISH Service Provider manner across various administrative boundaries. The data-mining sector is an ideal candidate to exploit the benefits of such a framework. Service Directory Figure 1 Figure 1 shows the individual components of the service-oriented architecture (SOA). The service directory is the location where all information about all available grid services is maintained. A service provider that wants to offer services publishes its services by putting appropriate entries into the service directory. A service requestor uses the service directory to find an appropriate service that matches its requirements. An example of such a requirement is the maximum price a service requestor is willing to accept for a specific data mining service. When a service requestor locates a suitable service, it binds to the service provider, using binding information maintained in the service directory. The binding information contains the specification of the protocol that the service requestor must use as well as the structure of the request messages and the resulting responses. The communication between the various agents all occurs via an appropriate transport mechanism. 5 Summary Grid technology presents a framework that aims to provide access to heterogeneous resources in a secure, reliable and scalable Heterogeneous Resource: The challenge in grid enabling a commercial product comes from the requirement to use a mixture of different resources and platforms, efficiently use mixtures of nodes with very different performance and capabilities and manage external resources with unknown availability. All this must be provided to a non-technical user in a simple and straightforward manner. Management: As grids evolve as a heterogeneous array of heterogeneous nodes the monitoring and management of such grids becomes more and more critical. To date, little or no work has been undertake to investigate a cohesive strategy to managing such arrays of heterogeneous grid elements and how such monitoring and management strategies will be integrated into applications and existing enterprise management solutions. References [1] OGSA http://www.globus.org/ogsa/ [2] OGSI http://www.gridforum.org/ogsi-wg/ [3] S. Burbeck, “The Tao of e-Business Services,” IBM Corporation (2000); see http://www4.ibm.com/software/developer/library/wstao/index.html.