How Several Different Investigation Groups Can Share a Single, Secure Database of Rhino Horn Criminal Traffickers Timothy C. Haas, haas@uwm.edu Draft of May 21, 2014 1 A Federated Database Below, a method is given for the development of a single, virtual database of all evidence gathered on persons suspected of participating in illegal rhino horn trafficking networks (hereafter called players or suspects). This database would hold data from several different investigation groups who work within national parks, provincial governments, and the national government. Reports would be generated from this database that would allow sophisticated social network analyses to be conducted to identify key players. The removal of these key players from these networks would cause the greatest disruption to each network’s criminal activities. Criminal trafficking networks consist of several tiers of middlemen. Players at higher tiers receive rhino horn from several different subnetworks of poachers and runners. Combining data on these subnetworks into one network allows analysis to identify increasingly higher-tiered players (typically those in control of funds and with export capabilities), and to reveal the true extent of the network. On the other hand, if subnetworks are analyzed in isolation from other subnetworks, these high-level middlemen can be mis-identified as peripheral players. Federation members would, at their discretion share data with each other in the course of an investigation. If third tier middlemen are funding poaching raids in widely separated parts of South Africa, the use of a federated database would allow long-distance links to be discovered and made known to all investigation groups. The first two steps of any federated database option are as follows. First, each investigation group’s evidence storage software system is modified to allow it to read input from a report file generated by either MySQL or MS Access (hereafter referred to as the database engine). Next, data entry protocols are changed so that each investigation group enters new evidence directly into their local, secure database engine. When a new set of evidence is received, the investigation group would enter it into their database engine. If the investigation group possesses a criminal intelligence software system such as IBM’s Analyst’s Notebook or SAS Law Enforcement and Public Safety System (formerly MEMEX), they would also perform the following two steps: 1. have their database engine write an output file of this evidence, 1 2. read this output file into their criminal intelligence software system. With these steps, an investigation group would enter new evidence only once into their computers. In the following sections, different options for implementing a federated database are described. A working example of a zero-cost option is provided. Next, a method is given for sending de-identified network linkage data to an outside specialist for analysis. Finally, a workshop is outlined that would train investigation group personnel on the basics of querying databases, and social network analysis applied to criminal network analyses. 2 Implementing a Federated Database 2.1 A Zero Cost Implementation The following protocol could be used to query a virtual database composed of all evidence aggregated across all investigation groups. 1. A member of the federation of investigation groups, called herein the requester sends an email with a Structured Query Language (SQL) query in it to each federation member. 2. Upon receipt of an email query a member may choose to ignore the email query due to lack of trust in the requester. Or, the member may run the query against their local MySQL database and send the query’s result as an encrypted file that is attached to an email back to the requester. Several free utilities are available on the web that encrypt files using a shared key. Investigation groups would share this key amongst themselves by meeting once at a centrally located coffee shop and sharing a common key string. MySQL is a free database program (see below). 3. The requester collects all received query responses into a single data file. All of these different, local databases taken together form a virtual, single database (often called a federated database) of all players that these investigation groups are gathering evidence on. Haase et al. (2010) review different architectures for federated databases and conclude that for data that is changing frequently, queries to member databases is more efficient and less expensive than first building a master database and then querying it. 4. The requester uses this single data file to identify all links between criminal network players and creates a Pajek readable link file. Pajek is a free social network analysis software package (Nooy et al. 2011). 2 5. The requester reads this link file into Pajek and draws the network, and computes its centrality measures. This analysis produces actionable intelligence in the form of a ranked list of suspects to take into custody. It is possible to automate the receipt of the incoming query request email, the execution of its query, and the generation of the outgoing email message containing the query’s results by using a free set of macros written by Mehta and Williams (2002). This approach does not require any group’s database be exposed to web hacking, nor the purchase of any software beyond MS Outlook. 2.1.1 MySQL Database of Criminal Wildlife Traffickers If an investigation group does not have the resources to purchase a criminal intelligence database package, a free package has been written and is available at www4.uwm.edu/people/haas/sna This package is written for MySQL, a free database system. To run this example, a version of MySQL that is 5.2.3 or later is needed. One download site that offers a virusfree download is http://mysql-com.en.softonic.com/ The example consists of three files: createdatabase.sql, addtodatabase.sql, and querydatabase.sql. The first file creates a six-table database: players, phones, cars, guns, random identifiers, and encryption key. The second file gives an example of adding three suspects to the database. The third file runs the query needed to create a file suitable for social network analysis. The query file performs two tasks as follows. 1. First, internally-generated random identifier numbers are assigned to suspects – thus relieving the data manager from having to maintain a log relating local suspect IDs to local suspect names. These random IDs are then used in lieu of suspect names during the creation of the first output file of the query. Note however, that random identifiers are not a substitute for encrypted names in a federated database because there is no guarantee that the same random identifier will be assigned to the same suspect who is present in the local databases of two or more investigation groups (see the next section). 2. Second, using a given key, a second output file of the query is written wherein suspect names have been encrypted via an algorithm that is currently considered to be unbreakable: the AES encryption algorithm of Daemen and Rijmen (2003). 3 2.2 A Low Cost Implementation To have dedicated database support, each federation member would also purchase MS Access or MS Office (MS Access is bundled in this package). In addition to Microsoft support for MS Access, there is a large user base with several active forums. 2.3 More Expensive Alternatives Federations members could achieve secure transmissions by jointly purchasing a Virtual Private Network (VPN). Also, there are many web-based database systems that offer greater opportunities for data integration such as one based on a set of distributed Microsoft SQL Servers. These higher-performance solutions however, are more expensive and require stronger Information Technology (IT) support. 3 De-identifying Data for Outside Analysis There are many criminal network analyses that Pajek cannot perform such as (a) automatic weighting of links by each player’s number of phones, cars, or guns, (b) reconstruction of the network (prediction of unobserved links and group memberships), (c) prediction of who will succeed an arrested player, and (d) an optimal arrest strategy. The services of an outside specialist may be needed to perform such analyses. A data file sent to any outside specialist however, should not contain any classified, private, or confidential information. The action of replacing suspect names in a database with encrypted or random identifiers is referred to as de-identifying or de-classifying the database. If the network is changing frequently through time, such analyses may need to be rerun every week. In this case, de-identification needs to be automatic. One way approach is as follows. 1. An SQL query script would be provided to each investigation group by the outside specialist. This script, when run against a federation member’s local database, would de-identify each suspect’s name by replacing the name with an encrypted name (hereafter called a codename). Because all investigation groups would use the same encryption key as discussed above, these codenames could not be reverse engineered back to the original names by anyone not having the secret key. Similar to the discussion, above, because this name encryption step assigns a unique codename to a unique player name, all of these different, local databases taken together would form a virtual, single database of all players that these investigation groups are gathering evidence on. 4 A second approach is to use a commercial cipher program. Two such programs are: www.littlelite.net/ncryptxl and www.extendoffice.com/order/kutools-for-excel.html Finally, a third alternative is to use a bit manipulation cipher in Excel. If the AES algorithm is not used, the next best way is to use a bit manipulation (xor) cipher with a key that is almost as long as the total number of characters in the complete suspect name list. This way, the cipher is essentially unbreakable because it is almost a one time pad cipher. This is what is programmed in the Excel VBA code deident.bas that is available at www4.uwm.edu/people/haas/sna To install this macro, do the following: (a) Start Excel. (b) ?? The Excel spreadsheet, deident.xlsm, on the above website contains this macro. Note that all three alternatives will support a federated database only if all participants use the same key. Analyst’s Notebook has a declassify link chart function that replaces all names on a link chart with non-traceble identifier strings. But is feature will not support a federated database because the cipher key used by Analyst’s Notebook is not necessarily the same across all federation members. 2. The SQL query script would ask for: (a) each pair of players that are linked through (say) an intercepted phone call, (b) how many phones, and/or cars, and/or guns each player has, and (c) for each pair of players, the number of evidence items that mention both of them. 3. Each investigation group would then create an email message containing this deidentified data file and send it to the outside specialist for social network analysis. Upon receipt of all data files from cooperating federation members, the specialist would: (a) aggregate all data files into one file, (b) analyze the aggregated network, (c) share the results of the analysis with all investigation groups. Because a particular suspect’s name may differ to some degree across the databases of the federation, the specialist will replace a pair of suspects with one suspect if the pair have nearly identical addresses and, optionally, car registration numbers. 5 4 Summary The tasks listed below would enable investigation groups to implement a federated database and grasp the essentials of social network analysis. 1. Agree to setup a federated database to gather information on organized criminals engaged in wildlife trafficking. 2. Implement the federated database in either MySQL or Excel. 3. Compute social network analysis centrality measures on this federated database to identify the network’s current set of key players. 4. Also using this federated database, predict unobserved links, predict who will succeed an arrested player, and develop an optimal strategy for arresting suspects. References Daemen, J., and Rijmen, V. (2003), The Rijndael Block Cipher, AES Proposal. Retrieved December 20, 2013, from http://csrc.nist.gov/archive/aes/rijndael/Rijndael-ammended.pdf Haase, P., Mathäβ, T., and Ziller, M. (2010), “An Evaluation of Approaches to Federated Query Processing over Linked Data,” I-SEMANTICS ’10 Proceedings of the 6th International Conference on Semantic Systems, Article No. 5, September 1-3, Graz, Austria. Retrieved November 8, 2013 from http://dl.acm.org/citation.cfm?id=1839713 Mehta, A. and Williams, D. (2002), “SQL and Outlook: Enable Database Access and Updates through Exchange and Any E-mail Client,” MSDN Magazine, Microsoft Corporation. Retrieved November 8, 2013 from msdn.microsoft.com/en-us/magazine/cc301799.aspx Nooy, Wouter de, Mrvar, A., and Batagelj, V. (2011), Exploratory Social Network Analysis with Pajek, Second Edition, Cambridge, U.K.: Cambridge University Press. Software may be freely downloaded from http://pajek.imfm.si Retrieved May 11, 2013. 6