Knowledge Management, Social Network Analysis, and Knowledge Discovery for Homeland Security Sidd Kaza (sidd@u.arizona.edu) MIS 480/580 Feb 20, 2007 1 Outline • Knowledge Management using COPLINK • Social Network Analysis of Criminal Networks • Social Network Concepts using NetDraw 2 Knowledge Management using COPLINK • COPLINK is an knowledge management system that integrates information from multiple law-enforcement agencies. • It incorporates algorithms for crossjurisdictional social network analysis, knowledge discovery, and visualization for intelligence, border safety, and national security applications. 3 Multiple Isolated Data Sources within a Single Agency Records Management System (RMS) Gang Database Mug Shots Database Tucson Police Department Records System 4 Isolated Agencies Share Limited Information through State and Federal Systems Pima County Systems Tucson Police Department Systems Phoenix Police Department Systems 5 Provide Access to Information using One Friendly Interface Records Management System (RMS) Gang Database Mugshots Database 6 Consolidated Information Provides Opportunities for Analytical and Data Analysis Applications 7 COPLINK™ Information Retrieval Interface 8 Query Parameters and Filters Running the query with filters. 9 Person Search Results A search of White males named Mike 20-35, 5’5” to 6’3” 150 to 250 lbs returns a generic set of results (24 persons). 10 Association Retrieval and Visualization 11 Spatio-temporal Analysis and Visualization 12 Outline • Knowledge Management using COPLINK • Social Network Analysis of Criminal Networks • Social Network Concepts using Netdraw 13 Criminal Activity Networks • Criminal Activity Networks (CAN) are networks of people, vehicles and locations that are linked by law enforcement information. • These networks allow us to understand the complex relationships between people and vehicles. • Analysis of the topological characteristics of these networks helps better understand their governing mechanisms. • In this study we analyze the topological characteristics of CANs of people and vehicles in a multiple jurisdiction scenario to support border and transportation security. 14 Literature Review • • • • Criminal Activity Network extraction Previous studies of complex networks Topological characteristics of networks The theory of growth in networks 15 Criminal Activity Network Extraction • The extraction of CANs involves analyzing information from many different datasets. • Accessing information from multiple sources poses many challenges that are documented in literature. [Garcia-Molina, 2002; Rahm, 2001] • This study uses the BorderSafe information sharing and analysis framework. [Marshall et al., 2004] • Using the framework, law enforcement and other datasets are accessed such that they are amenable for network extraction and analysis. 16 Complex Networks: Previous Studies • There have been various studies to understand the characteristics of large and complex networks. • The studies have explored the topology, evolution, robustness and other properties of real world networks. – The World Wide Web [Albert, Jeong and Barabasi, 1999; Kumar et al., 2000] – Cellular and metabolism networks [Jeong et al., 2000] – Citation networks [Redner, 1998] • Most real world networks were found to have similar topological and evolutionary characteristics. [Albert and Barabasi, 2002] 17 Topological Characteristics • Topological characteristics are used to study networks at a macro level. • Three concepts dominate the statistical study of topology: [Albert and Barabasi, 2002] – Small world • Despite the large size of networks, nodes often have relatively short paths between them. – Clustering • The tendency of nodes to cluster together to form cliques, representing circles of friends in which every member knows every other member. – Degree distribution • The distribution of edges among nodes, where different nodes may have different number of edges. 18 Small World • The small world concept is important as it can depict the communications within a network. • Communication can range from the spread of disease in human populations and spread of viruses on the Internet to passage of messages and commands in a criminal network. • The small world property of a network is measured by the average path length. [Albert and Barabasi, 2002] • The average shortest path length of many real networks have been measured. – Movie actors were found to be an average distance of 3.65 from each other. [Watts & Strogatz, 1998] – Average paths between co-authors in MEDLINE were 4.6. [Newman, 2001] • Shortest path lengths of social networks are small due to the presence of shortcuts between otherwise distant people. [Watts, 1999; Nishikawa et al, 2002 ] 19 Clustering • Individuals in social networks often form cliques. • Examples of cliques in social network include authors collaborating together in a co-authorship network and websites pointing to each other on the web. • The tendency to form cliques is measured by the clustering coefficient (CC) which is a ratio of the number of edges that exist in a network to the total number of possible edges. [Albert and Barabasi, 2002] • Real networks tend to have high CC often compared to random graphs: – Movie actors: 0.79 [Watts & Strogatz, 1998] – MEDLINE co-authorship: 0.066 [Newman, 2001] • The CC in a criminal network points to the tendency of individuals to collaborate together and partner in crimes. 20 Degree Distribution • Nodes in a network have different number of edges connecting them. The number of edges connected to a node is called its degree. • The spread in node degrees is given by a distribution function P(k), which gives the probability that a randomly selected node has exactly ‘k’ edges. [Albert and Barabasi, 2002] • The distribution functions of most real world networks follow power law scaling with varying exponents: – Movie actors: exponent of 2.3. [Watts & Strogatz, 1998] – Medline co-authorship: exponent of 1.2. [Newman, 2001] • In criminal networks, high degree of individuals may imply their leadership. [Xu and Chen, 2004] • The degrees of nodes are also used to study the growth and evolution of networks. 21 Growth in Networks • Most real world networks (including CANs) are not static and grow due to the addition of nodes and/or edges. • The growth of networks changes their topological characteristics. • Two mechanisms govern evolving networks: [Barabasi and Albert, 1999; Dorogovtsev, Mendes and Samukhin, 2000; Newman, 2001] – Growth: networks expand continuously by adding new nodes and, – Preferential attachment: new nodes attach preferentially to nodes that are already well connected. 22 Preferential Attachment • Network growth involves adding new nodes (and edges) to the set of current nodes. • Preferential attachment assumes that the probability that a new node will connect to an existing node i depends on the degree of the node. – The higher the degree of the existing node, higher the probability that new nodes will attach to it. • The functional form of preferential attachment ((k)) for a network can be measured by observing the nodes present in the network and their degrees [Albert and Barabasi, 2002] 23 Preferential Attachment: Previous Studies • ∏(k) for co-authorship, citation, actor and the Internet networks was found to follow the power law distribution.[Jeong, Neda and Barabasi, 2003; Newman, 2001] • However, in some cases (k) may grow linearly up to a point and then fall off at high degrees. [Newman, 2001] • This implies that the high degree nodes are not able to attract more newer nodes. • Constraints to growth are also seen in criminal networks. 24 Constraints on Growth of a Network • Constraints on the number of links a node can attract may be due to:[Amaral et al, 2000] – Aging: Since the growth of the network may be over time, some high degree nodes might become too old to participate in the network. (e.g., actors in a movie network) – Cost: It might become costly for a node to attach to a large number of nodes. • Constraints on the growth of networks may be domain specific and have been studied in many domains: – In plant-animal pollination networks, some animals cannot pollinate certain plants: hence a link cannot be established. [Jordano, Basocompte and Olesen, 2003] – In criminal networks, trust may restrict the growth of networks. Criminals and terrorists do not include many people in their inner trust circle. [Klerks, 2001] 25 Research Questions • What are the topological characteristics of criminal networks? • How does cross-jurisdictional data affect the topological characteristics of criminal networks? • How do criminal networks grow on adding data from more jurisdictions? 26 Research Testbed • The testbed for this study contains incident reports of all the individuals and vehicles involved in crimes in the jurisdiction of Tucson Police Department (TPD) and Pima County Sheriff’s Department (PCSD) from 1990 to 2002. Incidents Individuals Vehicles TPD 2.99 million 1.44 million 675,000 PCSD 2.18 million 1.31 million 520,000 • A CAN consists of individuals and vehicles represented as nodes and police incidents represented as edges. • Two nodes have an edge between them when they are involved in the same police incident. • Narcotics networks are extracted from the testbed. 27 Research Design • The study is divided into three parts: – Characteristics of criminal networks in a single jurisdiction. • Narcotics networks that include individuals and incidents reported in a single jurisdiction are analyzed. – Characteristics of the networks by combining data from multiple jurisdictions. • Narcotics networks including individuals and incidents reported in both TPD and PCSD are analyzed. • The implications of the topological properties of these networks are explained in the law enforcement domain. 28 Experiment Results Narcotics Networks in a Single Jurisdiction Basic Statistics TPD PCSD Nodes 31,478 individuals 11,173 individuals Edges 82,696 67,106 22,393 (70%) 10,610 (94%) 41 0.0002 103 0.0008 Giant component 2nd largest component Link density 29 Experiment Results Single Jurisdiction (cont.) Small World Properties Clustering Coefficient Average Shortest Path Length (L) Diameter TPD PCSD 0.39 (1.39 x 10-4) 0.53 (4.08 x 10-4) 5.09 4.62 22 23 Values in parenthesis are values for a random network of the same size and average degree. 30 Implications of the Small World Property • The narcotics networks in both jurisdictions can be classified as small world networks. • The clustering coefficients of the networks are much larger than their random counterparts. – This suggests that criminals show the tendency to from circles of associates where members commit crimes together. – This is not unusual in narcotics networks where an individual commits crimes with friends and people in his trust circle. – This property works as an asset to law enforcement in identifying criminal conspiracies. • A short L in a narcotics network has important implications for both crime and law enforcement: – It improves the speed of flow of information and goods in the network. – It also suggests that criminals often commit crimes with individuals outside their group. This creates the shortcuts needed to reduce L. – A short average path length has positive implications for law enforcement too. Short paths between criminals generate better leads in crime investigations. 31 Single Jurisdiction (cont.) Degree Distributions TPD Narcotics Network PCSD Narcotics Network 0 0 cumulative p(k) 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 0 -2 -2 -4 -4 -6 -6 1 2 3 4 5 0.18 -8 0.16 -8 0.14 0.12 0.1 0.08 -10 0.06 -10 0.04 0.02 83 70 60 55 52 49 46 43 40 37 34 31 28 25 22 19 4 16 1 13 7 10 0 -12 -12 k k These diagrams show the log-log plots of the cumulative degree distribution (p(k)) vs. the degree (k). The insets are p(k) vs. k. The solid line is the truncated power law curve. 32 Implications of the Scale Free Property • The narcotics networks in both jurisdictions can be classified as scale free (SF) networks. • This implies that a large number of individuals do not have have many associates but, a few have large number of associates. • The exponents in both power law decays are very small (0.85 – 1.3). The distribution decays slowly for lower degrees, indicating that there a large number of nodes with small degrees. – This is not unexpected as criminals with high degrees attract more attention from law enforcement authorities so having less associates is beneficial. • The truncated power law fits (R2 =93%) better than the power law distribution (R2 =85-87%) . – As the number of links (k) grows, the probability of nodes having ‘k’ links decreases. – This might indicate the cost or trust constraint (criminals may not want to attach to many people) to growth. 33 Growth in Multiple Jurisdictions This curve shows the preferential attachment when the narcotics network in TPD is augmented with data from PCSD. Preferential Attachment (TPD < PCSD) 1 0.9 0.8 0.7 K(k) 0.6 0.5 0.4 0.3 0.2 0.1 83 70 60 55 52 49 46 43 40 37 34 31 28 25 22 19 16 13 10 7 4 1 0 k The dashed line above the curve shows a linear preferential attachment growth, the 34 solid line shows the state of no preferential attachment. Preferential Attachment: Implications • The curves lie above and grow faster than the solid line, offering visual evidence of the presence of preferential attachment. • Two properties of growth between jurisdictions are worth noting: – The curve maintains linearity at low value of k. The linearity breaks down for higher degrees. – In totality the lower degree nodes attract more nodes towards themselves than higher degree nodes. 35 Preferential Attachment: Implications (cont.) • Break in Linearity – The slow growth of nodes with high degree can be attributed to the nature of networks being studied. – Cost/Trust effect: Criminals may not prefer to be related to a large number of individuals for the risk of drawing attention. Thus, the cost of acquiring more links is high, this might prevent a node with large number of links to acquire more. – External influences: Law enforcement limits the number of crimes a individual can commit. 36 Preferential Attachment: Implications (cont.) • Lower degree nodes attract more nodes – The data on police incidents is drawn from two different jurisdictions. – A criminal might be committing more crimes in one jurisdiction and not the other. – Thus, one jurisdiction may have incomplete information about the activity of some criminals in the network. – These criminals will have a low degree in one jurisdiction. – On adding the second jurisdiction, the degree of these criminals increase since they commit more crimes in the second jurisdiction. – This will lead to lower degree nodes attracting more nodes than higher degree nodes. 37 Conclusions • This study focused on topological properties of criminal activity networks and their link to law enforcement, border and transportation security. • Criminal networks are small world networks with scale free distributions. These topological characteristics have important implications for law enforcement and hence transportation security. • A single jurisdiction contains incomplete information on criminals and cross-jurisdictional data provides an increased number of higher quality investigative leads. 38 Outline • Knowledge Management using COPLINK • Social Network Analysis of Criminal Networks • Social Network Analysis Concepts using Netdraw 39 Online Sources • Studies discussed today – http://ai.eller.arizona.edu/paper_conf/index.htm • Visualize your social/organizational networks – http://www.touchgraph.com/ 40