Siddharth Kaza
(sidd@u.arizona.edu)
MIS 596A
Mar 21, 2007
1
•
Homeland Security Centers of Excellence
• Information Retrieval using COPLINK
• Social Network Analysis of Criminal
Networks
• Criminal Network Visualizer: System Demo
• Mutual Information Analysis to Identify Highrisk Vehicles
2
• The Department of Homeland Security (DHS) Science and Technology office funds and establishes Centers of Excellence (COE) to conduct fundamental research and education for HS issues.
• Each center is led by a university in collaboration with partners from other institutions, agencies, laboratories, think tanks, and the private sector.
• Generate and disseminate knowledge and technology to advance the homeland security mission.
• Create and leverage intellectual capital and nurture a homeland security science and engineering workforce.
3
• The Center for Risk and Economic Analysis of Terrorism Events
(CREATE) , led by the University of Southern California, evaluates the risks, costs, and consequences of terrorism.
• The National Center for Food Protection and Defense (NCFPD) , led by the
University of Minnesota, defends the safety of the food system from prefarm inputs through consumption by establishing best practices.
• The National Center for Foreign Animal and Zoonotic Disease Defense
(FAZD), led by Texas A&M University, protects against the introduction of high-consequence foreign animal and zoonotic diseases.
• The National Consortium for the Study of Terrorism and Responses to
Terrorism (START) , led by the University of Maryland, informs decisions
4 on how to disrupt terrorists and terrorist groups.
• The National Center for the Study of Preparedness and Catastrophic Event
Response (PACER) , led by Johns Hopkins University, optimizes our Nation's preparedness in the event of a high-consequence natural or man-made disaster.
•
The Center for Advancing Microbial Risk Assessment (CAMRA) , led by Michigan
State University, fills critical gaps in risk assessments for decontaminating microbiological threats.
• The University Affiliate Centers to the Institute for Discrete Sciences (IDS-UACs) are led by Rutgers University, USC, UIUC, and the University of Pittsburgh.
• The Regional Visualization and Analytics Centers (RVACs) , led by Penn State
University, Purdue University, Stanford University and others, conduct research on visually-based analytic techniques that help people gain insight from complex, conflicting, and changing information.
5
• Major research areas:
– Surveillance, Screening, Data Fusion, and Situational
Awareness
• How can border inspection processes be strengthened?
• How can high-risk traffic be distinguished from low-risk traffic?
• What emerging methods of collecting, fusing, integrating and analyzing information offer promise of increasing border security?
– Population Dynamics and Immigration Administration
– Command, Control, and Communications
– Immigration Policy and Effects
– Border Risk Management
6
Screening
High-risk Vehicle
Identification
Law-enforcement Data
Narcotics Network
Border Crossing Data
Mutual Information
Vehicle A Vehicle B
2000
1500
1000
500
< 2004 Dates 2005 >
Frequent Crossers at Night
Vehicle A Vehicle B
Identify high-risk vehicles using association techniques like mutual information to incorporate crossing frequency and law-enforcement data.
Data Fusion and Analysis
Law-enforcement Data Geo-coded Illegal
Alien/Vehicle Data
AZ CA TX
AZ CA TX
Border Crossing Data
(AZ, CA, TX)
Vehicles People
Information Retrieval
Network Analysis Spatio-temporal
Analysis
Situational
Awareness
Analysis of Criminal
Networks
Study criminal networks using social network analysis techniques to understand and predict crime patterns.
• Homeland Security Centers of Excellence
•
Information Retrieval using COPLINK
• Social Network Analysis of Criminal
Networks
• Criminal Network Visualizer: System Demo
• Mutual Information Analysis to Identify Highrisk Vehicles
8
• COPLINK is an information retrieval and analysis system that integrates information from multiple law-enforcement agencies.
• It incorporates algorithms for crossjurisdictional social network analysis, knowledge discovery, and visualization for intelligence, border safety, and national security applications.
9
Gang Database
Records Management
System (RMS)
Mug Shots Database
Tucson Police Department Records System
10
Pima County
Systems
Tucson Police Department
Systems
Phoenix Police
Department
Systems
11
Gang Database
Records Management
System (RMS)
Mugshots Database
12
Consolidated Information Provides Opportunities for
Analytical and Data Analysis Applications
13
14
Running the query with filters.
15
A search of White males named Mike
20-35, 5’5” to 6’3”
150 to 250 lbs returns a generic set of results (24 persons).
16
17
18
• Homeland Security Centers of Excellence
• Information Retrieval using COPLINK
•
Social Network Analysis of Criminal
Networks
• Criminal Network Visualizer: System Demo
• Mutual Information Analysis to Identify Highrisk Vehicles
19
• Criminal Activity Networks (CAN) are networks of people, vehicles and locations that are linked by law enforcement information.
• These networks allow us to understand the complex relationships between people and vehicles.
• Analysis of the topological characteristics of these networks helps better understand their governing mechanisms.
• In this study we analyze the topological characteristics of
CANs of people and vehicles in a multiple jurisdiction scenario to support border and transportation security.
20
• Criminal Activity Network extraction
• Previous studies of complex networks
• Topological characteristics of networks
• The theory of growth in networks
21
• The extraction of CANs involves analyzing information from many different datasets.
• Accessing information from multiple sources poses many challenges that are documented in literature.
[Garcia-Molina, 2002; Rahm, 2001]
• This study uses the BorderSafe information sharing and analysis framework.
[Marshall et al., 2004]
• Using the framework, law enforcement and other datasets are accessed such that they are amenable for network extraction and analysis. 22
• There have been various studies to understand the characteristics of large and complex networks.
• The studies have explored the topology, evolution, robustness and other properties of real world networks.
– The World Wide Web
[Albert, Jeong and Barabasi, 1999; Kumar et al., 2000]
– Cellular and metabolism networks
[Jeong et al., 2000]
– Citation networks
[Redner, 1998]
• Most real world networks were found to have similar topological and evolutionary characteristics.
[Albert and Barabasi,
2002]
23
• Topological characteristics are used to study networks at a macro level.
• Three concepts dominate the statistical study of topology:
[Albert and Barabasi, 2002]
– Small world
• Despite the large size of networks, nodes often have relatively short paths between them.
– Clustering
• The tendency of nodes to cluster together to form cliques, representing circles of friends in which every member knows every other member.
– Degree distribution
• The distribution of edges among nodes, where different nodes may have different number of edges.
24
• The small world concept is important as it can depict the communications within a network.
• Communication can range from the spread of disease in human populations and spread of viruses on the Internet to passage of messages and commands in a criminal network.
• The small world property of a network is measured by the average path length.
[Albert and Barabasi, 2002]
• The average shortest path length of many real networks have been measured.
– Movie actors were found to be an average distance of 3.65 from each other.
[Watts & Strogatz, 1998]
– Average paths between co-authors in MEDLINE were 4.6.
[Newman, 2001]
• Shortest path lengths of social networks are small due to the presence of shortcuts between otherwise distant people.
[Watts, 1999; Nishikawa et al, 2002 ]
25
• Individuals in social networks often form cliques.
• Examples of cliques in social network include authors collaborating together in a co-authorship network and websites pointing to each other on the web.
• The tendency to form cliques is measured by the clustering coefficient
(CC) which is a ratio of the number of edges that exist in a network to the total number of possible edges.
[Albert and Barabasi, 2002]
• Real networks tend to have high CC often compared to random graphs:
– Movie actors: 0.79
[Watts & Strogatz, 1998]
– MEDLINE co-authorship: 0.066
[Newman, 2001]
• The CC in a criminal network points to the tendency of individuals to collaborate together and partner in crimes.
26
• Nodes in a network have different number of edges connecting them. The number of edges connected to a node is called its degree.
• The spread in node degrees is given by a distribution function P(k), which gives the probability that a randomly selected node has exactly ‘k’ edges.
[Albert and Barabasi, 2002]
• The distribution functions of most real world networks follow power law scaling with varying exponents:
– Movie actors: exponent of 2.3.
[Watts & Strogatz, 1998]
– Medline co-authorship: exponent of 1.2.
[Newman, 2001]
• In criminal networks, high degree of individuals may imply their leadership.
[Xu and Chen, 2004]
• The degrees of nodes are also used to study the growth and evolution of networks.
27
• Most real world networks (including CANs) are not static and grow due to the addition of nodes and/or edges.
• The growth of networks changes their topological characteristics.
• Two mechanisms govern evolving networks:
[Barabasi and Albert,
1999; Dorogovtsev, Mendes and Samukhin, 2000; Newman, 2001]
– Growth: networks expand continuously by adding new nodes and,
– Preferential attachment: new nodes attach preferentially to nodes that are already well connected.
28
• Network growth involves adding new nodes (and edges) to the set of current nodes.
• Preferential attachment assumes that the probability that a new node will connect to an existing node i depends on the degree of the node .
– The higher the degree of the existing node, higher the probability that new nodes will attach to it.
• The functional form of preferential attachment (
(k)) for a network can be measured by observing the nodes present in the network and their degrees
[Albert and Barabasi, 2002]
29
• ∏(k) for co-authorship, citation, actor and the Internet networks was found to follow the power law distribution.
[Jeong,
Neda and Barabasi, 2003; Newman, 2001]
• However, in some cases
(k) may grow linearly up to a point and then fall off at high degrees.
[Newman, 2001]
• This implies that the high degree nodes are not able to attract more newer nodes.
• Constraints to growth are also seen in criminal networks.
30
• Constraints on the number of links a node can attract may be due to:
[Amaral et al, 2000]
– Aging: Since the growth of the network may be over time, some high degree nodes might become too old to participate in the network. (e.g., actors in a movie network)
– Cost: It might become costly for a node to attach to a large number of nodes.
• Constraints on the growth of networks may be domain specific and have been studied in many domains:
– In plant-animal pollination networks, some animals cannot pollinate certain plants: hence a link cannot be established.
[Jordano, Basocompte and
Olesen, 2003]
– In criminal networks, trust may restrict the growth of networks.
Criminals and terrorists do not include many people in their inner trust circle.
[Klerks, 2001]
31
• What are the topological characteristics of criminal networks?
• How does cross-jurisdictional data affect the topological characteristics of criminal networks?
• How do criminal networks grow on adding data from more jurisdictions?
32
• The testbed for this study contains incident reports of all the individuals and vehicles involved in crimes in the jurisdiction of Tucson Police
Department (TPD) and Pima County Sheriff’s Department (PCSD) from
1990 to 2002.
Incidents
Individuals
Vehicles
TPD
2.99 million
1.44 million
675,000
PCSD
2.18 million
1.31 million
520,000
• A CAN consists of individuals and vehicles represented as nodes and police incidents represented as edges.
• Two nodes have an edge between them when they are involved in the same police incident.
• Narcotics networks are extracted from the testbed.
33
• The study is divided into three parts:
– Characteristics of criminal networks in a single jurisdiction.
• Narcotics networks that include individuals and incidents reported in a single jurisdiction are analyzed.
– Characteristics of the networks by combining data from multiple jurisdictions.
• Narcotics networks including individuals and incidents reported in both TPD and PCSD are analyzed.
• The implications of the topological properties of these networks are explained in the law enforcement domain.
34
Narcotics Networks in a Single Jurisdiction
Basic Statistics
Nodes
Edges
Giant component
2 nd largest component
Link density
TPD
31,478 individuals
82,696
22,393 (70%)
41
0.0002
PCSD
11,173 individuals
67,106
10,610 (94%)
103
0.0008
35
Experiment Results
(cont.)
Small World Properties
Clustering Coefficient
Average Shortest Path Length
(L)
TPD
0.39 (1.39 x 10
5.09
-4 )
PCSD
0.53 (4.08 x 10
4.62
Diameter 22
Values in parenthesis are values for a random network of the same size and average degree.
23
-4 )
36
• The narcotics networks in both jurisdictions can be classified as small world networks.
• The clustering coefficients of the networks are much larger than their random counterparts.
– This suggests that criminals show the tendency to from circles of associates where members commit crimes together.
– This is not unusual in narcotics networks where an individual commits crimes with friends and people in his trust circle.
– This property works as an asset to law enforcement in identifying criminal conspiracies.
• A short L in a narcotics network has important implications for both crime and law enforcement:
– It improves the speed of flow of information and goods in the network.
– It also suggests that criminals often commit crimes with individuals outside their group. This creates the shortcuts needed to reduce L.
– A short average path length has positive implications for law enforcement too.
Short paths between criminals generate better leads in crime investigations.
37
(cont.)
-4
0
0
-2
0.5
1 1.5
TPD Narcotics Network
Degree Distributions
PCSD Narcotics Network
2 2.5
3 3.5
4 4.5
5
0
0
-2
1 2 3
-4
-6
-8
-10
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
1 4 7
10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 70 83
-12
-6
-8
-10
-12 k k
4
These diagrams show the log-log plots of the cumulative degree distribution (p(k)) vs. the degree (k).
The insets are p(k) vs. k. The solid line is the truncated power law curve.
38
5
• The narcotics networks in both jurisdictions can be classified as scale free
(SF) networks.
• This implies that a large number of individuals do not have have many associates but, a few have large number of associates.
• The exponents in both power law decays are very small (0.85 – 1.3). The distribution decays slowly for lower degrees, indicating that there a large number of nodes with small degrees.
– This is not unexpected as criminals with high degrees attract more attention from law enforcement authorities so having less associates is beneficial.
• The truncated power law fits (R 2 =93%) better than the power law distribution (R 2 =85-87%) .
– As the number of links (k) grows, the probability of nodes having ‘k’ links decreases.
– This might indicate the cost or trust constraint (criminals may not want to attach to many people) to growth.
39
This curve shows the preferential attachment when the narcotics network in TPD is augmented with data from PCSD.
Preferential Attachment (TPD < PCSD)
0.5
0.4
0.3
0.2
0.1
0
1
0.9
0.8
0.7
0.6
k
The dashed line above the curve shows a linear preferential attachment growth, the
40 solid line shows the state of no preferential attachment.
• The curves lie above and grow faster than the solid line, offering visual evidence of the presence of preferential attachment.
• Two properties of growth between jurisdictions are worth noting:
– The curve maintains linearity at low value of k. The linearity breaks down for higher degrees.
– In totality the lower degree nodes attract more nodes towards themselves than higher degree nodes.
41
(cont.)
• Break in Linearity
– The slow growth of nodes with high degree can be attributed to the nature of networks being studied.
– Cost/Trust effect: Criminals may not prefer to be related to a large number of individuals for the risk of drawing attention. Thus, the cost of acquiring more links is high, this might prevent a node with large number of links to acquire more.
– External influences: Law enforcement limits the number of crimes a individual can commit.
42
(cont.)
• Lower degree nodes attract more nodes
– The data on police incidents is drawn from two different jurisdictions.
– A criminal might be committing more crimes in one jurisdiction and not the other.
– Thus, one jurisdiction may have incomplete information about the activity of some criminals in the network.
– These criminals will have a low degree in one jurisdiction.
– On adding the second jurisdiction, the degree of these criminals increase since they commit more crimes in the second jurisdiction.
– This will lead to lower degree nodes attracting more nodes than higher degree nodes.
43
• This study focused on topological properties of criminal activity networks and their link to law enforcement, border and transportation security.
• Criminal networks are small world networks with scale free distributions. These topological characteristics have important implications for law enforcement and hence transportation security.
• A single jurisdiction contains incomplete information on criminals and cross-jurisdictional data provides an increased number of higher quality investigative leads.
44
• Homeland Security Centers of Excellence
• Information Retrieval using COPLINK
• Social Network Analysis of Criminal
Networks
•
Criminal Network Visualizer: System Demo
•
Mutual Information Analysis to Identify
High-risk Vehicles
45
46
• Vehicles involved in illegal activities (especially smuggling) may operate in groups.
• If the criminal links of one vehicle in a group are known, then their border crossing patterns can be used to identify other partner vehicles.
• CBP agents also suggest that criminal vehicles may cross at certain times of the day to try and evade inspection.
• The concept of mutual information (MI) can be used to include these heuristics and identify high risk vehicles.
47
• Association rule mining
• Mutual information
• Applications of mutual information
48
• Inferring associations between items in the database was motivated by decision support problems faced by retail organizations (Stonebraker 1993).
• An association rule (AR) is a relationship of the form A
B
– A is the antecedent item-set and B is the consequent item-set.
– The antecedent and consequent item-sets can contain multiple items.
•
A
B holds in a transaction set D with
– confidence ‘c’ if c% of transactions in D that contain A also contain B,
– support ‘s’ if s% of transactions in D contain both A and B .
• Association mining identifies all the rules that have support and confidence greater than user-specified thresholds.
49
• AR mining has been applied in many domains including
– ‘market basket’ data (Agrawal 1993, 1994),
– web log analysis (Mobasher 1996),
– network intrusion detection (Lee 1998),
– recommender systems (Lin 2002), and
– gene regulatory network extraction (Berrar 2001).
• Work has also been done to include domain heuristics in AR mining with
– market basket analysis (Hilderman 1998), and
– gene regulatory networks (Huang forthcoming).
50
• Mutual information is an information theoretic measure that can be used to identify interesting co-occurrences of objects.
• It can be considered a subset of AR mining with 1-item antecedent and consequent item-sets.
• The earliest definitions of MI was given by Claude et al.
(1949) and Fano (1961) as the amount of information provided by the occurrence of an event ( y ) about the occurrence of another event ( x ):
( ; )
log
2
P x y )
• Intuitively, this concept measures if the co-occurrence of x and y (P(x,y)) is more likely than their separate occurrences
51
(P(x).P(y)).
• Phrase extraction from text
– MI works well in text documents as they can be considered as a set of events (words), and the probability of the occurrence of a word can be calculated over the entire document.
– It has been used to study association between words in
English texts to identify commonly occurring phrases
(Church 1990, Hindle 1990).
– It has also been used for key phrase extraction in Chinese texts (Ong 1999).
• Data sharing and schema matching
– Pantel et al. (2005) used MI to match database columns containing similar information.
52
• Bioinformatics
– MI has been used to extract protein motif patterns from sequences (Tao
2004), and
– identify building blocks of proteins from biomedical abstracts (Weisser
2004).
• Feature selection
– Selecting features in supervised neural network learning (Battiti 1994).
• These studies have not modified the mutual information measure to include domain specific heuristics.
53
• There have been studies that have extended/modified the classical MI measure to include domain heuristics.
• These include studies in
– natural language processing: modified MI to include ngrams (Magerman 1990),
– bioinformatics: extended MI to include transitive biological associations (Wren 2004), and
– feature selection: used conditional MI to identify important features for image classification (Fleuret 2004).
54
• Border-crossing records can be considered as a stream of text
(license plates) ordered by the time of crossing.
– MI can be used to identify frequent co-occurrence between a pair of vehicle crossings.
• If one vehicle in the pair has a criminal record, some inferences may be made about the second vehicle if they cross together frequently.
• We use conditional probability to include domain heuristics in the MI formulation.
• The heuristics are derived from information recorded in multiple law-enforcement databases.
55
• Can law enforcement information from border-area jurisdictions be used to identify suspect vehicles at the border?
• How can domain heuristics be incorporated to enhance the performance of mutual information?
• Which domain heuristics are important in identifying suspect vehicles at the border?
56
• Datasets obtained from various LE agencies in the Tucson,
AZ metropolitan area
Date range
TPD
1990 – 2006
PCSD
1990 - 2006
Others
1990-2006
Vehicle with police incidents
662, 527 640, 733 662, 034
• CBP data includes information on vehicles crossing the border between Arizona and Mexico at six ports of entry.
Date range
Recorded crossings
Vehicles
Mar 2004 – Feb 2006
17.6 million
2.6 million
57
• To identify interesting pairs of vehicles that cross the border together we use the time of crossing and waiting-time heuristics to enhance MI
– Suggested by domain experts and previous studies (Kaza 2005, Kaza 2006).
• Time of crossing: Criminal vehicle pairs that cross during certain times of the day/night are interesting.
– Crossings at odd times may be considered suspicious
• Waiting-time: The ability to identify high risk vehicles was greatly influenced by the traffic situation and waiting times at a particular port.
– The traffic intensity influences the time interval (waiting time) between the inspections of two potential partner vehicles.
– The waiting-time heuristic is used to adjust the time interval between two paired vehicles included in the MI calculation.
• MI measure with time and waiting-time heuristics is referred to as ‘MIW.’
Classical MI (without heuristics) – ‘MIC.’
58
Border Wait Times
Web-Spider Internet
Archive
Law Enforcement Data*
TPD
PCSD
Border Crossing Data
Six Ports
Splitting
Training Data
2/3
Testing Data
1/3
Narcotics
Vehicles
Overlap
Law Enforcement Data*
TPD
PCSD
Subset
Evaluation
Heuristic
Calculation
Set A
Criminal
Vehicles with
Crossings
MIW/MIC
Scores
Set B
Potential Target
Vehicles
59
© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQ TM
Port of Entry
(Check points)
• An aerial photograph of a typical U.S. port of entry
(southern border).
•
Vehicle lanes are backed up with dozens of vehicles during peak times.
•
Criminal vehicles operate in groups.
– If one is caught others turn back into Mexico.
• They may join the lines one at a time or use turn-out points.
Turn-out points
Vehicle lanes
Turn-out points
•
Thus, time interval between two related vehicles is likely to be less or equal to the waiting time if the second vehicle doesn’t join the line until the first vehicle goes through.
•
This needs to be taken into consideration in the calculation of MI.
60
•
Two external sources
1.
CBP publishes hourly wait times on its website (BWT).
•
The information is posted only for the current day
•
No publicly available archive is maintained
• A web-spider was used to systematically download the web-page for every hour over several days in April 2006
•
However, the average waiting times thus obtained cannot be generalized to the entire year
2.
The Internet Archive (IA) contained snapshots of the BWT web-page from April 10, 2004 to March 31, 2005.
• Obtain waiting time statistics for various days over many months in 2004 and 2005
•
The statistics from the spidering process and IA were then used to calculate average waiting times for each port on an hourly basis and used in MIW.
61
• The mutual information score between any two vehicles is defined as:
( , )
log
2
Here A is a vehicle in Set A and B is a vehicle in Set B .
• P( A , B ) is the probability of B crossing within 15 minutes of A, this is calculated based on the number of times A and B are seen crossing together.
– 15 minutes is the average wait time for a vehicle at the six ports of entry in AZ.
• P( A ) and P( B ) are the probabilities of the vehicles A and B crossing the border.
62
• By using conditional probability we can modify P(A), P(B), and P(A,B) to reflect domain heuristics.
• P(A) is changed to the probability that vehicle A crosses the border and commits crimes. Hereafter this is P’(A).
• P(B) is changed to the probability that vehicle B crosses the border and commits crimes (P’(B)).
• P(A,B) is changed to the probability that vehicles A and B cross the border together and commit crimes (P’(A,B)).
• Thus, a high MI’(A,B) will indicate vehicles that are likely to enter the country together and commit crimes.
63
• Let P c
Let P b
(a) be the probability that vehicle
‘ a
’ commits a crime.
(a) be the probability that
‘ a
’ crosses the border.
• The day is divided into six time periods corresponding to office travel, travel for lunch, night time, etc.
• The probability of criminal vehicles crossing during these time periods is calculated using historical information in the
TPD/PCSD database.
– We can get P c
(V|t) where
‘ t
’ is a time period (1≤ t≤ 6). P c
(V|t) is the probability that a vehicle in time period
‘ t
’ will commit a crime.
• By definition,
t
6
1
[( ) | ] b c
A b refers to vehicle A crossing the border, and A c refers to vehicle A having contact with the police. This is summed over all six periods to obtain P’(A).
64
• Since the probability of a vehicle crossing and having a police contact are independent, we get
t
6
1
P A t P V t b
( | ) ( | ) c
• Similarly, P’(B), P’(A,B), and MI’(A,B) can be defined as:
t
6
1
[( ) | ] b c
t
6
1
P B t P V t b
( | ) ( | ) c
t
6
1
P [(( b
( ) ) | ] c t
t
6
1
P b
(( ) | ) ( | ) ( | ) c c
( , )
'( , ) /( '( )
'( ))
• MIW(A,B) now includes the time heuristic.
65
• To include the waiting-time heuristic, the definition of P’(A,B) was further modified to include variable time intervals.
• The interval between the crossings of Vehicle A and Vehicle B was set at the waiting time.
– For instance, consider that the average waiting time at Port A between
10:00pm – 11:00pm is 20 minutes.
– If Vehicle A crossed at 10:00 pm at Port A then Vehicles A & B are said to cross together only if they crossed within 20 minutes of each other.
• Since the waiting-time periods varied greatly by port and time, this procedure only paired vehicles that were most likely to be present together at that port.
66
• The potential target vehicles identified by MI algorithms were evaluated by measuring their overlap with police datasets.
• The overlap with three different datasets was measured: TPD,
PCSD and the entire Tucson met. region dataset (that includes
TPD and PCSD).
– Each of these datasets cover different geographical areas and record different levels of crimes.
• These numbers were compared with each other and the classical (no domain heuristics) MI measure.
• Both statistical tests and illustrative cases were used in the comparison.
67
Experimental Results
8pm-Midnight
23%
Night
Day
4pm-8pm
22%
(a) Midnight-5am
12%
5am-10am
10%
2pm-4pm
13%
10am-2pm
20%
8pm-Midnight
27%
Nigh t
Day
4pm-8pm
24%
(b)
Midnight-5am
15%
•
Figure (a) shows the percentage of all crossings
5am-10am over six time periods of the a day.
– 23% of all crossings take
10% place between 8pm-
Midnight.
10%
10am-2pm
•
Figure (b) shows the percentage of all crossings by
14% vehicles with police contacts over the six time periods.
2pm-4pm
– 27% of crossings by vehicles with police contacts happen between
8pm-Midnight.
• The figure suggests that a large number (≈50%) of crossings with police contacts happen after dark.
•
MIW uses this information to assign more weight to time periods with more criminal crossings.
68
Experimental Results
A B C D E F
800
700
600
500
400
300
200
100
0
•
This chart shows the average number of crossings at the six ports in Arizona over time of day
(X-axis).
• On the Y-axis are the average numbers of vehicles that cross at the port (the ports with zero crossings are closed at certain times).
0000 0200 0400 0600 0800 1000 1200
Time of Day
1400 1600 1800 2000 2200
• The three largest ports (A, D and F) have an average of several hundred crossings per hour between 7am and 8pm.
• There is wide variation in the number of crossing vehicles across different ports and time periods. This supports the assumption that a single, across-the-board, time interval for MI calculation is not suited in this problem domain.
– A large time interval is likely to pair many vehicles that may not be related to each other, a small interval is likely to miss many related pairs of vehicles due to large waiting times at peak hours.
69
Experimental Results
A
100
90
80
70
60
50
40
30
20
10
0
00
00
02
00
04
00
06
00
08
00
B C D E F
10
00
12
00
Time of day
14
00
16
00
18
00
20
00
22
00
• This figure shows the average waiting times at each of the six over times of the day (Xaxis).
• On the Y-axis are the average waiting times in minutes.
• It can be seen that the variations in waiting times at small ports (B, C, and E) are not large and usually stay under 10 minutes.
• The variations at large ports (A, D, and F) roughly mirror their traffic intensity.
• The waiting times were used to define time intervals in the MIW algorithm.
70
Experimental Results
45
40
35
30
25
20
15
10
5
0
TPD dataset
MIC
MIW
10
0
1
50
1
1
70
60
PCSD dataset
50
40
30
20
100
2
1
500 1000
5 14
6 13
Top ‘n’ pairs
10
0
MIC
MIW
10
1
2
50
3
3
100
5
4
500
13
16
1000
22
27
Top ‘n’ pairs
160
140
Tucson met. dataset
120
100
80
60
40
20
0
MIC
MIW
10
1
2
50
4
6
100
8
10
500 1000
28
30
50
55
Top ‘n’ pairs
1500
24
21
1500
34
39
1500
73
84
2000
31
31
2000
47
51
2000
97
112
2500
37
40
2500
56
60
2500
118
134
•
MIC and MIW scores were calculated for
310,751 pairs of vehicles (the first vehicle from
Set A and the second from Set B ).
•
For comparison, the number of police contact vehicles identified by each was counted using all three police datasets.
• On the X-axis are topn pairs of vehicles ordered by their MIC and MIW scores.
• On the Y-axis is the number of vehicles with police contacts identified by the two measures.
•
MIW generally identified more potentially criminal vehicles (vehicles with prior police contacts) than MIC.
•
The illegal activity of border crossing vehicles was higher in the Pima County region as compared to the city of Tucson as shown by the larger number of police contacts identified.
71
Experimental Results > Selected Case Studies
• This figure shows the crossing patterns of a pair of vehicles with the high MIW score.
Vehicle C Vehicle D
2000
After dark/No fixed work schedule
1500
1000
500
0
Dates (2006)
• Vehicle C from Arizona and it’s occupant were arrested in Tucson for the sale of narcotics.
• Vehicle C crossed 7 times in a one month period and crossed within a few minutes of Vehicle D.
• The crossings may be considered suspicious since they are almost always after dark and do not fit a standard work schedule.
72
Experimental Results > Selected Case Studies
Tucson met. area – Narcotics Network Customs and Border Protection
Tucson met. area
Criminal Network
2000
1500
1000
500
0
MIW
Vehicle A Vehicle B
Vehicle C
Frequent
Crossers at Night
Vehicle D
• Vehicle C was found to have strong connections to a narcotics network in the
Tucson metropolitan area. It had links to other people and vehicles that had been arrested / suspected for narcotics sales and possession in the region.
•
Vehicle D was also involved in criminal activity in the Tucson region.
•
MIW identified many other such strong cases.
73
• Exploring the criminal links of border crossing vehicles in local law enforcement databases can be used to enhance border security.
• We found that MI can be used to identify high risk potential suspect vehicles that may warrant more inspection at the border.
• The MI measure modified to include domain heuristics like time of crossing and waiting-time performs significantly better than classical MI in the identification of potentially criminal vehicles.
• In addition, the transitive use of MI scores may hold promise for the identification of groups of vehicles.
74
• DHS COEs
– http://www.dhs.gov/xres/programs/editorial_0498.
shtm
• Studies discussed today
– http://ai.eller.arizona.edu/paper_conf/index.htm
75