Information Retrieval, Social Network Analysis, and Knowledge Discovery for Homeland Security Siddharth Kaza

Information Retrieval, Social Network

Analysis, and Knowledge Discovery for Homeland Security

Siddharth Kaza

(sidd@u.arizona.edu)

MIS 596A

Mar 21, 2007

1

Outline

•

Homeland Security Centers of Excellence

• Information Retrieval using COPLINK

• Social Network Analysis of Criminal

Networks

• Criminal Network Visualizer: System Demo

• Mutual Information Analysis to Identify Highrisk Vehicles

2

Homeland Security Centers of

Excellence

• The Department of Homeland Security (DHS) Science and Technology office funds and establishes Centers of Excellence (COE) to conduct fundamental research and education for HS issues.

• Each center is led by a university in collaboration with partners from other institutions, agencies, laboratories, think tanks, and the private sector.

• Generate and disseminate knowledge and technology to advance the homeland security mission.

• Create and leverage intellectual capital and nurture a homeland security science and engineering workforce.

3

Eight COEs

• The Center for Risk and Economic Analysis of Terrorism Events

(CREATE) , led by the University of Southern California, evaluates the risks, costs, and consequences of terrorism.

• The National Center for Food Protection and Defense (NCFPD) , led by the

University of Minnesota, defends the safety of the food system from prefarm inputs through consumption by establishing best practices.

• The National Center for Foreign Animal and Zoonotic Disease Defense

(FAZD), led by Texas A&M University, protects against the introduction of high-consequence foreign animal and zoonotic diseases.

• The National Consortium for the Study of Terrorism and Responses to

Terrorism (START) , led by the University of Maryland, informs decisions

4 on how to disrupt terrorists and terrorist groups.

Eight COEs (cont.)

• The National Center for the Study of Preparedness and Catastrophic Event

Response (PACER) , led by Johns Hopkins University, optimizes our Nation's preparedness in the event of a high-consequence natural or man-made disaster.

•

The Center for Advancing Microbial Risk Assessment (CAMRA) , led by Michigan

State University, fills critical gaps in risk assessments for decontaminating microbiological threats.

• The University Affiliate Centers to the Institute for Discrete Sciences (IDS-UACs) are led by Rutgers University, USC, UIUC, and the University of Pittsburgh.

• The Regional Visualization and Analytics Centers (RVACs) , led by Penn State

University, Purdue University, Stanford University and others, conduct research on visually-based analytic techniques that help people gain insight from complex, conflicting, and changing information.

5

COE for Border Security and

Immigration (ninth COE solicitation)

• Major research areas:

– Surveillance, Screening, Data Fusion, and Situational

Awareness

• How can border inspection processes be strengthened?

• How can high-risk traffic be distinguished from low-risk traffic?

• What emerging methods of collecting, fusing, integrating and analyzing information offer promise of increasing border security?

– Population Dynamics and Immigration Administration

– Command, Control, and Communications

– Immigration Policy and Effects

– Border Risk Management

6

Screening

High-risk Vehicle

Identification

Law-enforcement Data

Narcotics Network

Border Crossing Data

Mutual Information

Vehicle A Vehicle B

2000

1500

1000

500

< 2004 Dates 2005 >

Frequent Crossers at Night

Vehicle A Vehicle B

Identify high-risk vehicles using association techniques like mutual information to incorporate crossing frequency and law-enforcement data.

Addressing Major Issues

Data Fusion and Analysis

Law-enforcement Data Geo-coded Illegal

Alien/Vehicle Data

AZ CA TX

AZ CA TX

Border Crossing Data

(AZ, CA, TX)

Vehicles People

Information Retrieval

Network Analysis Spatio-temporal

Analysis

Situational

Awareness

Analysis of Criminal

Networks

Study criminal networks using social network analysis techniques to understand and predict crime patterns.

Outline

• Homeland Security Centers of Excellence

•

Information Retrieval using COPLINK


Networks



8

Information Retrieval using COPLINK

• COPLINK is an information retrieval and analysis system that integrates information from multiple law-enforcement agencies.

• It incorporates algorithms for crossjurisdictional social network analysis, knowledge discovery, and visualization for intelligence, border safety, and national security applications.

9

Multiple Isolated Data Sources within a Single Agency

Gang Database

Records Management

System (RMS)

Mug Shots Database

Tucson Police Department Records System

10

Isolated Agencies Share Limited Information through State and Federal Systems

Pima County

Systems

Tucson Police Department

Systems

Phoenix Police

Department

Systems

11

Provide Access to Information using One

Friendly Interface

Gang Database

Records Management

System (RMS)

Mugshots Database

12

Consolidated Information Provides Opportunities for

Analytical and Data Analysis Applications

13

COPLINK™ Information Retrieval Interface

14

Query Parameters and Filters

Running the query with filters.

15

Person Search Results

A search of White males named Mike

20-35, 5’5” to 6’3”

150 to 250 lbs returns a generic set of results (24 persons).

16

Association Retrieval and Visualization

17

Spatio-temporal Analysis and Visualization

18

Outline



•

Social Network Analysis of Criminal

Networks



19

Criminal Activity Networks

• Criminal Activity Networks (CAN) are networks of people, vehicles and locations that are linked by law enforcement information.

• These networks allow us to understand the complex relationships between people and vehicles.

• Analysis of the topological characteristics of these networks helps better understand their governing mechanisms.

• In this study we analyze the topological characteristics of

CANs of people and vehicles in a multiple jurisdiction scenario to support border and transportation security.

20

Literature Review

• Criminal Activity Network extraction

• Previous studies of complex networks

• Topological characteristics of networks

• The theory of growth in networks

21

Criminal Activity Network Extraction

• The extraction of CANs involves analyzing information from many different datasets.

• Accessing information from multiple sources poses many challenges that are documented in literature.

[Garcia-Molina, 2002; Rahm, 2001]

• This study uses the BorderSafe information sharing and analysis framework.

[Marshall et al., 2004]

• Using the framework, law enforcement and other datasets are accessed such that they are amenable for network extraction and analysis. 22

Complex Networks: Previous Studies

• There have been various studies to understand the characteristics of large and complex networks.

• The studies have explored the topology, evolution, robustness and other properties of real world networks.

– The World Wide Web

[Albert, Jeong and Barabasi, 1999; Kumar et al., 2000]

– Cellular and metabolism networks

[Jeong et al., 2000]

– Citation networks

[Redner, 1998]

• Most real world networks were found to have similar topological and evolutionary characteristics.

[Albert and Barabasi,

2002]

23

Topological Characteristics

• Topological characteristics are used to study networks at a macro level.

• Three concepts dominate the statistical study of topology:

[Albert and Barabasi, 2002]

– Small world

• Despite the large size of networks, nodes often have relatively short paths between them.

– Clustering

• The tendency of nodes to cluster together to form cliques, representing circles of friends in which every member knows every other member.

– Degree distribution

• The distribution of edges among nodes, where different nodes may have different number of edges.

24

Small World

• The small world concept is important as it can depict the communications within a network.

• Communication can range from the spread of disease in human populations and spread of viruses on the Internet to passage of messages and commands in a criminal network.

• The small world property of a network is measured by the average path length.


• The average shortest path length of many real networks have been measured.

– Movie actors were found to be an average distance of 3.65 from each other.

[Watts & Strogatz, 1998]

– Average paths between co-authors in MEDLINE were 4.6.

[Newman, 2001]

• Shortest path lengths of social networks are small due to the presence of shortcuts between otherwise distant people.

[Watts, 1999; Nishikawa et al, 2002 ]

25

Clustering

• Individuals in social networks often form cliques.

• Examples of cliques in social network include authors collaborating together in a co-authorship network and websites pointing to each other on the web.

• The tendency to form cliques is measured by the clustering coefficient

(CC) which is a ratio of the number of edges that exist in a network to the total number of possible edges.


• Real networks tend to have high CC often compared to random graphs:

– Movie actors: 0.79


– MEDLINE co-authorship: 0.066

[Newman, 2001]

• The CC in a criminal network points to the tendency of individuals to collaborate together and partner in crimes.

26

Degree Distribution

• Nodes in a network have different number of edges connecting them. The number of edges connected to a node is called its degree.

• The spread in node degrees is given by a distribution function P(k), which gives the probability that a randomly selected node has exactly ‘k’ edges.


• The distribution functions of most real world networks follow power law scaling with varying exponents:

– Movie actors: exponent of 2.3.


– Medline co-authorship: exponent of 1.2.

[Newman, 2001]

• In criminal networks, high degree of individuals may imply their leadership.

[Xu and Chen, 2004]

• The degrees of nodes are also used to study the growth and evolution of networks.

27

Growth in Networks

• Most real world networks (including CANs) are not static and grow due to the addition of nodes and/or edges.

• The growth of networks changes their topological characteristics.

• Two mechanisms govern evolving networks:

[Barabasi and Albert,

1999; Dorogovtsev, Mendes and Samukhin, 2000; Newman, 2001]

– Growth: networks expand continuously by adding new nodes and,

– Preferential attachment: new nodes attach preferentially to nodes that are already well connected.

28

Preferential Attachment

• Network growth involves adding new nodes (and edges) to the set of current nodes.

• Preferential attachment assumes that the probability that a new node will connect to an existing node i depends on the degree of the node .

– The higher the degree of the existing node, higher the probability that new nodes will attach to it.

• The functional form of preferential attachment ( 

(k)) for a network can be measured by observing the nodes present in the network and their degrees


29

Preferential Attachment: Previous Studies

• ∏(k) for co-authorship, citation, actor and the Internet networks was found to follow the power law distribution.

[Jeong,

Neda and Barabasi, 2003; Newman, 2001]

• However, in some cases 

(k) may grow linearly up to a point and then fall off at high degrees.

[Newman, 2001]

• This implies that the high degree nodes are not able to attract more newer nodes.

• Constraints to growth are also seen in criminal networks.

30

Constraints on Growth of a Network

• Constraints on the number of links a node can attract may be due to:

[Amaral et al, 2000]

– Aging: Since the growth of the network may be over time, some high degree nodes might become too old to participate in the network. (e.g., actors in a movie network)

– Cost: It might become costly for a node to attach to a large number of nodes.

• Constraints on the growth of networks may be domain specific and have been studied in many domains:

– In plant-animal pollination networks, some animals cannot pollinate certain plants: hence a link cannot be established.

[Jordano, Basocompte and

Olesen, 2003]

– In criminal networks, trust may restrict the growth of networks.

Criminals and terrorists do not include many people in their inner trust circle.

[Klerks, 2001]

31

Research Questions

• What are the topological characteristics of criminal networks?

• How does cross-jurisdictional data affect the topological characteristics of criminal networks?

• How do criminal networks grow on adding data from more jurisdictions?

32

Research Testbed

• The testbed for this study contains incident reports of all the individuals and vehicles involved in crimes in the jurisdiction of Tucson Police

Department (TPD) and Pima County Sheriff’s Department (PCSD) from

1990 to 2002.

Incidents

Individuals

Vehicles

TPD

2.99 million

1.44 million

675,000

PCSD

2.18 million

1.31 million

520,000

• A CAN consists of individuals and vehicles represented as nodes and police incidents represented as edges.

• Two nodes have an edge between them when they are involved in the same police incident.

• Narcotics networks are extracted from the testbed.

33

Research Design

• The study is divided into three parts:

– Characteristics of criminal networks in a single jurisdiction.

• Narcotics networks that include individuals and incidents reported in a single jurisdiction are analyzed.

– Characteristics of the networks by combining data from multiple jurisdictions.

• Narcotics networks including individuals and incidents reported in both TPD and PCSD are analyzed.

• The implications of the topological properties of these networks are explained in the law enforcement domain.

34

Experiment Results

Narcotics Networks in a Single Jurisdiction

Basic Statistics

Nodes

Edges

Giant component

2 nd largest component

Link density

TPD

31,478 individuals

82,696

22,393 (70%)

41

0.0002

PCSD

11,173 individuals

67,106

10,610 (94%)

103

0.0008

35

Experiment Results

Single Jurisdiction

(cont.)

Small World Properties

Clustering Coefficient

Average Shortest Path Length

(L)

TPD

0.39 (1.39 x 10

5.09

-4 )

PCSD

0.53 (4.08 x 10

4.62

Diameter 22

Values in parenthesis are values for a random network of the same size and average degree.

23

-4 )

36

Implications of the Small World Property

• The narcotics networks in both jurisdictions can be classified as small world networks.

• The clustering coefficients of the networks are much larger than their random counterparts.

– This suggests that criminals show the tendency to from circles of associates where members commit crimes together.

– This is not unusual in narcotics networks where an individual commits crimes with friends and people in his trust circle.

– This property works as an asset to law enforcement in identifying criminal conspiracies.

• A short L in a narcotics network has important implications for both crime and law enforcement:

– It improves the speed of flow of information and goods in the network.

– It also suggests that criminals often commit crimes with individuals outside their group. This creates the shortcuts needed to reduce L.

– A short average path length has positive implications for law enforcement too.

Short paths between criminals generate better leads in crime investigations.

37

Single Jurisdiction

(cont.)

-4

0

0

-2

0.5

1 1.5

TPD Narcotics Network

Degree Distributions

PCSD Narcotics Network

2 2.5

3 3.5

4 4.5

5

0

0

-2

1 2 3

-4

-6

-8

-10

0.18

0.16

0.14

0.12

0.1

0.08

0.06

0.04

0.02

0

1 4 7

10 13 16 19 22 25 28 31 34 37 40 43 46 49 52 55 60 70 83

-12

-6

-8

-10

-12 k k

4

These diagrams show the log-log plots of the cumulative degree distribution (p(k)) vs. the degree (k).

The insets are p(k) vs. k. The solid line is the truncated power law curve.

38

5

Implications of the Scale Free Property

• The narcotics networks in both jurisdictions can be classified as scale free

(SF) networks.

• This implies that a large number of individuals do not have have many associates but, a few have large number of associates.

• The exponents in both power law decays are very small (0.85 – 1.3). The distribution decays slowly for lower degrees, indicating that there a large number of nodes with small degrees.

– This is not unexpected as criminals with high degrees attract more attention from law enforcement authorities so having less associates is beneficial.

• The truncated power law fits (R 2 =93%) better than the power law distribution (R 2 =85-87%) .

– As the number of links (k) grows, the probability of nodes having ‘k’ links decreases.

– This might indicate the cost or trust constraint (criminals may not want to attach to many people) to growth.

39

Growth in Multiple Jurisdictions

This curve shows the preferential attachment when the narcotics network in TPD is augmented with data from PCSD.

Preferential Attachment (TPD < PCSD)

0.5

0.4

0.3

0.2

0.1

0

1

0.9

0.8

0.7

0.6

k

The dashed line above the curve shows a linear preferential attachment growth, the

40 solid line shows the state of no preferential attachment.

Preferential Attachment: Implications

• The curves lie above and grow faster than the solid line, offering visual evidence of the presence of preferential attachment.

• Two properties of growth between jurisdictions are worth noting:

– The curve maintains linearity at low value of k. The linearity breaks down for higher degrees.

– In totality the lower degree nodes attract more nodes towards themselves than higher degree nodes.

41


(cont.)

• Break in Linearity

– The slow growth of nodes with high degree can be attributed to the nature of networks being studied.

– Cost/Trust effect: Criminals may not prefer to be related to a large number of individuals for the risk of drawing attention. Thus, the cost of acquiring more links is high, this might prevent a node with large number of links to acquire more.

– External influences: Law enforcement limits the number of crimes a individual can commit.

42


(cont.)

• Lower degree nodes attract more nodes

– The data on police incidents is drawn from two different jurisdictions.

– A criminal might be committing more crimes in one jurisdiction and not the other.

– Thus, one jurisdiction may have incomplete information about the activity of some criminals in the network.

– These criminals will have a low degree in one jurisdiction.

– On adding the second jurisdiction, the degree of these criminals increase since they commit more crimes in the second jurisdiction.

– This will lead to lower degree nodes attracting more nodes than higher degree nodes.

43

Conclusions

• This study focused on topological properties of criminal activity networks and their link to law enforcement, border and transportation security.

• Criminal networks are small world networks with scale free distributions. These topological characteristics have important implications for law enforcement and hence transportation security.

• A single jurisdiction contains incomplete information on criminals and cross-jurisdictional data provides an increased number of higher quality investigative leads.

44

Outline




Networks

•

Criminal Network Visualizer: System Demo

•

Mutual Information Analysis to Identify

High-risk Vehicles

45

CAN Visualizer Demo

46

Mutual Information Analysis to

Identify High-risk Vehicles

• Vehicles involved in illegal activities (especially smuggling) may operate in groups.

• If the criminal links of one vehicle in a group are known, then their border crossing patterns can be used to identify other partner vehicles.

• CBP agents also suggest that criminal vehicles may cross at certain times of the day to try and evade inspection.

• The concept of mutual information (MI) can be used to include these heuristics and identify high risk vehicles.

47

Literature Review

• Association rule mining

• Mutual information

• Applications of mutual information

48

Association Rule Mining

• Inferring associations between items in the database was motivated by decision support problems faced by retail organizations (Stonebraker 1993).

• An association rule (AR) is a relationship of the form A



B

– A is the antecedent item-set and B is the consequent item-set.

– The antecedent and consequent item-sets can contain multiple items.

•

A



B holds in a transaction set D with

– confidence ‘c’ if c% of transactions in D that contain A also contain B,

– support ‘s’ if s% of transactions in D contain both A and B .

• Association mining identifies all the rules that have support and confidence greater than user-specified thresholds.

49

Applications of AR Mining

• AR mining has been applied in many domains including

– ‘market basket’ data (Agrawal 1993, 1994),

– web log analysis (Mobasher 1996),

– network intrusion detection (Lee 1998),

– recommender systems (Lin 2002), and

– gene regulatory network extraction (Berrar 2001).

• Work has also been done to include domain heuristics in AR mining with

– market basket analysis (Hilderman 1998), and

– gene regulatory networks (Huang forthcoming).

50

Mutual Information

• Mutual information is an information theoretic measure that can be used to identify interesting co-occurrences of objects.

• It can be considered a subset of AR mining with 1-item antecedent and consequent item-sets.

• The earliest definitions of MI was given by Claude et al.

(1949) and Fano (1961) as the amount of information provided by the occurrence of an event ( y ) about the occurrence of another event ( x ):

( ; )

 log

2

P x y )

• Intuitively, this concept measures if the co-occurrence of x and y (P(x,y)) is more likely than their separate occurrences

51

(P(x).P(y)).

Applications of MI

• Phrase extraction from text

– MI works well in text documents as they can be considered as a set of events (words), and the probability of the occurrence of a word can be calculated over the entire document.

– It has been used to study association between words in

English texts to identify commonly occurring phrases

(Church 1990, Hindle 1990).

– It has also been used for key phrase extraction in Chinese texts (Ong 1999).

• Data sharing and schema matching

– Pantel et al. (2005) used MI to match database columns containing similar information.

52

Applications of MI (cont.)

• Bioinformatics

– MI has been used to extract protein motif patterns from sequences (Tao

2004), and

– identify building blocks of proteins from biomedical abstracts (Weisser

2004).

• Feature selection

– Selecting features in supervised neural network learning (Battiti 1994).

• These studies have not modified the mutual information measure to include domain specific heuristics.

53

Modifications of MI

• There have been studies that have extended/modified the classical MI measure to include domain heuristics.

• These include studies in

– natural language processing: modified MI to include ngrams (Magerman 1990),

– bioinformatics: extended MI to include transitive biological associations (Wren 2004), and

– feature selection: used conditional MI to identify important features for image classification (Fleuret 2004).

54

Applications of MI for Border Safety

• Border-crossing records can be considered as a stream of text

(license plates) ordered by the time of crossing.

– MI can be used to identify frequent co-occurrence between a pair of vehicle crossings.

• If one vehicle in the pair has a criminal record, some inferences may be made about the second vehicle if they cross together frequently.

• We use conditional probability to include domain heuristics in the MI formulation.

• The heuristics are derived from information recorded in multiple law-enforcement databases.

55

Research Questions

• Can law enforcement information from border-area jurisdictions be used to identify suspect vehicles at the border?

• How can domain heuristics be incorporated to enhance the performance of mutual information?

• Which domain heuristics are important in identifying suspect vehicles at the border?

56

Research Testbed

• Datasets obtained from various LE agencies in the Tucson,

AZ metropolitan area

Date range

TPD

1990 – 2006

PCSD

1990 - 2006

Others

1990-2006

Vehicle with police incidents

662, 527 640, 733 662, 034

• CBP data includes information on vehicles crossing the border between Arizona and Mexico at six ports of entry.

Date range

Recorded crossings

Vehicles

Mar 2004 – Feb 2006

17.6 million

2.6 million

57

Research Design

• To identify interesting pairs of vehicles that cross the border together we use the time of crossing and waiting-time heuristics to enhance MI

– Suggested by domain experts and previous studies (Kaza 2005, Kaza 2006).

• Time of crossing: Criminal vehicle pairs that cross during certain times of the day/night are interesting.

– Crossings at odd times may be considered suspicious

• Waiting-time: The ability to identify high risk vehicles was greatly influenced by the traffic situation and waiting times at a particular port.

– The traffic intensity influences the time interval (waiting time) between the inspections of two potential partner vehicles.

– The waiting-time heuristic is used to adjust the time interval between two paired vehicles included in the MI calculation.

• MI measure with time and waiting-time heuristics is referred to as ‘MIW.’

Classical MI (without heuristics) – ‘MIC.’

58

Research Design (cont.)

Border Wait Times

Web-Spider Internet

Archive

Law Enforcement Data*

TPD

PCSD

Border Crossing Data

Six Ports

Splitting

Training Data

2/3

Testing Data

1/3

Narcotics

Vehicles

Overlap

Law Enforcement Data*

TPD

PCSD

Subset

Evaluation

Heuristic

Calculation

Set A

Criminal

Vehicles with

Crossings

MIW/MIC

Scores

Set B

Potential Target

Vehicles

59

Estimating Border Wait Times

© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQ TM

Port of Entry

(Check points)

• An aerial photograph of a typical U.S. port of entry

(southern border).

•

Vehicle lanes are backed up with dozens of vehicles during peak times.

•

Criminal vehicles operate in groups.

– If one is caught others turn back into Mexico.

• They may join the lines one at a time or use turn-out points.

Turn-out points

Vehicle lanes

Turn-out points

•

Thus, time interval between two related vehicles is likely to be less or equal to the waiting time if the second vehicle doesn’t join the line until the first vehicle goes through.

•

This needs to be taken into consideration in the calculation of MI.

60

Estimating Border Wait Times (cont.)

•

Two external sources

1.

CBP publishes hourly wait times on its website (BWT).

•

The information is posted only for the current day

•

No publicly available archive is maintained

• A web-spider was used to systematically download the web-page for every hour over several days in April 2006

•

However, the average waiting times thus obtained cannot be generalized to the entire year

2.

The Internet Archive (IA) contained snapshots of the BWT web-page from April 10, 2004 to March 31, 2005.

• Obtain waiting time statistics for various days over many months in 2004 and 2005

•

The statistics from the spidering process and IA were then used to calculate average waiting times for each port on an hourly basis and used in MIW.

61

Classical MI Formulation

• The mutual information score between any two vehicles is defined as:

( , )

 log

2

Here A is a vehicle in Set A and B is a vehicle in Set B .

• P( A , B ) is the probability of B crossing within 15 minutes of A, this is calculated based on the number of times A and B are seen crossing together.

– 15 minutes is the average wait time for a vehicle at the six ports of entry in AZ.

• P( A ) and P( B ) are the probabilities of the vehicles A and B crossing the border.

62

Incorporating Heuristics

• By using conditional probability we can modify P(A), P(B), and P(A,B) to reflect domain heuristics.

• P(A) is changed to the probability that vehicle A crosses the border and commits crimes. Hereafter this is P’(A).

• P(B) is changed to the probability that vehicle B crosses the border and commits crimes (P’(B)).

• P(A,B) is changed to the probability that vehicles A and B cross the border together and commit crimes (P’(A,B)).

• Thus, a high MI’(A,B) will indicate vehicles that are likely to enter the country together and commit crimes.

63

Incorporating Time Heuristics

• Let P c

Let P b

(a) be the probability that vehicle

‘ a

’ commits a crime.

(a) be the probability that

‘ a

’ crosses the border.

• The day is divided into six time periods corresponding to office travel, travel for lunch, night time, etc.

• The probability of criminal vehicles crossing during these time periods is calculated using historical information in the

TPD/PCSD database.

– We can get P c

(V|t) where

‘ t

’ is a time period (1≤ t≤ 6). P c

(V|t) is the probability that a vehicle in time period

‘ t

’ will commit a crime.

• By definition,

 t

6 



1

[( ) | ] b c

A b refers to vehicle A crossing the border, and A c refers to vehicle A having contact with the police. This is summed over all six periods to obtain P’(A).

64

Incorporating Time Heuristics (cont.)

• Since the probability of a vehicle crossing and having a police contact are independent, we get

 t

6 



1

P A t P V t b

( | ) ( | ) c

• Similarly, P’(B), P’(A,B), and MI’(A,B) can be defined as:

 t

6 



1

[( ) | ] b c

 t

6 



1

P B t P V t b

( | ) ( | ) c

 t

6 



1

P [(( b

( ) ) | ] c t

 t

6 



1

P b

(( ) | ) ( | ) ( | ) c c

( , )



'( , ) /( '( )



'( ))

• MIW(A,B) now includes the time heuristic.

65

Incorporating Waiting-time Heuristics

• To include the waiting-time heuristic, the definition of P’(A,B) was further modified to include variable time intervals.

• The interval between the crossings of Vehicle A and Vehicle B was set at the waiting time.

– For instance, consider that the average waiting time at Port A between

10:00pm – 11:00pm is 20 minutes.

– If Vehicle A crossed at 10:00 pm at Port A then Vehicles A & B are said to cross together only if they crossed within 20 minutes of each other.

• Since the waiting-time periods varied greatly by port and time, this procedure only paired vehicles that were most likely to be present together at that port.

66

Evaluation Procedure

• The potential target vehicles identified by MI algorithms were evaluated by measuring their overlap with police datasets.

• The overlap with three different datasets was measured: TPD,

PCSD and the entire Tucson met. region dataset (that includes

TPD and PCSD).

– Each of these datasets cover different geographical areas and record different levels of crimes.

• These numbers were compared with each other and the classical (no domain heuristics) MI measure.

• Both statistical tests and illustrative cases were used in the comparison.

67

Experimental Results

Temporal Patterns of Border Crossings

8pm-Midnight

23%

Night

Day

4pm-8pm

22%

(a) Midnight-5am

12%

5am-10am

10%

2pm-4pm

13%

10am-2pm

20%

8pm-Midnight

27%

Nigh t

Day

4pm-8pm

24%

(b)

Midnight-5am

15%

•

Figure (a) shows the percentage of all crossings

5am-10am over six time periods of the a day.

– 23% of all crossings take

10% place between 8pm-

Midnight.

10%

10am-2pm

•

Figure (b) shows the percentage of all crossings by

14% vehicles with police contacts over the six time periods.

2pm-4pm

– 27% of crossings by vehicles with police contacts happen between

8pm-Midnight.

• The figure suggests that a large number (≈50%) of crossings with police contacts happen after dark.

•

MIW uses this information to assign more weight to time periods with more criminal crossings.

68


Traffic Intensity

A B C D E F

800

700

600

500

400

300

200

100

0

•

This chart shows the average number of crossings at the six ports in Arizona over time of day

(X-axis).

• On the Y-axis are the average numbers of vehicles that cross at the port (the ports with zero crossings are closed at certain times).

0000 0200 0400 0600 0800 1000 1200

Time of Day

1400 1600 1800 2000 2200

• The three largest ports (A, D and F) have an average of several hundred crossings per hour between 7am and 8pm.

• There is wide variation in the number of crossing vehicles across different ports and time periods. This supports the assumption that a single, across-the-board, time interval for MI calculation is not suited in this problem domain.

– A large time interval is likely to pair many vehicles that may not be related to each other, a small interval is likely to miss many related pairs of vehicles due to large waiting times at peak hours.

69


Waiting Times

A

100

90

80

70

60

50

40

30

20

10

0

00

00

02

00

04

00

06

00

08

00

B C D E F

10

00

12

00

Time of day

14

00

16

00

18

00

20

00

22

00

• This figure shows the average waiting times at each of the six over times of the day (Xaxis).

• On the Y-axis are the average waiting times in minutes.

• It can be seen that the variations in waiting times at small ports (B, C, and E) are not large and usually stay under 10 minutes.

• The variations at large ports (A, D, and F) roughly mirror their traffic intensity.

• The waiting times were used to define time intervals in the MIW algorithm.

70


Comparative Evaluation of MIC and MIW

45

40

35

30

25

20

15

10

5

0

TPD dataset

MIC

MIW

10

0

1

50

1

1

70

60

PCSD dataset

50

40

30

20

100

2

1

500 1000

5 14

6 13

Top ‘n’ pairs

10

0

MIC

MIW

10

1

2

50

3

3

100

5

4

500

13

16

1000

22

27

Top ‘n’ pairs

160

140

Tucson met. dataset

120

100

80

60

40

20

0

MIC

MIW

10

1

2

50

4

6

100

8

10

500 1000

28

30

50

55

Top ‘n’ pairs

1500

24

21

1500

34

39

1500

73

84

2000

31

31

2000

47

51

2000

97

112

2500

37

40

2500

56

60

2500

118

134

•

MIC and MIW scores were calculated for

310,751 pairs of vehicles (the first vehicle from

Set A and the second from Set B ).

•

For comparison, the number of police contact vehicles identified by each was counted using all three police datasets.

• On the X-axis are topn pairs of vehicles ordered by their MIC and MIW scores.

• On the Y-axis is the number of vehicles with police contacts identified by the two measures.

•

MIW generally identified more potentially criminal vehicles (vehicles with prior police contacts) than MIC.

•

The illegal activity of border crossing vehicles was higher in the Pima County region as compared to the city of Tucson as shown by the larger number of police contacts identified.

71

Experimental Results > Selected Case Studies

Vehicle Pair Identified by MIW

• This figure shows the crossing patterns of a pair of vehicles with the high MIW score.

Vehicle C Vehicle D

2000

After dark/No fixed work schedule

1500

1000

500

0

Dates (2006)

• Vehicle C from Arizona and it’s occupant were arrested in Tucson for the sale of narcotics.

• Vehicle C crossed 7 times in a one month period and crossed within a few minutes of Vehicle D.

• The crossings may be considered suspicious since they are almost always after dark and do not fit a standard work schedule.

72

Experimental Results > Selected Case Studies

Criminal Activity of Vehicle C & D

Tucson met. area – Narcotics Network Customs and Border Protection

Tucson met. area

Criminal Network

2000

1500

1000

500

0

MIW

Vehicle A Vehicle B

Vehicle C

Frequent

Crossers at Night

Vehicle D

• Vehicle C was found to have strong connections to a narcotics network in the

Tucson metropolitan area. It had links to other people and vehicles that had been arrested / suspected for narcotics sales and possession in the region.

•

Vehicle D was also involved in criminal activity in the Tucson region.

•

MIW identified many other such strong cases.

73

Conclusions

• Exploring the criminal links of border crossing vehicles in local law enforcement databases can be used to enhance border security.

• We found that MI can be used to identify high risk potential suspect vehicles that may warrant more inspection at the border.

• The MI measure modified to include domain heuristics like time of crossing and waiting-time performs significantly better than classical MI in the identification of potentially criminal vehicles.

• In addition, the transitive use of MI scores may hold promise for the identification of groups of vehicles.

74

Online Sources

• DHS COEs

– http://www.dhs.gov/xres/programs/editorial_0498.

shtm

• Studies discussed today

– http://ai.eller.arizona.edu/paper_conf/index.htm

75