Homeland Security Data Mining using Social (Dark) Network Analysis

advertisement
Homeland Security Data Mining using
Social (Dark) Network Analysis
ISI 2008, Keynote Address
Hsinchun Chen, Ph.D.
Director, Artificial Intelligence Lab
Director, NSF COPLINK and Dark Web Research Centers
University of Arizona
Acknowledgements: NSF, LOC, ITIC/KDD, DHS, DOJ
© 2005
1
Overview
© 2005
2
September 11, 2001, London Subway bombing
…Iraqi and Afghan Wars…
Spain Madrid bombing, Dutch Hofstad group,
Cairo bombing, Toronto plot, German terrorists…
(After 2004) All relying on Internet…
Leaderless Jihad…Al Qaeda University on the
Web…Cyber seduction for terrorist recruiting…
Eurabia…
ASEAN Regional Forum on fighting terrorism,
separatism, and extremism…
© 2005
The World is Flat…for good or for worse
3
Social Movement Organizations (SMO)
(Political) Activism: Political movement,
e.g., Young Democrats, Internet petition, global
warming
Extremism: Radical ideological movement,
e.g., KKK, Skin Head, Militia, animal rights, FLG
Terrorism: Violent political movement,
e.g., ELF, ALF, Aum, Al Qaeda
…They are all using the web…
© 2005
4
Terrorism, Terrorist Networks,
Terror on the Internet, Leaderless
Jihad
© 2005
5
• Islamic fanatics in the
Global Salafi Jihad (with roots in
Egypt)
• Based on data about 172
Jihadists…social bonds predated
ideological commitment
• small-world network…network
robustness…geographical
distribution…fuzzy
boundaries…cliques…the
strength of weak bonds…the
power of the Internet
M. Sageman, former foreign
service officer in Islamabad,
forensic psychiatrist (2004)
© 2005
6
• Drawing on an eight-year study of
WWW
• Terrorist organizations and their
supporters maintain hundreds of
web sites
• Terrorist organizations exploit the
Internet to raise funds, recruit
members, plan and launch attacks,
and publicize their chilling their
results.
• New terrorism, new media…The
war over minds…cyberterrorism
• Balancing security and civil
liberties
© 2005
G. Weimann, communication and
mass media study, U. of Haifa
(2006)
7
• “The process of radicalization in a
hostile habitat but linked through the
Internet leads to a disconnected
global network, the Leaderless
Jihad.”
• From anecdote to data and from
journalism to social sciences
• Going beyond incident databases;
detailed evidence-based terrorist
(500+) database
• Before 2004, face-to-face
interactions, 26-year old
• After 2004, interaction on the
Internet: Madrid, Dutch Hifsatd,
Cairo, Toronto…Irhabi007 and
Muntada, 20-year old
© 2005
M. Sageman (2007)
8
Intelligence and Security Informatics
for International Security:
Information Sharing and Data Mining
© 2005
9
ISI: Overview
Intelligence and Security Informatics
(ISI)
• development of advanced
information technologies, systems,
algorithms, and databases for
international, national and homeland
security related applications, through
an integrated technological,
organizational, and policy-based
approach” (Chen et al., 2003a)
© 2005
10
10
The World After 9/11, 2001
•
•
•
•
•
•
•
•
•
© 2005
WTC, Pentagon attacks
Afghanistan, Iraqi wars
Bali, Madrid, London bombing
Sunni, Shia, sectarian wars
Jihad, E-Jihad, infectious ideas
Worms, viruses
Infectious diseases, bioagents, WMDs
International, regional, cultural, religious conflicts, …
Traditional crimes, cyber crimes, narcotics, gangs
(MS 13), smuggling, domestic extremists (Oklahoma
bombing), cyber security, …
11
11
Related ISI Fields
• Network security (Corporate/DOD -- intrusion detection)
• System and information security (Corporate/DOD -firewalls, viruses, worms, hacking)
• Cyber security (NSF/DOD -- network security, cyber crime)
• Forensics, computer forensics (FBI/Police – fingerprint,
DNA, IP addresses, voiceprint, writeprint)
• Crime analysis (Police/FBI -- information sharing, data
mining)
• Intelligence analysis (CIA/NSA – surveillance, intelligence
collection, multilingual data mining)
• Terrorism study (White House -- policy, incident analysis)
• Defense information warfare (DOD -- propaganda, counterintelligence, psychological warfare)
© 2005
12
12
Crime and Security Concerns
© 2005
Crime types and security concerns 13
13
A knowledge discovery research
framework for ISI
© 2005
A knowledge discovery research
14
framework for ISI
14
ISI Research: KDD Techniques
•
•
•
•
•
•
© 2005
Information Sharing and Collaboration
Crime Association Mining
Crime Classification and Clustering
Intelligence Text Mining
Crime Spatial and Temporal Mining
Criminal Network Analysis
15
15
National Security Critical Mission
Areas and AI Lab Projects
•
•
•
•
•
•
© 2005
Intelligence and Warning: Dark Web
Border and Transportation Security: BorderSafe
Domestic Counter-terrorism: COPLINK, Dark
Web
Protecting Critical Infrastructure and Key Assets
Defending Against Catastrophic Terrorism: Dark
Web, BioPortal
Emergency Preparedness and Responses
16
16
• Intelligence and Security
Informatics (ISI): Development of
advanced information
technologies, systems,
algorithms, and databases for
national security related
applications, through an
integrated technological,
organizational, and policy-based
approach” (Chen et al., 2003a)
• Data, text, and web mining
• From COPLINK to Dark Web
H. Chen, computer scientist,
artificial intelligence, U. of
Arizona (2006)
© 2005
17
The ISI Communities
•
IEEE Intelligence and Security Informatics (ISI)
Conference, 2003 (Tucson), 2004 (Tucson),
2005 (Atlanta), 2006 (San Diego), 2007 (New
Brunswick), 2008 (Taiwan)
•
Pacific-Asia ISI Workshop (PAISI): 2006
(Singapore), 2007 (Chengdu, China), 2008
(Taiwan)
EU-ISI Workshop: 2008, Denmark
•
© 2005
18
COPLINK
•
•
•
•
•
•
•
1996-, DOJ, NIJ, NSF, ITIC, DHS
Connect
Detect
Agent
STV (Spatio-Temporal Visualization)
CAN (Criminal Activity Network)
BorderSafe (Mutual Information)
• AI Lab  Knowledge Computing Corporation
• Tucson, Phoenix  AZ  1600 agencies in US
© 2005
19
•The New York Times November 2, 2002
•ABC News April 15, 2003
•Newsweek Magazine March3, 2003
© 2005
20
New York Times, Nov 2, 2002
Human, field
intelligence
Artificial
intelligence
© 2005
21
Dark Web
• 2002-, ITIC, NSF, LOC
• Discussions: FBI, DOD/Dept of Army, NSA, DHS
• Connection:
– Web site spidering
– Forum spidering
– Video spidering
• Analysis and Visualization:
– Link and content analysis (web sites)
– Web metrics analysis (web sites sophistication)
– Authorship analysis (forums; CyberGate)
– Sentiment analysis (forums; CyberGate)
– Video coding and analysis (videos; MCT)
© 2005
22
The Dark Web project in the Press
Project Seeks to Track Terror Web
Posts, 11/11/2007
Researchers say tool could trace online posts
to terrorists, 11/11/2007
Mathematicians Work to Help Track Terrorist
Activity, 9/14/2007
Team from the University of Arizona
identifies and tracks terrorists on
the Web, 9/10/2007
© 2005
23
黑網緝恐嫌 陳炘鈞擔綱
研製網路偵測軟體
自動追蹤恐怖頭子
亞利桑納大學人工智慧實驗室主任
華裔科學家陳炘鈞
(Dr. Hsinchun Chen)
© 2005
24 24
(Social/Criminal) Network Analysis
© 2005
25
Existing Network Analysis Tools
• First generation — manual approach
– Anacapa Chart (Harper & Harris, 1975)
• Second generation — graphics-based
approach
– Analyst’s Notebook, Netmap, Watson
– COPLINK hyperbolic tree view, network view
• Third generation — structural analysis
approach
© 2005
26
26
Anacapa Chart (1st generation)
Association
Matrix
Link chart
© 2005
27
27
Analyst’s Notebook, Netmap, Watson (2nd
generation)
Analyst’s Notebook.
Network nodes are
automatically arranged for
easy interpretation.
Source: i2, Inc.
Netmap.
Different colors
are used to
represent
different entity
types.
Source: Netmap
Analytics, LLC.
Watson.
Relations among a group of people
(the central sphere) based on telephone
records. Source: Xanalysis, Ltd.
© 2005
28
28
A 9/11 Terrorist Network
© 2005
29
29
Analyst’s Notebook & Starlight
• Analyst’s Notebook, by i2: A 2D graph and timeline
layout tool for crime and intelligence analysis
• Startlight, by Pacific Northwest Lab (PNL): A 3D
network visualization and navigation tool for
intelligence analysis
© 2005
30
Analyst’s Notebook, i2
© 2005
Starlight, PNL
31
SNA
• Social Network Analysis (SNA) has been widely
used to study real-world networks including dark
networks (Kaza, 2005; Koschade, 2006, Xu &
Chen, 2005).
• These include qualitative studies that study the
facilitators of link formation and quantitative
studies that use statistical methods to measure
existing networks.
© 2005
32
32
Characterizing Topological Properties
• L: Average path length
– The average of all-pair
shortest path lengths
L=2
L=1
• C: Clustering coefficient
– The tendency to form
clusters and groups
Ci 
2mi
ki (ki  1)
• p(k): Degree distribution
– The probability that a
randomly selected node
has exactly k links
© 2005
Ci= 1.0
p(k)
Ci= 3/6 = 0.5
p(k) = ?
33
k
33
Network Topology Models
• Random model (Erdős & Rényi, 1959)
• Small-world model (Watts & Strogatz,
1998)
• Scale-free model (Barabási & Albert,
1999)
© 2005
34
34
Random Network
• The probability that two arbitrary
nodes are connected is a fixed
number, p
• As a result, all nodes have
roughly the same number of
links (characterizing degree =
average degree <k>)
• Random networks are
characterized by
– Short distance
– Low clustering coefficient
– Poisson degree distribution
(bell-shaped)
© 2005
35
35
Small-World Network
• Average path length:
– Lsw ~ Lrandom
• Clustering coefficient:
– Csw >> Crandom
• Degree distribution
– Similar to that of random
networks
• Applications
– The “19 degrees of separation” on
the Web (Albert, Jeong, & Barabási,
1999)
– The small-world properties of
metabolic networks in cell implies
that cell functions are modulized and
localized
© 2005
36
36
Scale-Free Network
•
“Scale free” means there is no single
characterizing degree in the network
– Growth
• Instead of having a fix number of
nodes, the network can grow and
include new nodes
– Preferential attachment
• p~ki/ki
• A node that already has many links is
more capable of attracting links from
new nodes—”Rich get richer”
•
•
The degree of scale-free networks
follows the power-law distribution with a
flat tail for large k, p(k) ~ k-
The ubiquity of SF networks leads to a
conjecture that complex systems are
governed by the same self-organizing
principle
© 2005
37
37
Other Topological Properties
•
•
•
–
•
The number of actual links divided by the possible
number of links in the network
Assortativity
–
•
The Pearson correlation between the degrees of two
adjacent nodes
Global efficiency
–
© 2005
Number of nodes (n), number of links (m)
Average degree—  k  2nm
Density— d  n(2nm 1)
The average of the inverses of the shortest path
lengths over all pairs of nodes
38
38
Robustness of SF Networks
• Many complex systems display a surprising
degree of robustness against errors, e.g.,
– Organisms grow, persist, and reproduce despite
drastic changes in environment
– Although local area networks often fail, they seldom
bring the whole Internet down
• In addition to redundant rewiring, what else
can play a role in the robustness of networks?
Is it because of the topology (structure)?
© 2005
39
39
Robustness Testing
• How will the connectivity of a
network be affected if some
nodes are removed from the
network?
• How will random node removal
(failure) and targeted node
removal (attack targeting
hubs) affect
– S: the fraction of nodes in the
giant component
– L: the average path length of
the giant component
© 2005
40
40
Robustness Testing (Cont’d)
• SF networks are more robust
against failures than random
networks due to its skewed
degree distribution
• SF networks are more
vulnerable to attacks than
random networks, again, due
to its skewed degree
distribution
• The power-law degree
distribution becomes the
Achilles’ Heel of SF networks
Failure
Attack
Adapted from (Albert, Jeong, & Barabasi, 2000)
© 2005
41
41
Dynamic SNA Methods
• Previous studies focused on static network
structures rather than dynamic processes
due to:
– lack of reliable data recovery techniques
(Kossinets & Watts, 2006; Moody et al., 2005)
– few appropriate network measures (Kossinets
& Watts, 2006; Wasserman & Faust, 1994)
– little application of statistical methods for
evolving networks
© 2005
42
42
Network Measurement
• Most empirical studies on longitudinal data
plot descriptive measures over time.
• Three main types of measures are used in
dynamic SNA
– deterministic measures
– probabilistic measures
– temporal measures
© 2005
43
43
Criminal Networks:
Structured Information, Police Reports,
Criminal Associations
© 2005
44
COPLINK Connect
Consolidating & Sharing Information promotes
problem solving and collaboration
Records
Management
Systems (RMS)
Gang Database
Mugshots
Database
© 2005
45
45
COPLINK Detect
Consolidated information enables targeted problem solving
via powerful investigative criminal association analysis
© 2005
46
46
COPLINK Detect 2.0/2.5
© 2005
47
47
Association Retrieval and Visualization
© 2005
48
48
System Architecture
Structural Analysis
Criminal
-justice
Data
Network
Partition
Hierarchical
Clustering
Network
Creation
Network
Visualization
Concept Space
Centrality
Measures
Networked
Data
Blockmodeling
© 2005
MDS
49
Network Partition—Hierarchical Clustering
• Major algorithm selection criterion—time complexity
• RNN-based CLINK algorithm (Murtagh 1984)
– O(n2) time
– O(n2) space
• Algorithm modification
– Observation: the original network may not be a connected
graph but consists of several disjoint sub-networks, between
which no link exists
– Output contains multiple hierarchies
© 2005
50
SNA and Network Visualization
• SNA
– Central member identification
• Degree
– Counting direct links a node has
• Betweenness
– Using Dijkstra’s Shortest-path algorithm (1959)
• Closeness
– Using results from betweenness calculation
– Blockmodeling
• Network visualization—MDS
– Calculating the location (x-y coordinates) of each node
based on distance measure (Torgerson’s algorithm)
© 2005
51
System Interface
Nodes represent individual
criminals labeled by their
names
Links represent relationships
between criminals
Adjust the slider to perform
clustering and blockmodeling
© 2005
52
System Interface
The reduced star structure
found using blockmodeling
• Circles represent groups.
• The size of a circle is
proportional to the number
of group members.
• Each group is labeled by
its leader’s name.
© 2005
53
System Interface
The rankings of each group member
in terms of centrality measures
The first one of each column is the
leader, gatekeeper, and outlier,
respectively
The inner structure of a selected
group
Adjust the slider to do further
blockmodeling
© 2005
54
The 744-Member Narcotics Network
The “Meth World”-Red nodes represent
criminals who had been
involved in methrelated crimes since 1995
© 2005
55
Subgroup Detection
• Subgroups detected have different characteristics: The
subgroups found are consistent with the groups’ specializations
or responsibilities in a network
White gang members
who were involved in
assaults and murders
© 2005
White gang members
who were involved in
crack cocaine
Drug dealers
Offenders who were
responsible for stealing,
counterfeiting, and cashing
checks and providing money
to other groups to carry out
drug transactions
56
Central Member Identification
• A member who scores the highest in degree can be a group
leader
A group leader identified by the system
This person has a lot of money and
plays important roles in drug transactions
© 2005
57
Interaction Pattern Identification
• Frequency of interaction (represented by thickness of lines)
between subgroups can indicate the strength of between-group
relationship
Frequent interactions between
the two groups (their leaders
were good friends)
© 2005
58
Extraction of Overall Network Structure
A chain structure found in a
60-member network
using blockmodel analysis
© 2005
59
Usefulness
• Saving investigation time
• Saving training time for new investigators
• Suggesting investigative leads that might
otherwise be overlooked
• Helping prove guilt of criminals in court
© 2005
60
Temporal Network Analysis
• Research objectives
– Applying various measures to capture and predict changes
in criminal networks over time
• Unit of analysis
– Individual level
• Centrality measures: Who will be the next key members?
– Group level
• Density: What does the change in density imply about group membership
(recruitment and turnover)?
• Cohesion: Do groups become more cohesive or less cohesive over time? Who
does this change imply about the operation of the criminal groups?
– Network level
• Overall structure: How does the overall network structure of a criminal
enterprise change over time? What does this imply about changes in
the organization of a criminal enterprise?
© 2005
61
The Evolution of “Meth World”
The network in Year:
1995, 1996, 1998,
1999, 2002
© 2005
62
The Evolution of “Meth World”

Both density and cohesion of the highlighted group dropped in
1994, possibly indicating a turnover

No connections with people outside of the group existed during
1995 and 1996. The group stayed highly cohesive

In 1998, 1999, and 2001, group cohesion dropped while
density remained high, indicating a tendency to connect to
people outside of the group and to recruit new members
© 2005
63
Dynamic Network Analysis Research Testbed
• Two related real-world datasets:
– police incident reports from Tucson Police
Department (TPD)
• 2.03 million individuals
• 1.34 million vehicles
• 1990-2005
– inmate information from the Arizona
Department of Corrections (ADOC)
© 2005
• 165,540 jailed individuals
• 1986 to 2006
64
64
Facilitator Identification
• In this study, the facilitators included three individual
attributes and five shared affiliations.
• Individual attributes: age, race, gender
• Shared affiliations: mutual acquaintance, inmate
affiliation, vehicle affiliation, phone affiliation, residential
address
• These facilitators were selected based on previous
studies and input by domain experts.
© 2005
65
65
Statistical Analysis
• Cox survival analysis was used to examine the
significance of facilitators.
h(t, x1 , x2 , x3 ...)  h0 (t ) exp( 1 x1   2 x2   3 x3 ...)
• h(t,x1,x2,x3…) is instantaneous hazard - the probability
that the event will happen at time t
– given that the event has not happened up until time t
– with the observations of various independent variables (x1, x2,
x3…)
• The dependent variable indicates if a pair of individuals i
and j with dij = 2 would subsequently form a new link at
time t.
© 2005
66
66
Experimental Results
Vehicle
Mutual acq.
Age
Race
Gender
0
1
5
10
15
20
25
30
35
40
45
Hazard Ratio (g)
Results of multivariate survival analysis (Cox regression) of triadic closure for pairs of
individuals.
On the X-axis, the figure shows the hazard ratios and their 95% confidence intervals.
The probability of the triadic closure would increase by a factor of hazard ratio (g) when
the corresponding independent variable increases by one unit.
© 2005
67
67
Experimental Results (cont.)
Facilitator
Significant/Insignificant in predicting future cooffending
Mutual
Acquaintances
Significant, criminals with shared ‘friends’ are likely to cooffend in crimes in the future
Shared Vehicles
Significant, common vehicles point to hidden/future
operational links
Homophily in age,
race, gender
Insignificant, crime crosses race, gender boundaries
(especially in an immigrant city like Tucson).
Common jails
Insignificant, ADOC’s jail segregation system appears to
work. Important policy implications.
© 2005
68
68
Link Prediction
• Cox regression can also be used to determine the scale of influence
for each of the facilitators.
• Sharing the same vehicle in different crimes increases the
probability of triadic closure by a factor of 9.38 and each
additional mutual acquaintance increases it by a factor of 10.79.
• Therefore, if two unconnected criminals have used the same vehicle
in different crimes and have five mutual acquaintances then they are
9.381 x 10.79(5-1) ≈ 127141.88
times more likely to co-offend in the future.
© 2005
69
69
Terrorist Networks:
Unstructured and Multilingual,
Intelligence Reports,
Family/Friendship/Disciple Affiliations
© 2005
70
• Islamic fanatics in the
Global Salafi Jihad (with roots in
Egypt)
• Based on data about 172
Jihadists…social bonds predated
ideological commitment
• small-world network…network
robustness…geographical
distribution…fuzzy
boundaries…cliques…the
strength of weak bonds…the
power of the Internet
M. Sageman, former foreign
service officer in Islamabad,
forensic psychiatrist (2004)
© 2005
71
• “The process of radicalization in a
hostile habitat but linked through the
Internet leads to a disconnected
global network, the Leaderless
Jihad.”
• From anecdote to data and from
journalism to social sciences
• Going beyond incident databases;
detailed evidence-based terrorist
(500+) database
• Before 2004, face-to-face
interactions, 26-year old
• After 2004, interaction on the
Internet: Madrid, Dutch Hifsatd,
Cairo, Toronto…Irhabi007 and
Muntada, 20-year old
© 2005
M. Sageman (2007)
72
A 9/11 Terrorist Network
© 2005
73
73
The Global Salafi Jihad (GSJ) Network
• Based on Dr. Marc Sageman’s book and data
• Data collected and cross-validated from open sources regarding 366
GSJ members
• Background
– 75% From upper or middle class
– Average age is 26
– Affiliation through friendship, kinship, discipleship, and worship
• Four clumps (based on geographical location)
–
–
–
–
© 2005
Central Staff
Core Arab
Maghreb Arab
Southeast Asian
74
GSJ (Cont’d)
• Each clump has its Hubs: important, popular
members with many links (high degree)
• The Central Staff clump connected with other
three clump through Lieutenants: important
connectors (high betweenness)
• A clump may contain Cliques: members are
nearly fully connected
• Clumps have different structures
– Scale free network
– Hierarchical network
© 2005
75
GSJ (Cont’d)
Clump
Central Staff
Hub
Osama bin
Laden
Lieutenants
Network
Structure
-
-
Core Arab
Khalid Sheikh
Mohammed
(KSM)
Waleed Mohd Tawfiq
bin Attash, Abdal
Rahim al Nashiri,
Ramzi Mohd
Abdullah bin al
Shibh,
Scale free
Maghreb
Arab
Southeast
Asian
Zain al Abidin
Mohd Hussein
Fateh Kamel, Amar
Makhlulif
Scale free
Abu Bakar
Baasyir
Encep Nurjaman, Ali
Ghufron
Hierarchical
© 2005
76
The Dataset
• EXCEL spreadsheet containing the information about the
366 GSJ members.
• Data characteristics
– Node  individual terrorist
• Short name, full name, DOB, education, marital status, etc.
– Link  relation
• Operational link (based on attacks)
• Personal link
–
–
–
–
–
© 2005
Acquaintance
Friends
Family
Relative
Religious
• Post join tie
77
The Network (with all links)
A lieutenant
acting as a
gatekeeper to
connect two
clumps
Southeast Asians
are lead by their
own leader
Clumps
Three clumps
(central, core,
Maghreb) are
lead by
members from
the “central
staff” clump
Central Staff
Core Arab
Southeast Asian
Maghreb Arab
Node Size
Leader
An important
person linking
two groups
Lieutenant
Other people
Fate
Dead
© 2005
Captured
78
Operational Links
Bali, 2002
Jakarta, 2003
Singapore Plot, 2001
9/11, 2001
Strasbourg, 1999
LAX,. 1999
France, 1995
Casablanca, 2003
Emb, 1998
Morocco, 1994
Istambul, 2003
© 2005
79
All Personal Links
© 2005
80
Personal Links v.s. Operational Links
How did they get involved in 9/11?
© 2005
81
Finding the Path Resulting in an Attack
© 2005
82
Finding the Network Structure
© 2005
83
Use PageRank Algorithm to Calculate the
Importance Value of Each Terrorist
• PageRank (Brin & Page, 1998) is a very famous algorithm
designed to calculate the authoritativeness of Web pages
based on the Web link structure.
• We borrowed the main idea of PageRank to calculate the
importance value of each terrorist in the terrorist network based
on their relationships:
– Step 0: Initially, assign equal PageRank scores (importance
value) to every terrorist in the network.
– Step 1: For every terrorist p in the network, calculate its
RageRank score as follow:
• PageRank(p) =
 PageRank (q) 
1 d

 d   
n
c( q)
All q link 

to p
– Step 2: Repeat Step 1 until the changes of the PageRank score
are smaller than a threshold value (convergence).
© 2005
84
Build Authority Derivation Graph: Reveal the
Social Hierarchy among Terrorists
q2
• For each terrorist p in
the original full network:
– Find the all the terrorists
{q1, q2, …, qm} that have
relationships to p
– Find qi who has the
highest importance
value among {q1, q2, …,
qm}
– Draw a directed link from
p to qi indicating that qi is
the direct leader of p.
© 2005
q1
p
q3
q6
q4
q5
0.14
0.18
p
0.13
0.12
0.13
0.15
q1
p
85
The ADG of the GSJ Network (n = 1)
Central Staff
Core Arab
Southeast Asian
Maghreb Arab
© 2005
86
The ADG of the GSJ Network
• From the ADG of the GSJ network, we can clearly see
that:
– The network has a fanning-out hierarchical structure.
– The people who were stated as leaders in Dr. Sageman’s book
also appear to be leaders in each level of the hierarchy in the
network (their names are marked in red).
– Some leaders has many directly related underlings (e.g. Hambali)
while some others has less directly related underlings, but many
levels of underlings (e.g. bin Laden).
– The whole network seems to be composed of two parts: the north
parts led by Hambali and the south part led by bin Laden.
© 2005
87
Dark Networks: Topology,
Disruption Strategy
© 2005
88
Introduction
• Many “Dark Networks” (e.g., terrorist networks, drugtrafficking networks, arms smuggling networks, etc.) are
hidden from our view yet could bring devastating impacts to
our society and economy
• Traditionally, due to the difficulty of collecting and accessing
reliable data sources, the topology of these networks are
largely unknown
– Do dark networks share the same topological properties
with other empirical networks?
– Do they follow the same self-organizing principle?
– How do they achieve efficiency under constant surveillance
and threats from authorities?
– How robust are they against attacks?
© 2005
89
89
The Four Dark Networks
•
The Global Salafi Jihad (GSJ) terrorist network
– Nodes: terrorists from four terrorist groups: Central Staff, Core Arabs,
Meghrab Arabs, and Southeast Asian
– Links: personal links (kinship, friendship, religious ties) and relations
formed after joining the GSJ
•
The narcotics-trafficking network (Meth World)
– Nodes: criminals involved in meth-related crimes between 1985-2002
– Links: co-occurrence relations extracted from crime incident reports
•
The gang network
– Nodes: criminals involved in gang-related crimes between 1985-2002
– Links are co-occurrence relations extracted from crime incident reports
•
The terrorist web sites (Dark Web)
– Nodes: Web sites created by four terrorist groups: Al-Gama’a al-Islamiyya,
Hizballa, Al-Jihad, and Palestinian Islamic Jihad and their supporters
– Links: composite hyperlinks
© 2005
90
90
The Dark Networks (Cont’d)
© 2005
91
91
Basic Statistics
GSJ
Meth World
Gang Network
Dark Web
Number of Nodes
366
1349
3917
104
Number of Links
1247
2392
9051
156
Size of the Giant
Component
356
(97.3%)
924
(68.5%)
2231
(57.0%)
80
(77.9%)
Link Density
0.02
0.01
0.003
0.05
Average Degree
6.97
4.62
2.87
1.94
44
37
51
33
0.41**
-0.14**
0.17**
-0.24*
Maximum Degree
Assortativity
For the
giant
component
* p < 0.05
** p < 0.01
© 2005
92
92
Small-World Properties
GSJ
Meth World
Gang Network
Dark Web
Data
Random
Data
Random
Data
Random
Data
Random
9
6.00
(0.263)
17
9.57
(0.556)
22
16.40
(0.516)
12
13.16
(0.830)
Average Path Length
L
4.20
3.23
(0.040)
6.49
4.52
(0.056)
9.56
4.59
(0.034)
4.70
3.15
(0.108)
Global Efficiency
0.28
0.33
(0.004)
0.18
0.23
(0.003)
0.12
0.23
(0.001)
0.30
0.34
(0.019)
Clustering Coefficient
C
0.55
0.020
(0.0029)
0.60
0.005
(0.0014)
0.68
0.002
(0.0005)
0.47
0.049
(0.0155)
Diameter
© 2005
93
93
Findings about SW Properties
• Dark networks are sparse
• Dark networks are small worlds
– The average path length (and diameter) is small relative to
the network size but slightly larger than that in random
graph counterpart
– The clustering coefficient is significantly greater than that in
random graph counterpart
• Network members are extremely close to their
leaders
– GSJ: 2.5 steps to Bin Laden, on average
– Meth World: 3.9 steps to its leader, on average
• GSJ and the gang network are assortative, while the
Meth World and the Dark Web are disassortative
© 2005
94
94
Scale-Free Property
p(k)
R2
GSJ
p(k )  0.45k 1.38
0.74
Meth World
p(k )  0.86k 1.86
0.89
Gang Network
p(k )  1.14k 1.95
0.81
Dark Web
© 2005
p(k )  0.35k 1.10
0.82
95
95
Cumulative Degree Distributions
GSJ
Meth World
1
1
1
10
100
1
Data
10
100
Data
Pow er-Law
0.1
Pow er Law
P(k)
P(k)
0.1
0.01
0.01
0.001
0.001
k
k
Gang Network
Dark Web
1
1
1
10
100
1
Data
0.1
10
100
Data
Pow er Law
Pow er Law
P(k)
P(k)
0.1
0.01
0.01
0.001
0.0001
0.001
k
© 2005
k
96
96
Findings about SF Properties
• All four networks display scale-free
characteristics
• The power-law distributions fit especially well
with the data for large degrees
• The three human networks show somewhat
two-regime scaling behavior which may be
due to (Barabasi et al., 2002)
– New links between existing members
– Rewiring
© 2005
97
97
Implications
• Sparseness and short paths between network members
– Enhanced efficiency in flow and transmission of information
and goods
– Reduced risks of being detected and captured by authorities
• High clustering coefficient
– High tendency to form groups and teams
– Enhanced efficiency in flow of resources within the local group
• High closeness to network leaders
– Short chain of command and high communication efficiency
• The Dark Web is a special case with relatively large path length
(4.70)
– Reluctance to share potential resources with other terrorist
groups
• Dark networks may form following the self-organizing principle
© 2005
98
98
Robustness against Attacks
• Two types of attacks
– Simultaneous attacks (the
degree/betweenness of nodes are not
updated after each removal)
– Progressive attacks (the degree/betweenness
of nodes are updated after each removal)
• Two attack strategies
– Attack on hubs (highest degree)
– Attack on bridge (highest betweenness)
© 2005
99
99
Simultaneous vs. Progressive Attacks
1
12
S (Simultaneous attacks)
0.8
10
Average path length
S (Progressive attacks)
S
0.6
0.4
0.2
fp
fs
Simultaneous attacks
8
Progressive attacks
6
4
2
0
0
0
0.2
0.4
0.6
Fraction of nodes removed
0.8
1
0
0.1
0.2
0.3
0.4
Fraction of nodes removed
Bridge attack on the GSJ network
© 2005
100
100
Hub vs. Bridge Attacks
Meth World
GSJ
1
1
0.9
0.8
S (Hub attacks)
0.6
S (Bridge attacks)
0.5
S (Hub attacks)
S and <s>
S and <s>
0.7
0.4
S (Bridge attacks)
0.3
0.2
0.1
0
0
0
0.2
0.4
0.6
0.8
0
1
0.2
Fraction of nodes removed
Fraction of nodes removed
Gang Netw ork
1
1
0.8
S (Hub attacks)
0.8
S (Bridge attacks)
0.6
S and <s>
S and <s>
S (Hub attacks)
S (Bridge attacks)
0.4
0.2
0.6
0.4
0.2
0
0
0.2
0
0
Fraction of nodes removed
© 2005
0.1
0.2
0.3
0.4
0.5
Fraction of nodes rem oved
101
101
Findings and Implications
• Dark networks are more vulnerable to
progressive attacks than simultaneous attacks
• Dark networks are more vulnerable to bridge
attacks than to hub attacks
© 2005
102
102
How Well has the Authority Done?
(close to random!)
Disruption of the GSJ Network
S
l
7
1.2
Preferential
1
Preferential
Real
Random
6
Real
Random
0.8
5
4
0.6
3
0.4
2
0
0
a
© 2005
19
93
19
95
19
95
19
98
19
99
20
00
20
01
20
01
20
01
20
01
20
02
20
02
20
03
20
03
20
03
1
19
93
19
95
19
95
19
98
19
98
19
99
20
01
20
01
20
01
20
01
20
02
20
02
20
02
20
03
20
03
20
03
0.2
b
103
103
Dark Web: Unstructured and
Multilingual, Web 1.0 and 2.0, Multifaceted Analysis (Content, Authorship,
Sentiment)
© 2005
104
• Drawing on an eight-year study of
WWW
• Terrorist organizations and their
supporters maintain hundreds of
web sites
• Terrorist organizations exploit the
Internet to raise funds, recruit
members, plan and launch attacks,
and publicize their chilling their
results.
• New terrorism, new media…The
war over minds…cyberterrorism
• Balancing security and civil
liberties
© 2005
G. Weimann, communication and
mass media study, U. of Haifa
(2006)
105
• “The process of radicalization in a
hostile habitat but linked through the
Internet leads to a disconnected
global network, the Leaderless
Jihad.”
• From anecdote to data and from
journalism to social sciences
• Going beyond incident databases;
detailed evidence-based terrorist
(500+) database
• Before 2004, face-to-face
interactions, 26-year old
• After 2004, interaction on the
Internet: Madrid, Dutch Hifsatd,
Cairo, Toronto…Irhabi007 and
Muntada, 20-year old
© 2005
M. Sageman (2007)
106
Web Site Example: Links to Multimedia and Manuals
Link to “The General of Islam” Radio Station
Azzam
Speeches
Berg
beheading
others
videos of
Zarqawi
Source: http://www.al-ghazawat.110mb.com/,
© 2005
French and Arabic Web Site
Complete
65 pages
manual of
a 50
caliber rifle
in pdf
107
Web Site Example: Links to Web Sites and Forums
• Links to Several Iraqi
Jihadist Web Sites and
Forums
• Source:
http://almaaber.jeeran.com/,
Arabic Web Site
© 2005
108
Web Link Analysis – Generating Hyperlink
Diagrams
1.
Calculate a similarity measure for a pair (A,B) of Web sites based
on:
1. Number of hyperlinks between the two Web sites
2. Level of the hyperlinks in the Web site hierarchy
1
Similarity ( A, B)  
All linksL 1  lv(L)
b/w A and B
where L is a hyperlink between site A and B;
lv(L) is the level of hyperlink L in the Web site hierarchy.
2.
© 2005
Similarity matrix is fed to the multidimensional scaling algorithm
(MDS), which generates a 2-dimensional graph of Web sites with
embedded distance (similarity) information.
109
Proposed Approach - Content Analysis
Coding Scheme
High Level Attribute
Low Level Attribute
Communications
Email
Telephone
High Level Attribute
Slogans
Propaganda (insiders)
Dates
Multimedia
Fundraising
Low Level
Attribute
Online Feedback Form
Martyrs
External Aid Mentioned
Leaders
Fund Transfer
Banners and Seals
Donation
Narratives of
Operations and
Events
Charity
Support Groups
Propaganda (outsiders)
Sharing Ideology
Mission
Doctrine
Justification of the Use of
Violence
Pin-pointing Enemies
References to
Western Media
Coverage
High Level
Attribute
Low Level Attribute
Command and
Control
Tactics
Organization Structure
Recording or Videos from
Senior Members of the
Group
Documentation of Previous
Operations
Recruitment
and Training
Operations’ Geographical
Area
Explicit Invitation to Join
the Movement or Group
News Reporting
Virtual Community
Listserv
Text Chat Room
Message Board
E-conferencing
© 2005
Webring
110
U.S. Domestic and Middle Eastern
Terrorist/Extremist Web Site Testbed
Category
U.S .
Domestic
# URLs
Example: Group
Category
MiddleEastern
# URLs
Example: Group
Black Separatist
2
“Nation of Islam”
Sunni
24
“Al-Qaeda”
Christian Identity
13
“Kinsman Redeemer
Ministries”
Shi’a
5
“Hizbollah”
Militia
8
“Michigan Militia”
Secular
10
Neo Confederate
4
“Texas League of the
south”
Total
39
White Supremacy
7
“Ku Klux Klan”
Neo-Nazis
9
“American Nazi Party”
Ecoterrorism/Animal
Rights
1
“Animal Liberation
Front”
Total
© 2005
“Al-Aqsa Martyr’s
Brigades”
44
111
Results – Hyperlink Diagram of U.S.
Domestic Groups’ Web sites
© 2005
112
Results - Hyperlink Diagram of Middle
Eastern Groups’ Web sites
Hizb-ut-Tahrir
Jihad
Sympathizers
Tanzeem-e-Islami
Hizbollah
Al-Qaeda linked
Web sites
Palestinian
terrorist groups
© 2005
113
Results – Web Usage Patterns for U.S.
Domestic Groups
0.9
Communications
Normalized Content Levels
0.8
Fundraising
0.7
Ideology
0.6
0.5
Propaganda
(insiders)
0.4
0.3
Propaganda
(outsiders)
0.2
Virtual Community
0.1
Command and
Control
0
Black
Separatists
© 2005
Christian
Identity
Militia
Neoconfederates
NeoEco-Terrorism
Nazis/White
Supremacists
Recruitment and
Training
114
Results – Summary of Web Usage
Patterns for U.S. Domestic Groups
• “Ideology” and “Propaganda towards insiders” were allocated the
highest amount of Web site resources, followed by
“Communications.”
• For eco-terrorism and animal rights groups, they allocated more Web
site resources for “Communications” and “Command and Control”.
• “Propaganda towards outsiders” and “Virtual Community” had very
limited appearance in U.S. domestic group Web sites.
© 2005
115
Normalized Content Levels
Results – Web Usage Patterns for Middle
Eastern Groups
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
© 2005
Communications
Fundraising
Sharing Ideology
Propaganda (Insiders)
Propaganda(outsider)
Virtual Community
Command and Control
Recruitment and Training
Hizb-ut-Tahrir
Hizbollah
Al-Qaeda Linked
Websites
Jihad Sympathizers
Palestinian terrorist
groups
116
Results – Summary of Web Usage Patterns for
Middle Eastern Groups
• “Sharing Ideology” was the attribute with the highest frequency of
occurrence in Middle Eastern terrorist/extremist groups’ Web sites.
• Clandestine groups (e.g., Al-Qaeda) tended to emphasize
“Propaganda towards outsiders” while the more established groups
(e.g., Hizballah, Hamas) directed their propaganda towards insiders.
• The less covert militant groups tended to conduct fundraising on their
Web sites, for instance, both Hizb-ut-Tahrir and Hizbollah used the
Web to support their fundraising activities.
© 2005
117
Forum Interaction Network Analysis:
ClearGuidance.com
• Background
– Forum with some members affiliated with Toronto
terror plot.
– Reportedly had as many as 15,000 members.
– Unfortunately, the site went offline in February of 2004,
before we began spidering forums.
– We were able to retrieve selected content from various
blogs.
Authors
269
© 2005
Messages
877
Duration
9/2002-2/2004
118
ClearGuidance.com
• Member locations
– Shown to the right are
the self-reported
member locations.
– Approximately 2/3 of
the 269 members
specified a location
country.
– Breakdowns for those
that did specify:
• Majority located in
USA, UK, Canada, and
the Middle East.
© 2005
Members Reporting Location
33%
Unspecified
Specified
67%
Member Locations by Region
USA
8%
28%
16%
Canada
UK
Australia
3%
Europe-Other
6%
12%
27%
Middle East
Other
119
ClearGuidance.com
• Toronto plot forum Member Interaction Network
– Blue nodes indicate members with the greatest number of in-links.
– These members are the core set of forum “experts” and propagandists
© 2005
120
Arabic Feature Set
Feature Set
(418)
Violence
Race/Nationality
Technical Structure
Word Structure
Word Roots
Function Words
Punctuation
Hyperlinks
Embedded Images
Font Size
Font Color
Contact Information
Paragraph Level
Message Level
Elongation
Word Length Dist.
Vocab. Richness
Word-Level
Special Char.
Letter Frequency
Char-Level
(4)
Word-Based
Char-Based
121
(7)
(8) (4)
(29)
(3)
(6)
(5)
(8) (15) (2)
(6)
(9)
(35)
© 2005
(4)
(11)
(48)
(14)
(50)
(200)
(12)
(31)
(48)
(15)
(62)
(262)
(79)
Content
Specific
Structural
Syntactic
Lexical
Arabic Feature Extraction Component
1
Incoming
Message
2
Count +1
Elongation Filter
Degree + 5
Filtered
Message
Feature Set
Similarity Scores (SC)
Root Dictionary
3
max(SC)+1
Root Clustering
Algorithm
All Remaining
Features Values
Generic Feature
Extractor
© 2005
4
122
Sliding Window Algorithm Illustration
Message Text
2.
1.
Compute eigenvectors for
2 principal components of
feature group
x
0.533
-0.541
0.034
0.653
0.975
0.143
Extract feature
usage vectors
y
0.956
0.445
0.089
0.456
-0.085
-0.381
1,0,0,2,1,2
Eigenvectors
3.
Repeat steps
2 and 3
© 2005
Transform into 2dimensional space
Feature Usage Vector Z
0,1,3,0,1,0
y
x =  Zx
y =  Zy
x
123
Author Writeprints
© 2005
Anonymous Messages
Author A
10 messages
Author B
10 messages
124
Forum “Experts”
The series of overlapping circular patterns for bag-of-word
features indicates that the author’s discussion revolves around a
related set of topics.
Bag-of-words are predominantly
related to religious topics.
Many large red blots indicative
of the presence of features
unique to this author.
This author attempts to use his
religious “expertise”.
© 2005
125
This author was later arrested as a major culprit in
the Toronto terror plot (“Soldier of God”). He uses
many violent affect terms.
Radar chart showing violent
affect feature usages.
Text annotation view showing
key bag-of-words highlighted.
Comparison to mean shows
several high occurrence terms
(e.g., jihad, martyrdom).
Selected feature (i.e., “jihad”) is
shown in red.
Selected feature is use of term
“jihad” which is the highest in
the forum .
© 2005
This author constantly attempts
to justify acts of violence and
terrorism.
“…there are so many paid sheikhs
stuck in this life….no point going to
them for fatwas…personally
speaking…cuz they don’t even
agree with jihad in the first place”
126
www.albasrah.net
Major Iraqi resistance web site
www.geocities.com/m_ale3dad4
Training materials
www.saaid.net
www.geocities.com/maoso3ah
Major Dark Web site
Training materials
© 2005
IED Dark Web
Network
127
Extraction: Retrieved Pages
• Using the lexicon, we used a search engine to extract all web pages
with these terms from our collection.
– A total of 2541 relevant web pages were collected from 30 web sites.
– Over 90% of these pages came from a core set of 7 web sites
Total Web Sites
Frequency Distribution
No. Web Sites
No. Web Pages
30
2541
3000
Core Web Sites
Web Site
No. Web Pages
www.qudsay.com
1209
www.albasrah.net
332
www.khayma.com
162
www.jamaat.org
141
www.hilafet.com
66
www.geocities.com
© 2005
51
No. Web Pages
2500
2000
1500
1000
500
0
1
3
5
7
9 11 13 15 17 19 21 23 25 27 29
Web Site
128
Link Analysis: Hubs
www.albasrah.net
Is a website with links to the former Iraqi Baathist
regime. It contains a large collection of war
images and reports of military operations by Iraqi
insurgents
www.geocities.com/m_ale3dad4
Is a collection of training material. Topics include
weapons, their usage, and the manufacturing of
IEDs. The website contains video demonstrations,
books, and other documents in English and
Arabic.
www.geocities.com/maoso3ah
Is an “encyclopedia” of military training and
preparation for Jihad.
www.saaid.net
Is an Islamic directory. Much of the content
pertains to unrelated topics. However, some of
the contributors support the Jihadi Salafi
movement.
© 2005
129
Segmentation: Implanting and setting off the device
• Video 28: Insurgents prepare the IED
• Location: Mushahad region, Iraq
• Insurgents: Islamic Front of the Iraqi Resistance
Planting the device
© 2005
Hiding the device
Setting it off
130
Categorization
Extended Arabic Feature Set
Group
Category
Lexical
Word-Level
5
total words, % char. per word
Character-Level
5
total char., % char. per message
Character N-Grams
Digit N-Grams
Syntactic
Topical
© 2005
< 18,278
< 1,110
Description/Examples
count of letters, char. bigrams, trigrams (e.g., ‫اب‬,‫)کك‬
count of digits, digit bigrams, digit trigrams (e.g., 1, 12, 123)
Word Length Distribution
20
frequency distribution of 1-20 letter words
Vocabulary Richness
8
richness (e.g., hapax legomena, Yule’s K, Honore’s H)
Special Characters
21
occurrences of special char. (e.g., @#$%^&*+=)
Function Words
300
frequency distribution of function words (e.g., of, for, to)
Punctuation
12
occurrence of punctuation marks (e.g., !;:,.?)
Word Root N-Grams
Structural/HTML
Quantity
varies
roots, bigrams, trigrams (e.g., ‫كتب‬, ‫)كسب‬
Message-Level
6
e.g., has greeting, has url, requoted content
Paragraph-Level
8
e.g., number of paragraphs, sentences per paragraph
Technical Structure
50
e.g., file extensions, fonts, use of images, HTML tags
HTML Tag N-Grams
< 46,656
Word N-Grams
varies
e.g., <head>, <br>, <td>, <message>
bag-of-words n-grams (e.g., “explosive”, “explosive device”)
131
IED Site Signatures
• Using feature
selection, we were
able to get 88.8%
accuracy.
• We were also able
to isolate a subset
of approximately
9,000 key features.
Technique
Features
Mean
Accuracy
Standard
Deviation
SVM
21,333
81.938
5.313
65.00 – 92.50
SVM-IG
9,268
88.838
3.238
80.00 – 96.25
Range
Classification Results
100
95
Accuracy (%)
• The table and graph
summarize the 100
bootstrapping
instance results.
90
85
80
75
SVM
70
SVM-IG
65
0
© 2005
20
40
60
Instance
80
132
Recommendation: Terrorism Informatics
Methodology
• Anecdote  Data  Data Mining (SNA)
• Journalism Social Sciences  Computational
Sciences
• Field, classified, and human intelligence  Open
source, web and artificial intelligence
© 2005
133
Recommendation: Databases and Tools
• Developing evidence-based, open source collections for
the international intelligence community
• Developing advanced open source, web and artificial
intelligence tools and linguistic resources for the
international intelligence community
• Leveraging best existing web intelligence and data mining
tools
• Monitoring and analyzing radical forums and Web 2.0
• Advancing multilingual and multimedia analysis
techniques for intelligence analysis
© 2005
134
Recommendation: “Soft Power”
• Identifying and promoting moderate sites, forums,
opinion leaders, and statements
• Removing targeted radical sites and forums
based on community tagging and automated,
“refreshed” spidering
• Enabling stakeholders and “cultural intelligence”
through digital libraries
• Promoting positive alternatives, role models, and
local heroes in the Muslim worlds
© 2005
135
Hsinchun Chen …
Artificial Intelligence Lab, COPLINK and
Dark Web Teams …
hchen@eller.arizona.edu …
http://ai.arizona.edu …
© 2005
136
EuroISI 2008: December 3-5, 2008,
Copenhagen, Denmark; CFP Deadline:
July 8, 2008
PAISI 2009: Co-locating with PAKDD,
April 27-30, 2009,
IEEE ISI 2009: June 8-11, 2009, Dallas,
Texas
© 2005
137
Download